October 20, 2016
Today I was playing with the count data from a small RNA-Seq experiment performed in Arabidopsis thaliana.
At some point, I decided to look at the mean-variance relationship for the fragment counts. As I said, the dataset is small; there are only 3 replicates per condition from which to estimate the variance. Moreover, each sample is from a different batch. I wasn’t expecting to see much.
But there was a pattern in the mean-variance plot that was impossible to miss.
It is a nice straight line that many points lie on, but none dare to cross. A ceiling.
The ceiling looked mysterious at first, but then I found a simple explanation. The sample variance of \(n\) numbers \(a_1,\ldots,a_n\) can be written as
where \(\mu\) is the sample mean. Thus,
For non-negative numbers, \(n^2\mu^2=(\sum a_i)^2\geq \sum a_i^2\), and
This means that on a log-log plot, all points \((\mu,\sigma^2)\) lie on or below the line \(y=2x+\log n\).
Moreover, the points that lie exactly on the line correspond to the samples where all \(a_i\) but one are zero. In other words, those are gene-condition combinations where the gene’s transcripts were registered in a single replicate for that condition.