RNA-Seq normalization explained

Published on November 28, 2016; tags: Bioinformatics, Probability

RNA-Seq (short for RNA sequencing) is a type of experiment that lets us measure gene expression. The sequencing step produces a large number (tens of millions) of cDNA1 fragment sequences called reads. Every read represents a part of some RNA molecule in the sample2.

Then we assign (“map”) every read to one of the isoforms and count how many reads each isoform has got.

All else being equal, the more abundant an isoform is, the more fragments from it are likely to be sequenced. Therefore, we can use read counts as a proxy for isoform abundance.

However, since “all else” is never equal, the counts need to be adjusted to be comparable across isoforms, samples, and experiments. Here we will explore these adjustments and why they are necessary.

RPK

Consider the following mapped reads from an RNA-Seq experiment. Which isoform is more abundant, the red one or the yellow one?

The yellow isoform has got more reads assigned to it, but it is also much longer than the red one. The longer the isoform, the more fragments (and, therefore, reads) we should expect it to generate.

To be able to compare read counts across isoforms, we divide the counts by the isoform length. It is also customary to multiply the number by $1000$, obtaining reads per kilobase:

$RPK_i=10^3\cdot\frac{n_i}{l_i},$

where $n_i$ is the number of reads mapped to isoform $i$, and $l_i$ is the length of that isoform.

RPKM and RPM

Consider the data from two RNA-Seq experiments. In which one is the red isoform more expressed?

If we rely on raw counts, the first experiment has produced more red fragments, and so we may conclude that the red isoform is more expressed there.

Computing $RPK$ won’t change anything since we are comparing expression of the same isoform.

But the first experiment has produced more total fragments sequenced, and the higher the overall number of reads is, the higher count we should expect for any given isoform.

To compare counts across experiments, we should further normalize by the total fragment count (usually expressed in millions). Thus, reads per kilobase per million is computed as

$RPKM_i=10^9\cdot\frac{n_i}{l_i\cdot \sum_j n_j}.$

If we do not intend to compare abundances across isoforms, then simply reads per million will do:

$RPM_i=10^6\cdot\frac{n_i}{\sum_j n_j}.$

TPM

Here we consider RNA-Seq data from two different tissues. For simplicity, let’s make a (completely unrealistic) assumption that in each tissue, only two isoforms are expressed: red and yellow in tissue 1, and red and green in tissue 2. We are interested in the difference in expression of the red isoform in the two tissues.

The number of reads for the red isoform is the same in both cases. Since we are comparing the expression of the same isoform, the $RPK$ values will be identical too. Furthermore, the total number of reads is the same between the experiments, so even the $RPKM$ values won’t show any difference. But is there one?

Imagine for a second that there is an equal number of red and yellow transcripts in the tissue 1 sample. Since the red isoform is longer than the yellow one, it yields more fragments, and so we should observe more red reads than yellow ones. Yet we see more yellow reads. This means that the number of red transcripts is significantly lower than the number of yellow transcripts.

If we conduct the same thought experiment with tissue 2, we would expect somewhat more green fragments than red ones, and this is what we observe. The relative abundance of the red isoform in tissue 2 is probably close to $50\%$ and thus is higher than in tissue 1.

The formula for $RPKM_i$ takes into account the length of isoform $i$ only. But the lengths of other isoforms clearly have an impact on the relative number of isoform $i$’s fragments.

The metric called $TPM$, or transcripts per million, directly measures the relative abundance of transcripts. To estimate it from the count data, notice that the $RPK$ values are proportional to the abundances of isoforms within a single experiment. Thus the abundance of isoform $i$ per million transcripts can be estimated as

$TPM_i=10^6\cdot\frac{RPK_i}{\sum_j RPK_j}=10^6\cdot\frac{n_i/l_i}{\sum_j n_j/l_j}.$

Relative vs absolute expression

So far we have been estimating the relative abundance, i.e. what proportion of transcripts in a sample belong to a particular isoform. Can we estimate the absolute abundance from RNA-Seq data?

Consider the two cells drawn below. The colored squiggly lines represent individual transcripts of the corresponding isoforms.

Cell B has twice as many transcripts of each isoform as cell A. If we conduct RNA-Seq experiments in the two cells, we would get samples from essentially the same distribution. We wouldn’t be able to tell the cells apart based on their RNA-Seq.

Here’s a trick: during library preparation, we add a known amount of an artificial RNA or DNA that is not produced by the studied organism (the blue squiggle below), then we can compare all abundances against it. This artificially introduced material is called a spike-in.

If we regard the spike-in as isoform $0$, with the known absolute abundance of $T_0$ transcripts, then the absolute abundance of isoform $i$ can be estimated as

$T_i = T_0\cdot\frac{RPK_i}{RPK_0}=T_0\cdot\frac{n_i/l_i}{n_0/l_0}.$

Dangers of relative expression

Whether we care about absolute or relative expression depends on the biological question in hand. However, looking at relative expression alone can produce unexpected results.

Suppose again that only two isoforms are being expressed, red and yellow. In condition A, the two isoforms are equally expressed. In condition B, the yellow isoform’s expression doubles, while the red isoform’s expression is not affected at all.

Now let’s look at the relative expression:

Based on this chart, we might conclude that the red isoform is also differentially expressed between the two conditions. Technically this is true, as long as we are talking about relative expression, but this is only a consequence of the overexpression of the yellow isoform.

This article is a transcribed part of a talk I gave in Kiev on November 27, 2016. You can look through the slides (in English) and watch the video (in Russian).

Other resources:

1. complementary DNA, meaning the DNA that has been reverse-transcribed from RNA

2. for simplicity, I do not consider paired-end sequencing here