# Explained variance in PCA

Published on

There are quite a few explanations of the principal component analysis (PCA) on the internet, some of them quite insightful. However, one issue that is usually skipped over is the variance explained by principal components, as in “the first 5 PCs explain 86% of variance”. So this is my attempt to explain the explained variance.

## TL;DR

The **total variance** is the sum of variances of all individual principal components.

The fraction of **variance explained** by a principal component is the ratio between the variance of that principal component and the total variance.

For several principal components, add up their variances and divide by the total variance.

## Example & explanation

Let’s define a data set (matrix) in R that consists of 3 variables (columns) and 4 observations (rows), where the third variable is roughly the average of the first two.

```
set.seed(2018)
mx <- matrix(rnorm(4*2), nrow=4, ncol=2)
mx <- cbind(mx, (mx[,1]+mx[,2])/2 + rnorm(4, sd=0.4))
mx
```

```
[,1] [,2] [,3]
[1,] -0.42298398 1.7352837 0.4119150
[2,] -1.54987816 -0.2647112 -0.6524724
[3,] -0.06442932 2.0994707 0.7603068
[4,] 0.27088135 0.8633512 0.1551048
```

And let’s compute the sample variances of each of the 3 variables (columns):

`[1] 0.6261715 1.1068959 0.3612204`

Note that these are different from true population variances, which we know to be equal to 1, 1, and 0.66, respectively.

When talking about PCA, the sum of the sample variances of all individual variables is called the *total variance*. If we divide individual variances by the total variance, we’ll see how much variance each variable explains:

`[1] 0.2989902 0.5285309 0.1724789`

The highest fraction of explained variance among these variables is 53%, and the lowest one is 17%. We can also compute these fractions for subsets of variables. For instance, variables 1 and 2 together explain 83% of the total variance, and variables 1 and 3 explain 47%.

Principal component analysis computes a new set of variables (“principal components”) and expresses the data in terms of these new variables. Considered together, the new variables represent the same amount of information as the original variables, in the sense that we can restore the original data set from the transformed one.

Moreover, the total variance remains the same. However, it is redistributed among the new variables in the most “unequal” way: the first variable not only explains the most variance among the new variables, but the most variance a single variable can possibly explain.

More generally, the first \(k\) principal components (where \(k\) can be 1, 2, 3 etc.) explain the most variance any \(k\) variables can explain, and the last \(k\) variables explain the least variance any \(k\) variables can explain, under some general restrictions. (The restrictions ensure, for example, that we cannot adjust a variable’s explained variance simply by scaling it.)

Let’s try this in practice. This is our data set expressed in the new variables (“principal components”) called PC1, PC2, and PC3:

```
PC1 PC2 PC3
[1,] -0.5876362 -0.3227198 0.055468818
[2,] 1.9378229 -0.1805488 -0.012606347
[3,] -1.1907211 -0.2331953 -0.048668119
[4,] -0.1594657 0.7364640 0.005805649
```

These are the sample variances of the new variables.

```
PC1 PC2 PC3
1.847906652 0.244501735 0.001879334
```

Notice that their sum, the total variance, is the same as for the original variables: 2.09.

And these are the same variances divided by the total variance, i.e. how much of the total variance each new variable explains:

```
PC1 PC2 PC3
0.8823556734 0.1167469648 0.0008973618
```

In the original data set, the highest explained variance by a single variable was 53%; here it’s 88%. And the lowest explained variance dropped from 17% to less than 0.1%. The highest explained variance by two variables also increased, from 83% to 99.9%, and so on. (Because the total variance has not changed, these observation about the fractions of the total variance are equally valid for the variances themselves.)

Caution: PCA, being derived from noisy data, is itself noisy. For instance, it would be a mistake to conclude that there exists a parameter that explains 88% of the variability in the actual quantities we have measured. The true fraction of total variance that can be captured by a single variable in this case is only around 60%, and we would get closer to it if we increased our sample size.

## Mathematical justification

Hopefully the above explanation makes an intuitive sense. But, from a mathematical perspective, it raises several questions:

- What does it mean to introduce new variables?
- Why is the total variance preserved?
- Why can’t we make the variance for individual variables as low or high as we want simply by scaling them?
- Why is it desirable to maximize explained variance?

Let \(X\) be an \(n\times p\) matrix whose columns correspond to \(p\) variables and whose rows correspond to \(n\) samples or measurements.

What does it mean to introduce new variables? In a very general sense, it means coming up with a mapping (mathematical function) \(f:\mathbb{R}^p\to\mathbb{R}^q\) from the old variables to the new ones, such that it has an inverse \(f^{-1}:\mathbb{R}^q\to\mathbb{R}^p\), which restores the original data. The function \(f\) is then applied to each row of \(X\) to get the new data matrix, \(Z\).

This formulation is too general, however. For example, because \(\mathbb{R}^p\) is isomoprhic (as a set) to \(\mathbb{R}\), we can encode everything in a single variable. But the relationship between the old variables and the new one will be very non-trivial. The new variable, even though it encodes precisely the same information, would tell us nothing meaningful about the data.

Therefore, we need to impose some restrictions on the nature of \(f\). Here we require that \(f\) be a linear function; so that \(f(\mathrm{x})=\mathrm{x}A\), where \(\mathrm{x}\) is a row of \(X\) and \(A\) is a \(p\times q\) matrix. The new data matrix can be then computed as \(Z = X A\).

Now it’s no longer possible to squeeze everything into a single variable; for the inverse mapping to exist we need \(q\geq p\). Because we are trying to *reduce* the dimensionality of the data, not *expand* it, we’ll stick with \(q=p\), so \(A\) is an invertible \(p\times p\) square matrix.

But even an invertible linear transformation is too general for PCA because it does not preserve the total variance.

From now on, we shall assume that the sample mean of each column of \(X\) is 0. This can be achieved by subtracting from each column its mean value. Under this assumption, the covariance matrix of \(X\) has a very simple form: \(\mathrm{Cov}(X)=\frac1{n-1} X'X\). Also, if the columns of \(X\) have zero mean, so do the column of \(Z=XA\):

\[ \begin{split} \left(\frac1n\quad \frac1n\quad \ldots\quad \frac1n\right) Z &= \left(\frac1n\quad \frac1n\quad \ldots\quad \frac1n\right) X A\\ &= \left(0\quad 0\quad \ldots\quad 0\right) A\\ &= \left(0\quad 0\quad \ldots\quad 0\right). \end{split} \]

Then the covariance matrix of \(Z\) is \[\mathrm{Cov}(Z)=\frac1{n-1} Z'Z=\frac1{n-1} A'X'XA=A'\mathrm{Cov}(X)A.\]

The total variance of \(X\) is the sum of the diagonal elements in the covariance matrix \(\mathrm{Cov}(X)\), i.e. its *trace*, \(\mathrm{Tr}(\mathrm{Cov}(X))\).

In general, a transformation \(\mathrm{Cov}(X)\mapsto A'\mathrm{Cov}(X)A\) does not preserve the trace of the matrix \(\mathrm{Cov}(X)\). For instance, multiplying all variables by a constant number \(c\) is a linear transformation with matrix \(c\cdot I\). This transformation will inflate the total variance by a factor of \(c^2\).

To ensure that the total variance does not change, we require that \(\mathrm{Cov}(X)\mapsto A'\mathrm{Cov}(X)A\) is a similarity transformation, which always preserves the trace of the matrix it transforms. This is equivalent to saying that \(A^{-1}=A'\), or that \(A\) is orthogonal.

Preserving the total variance is not the only (or even an important) reason to require that the change of variables is an orthogonal transformation. As we saw earlier, the transformation \(c\cdot I\) inflates the total variance by the factor of \(c^2\), but it does so by uniformly inflating the variance of each variable. So when we compute the fraction of the total variance explained by the variables, that common factor cancels out.

The real problem is that we could rescale individual variables. Consider a \(2\times 2\) matrix

\[ A=\pmatrix{a_{11}&0\\0&a_{22}}. \]

Unless \(a_{11}\) and \(a_{22}\) are \(\pm 1\), \(A\) is not orthogonal. So, in general, \(A\) will not preserve the total variance of *every* matrix. However, for *any given* matrix \(X\), there are many possible values of \(a_{11}\) and \(a_{22}\) that will preserve \(X\)’s total variance: they form an ellipse

\[a_{11}^2\mathrm{Var}(X_{\cdot 1})+a_{22}^2\mathrm{Var}(X_{\cdot 2})=\mathrm{Var}(X_{\cdot 1})+\mathrm{Var}(X_{\cdot 2}).\]

Recall that the objective of PCA is make the first variable explain the maximum fraction of the total variance. By choosing \(a_{22}\) close to zero (and inferring \(a_{11}\) from the above equation), we can make the fraction of variance “explained” by the first principal component arbitrarily close to 1 without transforming the data in any meaningful way.

This kind of cheating is made impossible by requiring that \(A\) is orthogonal.

Now let’s talk about *why* it’s desirable to maximize the explained variance. The typical use of PCA is to keep only the first \(k<p\) principal components. Because PCA is an orthogonal transformation, this corresponds to projecting the data from its original \(p\)-dimensional space to a \(k\)-dimensional subspace. The remaining \(p-k\) components are lost in this projection; so it makes sense to minimize the variability of the data in those directions. Because the total variance is constant, minimizing the variance of the last \(p-k\) variables is the same as maximizing the variance of the first \(k\) variables. The choices we make in PCA are motivated precisely by this objective:

- PCA itself is designed to maximize the variance of the first \(k\) components, and minimize the variance of the last \(p-k\) components, compared to all other orthogonal transformations.
- We choose the first \(k\) components, and not just
*some*\(k\) components, because they have the highest variance out of all principal components. - We try to choose \(k\) big enough to make the lost information — the variance of the last \(p-k\) components — sufficiently small.