Asymmetry of costs in (t-)SNE

Published on

Both the SNE and t-SNE papers contain the following claim:

Because the Kullback-Leibler divergence is not symmetric, different types of error in the pairwise distances in the low-dimensional map are not weighted equally. In particular, there is a large cost for using widely separated map points to represent nearby datapoints (i.e., for using a small \(q_{j|i}\) to model a large \(p_{j|i}\)), but there is only a small cost for using nearby map points to represent widely separated datapoints.

(t-SNE)

Notice that making \(q_{ij}\) large when \(p_{ij}\) is small wastes some of the probability mass in the \(q\) distribution so there is a cost for modeling a big distance in the high-dimensional space with a small distance in the low-dimensional space, though it is less than the cost of modeling a small distance with a big one.

(SNE)

Here \(p_{ij}\) and \(q_{ij}\) refer to the probabilities1 in the Kullback–Leibler cost function

\[ KL(P||Q)=\sum_{i,j} p_{ij} \log\frac{p_{ij}}{q_{ij}}, \]

which are also measures of similarity: \(p_{ij}\) is high when the points \(i\) and \(j\) are close to each other in the original, high-dimensional space, and \(q_{ij}\) is high when the points are close in the low-dimensional map space.

This got me thinking:

  1. How do we interpret this claim?
  2. Does it hold universally as a mathematical theorem, or is it an empirical observation that holds most of the time?

There is a simple but weak interpretation of the claim that obviously holds. If \(q_{ij}\) tends to 0 (so the points in the low-dimensional space become arbitrarily far apart), the cost function tends to infinity, whereas if \(p_{ij}\) tends to 0, the cost function remains bounded. This interpretation is weak because it only applies asymptotically, whereas in real (t-)SNE we are concerned with distributions of points within small rectangular plots.

Interpreting the claim is not trivial because we need two different datasets:

  1. A dataset where two close points are mapped to two points that are farther away (high \(p\), low \(q\)), and
  2. a dataset where two distant points are mapped to two points that are close (low \(p\), high \(q\)).

Then the claim is that the value of the cost function in (1) is higher than in (2). Clearly, the claim will not hold in general without requiring some sort of relationship between the datasets (1) and (2).

This is a good time to introduce some simplifying assumptions:

  1. The conditional probability distributions will be Gaussian, as in the original SNE.
  2. We shall ignore the perplexity-based adaptive computation of the variance in the data space and assume \(\sigma^2=1/2\) instead, just like in the map space. This makes \(p_{ij}\) symmetric and easy to compute.
  3. Our original (“high-dimensional”) and map (“low-dimensional”) space will both have dimension 2. This allows us to draw them easily and makes it trivial to find the optimal embedding: take \(y_i=x_i\).
  4. Finally, our dataset will contain 3 points. We cannot work with only 2 points, because then \(p_{12}\) and \(q_{12}\) will always be 1. But 3 points are enough to make SNE interesting and test the claim. We will arrange these points to form isosceles triangles, where we fix the base and move the apex towards or away from the base. The length of the base will be 1, and the length of the legs will be denoted by \(d\).

Given these constraints, is it true that when we move \(x_3\) closer to \(x_1\) and \(x_2\), we suffer more than if we move it away? Again, this has no hope to hold in general (for every combination of left and right displacements): because the cost function is continuous, for every given displacement to the left there is a displacement to the right with a lower cost.

What if we displace \(x_3\) to the left and to the right by the same amount? Then the result would depend on the specifics of the conditional distributions (normal, Cauchy) and their parameters (\(\sigma\)), which we are not modelling faithfully here, and would not be a result of the asymmetry of the KL divergence as implied by the statements quoted above.

Motivated by these thoughts, I came up with an interpretation that focuses solely on the asymmetry of the KL divergence, although it still uses the Gaussian distribution to calculate \(p_{ij}\) and \(q_{ij}\).

In this interpretation, we consider two isosceles triangles with the same base but different legs, \(d_1\) and \(d_2\). These triangles define probability distributions \(P(d_1)\) and \(P(d_2)\). The asymmetry of the KL divergence means that \(KL(P(d_1)||P(d_2))\) and \(KL(P(d_2)||P(d_1))\) are, in general, different. These divergences arise when mapping the longer triangle onto the shorter one and vice versa.

So the question becomes: is it true that \(d_1>d_2\) implies \(KL(P(d_2)||P(d_1)) > KL(P(d_1)||P(d_2))\)?

This answer can be easily computed for different values of \(d_1\) and \(d_2\) in R:

library(dplyr)
library(ggplot2)

probs <- function(d) {
  p <- exp(-c(1,d,d)^2)
  return(p/sum(p))
}

kl <- Vectorize(function(d1, d2) {
  p <- probs(d1)
  q <- probs(d2)
  return(sum(p*log(p/q)))
})

d <- expand.grid(d1=seq(0.5,3,by=0.03), d2=seq(0.5,3,by=0.03)) %>%
  filter(d1>d2) %>%
  mutate(claim=(kl(d1,d2)<kl(d2,d1)))

ggplot(d) + geom_point(aes(d1,d2,color=claim),size=1) +
  scale_color_manual(values=c("red","darkgreen")) +
  theme_bw() + coord_fixed()

As can be seen from the graph, the claim holds when \(d_1\) is high enough, consistent with the asymptotic interpretation. But it does not hold universally.