Understanding Asymmetric Numeral Systems

Published on ; updated on

Apparently, Google is trying to patent (an application of) Asymmetric Numeral Systems, so I spent some time today learning what it is.

In its essense lies a simple and beautiful idea.

ANS is a lossless compression algorithm. Its input is a list of symbols from some finite set. Its output is a positive integer. Each symbol \(s\) has a fixed known probability \(p_s\) of occurring in the list. The algorithm tries to assign each list a unique integer so that the more probable lists get smaller integers.

If we ignore the compression part (assigning smaller integers to more probable inputs), the encoding could be done as follows: convert each symbol to a number from \(0\) to \(B-1\) (where \(B\) is the number of symbols), add a leading 1 to avoid ambiguities caused by leading zeros, and interpret the list as an integer written in a base-\(B\) positional system.

This encoding process is an iterative/recursive algorithm:

  1. Start with the number 1;
  2. If the current number is \(n\), and the incoming symbol correspond to a number \(s\), update the number to be \(s + n\cdot B\).

The decoding process is a corecursive algorithm:

  1. Start with the number that we are decoding;
  2. Split the current number \(n\) into the quotient and remainder modulo \(B\);
  3. Emit the remainder and continue decoding the quotient;
  4. Stop when the current number reaches 1.

(The decoding is LIFO: the first decoded element will be the last encoded element.)

This encoding scheme relies on the standard isomorphism between the sets \(\{0,\ldots,B-1\}\times \mathbb{Z}_{\geq 1}\) and \(\mathbb{Z}_{\geq B}\), established by the functions

\[f(s,n) = s + n\cdot B;\] \[g(n) = (n \bmod B, [n/B]).\]

(The peculiar domain and codomain of this isomorphism are chosen so that we have \(\forall n,s.\;f(s,n) > n\); this ensures that the decoding process doesn’t get stuck.)

We can represent this in Haskell as

{-# LANGUAGE ScopedTypeVariables, TypeApplications,
             NamedFieldPuns, AllowAmbiguousTypes #-}

import Data.Ord
import Data.List
import Numeric.Natural

data Iso a b = Iso
  { to :: a -> b
  , from :: b -> a

encode :: Iso (s, Natural) Natural -> [s] -> Natural
encode Iso{to} = foldl' (\acc s -> to (s, acc)) 1

decode :: Iso (s, Natural) Natural -> Natural -> [s]
decode Iso{from} = reverse . unfoldr
  (\n ->
    if n == 1
      then Nothing
      else Just $ from n

And the standard isomorphism which we used in the simple encoding process is

std_iso :: forall s . (Bounded s, Enum s) => Iso (s, Natural) Natural
std_iso = Iso (\(s,n) -> s2n s + base @s * n) (\n -> (n2s $ n `mod` base @s, n `div` base @s))

s2n :: forall s . (Bounded s, Enum s) => s -> Natural
s2n s = fromIntegral $
  ((fromIntegral . fromEnum) s             :: Integer) -
  ((fromIntegral . fromEnum) (minBound @s) :: Integer)

n2s :: forall s . (Bounded s, Enum s) => Natural -> s
n2s n = toEnum . fromIntegral $
  (fromIntegral n + (fromIntegral . fromEnum) (minBound @s) :: Integer)

base :: forall s . (Bounded s, Enum s) => Natural
base = s2n (maxBound @s) + 1

(The functions are more complicated than they have to be to support symbol types like Int. Int does not start at 0 and is prone to overflow.)

Let’s now turn to the general form of the isomorphism

\[f \colon \{0,\ldots,B-1\}\times \mathbb{Z}_{\geq 1} \to \mathbb{Z}_{\geq \beta};\] \[g \colon \mathbb{Z}_{\geq \beta} \to \{0,\ldots,B-1\}\times \mathbb{Z}_{\geq 1}.\]

(In general, \(\beta\), the smallest value of \(f\), does not have to equal \(B\), the number of symbols.)

If we know (or postulate) that the second component of \(g\), \(g_2\colon \mathbb{Z}_{\geq \beta} \to \mathbb{Z}_{\geq 1}\), is increasing, then we can recover it from the first component, \(g_1\colon \mathbb{Z}_{\geq \beta} \to \{0,\ldots,B-1\}\).

Indeed, for a given \(s=g_1(n)\), \(g_2\) must be the unique increasing isomorphism from \[A_s = \{f(s,m)\mid m\in\mathbb{Z}_{\geq 1}\} = \{n\mid n\in\mathbb{Z}_{\geq \beta}, g_1(n) = s\}\] to \(\mathbb{Z}_{\geq 1}\). To find \(g_2(n)\), count the number of elements in \(A_s\) that are \(\leq n\).

Similarly, we can recover \(f\) from \(g_1\). To compute \(f(s,n)\), take \(n\)th smallest number in \(A_s\).

In Haskell:

ans_iso :: forall s . Eq s => (Natural, Natural -> s) -> Iso (s, Natural) Natural
ans_iso (b, classify) = Iso{to, from} where
  to :: (s, Natural) -> Natural
  to (s, n) = [ k | k <- [b..], classify k == s ] `genericIndex` (n-1)

  from :: Natural -> (s, Natural)
  from n =
    let s = classify n
        n' = genericLength [ () | k <- [b..n], classify k == s ]
    in (s, n')

For every function \(g_1\colon \mathbb{Z}_{\geq \beta} \to \{0,\ldots,B-1\}\) (named classify in Haskell), we have a pair of encode/decode functions, provided that each of the sets \(A_s\) is infinite. In particular, we can get the standard encode/decode functions (originally defined by std_iso) by setting classify to

classify_mod_base :: forall s . (Bounded s, Enum s) => (Natural, Natural -> s)
classify_mod_base = (base @s, \n -> n2s (n `mod` base @s))

By varying \(g_1\) (and therefore the sets \(A_s\)), we can control which inputs get mapped to smaller integers.

If \(A_s\) is more dense, \(f(s,n)\), defined as \(n\)th smallest number in \(A_s\), will be smaller.

If \(A_s\) is more sparse, \(f(s,n)\) will be larger.

The standard isomorphism makes the sets \[A_s = \{ s+n\cdot B \mid n\in \mathbb Z_{\geq 1} \} \] equally dense for all values of \(s\). This makes sense when all \(s\) are equally probable.

But in general, we should make \(A_s\) denser for those \(s\) that are more frequent. Specifically, we want

\[ \frac{|\{k\in A_s \mid k \leq x\}|}{x} \approx p_s. \]

Substituting \(x=f(s,n)\) then gives \(\log_2 f(s,n) \approx \log_2 n + \log_2 (1/p_s)\). This means that adding a symbol \(s\) costs \(\log_2 (1/p_s)\) bits, which is what we should strive for.

Here’s a simple example of a suitable \(g_1\):

classify_prob :: Show s => (Bounded s, Enum s) => [Double] -> (Natural, Natural -> s)
classify_prob probs =
  let beta = 2 -- arbitrary number > 1
      t = genericLength l
      l = concatMap (\(s, t) -> replicate t s)
        . sortBy (comparing (Down . snd))
        . zip [minBound..maxBound]
        $ map (round . (/ minimum probs)) probs
      g1 n = l `genericIndex` ((n-beta) `mod` t)
  in (beta, g1)

This is a periodic function. It computes the number of times each symbol \(s\) will appear within a single period as \(k_s=\mathrm{round}(p_s/\min \{p_s\})\). The number \(p_s/\min \{p_s\}\) is chosen for its following two properties:

  1. it is proportional to the probability of the symbol, \(p_s\);
  2. it is \(\geq 1\), so that even the least likely symbol occurs among the values of the function.

The function then works by mapping the first \(k_0\) numbers to symbol \(0\), the next \(k_1\) numbers to symbol \(1\), and so on, until it maps \(k_{B-1}\) numbers to symbol \(B-1\) and repeats itself. The period of the function is \(\sum_s k_s\approx 1/\min \{p_s\}\).

classify_prob rearranges the symbols in the order of decreasing probability, which gives further advantage to the more probable symbols. This is probably the best strategy if we want to allocate integers in blocks; a better way would be to interleave the blocks in a fair or random way in order to keep the densities more uniform.

Another downside of this function is that its period may be too small to distinguish between similar probabilities, such as 0.4 and 0.6. The function used in rANS is better in this regard; it uses progressively larger intervals, which provide progressively better approximations.

But classify_prob is enough to demonstate the idea. Let’s encode a list of booleans where True is expected 90% of time.

> iso = ans_iso $ classify_prob [0.1,0.9]
> encode iso (replicate 4 True)
> encode iso (replicate 4 False)

Four Trues compress much better than four Falses. Let’s also compare the number of bits in 11111 with the number of bits that the information theory predicts are needed to encode four events with probability 0.1:

> logBase 2 11111
> 4 * logBase 2 (1/0.1)

Not bad.

The implementation of ANS in this article is terribly inefficient, especially its decoding part, mostly because the isomorphism uses brute force search instead of computation. The intention is to elucidate what the encoding scheme looks like and where it comes from. An efficient implementation of ANS and its different variants is an interesting topic in itself, but I’ll leave it for another day.

The full code (including tests) is available here.

Thanks to /u/sgraf812 for pointing out a mistake in a previous version of classify_prob.