Understanding Asymmetric Numeral Systems
Published on ; updated on
Apparently, Google is trying to patent (an application of) Asymmetric Numeral Systems, so I spent some time today learning what it is.
In its essense lies a simple and beautiful idea.
ANS is a lossless compression algorithm. Its input is a list of symbols from some finite set. Its output is a positive integer. Each symbol \(s\) has a fixed known probability \(p_s\) of occurring in the list. The algorithm tries to assign each list a unique integer so that the more probable lists get smaller integers.
If we ignore the compression part (assigning smaller integers to more probable inputs), the encoding could be done as follows: convert each symbol to a number from \(0\) to \(B-1\) (where \(B\) is the number of symbols), add a leading 1 to avoid ambiguities caused by leading zeros, and interpret the list as an integer written in a base-\(B\) positional system.
This encoding process is an iterative/recursive algorithm:
- Start with the number 1;
- If the current number is \(n\), and the incoming symbol correspond to a number \(s\), update the number to be \(s + n\cdot B\).
The decoding process is a corecursive algorithm:
- Start with the number that we are decoding;
- Split the current number \(n\) into the quotient and remainder modulo \(B\);
- Emit the remainder and continue decoding the quotient;
- Stop when the current number reaches 1.
(The decoding is LIFO: the first decoded element will be the last encoded element.)
This encoding scheme relies on the standard isomorphism between the sets \(\{0,\ldots,B-1\}\times \mathbb{Z}_{\geq 1}\) and \(\mathbb{Z}_{\geq B}\), established by the functions
\[f(s,n) = s + n\cdot B;\] \[g(n) = (n \bmod B, [n/B]).\]
(The peculiar domain and codomain of this isomorphism are chosen so that we have \(\forall n,s.\;f(s,n) > n\); this ensures that the decoding process doesn’t get stuck.)
We can represent this in Haskell as
{-# LANGUAGE ScopedTypeVariables, TypeApplications,
NamedFieldPuns, AllowAmbiguousTypes #-}
import Data.Ord
import Data.List
import Numeric.Natural
data Iso a b = Iso
to :: a -> b
{ from :: b -> a
,
}
encode :: Iso (s, Natural) Natural -> [s] -> Natural
Iso{to} = foldl' (\acc s -> to (s, acc)) 1
encode
decode :: Iso (s, Natural) Natural -> Natural -> [s]
Iso{from} = reverse . unfoldr
decode ->
(\n if n == 1
then Nothing
else Just $ from n
)
And the standard isomorphism which we used in the simple encoding process is
std_iso :: forall s . (Bounded s, Enum s) => Iso (s, Natural) Natural
= Iso (\(s,n) -> s2n s + base @s * n) (\n -> (n2s $ n `mod` base @s, n `div` base @s))
std_iso
s2n :: forall s . (Bounded s, Enum s) => s -> Natural
= fromIntegral $
s2n s fromIntegral . fromEnum) s :: Integer) -
((fromIntegral . fromEnum) (minBound @s) :: Integer)
((
n2s :: forall s . (Bounded s, Enum s) => Natural -> s
= toEnum . fromIntegral $
n2s n fromIntegral n + (fromIntegral . fromEnum) (minBound @s) :: Integer)
(
base :: forall s . (Bounded s, Enum s) => Natural
= s2n (maxBound @s) + 1 base
(The functions are more complicated than they have to be to support
symbol types like Int
. Int
does not start at 0
and is prone to overflow.)
Let’s now turn to the general form of the isomorphism
\[f \colon \{0,\ldots,B-1\}\times \mathbb{Z}_{\geq 1} \to \mathbb{Z}_{\geq \beta};\] \[g \colon \mathbb{Z}_{\geq \beta} \to \{0,\ldots,B-1\}\times \mathbb{Z}_{\geq 1}.\]
(In general, \(\beta\), the smallest value of \(f\), does not have to equal \(B\), the number of symbols.)
If we know (or postulate) that the second component of \(g\), \(g_2\colon \mathbb{Z}_{\geq \beta} \to \mathbb{Z}_{\geq 1}\), is increasing, then we can recover it from the first component, \(g_1\colon \mathbb{Z}_{\geq \beta} \to \{0,\ldots,B-1\}\).
Indeed, for a given \(s=g_1(n)\), \(g_2\) must be the unique increasing isomorphism from \[A_s = \{f(s,m)\mid m\in\mathbb{Z}_{\geq 1}\} = \{n\mid n\in\mathbb{Z}_{\geq \beta}, g_1(n) = s\}\] to \(\mathbb{Z}_{\geq 1}\). To find \(g_2(n)\), count the number of elements in \(A_s\) that are \(\leq n\).
Similarly, we can recover \(f\) from \(g_1\). To compute \(f(s,n)\), take \(n\)th smallest number in \(A_s\).
In Haskell:
ans_iso :: forall s . Eq s => (Natural, Natural -> s) -> Iso (s, Natural) Natural
= Iso{to, from} where
ans_iso (b, classify) to :: (s, Natural) -> Natural
= [ k | k <- [b..], classify k == s ] `genericIndex` (n-1)
to (s, n)
from :: Natural -> (s, Natural)
=
from n let s = classify n
= genericLength [ () | k <- [b..n], classify k == s ]
n' in (s, n')
For every function \(g_1\colon
\mathbb{Z}_{\geq \beta} \to \{0,\ldots,B-1\}\) (named
classify
in Haskell), we have a pair of encode/decode
functions, provided that each of the sets \(A_s\) is infinite. In particular, we can
get the standard encode/decode functions (originally defined by
std_iso
) by setting classify
to
classify_mod_base :: forall s . (Bounded s, Enum s) => (Natural, Natural -> s)
= (base @s, \n -> n2s (n `mod` base @s)) classify_mod_base
By varying \(g_1\) (and therefore the sets \(A_s\)), we can control which inputs get mapped to smaller integers.
If \(A_s\) is more dense, \(f(s,n)\), defined as \(n\)th smallest number in \(A_s\), will be smaller.
If \(A_s\) is more sparse, \(f(s,n)\) will be larger.
The standard isomorphism makes the sets \[A_s = \{ s+n\cdot B \mid n\in \mathbb Z_{\geq 1} \} \] equally dense for all values of \(s\). This makes sense when all \(s\) are equally probable.
But in general, we should make \(A_s\) denser for those \(s\) that are more frequent. Specifically, we want
\[ \frac{|\{k\in A_s \mid k \leq x\}|}{x} \approx p_s. \]
Substituting \(x=f(s,n)\) then gives \(\log_2 f(s,n) \approx \log_2 n + \log_2 (1/p_s)\). This means that adding a symbol \(s\) costs \(\log_2 (1/p_s)\) bits, which is what we should strive for.
Here’s a simple example of a suitable \(g_1\):
classify_prob :: Show s => (Bounded s, Enum s) => [Double] -> (Natural, Natural -> s)
=
classify_prob probs let beta = 2 -- arbitrary number > 1
= genericLength l
t = concatMap (\(s, t) -> replicate t s)
l . sortBy (comparing (Down . snd))
. zip [minBound..maxBound]
$ map (round . (/ minimum probs)) probs
= l `genericIndex` ((n-beta) `mod` t)
g1 n in (beta, g1)
This is a periodic function. It computes the number of times each symbol \(s\) will appear within a single period as \(k_s=\mathrm{round}(p_s/\min \{p_s\})\). The number \(p_s/\min \{p_s\}\) is chosen for its following two properties:
- it is proportional to the probability of the symbol, \(p_s\);
- it is \(\geq 1\), so that even the least likely symbol occurs among the values of the function.
The function then works by mapping the first \(k_0\) numbers to symbol \(0\), the next \(k_1\) numbers to symbol \(1\), and so on, until it maps \(k_{B-1}\) numbers to symbol \(B-1\) and repeats itself. The period of the function is \(\sum_s k_s\approx 1/\min \{p_s\}\).
classify_prob
rearranges the symbols in the order of
decreasing probability, which gives further advantage to the more
probable symbols. This is probably the best strategy if we want to
allocate integers in blocks; a better way would be to interleave the
blocks in a fair or random way in order to keep the densities more
uniform.
Another downside of this function is that its period may be too small to distinguish between similar probabilities, such as 0.4 and 0.6. The function used in rANS is better in this regard; it uses progressively larger intervals, which provide progressively better approximations.
But classify_prob
is enough to demonstate the idea.
Let’s encode a list of booleans where True
is expected 90%
of time.
> iso = ans_iso $ classify_prob [0.1,0.9]
> encode iso (replicate 4 True)
5
> encode iso (replicate 4 False)
11111
Four True
s compress much better than four
False
s. Let’s also compare the number of bits in 11111 with
the number of bits that the information theory predicts are needed to
encode four events with probability 0.1:
> logBase 2 11111
13.439701045971955
> 4 * logBase 2 (1/0.1)
13.28771237954945
Not bad.
The implementation of ANS in this article is terribly inefficient, especially its decoding part, mostly because the isomorphism uses brute force search instead of computation. The intention is to elucidate what the encoding scheme looks like and where it comes from. An efficient implementation of ANS and its different variants is an interesting topic in itself, but I’ll leave it for another day.
The full code (including tests) is available here.
Thanks to /u/sgraf812 for pointing
out a mistake in a previous version of
classify_prob
.