Roman Cheplyaka

5 ways to manage allocated memory in Haskell

Published on August 6, 2017; tags: Haskell

Let’s say we have a foreign function that takes a C data structure. Our task is to allocate the structure in memory, fill in the fields, call the function, and deallocate the memory.

The data structure is not flat but contains pointers to other data structures that also need to be allocated. An example of such a data structure is a linked list:

typedef struct list {
  int car;
  struct list *cdr;
} list;

And an example of a C function is one that computes the sum of all elements in the list:

int sum(list *l) {
  int s = 0;
  while (l) {
    s += l->car;
    l = l->cdr;
  }
  return s;
}

In this article, I will explore different ways to track all the allocated pointers and free them reliably.

The complete code can be downloaded as a git repo:

git clone https://ro-che.info/files/2017-08-06-manage-allocated-memory-haskell.git

The modules below use the hsc2hs preprocessor; it replaces things like #{peek ...}, #{poke ...}, and #{size ...} with Haskell code.

Way 1: traverse the structure

Since all pointers are stored somewhere in the data structure, we could traverse it to recover and free those pointers.

mkList :: [Int] -> IO (Ptr List)
mkList l =
  case l of
    [] -> return nullPtr
    x:xs -> do
      struct <- mallocBytes #{size list}
      #{poke list, car} struct x
      xs_c <- mkList xs
      #{poke list, cdr} struct xs_c
      return struct

freeList :: Ptr List -> IO ()
freeList l
  | l == nullPtr = return ()
  | otherwise = do
      cdr <- #{peek list, cdr} l
      free l
      freeList cdr

way1 :: Int -> IO Int
way1 n = bracket (mkList [1..n]) freeList csum

This is how we’d probably do it in C, but in Haskell it has several disadvantages compared to the other options we have:

  1. The code to traverse the structure has to be written manually and is prone to errors.
  2. Having to do the extra work of traversing the structure makes it slower than some of the alternatives.
  3. It is not exception-safe; if something happens inside mkList, the already allocated pointers will be lost. Note that this code is async-exception-safe (bracket masks async exceptions for mkList), so the only exceptions we need to worry about must come from mkList or freeList (e.g. from mallocBytes).

Way 2: allocaBytes and ContT

The Foreign.Marshal.Alloc module provides the bracket-style allocaBytes function, which allocates the memory and then automatically releases it.

-- |@'allocaBytes' n f@ executes the computation @f@, passing as argument
-- a pointer to a temporarily allocated block of memory of @n@ bytes.
-- The block of memory is sufficiently aligned for any of the basic
-- foreign types that fits into a memory block of the allocated size.
--
-- The memory is freed when @f@ terminates (either normally or via an
-- exception), so the pointer passed to @f@ must /not/ be used after this.
allocaBytes :: Int -> (Ptr a -> IO b) -> IO b

It works great if we need to allocate a small fixed number of structures:

allocaBytes size1 $ \ptr1 ->
  allocaBytes size2 $ \ptr2 -> do
    -- do something with ptr1 and ptr2

But what if the number of allocations is large or even unknown, as in our case?

For that, we have the continuation monad!

mallocBytesC :: Int -> ContT r IO (Ptr a)
mallocBytesC n = ContT $ \k -> allocaBytes n k

mallocBytesC is a monadic function that returns a pointer to the allocated memory, so it is as convenient and flexible as the simple mallocBytes used in Way 1. But, unlike mallocBytes, all allocated memory will be safely and automatically released at the point where we run the ContT layer.

mkList :: [Int] -> ContT r IO (Ptr List)
mkList l =
  case l of
    [] -> return nullPtr
    x:xs -> do
      struct <- mallocBytesC #{size list}
      liftIO $ #{poke list, car} struct x
      xs_c <- mkList xs
      liftIO $ #{poke list, cdr} struct xs_c
      return struct

way2 :: Int -> IO Int
way2 n = runContT (liftIO . csum =<< mkList [1..n]) return

This is my favorite way: it is simple, safe, fast, and elegant.

Way 3: ResourceT

If we replace ContT r with ResourceT, we get a similar function with essentially the same semantics: the memory released when the ResourceT layer is run.

mallocBytesR :: Int -> ResourceT IO (Ptr a)
mallocBytesR n = snd <$> allocate (mallocBytes n) free

The rest of the code barely changes:

mkList :: [Int] -> ResourceT IO (Ptr List)
mkList l =
  case l of
    [] -> return nullPtr
    x:xs -> do
      struct <- mallocBytesR #{size list}
      liftIO $ #{poke list, car} struct x
      xs_c <- mkList xs
      liftIO $ #{poke list, cdr} struct xs_c
      return struct

way3 :: Int -> IO Int
way3 n = runResourceT (liftIO . csum =<< mkList [1..n])

However, as we shall see, this version is an order of magnitude slower than the ContT version.

Way 4: single-block allocation via mallocBytes

Even though our structure contains lots of pointers, we don’t have to allocate each one of them separately. Instead, we could allocate a single chunk of memory and calculate the pointers ourselves.

This method may be more involved for complex data structures, but it is the fastest one, because we only need to keep track of and deallocate a single pointer.

writeList :: Ptr List -> [Int] -> IO ()
writeList ptr l =
  case l of
    [] -> error "writeList: empty list"
    [x] -> do
      #{poke list, car} ptr x
      #{poke list, cdr} ptr nullPtr
    x:xs -> do
      #{poke list, car} ptr x
      let ptr' = plusPtr ptr #{size list}
      #{poke list, cdr} ptr ptr'
      writeList ptr' xs

mkList :: [Int] -> IO (Ptr List)
mkList l
  | null l = return nullPtr
  | otherwise = do
      ptr <- mallocBytes (length l * #{size list})
      writeList ptr l
      return ptr

way4 :: Int -> IO Int
way4 n = bracket (mkList [1..n]) free csum

Way 5: single-block allocation via allocaBytes + ContT

This is the same as Way 4, except using allocaBytes instead of mallocBytes. The two functions allocate the memory differently, so I thought I’d add this version to the benchmark.

mkList :: [Int] -> ContT r IO (Ptr List)
mkList l
  | null l = return nullPtr
  | otherwise = do
      ptr <- mallocBytesC (length l * #{size list})
      liftIO $ writeList ptr l
      return ptr

way5 :: Int -> IO Int
way5 n = runContT (liftIO . csum =<< mkList [1..n]) return

Is sum pure?

We expose the sum function from C as an IO function csum:

foreign import ccall "sum" csum :: Ptr List -> IO Int

The alternative is to expose it as a pure function:

foreign import ccall "sum" csum :: Ptr List -> Int

The Haskell FFI explicitly allows both declarations, and it may seem that a function summing up numbers deserves to be pure.

However, declaring csum as pure would break every single example above (after making the trivial changes to make it type check). Can you see why?

Benchmark results

The benchmark consists of allocating, summing, and deallocating a list of numbers from 1 to 100.

benchmarking way1
time                 3.420 μs   (3.385 μs .. 3.461 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 3.439 μs   (3.414 μs .. 3.508 μs)
std dev              127.8 ns   (72.34 ns .. 244.5 ns)
variance introduced by outliers: 48% (moderately inflated)

benchmarking way2
time                 2.150 μs   (2.142 μs .. 2.158 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 2.142 μs   (2.135 μs .. 2.149 μs)
std dev              24.06 ns   (21.18 ns .. 28.03 ns)

benchmarking way3
time                 12.14 μs   (12.10 μs .. 12.21 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 12.22 μs   (12.17 μs .. 12.30 μs)
std dev              203.6 ns   (156.8 ns .. 277.2 ns)
variance introduced by outliers: 14% (moderately inflated)

benchmarking way4
time                 1.499 μs   (1.489 μs .. 1.509 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.488 μs   (1.483 μs .. 1.495 μs)
std dev              19.83 ns   (16.17 ns .. 24.72 ns)
variance introduced by outliers: 12% (moderately inflated)

benchmarking way5
time                 1.423 μs   (1.405 μs .. 1.447 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 1.431 μs   (1.418 μs .. 1.448 μs)
std dev              51.46 ns   (41.00 ns .. 72.46 ns)
variance introduced by outliers: 49% (moderately inflated)