6 ways to manage allocated memory in Haskell

Published on August 6, 2017; updated on November 20, 2017

Let’s say we have a foreign function that takes a C data structure. Our task is to allocate the structure in memory, fill in the fields, call the function, and deallocate the memory.

The data structure is not flat but contains pointers to other data structures that also need to be allocated. An example of such a data structure is a linked list:

typedef struct list {
  int car;
  struct list *cdr;
} list;

And an example of a C function is one that computes the sum of all elements in the list:

int sum(list *l) {
  int s = 0;
  while (l) {
    s += l->car;
    l = l->cdr;
  }
  return s;
}

In this article, I will explore different ways to track all the allocated pointers and free them reliably.

The complete code can be downloaded as a git repo:

git clone https://ro-che.info/files/2017-08-06-manage-allocated-memory-haskell.git

The modules below use the hsc2hs preprocessor; it replaces things like #{peek ...}, #{poke ...}, and #{size ...} with Haskell code.

Way 1: traverse the structure

Since all pointers are stored somewhere in the data structure, we could traverse it to recover and free those pointers.

mkList :: [Int] -> IO (Ptr List)
mkList l =
  case l of
    [] -> return nullPtr
    x:xs -> do
      struct <- mallocBytes #{size list}
      #{poke list, car} struct x
      xs_c <- mkList xs
      #{poke list, cdr} struct xs_c
      return struct

freeList :: Ptr List -> IO ()
freeList l
  | l == nullPtr = return ()
  | otherwise = do
      cdr <- #{peek list, cdr} l
      free l
      freeList cdr

way1 :: Int -> IO Int
way1 n = bracket (mkList [1..n]) freeList csum

This is how we’d probably do it in C, but in Haskell it has several disadvantages compared to the other options we have:

The code to traverse the structure has to be written manually and is prone to errors.
Having to do the extra work of traversing the structure makes it slower than some of the alternatives.
It is not exception-safe; if something happens inside mkList, the already allocated pointers will be lost. Note that this code is async-exception-safe (bracket masks async exceptions for mkList), so the only exceptions we need to worry about must come from mkList or freeList (e.g. from mallocBytes).

Way 2: allocaBytes and ContT

The Foreign.Marshal.Alloc module provides the bracket-style allocaBytes function, which allocates the memory and then automatically releases it.

-- |@'allocaBytes' n f@ executes the computation @f@, passing as argument
-- a pointer to a temporarily allocated block of memory of @n@ bytes.
-- The block of memory is sufficiently aligned for any of the basic
-- foreign types that fits into a memory block of the allocated size.
--
-- The memory is freed when @f@ terminates (either normally or via an
-- exception), so the pointer passed to @f@ must /not/ be used after this.
allocaBytes :: Int -> (Ptr a -> IO b) -> IO b

It works great if we need to allocate a small fixed number of structures:

allocaBytes size1 $ \ptr1 ->
  allocaBytes size2 $ \ptr2 -> do
    -- do something with ptr1 and ptr2

But what if the number of allocations is large or even unknown, as in our case?

For that, we have the continuation monad!

mallocBytesC :: Int -> ContT r IO (Ptr a)
mallocBytesC n = ContT $ \k -> allocaBytes n k

mallocBytesC is a monadic function that returns a pointer to the allocated memory, so it is as convenient and flexible as the simple mallocBytes used in Way 1. But, unlike mallocBytes, all allocated memory will be safely and automatically released at the point where we run the ContT layer.

mkList :: [Int] -> ContT r IO (Ptr List)
mkList l =
  case l of
    [] -> return nullPtr
    x:xs -> do
      struct <- mallocBytesC #{size list}
      liftIO $ #{poke list, car} struct x
      xs_c <- mkList xs
      liftIO $ #{poke list, cdr} struct xs_c
      return struct

way2 :: Int -> IO Int
way2 n = runContT (liftIO . csum =<< mkList [1..n]) return

This is my favorite way: it is simple, safe, fast, and elegant.

Way 3: ResourceT

If we replace ContT r with ResourceT, we get a similar function with essentially the same semantics: the memory released when the ResourceT layer is run.

mallocBytesR :: Int -> ResourceT IO (Ptr a)
mallocBytesR n = snd <$> allocate (mallocBytes n) free

The rest of the code barely changes:

mkList :: [Int] -> ResourceT IO (Ptr List)
mkList l =
  case l of
    [] -> return nullPtr
    x:xs -> do
      struct <- mallocBytesR #{size list}
      liftIO $ #{poke list, car} struct x
      xs_c <- mkList xs
      liftIO $ #{poke list, cdr} struct xs_c
      return struct

way3 :: Int -> IO Int
way3 n = runResourceT (liftIO . csum =<< mkList [1..n])

However, as we shall see, this version is an order of magnitude slower than the ContT version.

Way 4: single-block allocation via mallocBytes

Even though our structure contains lots of pointers, we don’t have to allocate each one of them separately. Instead, we could allocate a single chunk of memory and calculate the pointers ourselves.

This method may be more involved for complex data structures, but it is the fastest one, because we only need to keep track of and deallocate a single pointer.

writeList :: Ptr List -> [Int] -> IO ()
writeList ptr l =
  case l of
    [] -> error "writeList: empty list"
    [x] -> do
      #{poke list, car} ptr x
      #{poke list, cdr} ptr nullPtr
    x:xs -> do
      #{poke list, car} ptr x
      let ptr' = plusPtr ptr #{size list}
      #{poke list, cdr} ptr ptr'
      writeList ptr' xs

mkList :: [Int] -> IO (Ptr List)
mkList l
  | null l = return nullPtr
  | otherwise = do
      ptr <- mallocBytes (length l * #{size list})
      writeList ptr l
      return ptr

way4 :: Int -> IO Int
way4 n = bracket (mkList [1..n]) free csum

Way 5: single-block allocation via allocaBytes + ContT

This is the same as Way 4, except using allocaBytes instead of mallocBytes. The two functions allocate the memory differently, so I thought I’d add this version to the benchmark.

mkList :: [Int] -> ContT r IO (Ptr List)
mkList l
  | null l = return nullPtr
  | otherwise = do
      ptr <- mallocBytesC (length l * #{size list})
      liftIO $ writeList ptr l
      return ptr

way5 :: Int -> IO Int
way5 n = runContT (liftIO . csum =<< mkList [1..n]) return

Way 6: managed

This is the same as Way 2, but using the managed library by Gabriel Gonzalez. (Thanks to reddit user /u/andriusst for bringing it up.)

mallocBytesC :: Int -> Managed (Ptr a)
mallocBytesC n = managed $ \k -> allocaBytes n k

mkList :: [Int] -> Managed (Ptr List)
mkList l =
  case l of
    [] -> return nullPtr
    x:xs -> do
      struct <- mallocBytesC #{size list}
      liftIO $ #{poke list, car} struct x
      xs_c <- mkList xs
      liftIO $ #{poke list, cdr} struct xs_c
      return struct

way6 :: Int -> IO Int
way6 n = with (liftIO . csum =<< mkList [1..n]) return

Managed is just a newtype wrapper around the polymorphic continuation monad (a.k.a. the Codensity monad) applied to IO. Because the underlying implementation is the same, the run time performance is exactly that of Way 2, but you get to work with simpler types: Managed instead of ContT r IO.

Is sum pure?

We expose the sum function from C as an IO function csum:

foreign import ccall "sum" csum :: Ptr List -> IO Int

The alternative is to expose it as a pure function:

foreign import ccall "sum" csum :: Ptr List -> Int

The Haskell FFI explicitly allows both declarations, and it may seem that a function summing up numbers deserves to be pure.

However, declaring csum as pure would break every single example above (after making the trivial changes to make it type check). Can you see why?

The answer is laziness. If we made csum pure and naively changed our code to compile, we might arrive at something like

bracket (mkList [1..n]) freeList (return . csum)

This tells the compiler to:

Call mkList [1..n];
Pass the result to the pure csum function, and promote the value to the IO type;
Call freeList;
Return the value from step (2).

You may think that the value returned from step (2), and therefore from the whole expression, is a number — the length of the list. But because of laziness, it’s not a number yet. It is a thunk, a delayed application of csum to the pointer value. This thunk will be forced at the next step — when we pass it to the print function, for example, or compare it with the expected value. Only then will csum try to do its work. But it will be too late, because by that time freeList will have run, and csum will discover that its pointer argument is no longer a valid pointer.

To work around this problem, we could replace return with evaluate, but this just shows us that our decision to declare csum was a mistake. It matters in what order mkList, csum, and freeList are evaluated, and the principled way to enforce this order in Haskell is the IO monad.

Benchmark results

The benchmark consists of allocating, summing, and deallocating a list of numbers from 1 to 100.

benchmarking way1
time                 3.301 μs   (3.277 μs .. 3.345 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 3.305 μs   (3.288 μs .. 3.341 μs)
std dev              76.34 ns   (44.80 ns .. 142.3 ns)
variance introduced by outliers: 27% (moderately inflated)

benchmarking way2
time                 2.183 μs   (2.147 μs .. 2.227 μs)
                     0.998 R²   (0.997 R² .. 1.000 R²)
mean                 2.161 μs   (2.145 μs .. 2.185 μs)
std dev              65.91 ns   (46.76 ns .. 97.04 ns)
variance introduced by outliers: 40% (moderately inflated)

benchmarking way3
time                 13.21 μs   (12.92 μs .. 13.49 μs)
                     0.997 R²   (0.996 R² .. 0.998 R²)
mean                 12.74 μs   (12.59 μs .. 12.93 μs)
std dev              574.0 ns   (437.5 ns .. 751.6 ns)
variance introduced by outliers: 54% (severely inflated)

benchmarking way4
time                 1.580 μs   (1.560 μs .. 1.605 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 1.582 μs   (1.570 μs .. 1.601 μs)
std dev              51.28 ns   (37.48 ns .. 72.14 ns)
variance introduced by outliers: 44% (moderately inflated)

benchmarking way5
time                 1.434 μs   (1.418 μs .. 1.459 μs)
                     0.998 R²   (0.997 R² .. 0.998 R²)
mean                 1.511 μs   (1.487 μs .. 1.538 μs)
std dev              83.96 ns   (72.68 ns .. 95.63 ns)
variance introduced by outliers: 70% (severely inflated)

benchmarking way6
time                 2.155 μs   (2.135 μs .. 2.178 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 2.152 μs   (2.139 μs .. 2.171 μs)
std dev              52.54 ns   (40.20 ns .. 69.02 ns)
variance introduced by outliers: 30% (moderately inflated)