6 ways to manage allocated memory in Haskell
Published on ; updated on
Let’s say we have a foreign function that takes a C data structure. Our task is to allocate the structure in memory, fill in the fields, call the function, and deallocate the memory.
The data structure is not flat but contains pointers to other data structures that also need to be allocated. An example of such a data structure is a linked list:
typedef struct list {
int car;
struct list *cdr;
} list;
And an example of a C function is one that computes the sum of all elements in the list:
int sum(list *l) {
int s = 0;
while (l) {
+= l->car;
s = l->cdr;
l }
return s;
}
In this article, I will explore different ways to track all the allocated pointers and free them reliably.
The complete code can be downloaded as a git repo:
git clone https://ro-che.info/files/2017-08-06-manage-allocated-memory-haskell.git
The modules below use the hsc2hs
preprocessor; it replaces things like #{peek ...}
,
#{poke ...}
, and #{size ...}
with Haskell
code.
Way 1: traverse the structure
Since all pointers are stored somewhere in the data structure, we could traverse it to recover and free those pointers.
mkList :: [Int] -> IO (Ptr List)
=
mkList l case l of
-> return nullPtr
[] :xs -> do
x<- mallocBytes #{size list}
struct #{poke list, car} struct x
<- mkList xs
xs_c #{poke list, cdr} struct xs_c
return struct
freeList :: Ptr List -> IO ()
freeList l| l == nullPtr = return ()
| otherwise = do
<- #{peek list, cdr} l
cdr
free l
freeList cdr
way1 :: Int -> IO Int
= bracket (mkList [1..n]) freeList csum way1 n
This is how we’d probably do it in C, but in Haskell it has several disadvantages compared to the other options we have:
- The code to traverse the structure has to be written manually and is prone to errors.
- Having to do the extra work of traversing the structure makes it slower than some of the alternatives.
- It is not exception-safe; if something happens inside
mkList
, the already allocated pointers will be lost. Note that this code is async-exception-safe (bracket
masks async exceptions formkList
), so the only exceptions we need to worry about must come frommkList
orfreeList
(e.g. frommallocBytes
).
Way 2: allocaBytes and ContT
The Foreign.Marshal.Alloc
module provides the
bracket
-style allocaBytes
function, which
allocates the memory and then automatically releases it.
-- |@'allocaBytes' n f@ executes the computation @f@, passing as argument
-- a pointer to a temporarily allocated block of memory of @n@ bytes.
-- The block of memory is sufficiently aligned for any of the basic
-- foreign types that fits into a memory block of the allocated size.
--
-- The memory is freed when @f@ terminates (either normally or via an
-- exception), so the pointer passed to @f@ must /not/ be used after this.
allocaBytes :: Int -> (Ptr a -> IO b) -> IO b
It works great if we need to allocate a small fixed number of structures:
$ \ptr1 ->
allocaBytes size1 $ \ptr2 -> do
allocaBytes size2 -- do something with ptr1 and ptr2
But what if the number of allocations is large or even unknown, as in our case?
For that, we have the continuation monad!
mallocBytesC :: Int -> ContT r IO (Ptr a)
= ContT $ \k -> allocaBytes n k mallocBytesC n
mallocBytesC
is a monadic function that returns a
pointer to the allocated memory, so it is as convenient and flexible as
the simple mallocBytes
used in Way 1. But, unlike
mallocBytes
, all allocated memory will be safely and
automatically released at the point where we run the ContT
layer.
mkList :: [Int] -> ContT r IO (Ptr List)
=
mkList l case l of
-> return nullPtr
[] :xs -> do
x<- mallocBytesC #{size list}
struct $ #{poke list, car} struct x
liftIO <- mkList xs
xs_c $ #{poke list, cdr} struct xs_c
liftIO return struct
way2 :: Int -> IO Int
= runContT (liftIO . csum =<< mkList [1..n]) return way2 n
This is my favorite way: it is simple, safe, fast, and elegant.
Way 3: ResourceT
If we replace ContT r
with ResourceT
, we
get a similar function with essentially the same semantics: the memory
released when the ResourceT
layer is run.
mallocBytesR :: Int -> ResourceT IO (Ptr a)
= snd <$> allocate (mallocBytes n) free mallocBytesR n
The rest of the code barely changes:
mkList :: [Int] -> ResourceT IO (Ptr List)
=
mkList l case l of
-> return nullPtr
[] :xs -> do
x<- mallocBytesR #{size list}
struct $ #{poke list, car} struct x
liftIO <- mkList xs
xs_c $ #{poke list, cdr} struct xs_c
liftIO return struct
way3 :: Int -> IO Int
= runResourceT (liftIO . csum =<< mkList [1..n]) way3 n
However, as we shall see, this version is an order of magnitude
slower than the ContT
version.
Way 4: single-block allocation via mallocBytes
Even though our structure contains lots of pointers, we don’t have to allocate each one of them separately. Instead, we could allocate a single chunk of memory and calculate the pointers ourselves.
This method may be more involved for complex data structures, but it is the fastest one, because we only need to keep track of and deallocate a single pointer.
writeList :: Ptr List -> [Int] -> IO ()
=
writeList ptr l case l of
-> error "writeList: empty list"
[] -> do
[x] #{poke list, car} ptr x
#{poke list, cdr} ptr nullPtr
:xs -> do
x#{poke list, car} ptr x
let ptr' = plusPtr ptr #{size list}
#{poke list, cdr} ptr ptr'
writeList ptr' xs
mkList :: [Int] -> IO (Ptr List)
mkList l| null l = return nullPtr
| otherwise = do
<- mallocBytes (length l * #{size list})
ptr
writeList ptr lreturn ptr
way4 :: Int -> IO Int
= bracket (mkList [1..n]) free csum way4 n
Way 5: single-block allocation via allocaBytes + ContT
This is the same as Way 4, except using allocaBytes
instead of mallocBytes
. The two functions allocate the
memory differently, so I thought I’d add this version to the
benchmark.
mkList :: [Int] -> ContT r IO (Ptr List)
mkList l| null l = return nullPtr
| otherwise = do
<- mallocBytesC (length l * #{size list})
ptr $ writeList ptr l
liftIO return ptr
way5 :: Int -> IO Int
= runContT (liftIO . csum =<< mkList [1..n]) return way5 n
Way 6: managed
This is the same as Way 2, but using the managed library by Gabriel Gonzalez. (Thanks to reddit user /u/andriusst for bringing it up.)
mallocBytesC :: Int -> Managed (Ptr a)
= managed $ \k -> allocaBytes n k
mallocBytesC n
mkList :: [Int] -> Managed (Ptr List)
=
mkList l case l of
-> return nullPtr
[] :xs -> do
x<- mallocBytesC #{size list}
struct $ #{poke list, car} struct x
liftIO <- mkList xs
xs_c $ #{poke list, cdr} struct xs_c
liftIO return struct
way6 :: Int -> IO Int
= with (liftIO . csum =<< mkList [1..n]) return way6 n
Managed
is just a newtype wrapper around the polymorphic
continuation monad (a.k.a. the Codensity monad) applied to
IO
. Because the underlying implementation is the same, the
run time performance is exactly that of Way 2, but you get to work with
simpler types: Managed
instead of
ContT r IO
.
Is sum pure?
We expose the sum
function from C as an IO function
csum
:
import ccall "sum" csum :: Ptr List -> IO Int foreign
The alternative is to expose it as a pure function:
import ccall "sum" csum :: Ptr List -> Int foreign
The Haskell FFI explicitly allows both declarations, and it may seem that a function summing up numbers deserves to be pure.
However, declaring csum
as pure would break every single
example above (after making the trivial changes to make it type check).
Can you see why?
The answer is laziness. If we made csum
pure and naively
changed our code to compile, we might arrive at something like
1..n]) freeList (return . csum) bracket (mkList [
This tells the compiler to:
- Call
mkList [1..n]
; - Pass the result to the pure
csum
function, and promote the value to theIO
type; - Call
freeList
; - Return the value from step (2).
You may think that the value returned from step (2), and therefore
from the whole expression, is a number — the length of the list. But
because of laziness, it’s not a number yet. It is a thunk, a delayed
application of csum
to the pointer value. This thunk will
be forced at the next step — when we pass it to the print
function, for example, or compare it with the expected value. Only then
will csum
try to do its work. But it will be too late,
because by that time freeList
will have run, and
csum
will discover that its pointer argument is no longer a
valid pointer.
To work around this problem, we could replace return
with evaluate
, but this just shows us that our decision to
declare csum
was a mistake. It matters in what order
mkList
, csum
, and freeList
are
evaluated, and the principled way to enforce this order in Haskell is
the IO
monad.
Benchmark results
The benchmark consists of allocating, summing, and deallocating a list of numbers from 1 to 100.
benchmarking way1
time 3.301 μs (3.277 μs .. 3.345 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 3.305 μs (3.288 μs .. 3.341 μs)
std dev 76.34 ns (44.80 ns .. 142.3 ns)
variance introduced by outliers: 27% (moderately inflated)
benchmarking way2
time 2.183 μs (2.147 μs .. 2.227 μs)
0.998 R² (0.997 R² .. 1.000 R²)
mean 2.161 μs (2.145 μs .. 2.185 μs)
std dev 65.91 ns (46.76 ns .. 97.04 ns)
variance introduced by outliers: 40% (moderately inflated)
benchmarking way3
time 13.21 μs (12.92 μs .. 13.49 μs)
0.997 R² (0.996 R² .. 0.998 R²)
mean 12.74 μs (12.59 μs .. 12.93 μs)
std dev 574.0 ns (437.5 ns .. 751.6 ns)
variance introduced by outliers: 54% (severely inflated)
benchmarking way4
time 1.580 μs (1.560 μs .. 1.605 μs)
0.999 R² (0.998 R² .. 0.999 R²)
mean 1.582 μs (1.570 μs .. 1.601 μs)
std dev 51.28 ns (37.48 ns .. 72.14 ns)
variance introduced by outliers: 44% (moderately inflated)
benchmarking way5
time 1.434 μs (1.418 μs .. 1.459 μs)
0.998 R² (0.997 R² .. 0.998 R²)
mean 1.511 μs (1.487 μs .. 1.538 μs)
std dev 83.96 ns (72.68 ns .. 95.63 ns)
variance introduced by outliers: 70% (severely inflated)
benchmarking way6
time 2.155 μs (2.135 μs .. 2.178 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 2.152 μs (2.139 μs .. 2.171 μs)
std dev 52.54 ns (40.20 ns .. 69.02 ns)
variance introduced by outliers: 30% (moderately inflated)