Roman Cheplyaka

MonadFix example: compiling regular expressions

September 2, 2015

{-# LANGUAGE RecursiveDo, BangPatterns #-}
import Control.Applicative
import Data.Function (fix)
import Data.IntMap as IntMap
import Control.Monad.Fix (mfix)
import Control.Monad.Trans.State
import Control.Monad.Trans.Class (lift)
import Text.Read (readMaybe)

MonadFix is an odd beast; many Haskell programmers will never use it in their careers. It is indeed very rarely that one needs MonadFix; and for that reason, non-contrived cases where MonadFix is needed are quite interesting to consider.

In this article, I’ll introduce MonadFix and show how it can be handy for compiling the Kleene closure (also known as star or repetition) of regular expressions.

What is MonadFix?

If you hear about MonadFix for the first time, you might think that it is needed to define recursive monadic actions, just like ordinary fix is used to define recursive functions. That would be a mistake. In fact, fix is just as applicable to monadic actions as it is to functions:

guessNumber m = fix $ \repeat -> do
  putStrLn "Enter a guess"
  n <- readMaybe <$> getLine
  if n == Just m
    then putStrLn "You guessed it!"
    else do
      putStrLn "You guessed wrong; try again"

So, what is mfix for? First, recall that in Haskell, one can create recursive definitions not just for functions (which makes sense in other, non-lazy languages) or monadic actions, but for ordinary data structures as well. This is known as cyclic (or circular, or corecursive) definitions; and the technique itself is sometimes referred to as tying the knot.

The classic example of a cyclic definition is the (lazy, infinite) list of Fibonacci numbers:

fib = 0 : 1 : zipWith (+) fib (tail fib)

Cyclic definitions are themselves rare in day-to-day Haskell programming; but occasionally, the right hand side will be not a pure value, but a monadic computation that needs to be run in order to obtain the value.

Consider this (contrived) example, where we start the sequence with an arbitrary number entered by the user:

fibIO1 = do
  putStrLn "Enter the start number"
  start <- read <$> getLine
  return $ start : 1 : zipWith (+) fibIO1 (tail fibIO1)

This doesn’t typecheck because fibIO is not a list; it’s an IO action that produces a list.

But if we try to run the computation, it doesn’t make much sense either:

fibIO2 = do
  putStrLn "Enter the start number"
  start <- read <$> getLine
  fib <- fibIO2
  return $ start : 1 : zipWith (+) fib (tail fib)

This version of fibIO will ask you to enter the start number ad infinitum and never get to evaluating anything.

Of course, the simplest thing to do would be to move IO out of the recursive equation; that’s why I said the example was contrived. But MonadFix gives another solution:

fibIO3 = mfix $ \fib -> do
  putStrLn "Enter the start number"
  start <- read <$> getLine
  return $ start : 1 : zipWith (+) fib (tail fib)

Or, using the do-rec syntax:

fibIO4 = do
    fib <- do
      putStrLn "Enter the start number"
      start <- read <$> getLine
      return $ start : 1 : zipWith (+) fib (tail fib)
  return fib

Compiling regular expressions

As promised, I am going to show you an example usage of MonadFix that solved a problem other than “how could I use MonadFix?”. This came up in my work on regex-applicative.

For a simplified presentation, let’s consider this type of regular expressions:

data RE
  = Sym Char  -- symbol
  | Seq RE RE -- sequence
  | Alt RE RE -- alternative
  | Rep RE    -- repetition

Our goal is to compile a regular expression into a corresponding NFA. The states will be represented by integer numbers. State 0 corresponds to successful completion; and each Sym inside a regex will have a unique positive state in which we are expecting the corresponding character.

type NFAState = Int

The NFA will be represented by a map

type NFA = IntMap (Char, [NFAState])

where each state is mapped to the characters expected at that state and the list of states where we go in case we get the expected character.

To compile a regular expression, we’ll take as an argument the list of states to proceed to when the regular expression as a whole succeeds (otherwise we’d have to compile each subexpression separately and then glue NFAs together). This is essentially the continuation-passing style; only instead of functions, our continuations are NFA states.

During the compilation, we’ll use a stack of two State monads: one to assign sequential state numbers to Syms; the other to keep track of the currently constructred NFA.

-- Returns the list of start states and the transition table
compile :: RE -> ([NFAState], NFA)
compile re = runState (evalStateT (go re [0]) 0) IntMap.empty

-- go accepts exit states, returns entry states
go :: RE -> [NFAState] -> StateT NFAState (State NFA) [NFAState]
go re exitStates =
  case re of
    Sym c -> do
      !freshState <- gets (+1); put freshState
      lift $ modify' (IntMap.insert freshState (c, exitStates))
      return [freshState]
    Alt r1 r2 -> (++) <$> go r1 exitStates <*> go r2 exitStates
    Seq r1 r2 -> go r1 =<< go r2 exitStates

This was easy so far: alternatives share their exit states and their entry states are combined; and consequtive subexpressions are chained. But how do we compile Rep? The exit states of the repeated subexpression should become its own entry states; but we don’t know the entry states until we compile it!

And this is precisely where MonadFix (or recursive do) comes in:

    Rep r -> do
        let allEntryStates = ownEntryStates ++ exitStates
        ownEntryStates <- go r allEntryStates
      return allEntryStates

Why does this circular definition work? If we unwrap the State types, we’ll see that the go function actually computes a triple of three non-strict fields:

  1. The last used state number
  2. The list of entry states
  3. The NFA map

The elements of the triple may depend on each other as long as there are no actual loops during evaluation. One can check that the fields can be indeed evaluated linearly in the order in which they are listed above:

  1. The used state numbers at each step depend only on the regular expression itself, so it can be computed wihtout knowing the other two fields.
  2. The list of entry states relies only on the state number information; it doesn’t need to know anything about the NFA transitions.
  3. The NFA table needs to know the entry and exit states; but that is fine, we can go ahead and compute that information without creating any reverse data dependencies.

Further reading

An ASM Monad – a similar example from a different domain.

Oliver Charles’s 24 Days of GHC Extensions: Recursive Do.

Levent Erkok’s thesis which contains all you need to know about MonadFix, including several other examples.

Todd Wilson points out that Douglas McIlroy describes a similar regular expression compilation technique in his 2004 JFP Functional Pearl Enumerating the strings of regular languages. Like this article, Douglas’s paper uses a circular definition when compiling the Kleene closure. But the circular definition is not monadic there: instead of using the State monad, Douglas passes the state around by hand.