MonadFix example: compiling regular expressions
Published on
{-# LANGUAGE RecursiveDo, BangPatterns #-}
import Control.Applicative
import Data.Function (fix)
import Data.IntMap as IntMap
import Control.Monad.Fix (mfix)
import Control.Monad.Trans.State
import Control.Monad.Trans.Class (lift)
import Text.Read (readMaybe)
MonadFix is an odd beast; many Haskell programmers will never use it in their careers. It is indeed very rarely that one needs MonadFix; and for that reason, non-contrived cases where MonadFix is needed are quite interesting to consider.
In this article, I’ll introduce MonadFix and show how it can be handy for compiling the Kleene closure (also known as star or repetition) of regular expressions.
What is MonadFix?
If you hear about MonadFix for the first time, you might think that
it is needed to define recursive monadic actions, just like ordinary
fix
is used to define recursive functions. That would be a
mistake. In fact, fix
is just as applicable to monadic
actions as it is to functions:
= fix $ \repeat -> do
guessNumber m putStrLn "Enter a guess"
<- readMaybe <$> getLine
n if n == Just m
then putStrLn "You guessed it!"
else do
putStrLn "You guessed wrong; try again"
repeat
So, what is mfix
for? First, recall that in Haskell, one
can create recursive definitions not just for functions (which makes
sense in other, non-lazy languages) or monadic actions, but for ordinary
data structures as well. This is known as cyclic (or circular, or
corecursive) definitions; and the technique itself is sometimes referred
to as tying the knot.
The classic example of a cyclic definition is the (lazy, infinite) list of Fibonacci numbers:
= 0 : 1 : zipWith (+) fib (tail fib) fib
Cyclic definitions are themselves rare in day-to-day Haskell programming; but occasionally, the right hand side will be not a pure value, but a monadic computation that needs to be run in order to obtain the value.
Consider this (contrived) example, where we start the sequence with an arbitrary number entered by the user:
= do
fibIO1 putStrLn "Enter the start number"
<- read <$> getLine
start return $ start : 1 : zipWith (+) fibIO1 (tail fibIO1)
This doesn’t typecheck because fibIO
is not a list; it’s
an IO action that produces a list.
But if we try to run the computation, it doesn’t make much sense either:
= do
fibIO2 putStrLn "Enter the start number"
<- read <$> getLine
start <- fibIO2
fib return $ start : 1 : zipWith (+) fib (tail fib)
This version of fibIO
will ask you to enter the start
number ad infinitum and never get to evaluating anything.
Of course, the simplest thing to do would be to move IO out of the recursive equation; that’s why I said the example was contrived. But MonadFix gives another solution:
= mfix $ \fib -> do
fibIO3 putStrLn "Enter the start number"
<- read <$> getLine
start return $ start : 1 : zipWith (+) fib (tail fib)
Or, using the do-rec syntax:
= do
fibIO4
rec<- do
fib putStrLn "Enter the start number"
<- read <$> getLine
start return $ start : 1 : zipWith (+) fib (tail fib)
return fib
Compiling regular expressions
As promised, I am going to show you an example usage of MonadFix that solved a problem other than “how could I use MonadFix?”. This came up in my work on regex-applicative.
For a simplified presentation, let’s consider this type of regular expressions:
data RE
= Sym Char -- symbol
| Seq RE RE -- sequence
| Alt RE RE -- alternative
| Rep RE -- repetition
Our goal is to compile a regular expression into a corresponding NFA.
The states will be represented by integer numbers. State 0 corresponds
to successful completion; and each Sym
inside a regex will
have a unique positive state in which we are expecting the corresponding
character.
type NFAState = Int
The NFA will be represented by a map
type NFA = IntMap (Char, [NFAState])
where each state is mapped to the characters expected at that state and the list of states where we go in case we get the expected character.
To compile a regular expression, we’ll take as an argument the list of states to proceed to when the regular expression as a whole succeeds (otherwise we’d have to compile each subexpression separately and then glue NFAs together). This is essentially the continuation-passing style; only instead of functions, our continuations are NFA states.
During the compilation, we’ll use a stack of two State monads: one to
assign sequential state numbers to Sym
s; the other to keep
track of the currently constructred NFA.
-- Returns the list of start states and the transition table
compile :: RE -> ([NFAState], NFA)
= runState (evalStateT (go re [0]) 0) IntMap.empty
compile re
-- go accepts exit states, returns entry states
go :: RE -> [NFAState] -> StateT NFAState (State NFA) [NFAState]
=
go re exitStates case re of
Sym c -> do
!freshState <- gets (+1); put freshState
$ modify' (IntMap.insert freshState (c, exitStates))
lift return [freshState]
Alt r1 r2 -> (++) <$> go r1 exitStates <*> go r2 exitStates
Seq r1 r2 -> go r1 =<< go r2 exitStates
This was easy so far: alternatives share their exit states and their
entry states are combined; and consequtive subexpressions are chained.
But how do we compile Rep
? The exit states of the repeated
subexpression should become its own entry states; but we don’t know the
entry states until we compile it!
And this is precisely where MonadFix (or recursive do) comes in:
Rep r -> do
reclet allEntryStates = ownEntryStates ++ exitStates
<- go r allEntryStates
ownEntryStates return allEntryStates
Why does this circular definition work? If we unwrap the
State
types, we’ll see that the go
function
actually computes a triple of three non-strict fields:
- The last used state number
- The list of entry states
- The NFA map
The elements of the triple may depend on each other as long as there are no actual loops during evaluation. One can check that the fields can be indeed evaluated linearly in the order in which they are listed above:
- The used state numbers at each step depend only on the regular expression itself, so it can be computed wihtout knowing the other two fields.
- The list of entry states relies only on the state number information; it doesn’t need to know anything about the NFA transitions.
- The NFA table needs to know the entry and exit states; but that is fine, we can go ahead and compute that information without creating any reverse data dependencies.
Further reading
An ASM Monad – a similar example from a different domain.
Oliver Charles’s 24 Days of GHC Extensions: Recursive Do.
Levent Erkok’s thesis which contains all you need to know about MonadFix, including several other examples.
Todd Wilson points out that Douglas McIlroy describes a similar regular expression compilation technique in his 2004 JFP Functional Pearl Enumerating the strings of regular languages. Like this article, Douglas’s paper uses a circular definition when compiling the Kleene closure. But the circular definition is not monadic there: instead of using the State monad, Douglas passes the state around by hand.