Lexical analysis with parser combinators
Published on
When writing a programming language parser in Haskell, the usual dilemma is whether to use lexer/parser generators (Alex+Happy), or make a single parser (using a parser combinator library like Parsec) without an explicit tokenizer.
In this article, I’ll explain why you may want to use a separate lexer built with applicative parser combinators, and how you might go about writing one.
Many of the ideas described in this article have been since implemented in lexer-applicative.
Alex
Alex, the Haskell lexer generator, is a nice tool, but it has a weakness. With Alex, you cannot observe the internal structure of a token.
The most common instance of this problem is parsing string
interpolation, think "Hello, ${name}!"
. Such a string is a
single token, yet it has some internal structure to it. To analyze this
with Alex, you’ll probably need to have two different lexers, and run
one inside the other.
Another instance is date literals (useful for many DSLs), such as
2015-01-02
(or d"2015-01-02"
to avoid
confusion with infix subtraction.) You can recognize such a literal with
an Alex regular expression $d+\-$d+\-$d+
. However, you
won’t be able to get hold of the individual $d+
pieces —
Alex regexps lack capture groups. So you’ll need to use a separate date
parsing function to parse the same string second time.
Parsec
On the other hand, we have parser combinators that solve this problem nicely. You define parsers for numbers; then you glue them with dashes, and you get a parser for dates.
However, using Parsec (or similar) without a separate lexer is not the best choice:
Dealing with whitespace and comments is awkward in the parser; you need to wrap everything in a
token
combinator. (If you decide to do that, at least use a free applicative functor to ensure that you don’t forget to consume that whitespace).Separating parsing into lexical and syntax analysis is just a good engineering practice. It reduces the overall complexity through «divide and conquer». The two phases usually have well-defined responsibilities and a simple interface — why not take advantage of that?
If a language needs the maximal munch rule, it’s hard to impossible to encode that with Parsec or similar libraries.
Tokens enable the syntax analyzer to report better errors. This is because you can tell which token you didn’t expect. In a Char-based Parsec parser, you can only tell which character (or an arbitrary number of characters) you didn’t expect, because you don’t know what constitutes a token.
Potential performance considerations. If a parser has to try several syntax tree alternatives, it reparses low-level lexical tokens anew every time. From this perspective, the lexical analyzer provides a natural «caching layer».
regex-applicative
My regex-applicative library provides applicative parser combinators with regexp semantics. We can write a simple lexer on top of it that would give us the benefits of both Alex and Parsec.
{-# LANGUAGE OverloadedStrings, OverloadedLists #-}
import qualified Data.Text as T
import qualified Data.HashSet as HS
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
import Data.Char
import Data.Time
For example, here’s a parser for dates. (For simplicity, this one doesn’t check dates for validity.)
pDate :: RE Char Day
= fromGregorian <$> decimal <* "-" <*> decimal <* "-" <*> decimal pDate
And here’s a parser for templates — strings that may include interpolated variables and escaping.
pTemplate :: RE Char [Either T.Text T.Text] -- Left = text, Right = variable
= "\"" *> many piece <* "\""
pTemplate where
-- piece of text or interpolated variable
=
piece Left . T.pack <$> some ch) <|>
(Right <$> var)
(
-- interpolated variable
= "${" *> pVar <* "}"
var
-- normal character, plain or escaped
=
ch -> not $ c `HS.member` escapable) <|>
psym (\c "\\" *> psym (\c -> c `HS.member` escapable))
(
-- set of escapable characters
= ['"', '\\', '$']
escapable
= T.pack <$> some (psym isAlpha) pVar
Individual parsers are merged into a single pToken
parser:
data Token
= Template [Either T.Text T.Text]
| Date Day
-- | ...
pToken :: RE Char Token
=
pToken Template <$> pTemplate) <|>
(Date <$> pDate)
(-- <|> ...
Finally, a simple tokenizing function might look something like this:
tokens :: String -> [Token]
"" = []
tokens =
tokens s let re = (Just <$> pToken) <|> (Nothing <$ some (psym isSpace)) in
case findLongestPrefix re s of
Just (Just tok, rest) -> tok : tokens rest
Just (Nothing, rest) -> tokens rest -- whitespace
Nothing -> error "lexical error"
The resulting token list can be further parsed with Parsec or Happy.
This simple function can be extended to do a lot more:
- Track position in the input string, use it for error reporting, include it into tokens
- Consume input incrementally
- Get feedback from the syntax analyzer, and, depending on the syntactic context, perform different lexical analyses. This is useful for more complex features of programming languages, such as the off-side rule in Haskell or arbitrarily-nested command substitution and «here documents» in POSIX Shell. (This is another thing Alex cannot do, by the way.)