Roman Cheplyaka

Lexical analysis with parser combinators

January 2, 2015

When writing a programming language parser in Haskell, the usual dilemma is whether to use lexer/parser generators (Alex+Happy), or make a single parser (using a parser combinator library like Parsec) without an explicit tokenizer.

In this article, I’ll explain why you may want to use a separate lexer built with applicative parser combinators, and how you might go about writing one.

Many of the ideas described in this article have been since implemented in lexer-applicative.

Alex

Alex, the Haskell lexer generator, is a nice tool, but it has a weakness. With Alex, you cannot observe the internal structure of a token.

The most common instance of this problem is parsing string interpolation, think "Hello, ${name}!". Such a string is a single token, yet it has some internal structure to it. To analyze this with Alex, you’ll probably need to have two different lexers, and run one inside the other.

Another instance is date literals (useful for many DSLs), such as 2015-01-02 (or d"2015-01-02" to avoid confusion with infix subtraction.) You can recognize such a literal with an Alex regular expression $d+\-$d+\-$d+. However, you won’t be able to get hold of the individual $d+ pieces — Alex regexps lack capture groups. So you’ll need to use a separate date parsing function to parse the same string second time.

Parsec

On the other hand, we have parser combinators that solve this problem nicely. You define parsers for numbers; then you glue them with dashes, and you get a parser for dates.

However, using Parsec (or similar) without a separate lexer is not the best choice:

  1. Dealing with whitespace and comments is awkward in the parser; you need to wrap everything in a token combinator. (If you decide to do that, at least use a free applicative functor to ensure that you don’t forget to consume that whitespace).

  2. Separating parsing into lexical and syntax analysis is just a good engineering practice. It reduces the overall complexity through «divide and conquer». The two phases usually have well-defined responsibilities and a simple interface — why not take advantage of that?

  3. If a language needs the maximal munch rule, it’s hard to impossible to encode that with Parsec or similar libraries.

  4. Tokens enable the syntax analyzer to report better errors. This is because you can tell which token you didn’t expect. In a Char-based Parsec parser, you can only tell which character (or an arbitrary number of characters) you didn’t expect, because you don’t know what constitutes a token.

  5. Potential performance considerations. If a parser has to try several syntax tree alternatives, it reparses low-level lexical tokens anew every time. From this perspective, the lexical analyzer provides a natural «caching layer».

regex-applicative

My regex-applicative library provides applicative parser combinators with regexp semantics. We can write a simple lexer on top of it that would give us the benefits of both Alex and Parsec.

{-# LANGUAGE OverloadedStrings, OverloadedLists #-}

import qualified Data.Text as T
import qualified Data.HashSet as HS
import Text.Regex.Applicative
import Text.Regex.Applicative.Common (decimal)
import Data.Char
import Data.Time

For example, here’s a parser for dates. (For simplicity, this one doesn’t check dates for validity.)

pDate :: RE Char Day
pDate = fromGregorian <$> decimal <* "-" <*> decimal <* "-" <*> decimal

And here’s a parser for templates — strings that may include interpolated variables and escaping.

pTemplate :: RE Char [Either T.Text T.Text] -- Left = text, Right = variable
pTemplate = "\"" *> many piece <* "\""
  where
    -- piece of text or interpolated variable
    piece = 
      (Left . T.pack <$> some ch) <|>
      (Right <$> var)

    -- interpolated variable
    var = "${" *> pVar <* "}"

    -- normal character, plain or escaped
    ch =
      psym (\c -> not $ c `HS.member` escapable) <|>
      ("\\" *> psym (\c -> c `HS.member` escapable))

    -- set of escapable characters
    escapable = ['"', '\\', '$']

    pVar = T.pack <$> some (psym isAlpha)

Individual parsers are merged into a single pToken parser:

data Token
   = Template [Either T.Text T.Text]
   | Date Day
-- | ...

pToken :: RE Char Token
pToken =
   (Template <$> pTemplate) <|>
   (Date <$> pDate)
-- <|> ...

Finally, a simple tokenizing function might look something like this:

tokens :: String -> [Token]
tokens "" = []
tokens s =
  let re = (Just <$> pToken) <|> (Nothing <$ some (psym isSpace)) in
  case findLongestPrefix re s of
    Just (Just tok, rest) -> tok : tokens rest
    Just (Nothing, rest) -> tokens rest -- whitespace
    Nothing -> error "lexical error"

The resulting token list can be further parsed with Parsec or Happy.

This simple function can be extended to do a lot more:

  1. Track position in the input string, use it for error reporting, include it into tokens
  2. Consume input incrementally
  3. Get feedback from the syntax analyzer, and, depending on the syntactic context, perform different lexical analyses. This is useful for more complex features of programming languages, such as the off-side rule in Haskell or arbitrarily-nested command substitution and «here documents» in POSIX Shell. (This is another thing Alex cannot do, by the way.)