Roman Cheplyaka

JSON validation combinators

April 20, 2014

At Signal Vine we have a JSON-based language for describing text messaging campaigns. We may design a better surface syntax for this language in the future, but the current one gets the job done and there are certain existing systems that depend on it.

Anyway, the problem with this language is that it is too easy to make mistakes — including errors in JSON syntax, structural errors (plugging an array where an object is expected), or name errors (making a typo in a field name).

So the first thing I did was write a validator for our JSON format.

There are several projects of «JSON schemas» around, but there were many reasons against using them.

  1. I don’t know about the quality of the tools that support such schemas (i.e. the quality of error messages they generate), and the expressivity of the schemas themselves (whether they’d let us to express the structure of our JSON structure and the constraints we’d like to enforce). So, though it may seem that using an existing solution is «free», it is not — I’d have to spend time learning and evaluating these existing solutions.

  2. I remember that we went through this in our team at Barclays, and eventually decided to create a custom JSON schema language, although I was not involved in the evaluation process, so can’t share the details.

  3. I was almost certain that no existing «generic JSON schema» solution can provide the power of a custom one. For instance, some of the JSON strings contain expressions in another in-house mini-language. Ideally, I’d like to parse those expressions while I am parsing the enclosing JSON structure, and give locations of possible errors as they appear in the JSON file.

  4. I’d need a parser for the language anyway. Maintaining a schema separately from the parser would mean one more thing to keep in sync and worry about.

I couldn’t use an existing JSON parsing library either. Of course, aeson was out of question, being notorious for its poor error messages (since it’s based on attoparsec and optimized for speed). json, though, is based on parsec, so its error messages are better.

But there is a deeper reason why a JSON parsing library is inadequate for validation. All of the existing JSON libraries first parse into a generic JSON structure, and only then do they try to recognize the specific format and convert to a value of the target Haskell type.

Which means that during parsing, only JSON syntax errors will be detected, but not the other kinds of errors described above. Granted, they all can be detected sooner or later. But what differentiates sooner from later is that once we’re out of the parsing monad, we no longer have access to the position information (unless, of course, our JSON parsing library does extra work to store locations in the parsed JSON structure — which it typically doesn’t). And not having such position information severely impacts our ability to produce good error messages.

To summarize, in order to provide good diagnostics, it is important to parse exactly the language we expect (and not some superset thereof), and to perform all the checks in the parsing monad, where we have access to location information.

JSON parsing combinators

Even though I couldn’t re-use an existing JSON parser or schema, I still wanted my parser to be high-level, and ideally to resemble a JSON schema, just embedded in Haskell.

The rest of this article describes the JSON schema combinators I wrote for this purpose.

Strings

As I mentioned before, the json package uses parsec underneath, so I was able to reuse some basic definitions from there — most notably, p_string, which parses a JSON string. This is fortunate, because handling escape sequences is not straightforward, and I’d rather use a well-tested existing implementation.

string :: Parser String
string = {- copied from Text.JSON.Parsec -}

I introduced one other combinator, theString, which parses a given string:

theString :: String -> Parser ()
theString str = (<?> "\"" ++ str ++ "\"") $ try $ do
  str' <- string
  if str == str'
    then return ()
    else empty

Objects

Objects are an interesting case because we know what set of fields to expect, but not the order in which they come (it may be arbitrary). Such syntax is known as a «permutation phrase», and can be parsed as described in the classical paper Parsing Permutation Phrases by Arthur Baars, Andres Löh and Doaitse Swierstra.

There are surprisingly many implementations of permutation parsing on hackage, including one in parsec itself. Most of them suffer from one or both of the following issues:

  1. they use custom combinators, which, despite being similar to Applicative and Alternative operators, have their quirks and require learning

  2. they don’t support permutation phrases with separators, which is obviously required to parse JSON objects. (The technique to parse permutation phrases with separators was descibed in the original paper, too.)

On the other hand, the action-permutations library by Ross Paterson addresses both of these issues. It provides the familiar Applicative interface to combine permutation elements (or atoms, as it calls them), and includes the function runPermsSep to parse phrases with separators. The interface is also very generic, requiring the underlying functor to be just Alternative.

Below are the combinators for parsing JSON objects. field parses a single object field (or member, as it’s called in the JSON spec), using the supplied parser to parse the field’s value. optField is similar, except it returns Nothing if the field is absent (in which case field would produce an error message). Finally, theField is a shortcut to parse a field with the fixed contents. It is useful when there’s a tag-like field identifying the type/class of the object, for instance

data Item
  = Book
      String -- writer
  | Song
      String -- composer
      String -- singer

item =
  (try . object $
    Book <$
    theField "type" "book" <*>
    field "writer" string)
  <|>
  (try . object $
    Song <$
    theField "type" "song" <*>
    field "composer" string <*>
    field "singer" string)

(Note: examples in this article have nothing to do with the JSON schema we actually use at Signal Vine.)

One thing to pay attention to is how field parsers (field, theField and optField) have a different type from the ordinary parsers. This makes it much easier to reason about what actually gets permuted.

object :: Perms Parser a -> Parser a
object fields = (<?> "JSON object") $
  between (tok (char '{')) (tok (char '}')) $
    runPermsSep (tok (char ',')) fields

-- common function used by field and optField
field' 
  :: String -- key name
  -> Parser a -- value parser
  -> Parser a
field' key value = theString key *> tok (char ':') *> value

field
  :: String -- key name
  -> Parser a -- value parser
  -> Perms Parser a
field key value = atom $ field' key value

theField
  :: String -- key name
  -> String -- expected value
  -> Perms Parser ()
theField key value = () <$ field key (theString value)

optField
  :: String -- key name
  -> Parser a -- value parser
  -> Perms Parser (Maybe a)
optField key value = maybeAtom $ field' key value

Aside: correct separator parsing

There was only one issue I ran into with action-permutations, and it is interesting enough that I decided to describe it here in more detail.

Consider, for example, the expression runPermsSep sep (f <$> atom a <*> atom b <*> atom c)

It would expand to

(flip ($) <$> a <*> (
  (sep *> (flip ($) <$> b <*>
  (sep *> (flip ($) <$> c <*>
    pure (\xc xb xa -> f xc xb xa)))))
  <|>
  (sep *> (flip ($) <$> c <*>
  (sep *> (flip ($) <$> b <*>
    pure (\xc xb xa -> f xb xc xa)))))
))
<|>
(flip ($) <$> b <*> (
  ...
))
<|>
(flip ($) <$> c <*> (
  ...
))

See the problem? Suppose the actual order of the atoms in the input stream is a, c, b. At the beginning the parser is lucky to enter the right branch (the one starting from flip ($) <$> a <*> ...) on the first guess. After that, it has two alternatives: b-then-c, or c-then-b. First it enters the b-then-c branch (i.e. the wrong one) and fails. However, it fails after having consumed some input (namely, the separator) — which in libraries like parsec and trifecta means that the other branch (the right one) won’t be considered.

We cannot even work around this outside of the library by using try, because we can’t insert it in the right place. E.g. wrapping the separator in try won’t work. The right place to insert try would be around the whole alternative

  (sep *> (flip ($) <$> b <*>
  (sep *> (flip ($) <$> c <*>
    pure (\xc xb xa -> f xc xb xa)))))

but this piece is generated by the lirbary and, as a library user, we have no control over it.

The usage of try inside the library itself is unsatisfactory, too. Remember, the interface only assumes the Alternative instance, which has no notion of try. If we had to make it less generic by imposing a Parsing constraint, that would be really unfortunate.

Fortunately, once identified, this problem is not hard to fix properly — and no usage of try is required! All we need is to change runPermsSep so that it expands to the tree where separator parsing is factored out:

(flip ($) <$> a <*> sep *> (
  (flip ($) <$> b <*> sep *>
  (flip ($) <$> c <*>
    pure (\xc xb xa -> f xc xb xa)))))
  <|>
  (flip ($) <$> c <*> sep *>
  (flip ($) <$> b <*>
    pure (\xc xb xa -> f xb xc xa)))
))
<|>
(flip ($) <$> b <*> sep *> (
  ...
))
<|>
(flip ($) <$> c <*> sep *> (
  ...
))

Now, all alternatives start with atoms, so we have full control over whether they consume any input.

Mathematically, this demonstrates that <*> does not distribute over <|> for some backtracking parsers. Note that such distributive property is not required by the Alternative class.

Even for parser monads that allow backtracking by default (attoparsec, polyparse) and for which there’s no semantic difference between the two versions, this change improves efficiency by sharing separator parsing across branches.

My patch fixing the issue has been incorporated into the version 0.0.0.1 of action-permutations.

Arrays

Arrays should be easier to parse than objects in that the order of the elements is fixed. Still, we need to handle separators (commas) between array elements.

If we interpreted arrays as lists, then the schema combinator for arrays might look like

array
  :: Parser a -- parser for a signle element
  -> Parser [a] -- parser for the array

Implementation would be straightforward, too.

However, in our JSON schema we use arrays as tuples rather than lists. That is, we typically expect an array of a fixed number of heterogeneous elements. Thus we’d like to combine these tuple elements into a single parser using the applicative interface.

Let’s say we expect a 2-tuple of a string (a person’s name) and an object (that person’s address).

data Address -- = ...
data Person =
  Person
    String -- name
    Address

Written by hand, the parser may look like

address :: Parser Address
address = _

personAddress = 
  between (tok (char '[')) (tok (char ']')) $
    Person <$> string <* sep <*> address
  where sep = tok $ char ','

It makes sense to move brackets parsing to the array combinator:

array :: Parser a -> Parser a
array p = (<?> "JSON array") $
  between (tok (char '[')) (tok (char ']')) p

But what should we do with the commas? Manually interspersing elements with separators is error-prone and doesn’t correspond to my view of a high-level JSON schema description.

Inserting comma parsers automatically isn’t impossible — after all, it is done in the action-permutations package and we used it to parse object fields, which are comma-separated, too. But it cannot be as easy as adding a separator to every element since there’s one less separators than elements. We have somehow to detect the last element and not to expect a separator after it.

A nice and simple way to achieve this is with a free applicative functor. A free applicative functor will allow us to capture the whole applicative expression and postpone the decision on where to insert separator parsers until we can tell which element is the last one. In this case we’ll use Twan van Laarhoven’s free applicative, as implemented in the free package.

element :: Parser a -> Ap Parser a
element = liftAp

theArray :: Ap Parser a -> Parser a
theArray p = between (tok (char '[')) (tok (char ']')) $ go p
  where
    go :: Ap Parser a -> Parser a
    go (Ap p (Pure f)) = f <$> p
    go (Ap p1 pn) = flip id <$> p1 <* tok (char ',') <*> go pn

Like object fields, array elements have a special type which makes it clear which pieces exactly are comma-separated.

In fact, the applicative functor Perms is essentially the free applicative functor Ap plus branching.

Optional array elements

Now comes the twist. Some of the array elements may be optional — in the same way as positional function arguments in some languages may be optional. Since the elements are positional, if one of them is omitted, all subsequent ones have to be omitted, too — otherwise we won’t be able to tell which one was omitted, exactly.

For that reason, all optional elements should come after all the non-optional ones; if not, then we’ve made a mistake while designing (or describing) our schema. Ideally, our solution should catch such mistakes, too.

So, how can the above solution be adapted to handle optional arguments?

Attempt #1:
optElement :: Parser a -> Ap Parser (Maybe a)
optElement p = element $ optional p

Here optional is a combinator defined in Control.Applicative as

optional v = Just <$> v <|> pure Nothing

This won’t work at all, as it doesn’t give us any information about whether it’s an optional element or just a parser that happens to return a Maybe type.

Attempt #2:

Well, let’s just add a flag indicating whether the element was created using optElement, shall we?

data El a =
  El
    Bool -- is optional?
    (Parser a)

element :: Parser a -> Ap El a
element = liftAp . El False

optElement :: Parser a -> Ap El (Maybe a)
optElement = liftAp . El True . optional

Now we can check that optional arguments come after non-optional ones. If an element’s parse result is Nothing, we also know whether that element is an optional one, and whether we should stop trying to parse the subsequent elements.

Still, there are two related issues preventing this version from working:

  • How do we actually know when a parser returns Nothing? Once we lift a parser into the free applicative, its return type becomes existentially quantified, i.e. we should treat it as polymorphic and cannot assume it has form Maybe a (by pattern-matching on it), even if we can convince ourselves by looking at the Bool flag that it indeed must be of that form.

  • Similarly, once we’ve detected an absent optional element (assuming for a second that it is possible), we have to force all the remaining optional parsers to return Nothing without parsing anything. But again, we cannot convince the compiler that Nothing is an acceptable return value of those parsers.

Attempt #3:

So, we need certain run-time values (the optionality flag) to introduce type-level information (namely, that the parser’s return type has form Maybe a). That’s exactly what GADTs do!

data El a where
  El :: Parser a -> El a
  OptEl :: Parser a -> El (Maybe a)

El’s two constructors are a more powerful version of our old Bool flag. They let us see whether the element is optional, and if so, guarantee that its parser’s return type is Maybeish.

And here’s the code for the parsing functions:

element :: Parser a -> Ap El a
element = liftAp . El

optElement :: Parser a -> Ap El (Maybe a)
optElement = liftAp . OptEl

theArray :: Ap El a -> Parser a
theArray p = between (tok (char '[')) (tok (char ']')) $
  go True False False p
  where
    go :: Bool -> Bool -> Bool -> Ap El a -> Parser a
    go _ _ _ (Pure x) = pure x
    go isFirst optionalOccurred optionalOmitted (Ap el1 eln) =
      let
        eltSequenceError :: a
        eltSequenceError =
          error "theArray: a non-optional element after an optional one"

        !_check =
          case el1 of
            El {} | optionalOccurred -> eltSequenceError
            _ -> ()
      in
        if optionalOmitted
          then
            case el1 of
              El {} -> eltSequenceError
              OptEl {} -> go False True True eln <*> pure Nothing
          else do
            let
              sep = if isFirst then pure () else () <$ tok (char ',')
            case el1 of
              El p1 ->
                flip id <$ sep <*> p1 <*> go False False False eln
              OptEl p1 -> do
                r1 <- optional $ sep *> p1
                go False True (isNothing r1) eln <*> pure r1

theArray is a state machine with three pieces of state: isFirst, optionalOccurred and optionalOmitted. isFirst and optionalOmitted are used to guide actual parsing, while optionalOccurred is needed to check the proper arrangement of optional vs non-optional arguments.

Conclusions

Although the standard approach to JSON parsing is to parse into a generic JSON representation first, the article shows that an alternative approach — parsing the expected structure directly — is also viable and can be employed to improve error reporting.

Of course, the tricks and ideas described here are not specific to JSON. Understanding how they work and how to use them may become handy in a variety of parsing situations.