Better YAML parsing

Published on

If you need to parse a YAML file in Haskell today, you will probably reach for Michael Snoyman’s yaml package.

That parser works in two stages.

During the first stage, it parses YAML into a generic representation, such as an array of dictionaries of strings. For this, the yaml package uses the libyaml C library written by Kirill Simonov.

During the second stage, the generic representation is converted into the application-specific Haskell type. For instance, an abstract dictionary may be mapped to a record type.

This idea of two-stage parsing is borrowed from the aeson package, which parses JSON in a similar way. And because JSON’s and YAML’s data models are similar, the yaml package borrows from Aeson not only the above idea but also the generic representation and the machinery to convert it to Haskell types.

Thanks to this approach, if you have a FromJSON instance for a type, you can deserialize this type not only from JSON but also from the more human readable and writable YAML.

But there is a downside, too. Because Aeson’s primary goal is performance, it doesn’t try to provide good error messages or even validate the input beyond what’s necessary. This is not a problem for JSON because it is typically generated by programs.

But YAML is often written by humans, so it is important to detect possible mistakes and report them clearly.

Example

Consider a Haskell type representing a shopping cart item:

In YAML, an Item may be written as:

In our application, most of the time the quantity will be 1, so we will allow two alternative simplified forms. In the first form, the quantity field is omitted and defaulted to 1:

In the second form, the object will be flattened to a bare string:

Here’s a reasonably idiomatic way to write an Aeson parser for this format:

With this example, I can now demonstrate the two weak spots of Aeson parsing: insufficient input validation and confusing error messages.

Validation

The following YAML parses successfully. But does the resulting Item match your expectations?

Right (Item "Shampoo" 1)

If you look closer, you’ll notice that the word quantity is misspelled. But the parser doesn’t have any problem with that. Such a typo may go unnoticed for a long time and quitely affect how your application works.

Error reporting

Let’s say I am a returning user who vaguely remembers the YAML format for Items. I might have written something like

Left "when expecting a string, encountered Object instead"

“That’s weird. I could swear this app accepted some form of an object where you could specify the quantity. But apparently I’m wrong, it only accepts simple strings.”

How to fix it

Check for unrecognized fields

To address the first problem, we need to know the set of acceptable keys. This set is impossible to extract from a FromJSON parser because it is buried inside an opaque function.

Let’s change parseJSON to have type FieldParser a, where FieldParser is an applicative functor that we’ll define shortly. The values of FieldParser can be constructed with combinators:

The combinators are analogous to the ones I described in JSON validation combinators.

How can we implement the FieldParser type? One (“initial”) way is to use a free applicative functor and later interpret it in two ways: as a FromJSON-like parser and as a set of valid keys.

But there’s another (“final”) way which is to compose the applicative functor from components, one per required semantics. The semantics of FromJSON is given by ReaderT Object (Either ParseError). The semantics of a set of valid keys is given by Constant (HashMap Text ()). We take the product of these semantics to get the implementation of FieldParser:

Here I used HashMap Text () instead of HashSet Text to be able to subtract this set from the object (represented as HashMap Text Value) later.

Another benefit of this approach is that it’s no longer necessary to give a name to the object (often called o) as in the Aeson-based parser. I’ve always found that awkward and unnecessary.

Improve error messages

Aeson’s approach to error messages is straightforward: it tries every alternative in turn and, if none succeeds, it returns the last error message.

There are two approaches to get a more sophisticated error reporting:

  1. Collect errors from all alternatives and somehow merge them. Each error would carry its level of “matching”. An alternative that matched the object but failed at key lookup matches better than the one that expected a string instead of an object. Thus the error from the first alternative would prevail. If there are multiple errors on the same level, we should try to merge them. For instance, if we expect an object or a string but got an array, then the error message should mention both object and string as valid options.

  2. Limited backtracking. This is what Parsec does. In our example, when it was determined that the object was “at least somewhat” matched by the first alternative, the second one would have been abandoned. This approach is rather restrictive: if you have two alternatives each expecting an object, the second one will never fire. The benefit of this approach is its efficiency (sometimes real, sometimes imaginary), since we never explore more than one alternative deeply.

It turns out, when parsing Values, we can remove some of the backtracking without imposing any restrictions. This is because we can “factor out” common parser prefixes. If we have two parsers that expect an object, this is equivalent to having a single parser expecting an object. To see this, let’s represent a parser as a record with a field per JSON “type”:

Writing a function Parser a -> Parser a -> Parser a which merges individual fields is then a simple exercise.

Why is every field wrapped in Maybe? How’s Nothing different from Just $ const $ Left "..."? This is so that we can see which JSON types are valid and give a better error message. If we tried to parse a JSON number as an Item, the error message would say that it expected an object or a string, because only those fields of the parser would be Just values.

The Parser type above can be mechanically derived from the Value datatype itself. In the actual implementation, I use generics-sop with great success to reduce the boilerplate. To give you an idea, here’s the real definition of the Parser type:

We can then apply a Parser to a Value using this function.

Example revisited

Here is the same Item type and a combinator-based YAML parser:

Let’s see now what errors it produces.

Validation

The YAML with a typo in the key name no longer parses:

Unexpected 

quanity: 2

as part of

quanity: 2
title: Shampoo

Error reporting

Since we supplied an object, the parser explains what’s wrong with that object without telling us it’d rather receive a string.

Expected field "title" as part of

quantity: 2
name: Shampoo

Implementation

I originally implemented these combinators as an internal module while working for Signal Vine in 2015. They kindly agreed to release it under the MIT instance, and I finally did so in 2017.

You can find the code packaged under the name yaml-combinators on hackage and github.