Roman Cheplyaka

Better Yaml Parsing

July 26, 2015

Michael Snoyman’s yaml package reuses aeson’s interface (the Value type and ToJSON & FromJSON classes) to specify how datatypes should be serialized and deserialized.

It’s not a secret that aeson’s primary goal is raw performance. This goal may be at odds with the goal of YAML: being human readable and writable.

In this article, I’ll explain how a better way of parsing human-written YAML may work. The second direction – serializing to YAML – also needs attention, but I’ll leave it out for now.

Example: Item

To demonstrate where the approach taken by the yaml package is lacking, I’ll use the following running example.

{-# LANGUAGE OverloadedStrings #-}
import Data.Aeson (FromJSON(..), withObject, withText, (.:), (.:?), (.!=))
import Data.Yaml (decodeEither)
import Data.Text (Text)
import Control.Applicative

data Item = Item
  Text -- title
  Int -- quantity
  deriving Show

The fully-specified Item in YAML may look like this:

title: Shampoo
quantity: 100

In our application, most of the time the quantity will be 1, so we’ll allow two alternative simplified forms. In the first one, the quantity field is omitted and defaulted to 1:

title: Shampoo

In the second form, the object will be flattened to a bare string:

Shampoo

Here’s a reasonably idiomatic way to write an aeson parser for this format:

defaultQuantity :: Int
defaultQuantity = 1

instance FromJSON Item where
  parseJSON v = parseObject v <|> parseString v
    where
      parseObject = withObject "object" $ \o ->
        Item <$>
          o .: "title" <*>
          o .:? "quantity" .!= defaultQuantity
        
      parseString = withText "string" $ \t ->
        return $ Item t defaultQuantity

Shortcomings of FromJSON

The main requirement for a format written by humans is error detection and reporting.

Let’s see how the parser we’ve defined copes with humanly errors.

> decodeEither "{title: Shampoo, quanity: 2}" :: Either String Item
Right (Item "Shampoo" 1)

Unexpected result, isn’t it? If you look closer, you’ll notice that the word quantity is misspelled. But our parser didn’t have any problem with that. Such a typo may go unnoticed for a long time and quitely affect how your application works.

For another example, let’s say I am a returning user who vaguely remembers the YAML format for Items. I might have written something like

*Main Data.ByteString.Char8> decodeEither "{name: Shampoo, quanity: 2}" :: Either String Item
Left "when expecting a string, encountered Object instead"

“That’s weird. I could swear this app accepted some form of an object where you could specify the quantity. But apparently I’m wrong, it only accepts simple strings.”

How to fix it

Check for unrecognized fields

To address the first problem, we need to know the set of acceptable keys. This set is impossible to extract from a FromJSON parser, because it is buried inside an opaque function.

Let’s change parseJSON to have type FieldParser a, where FieldParser is an applicative functor that we’ll define shortly. The values of FieldParser can be constructed with combinators:

field
  :: Text -- ^ field name
  -> Parser a -- ^ value parser
  -> FieldParser a

optField
  :: Text -- ^ field name
  -> Parser a -- ^ value parser
  -> FieldParser (Maybe a)

The combinators are analogous to the ones I described in JSON validation combinators.

So how do we implement FieldParser? One (“initial”) way is to use a free applicative functor and later interpret it in two ways: as a FromJSON-like parser and as a set of valid keys.

But there’s another (“final”) way which is to compose the applicative functor from components, one per required semantics. The semantics of FromJSON is given by ReaderT Object (Either ParseError). The semantics of a set of valid keys is given by Constant (HashMap Text ()). We take the product of these semantics to get the implementation of FieldParser:

newtype FieldParser a = FieldParser
  (Product
    (ReaderT Object (Either ParseError))
    (Constant (HashMap Text ())) a)

Notice how I used HashMap Text () instead of HashSet Text? This is a trick to be able to subtract this from the object (represented as HashMap Text Value) later.

Another benefit of this change is that it’s no longer necessary to give a name to the object (often called o), which I’ve always found awkward.

Improve error messages

Aeson’s approach to error messages is straightforward: it tries every alternative in turn and, if none succeeds, it returns the last error message.

There are two approaches to get a more sophisticated error reporting:

  1. Collect errors from all alternatives and somehow merge them. Each error would carry its level of “matching”. An alternative that matched the object but failed at key lookup matches better than the one that expected a string instead of an object. Thus the error from the first alternative would prevail. If there are multiple errors on the same level, we should try to merge them. For instance, if we expect an object or a string but got an array, then the error message should mention both object and string as valid options.

  2. Limited backtracking. This is what Parsec does. In our example, when it was determined that the object was “at least somewhat” matched by the first alternative, the second one would have been abandoned. This approach is rather restrictive: if you have two alternatives each expecting an object, the second one will never fire. The benefit of this approach is its efficiency (sometimes real, sometimes imaginary), since we never explore more than one alternative deeply.

It turns out, when parsing Values, we can remove some of the backtracking without imposing any restrictions. This is because we can “factor out” common parser prefixes. If we have two parsers that expect an object, this is equivalent to having a single parser expecting an object. To see this, let’s represent a parser as a record with a field per JSON “type”:

data Parser a = Parser
  { parseString :: Maybe (Text -> Either ParseError a)
  , parseArray  :: Maybe (Vector Value -> Either ParseError a)
  , parseObject :: Maybe (HashMap Text Value -> Either ParseError a)
  ...
  }

Writing a function Parser a -> Parser a -> Parser a which merges individual fields is then a simple exercise.

Why is every field wrapped in Maybe? How’s Nothing different from Just $ const $ Left "..."? This is so that we can see which JSON types are valid and give a better error message. If we tried to parse a JSON number as an Item, the error message would say that it expected an object or a string, because only those fields of the parser would be Just values.

Implementation

As you might notice, the Parser type above can be mechanically derived from the Value datatype itself. In my actual implementation, I use generics-sop with great success to reduce the boilerplate. To give you an idea, here’s the real definition of the Parser type:

newtype ParserComponent a fs = ParserComponent (Maybe (NP I fs -> Either ParseError a))
newtype Parser a = Parser (NP (ParserComponent a) (Code Value))

We can then apply a Parser to a Value using this function.


I’ve implemented this YAML parsing layer for our needs at Signal Vine. We are happy to share the code, in case someone is interested in maintaining this as an open source project.