Better YAML parsing
Published on
If you need to parse a YAML file in Haskell today, you will probably reach for Michael Snoyman’s yaml package.
That parser works in two stages.
During the first stage, it parses YAML into a generic representation, such as an array of dictionaries of strings. For this, the yaml package uses the libyaml C library written by Kirill Simonov.
During the second stage, the generic representation is converted into the application-specific Haskell type. For instance, an abstract dictionary may be mapped to a record type.
This idea of two-stage parsing is borrowed from the aeson package, which parses JSON in a similar way. And because JSON’s and YAML’s data models are similar, the yaml package borrows from Aeson not only the above idea but also the generic representation and the machinery to convert it to Haskell types.
Thanks to this approach, if you have a FromJSON
instance
for a type, you can deserialize this type not only from JSON but also
from the more human readable and writable YAML.
But there is a downside, too. Because Aeson’s primary goal is performance, it doesn’t try to provide good error messages or even validate the input beyond what’s necessary. This is not a problem for JSON because it is typically generated by programs.
But YAML is often written by humans, so it is important to detect possible mistakes and report them clearly.
Example
Consider a Haskell type representing a shopping cart item:
{-# LANGUAGE OverloadedStrings #-}
import Data.Aeson (FromJSON(..), withObject, withText, (.:), (.:?), (.!=))
import Data.Yaml (decodeEither)
import Data.Text (Text)
import Control.Applicative
data Item = Item
Text -- title
Int -- quantity
deriving Show
In YAML, an Item
may be written as:
title: Shampoo
quantity: 100
In our application, most of the time the quantity will be 1, so we
will allow two alternative simplified forms. In the first form, the
quantity
field is omitted and defaulted to 1:
title: Shampoo
In the second form, the object will be flattened to a bare string:
Shampoo
Here’s a reasonably idiomatic way to write an Aeson parser for this format:
defaultQuantity :: Int
= 1
defaultQuantity
instance FromJSON Item where
= parseObject v <|> parseString v
parseJSON v where
= withObject "object" $ \o ->
parseObject Item <$>
.: "title" <*>
o .:? "quantity" .!= defaultQuantity
o
= withText "string" $ \t ->
parseString return $ Item t defaultQuantity
With this example, I can now demonstrate the two weak spots of Aeson parsing: insufficient input validation and confusing error messages.
Validation
The following YAML parses successfully. But does the resulting
Item
match your expectations?
"{title: Shampoo, quanity: 2}" :: Either String Item decodeEither
Right (Item "Shampoo" 1)
If you look closer, you’ll notice that the word quantity is misspelled. But the parser doesn’t have any problem with that. Such a typo may go unnoticed for a long time and quitely affect how your application works.
Error reporting
Let’s say I am a returning user who vaguely remembers the YAML format for Items. I might have written something like
"{name: Shampoo, quantity: 2}" :: Either String Item decodeEither
Left "when expecting a string, encountered Object instead"
“That’s weird. I could swear this app accepted some form of an object where you could specify the quantity. But apparently I’m wrong, it only accepts simple strings.”
How to fix it
Check for unrecognized fields
To address the first problem, we need to know the set of acceptable
keys. This set is impossible to extract from a FromJSON
parser because it is buried inside an opaque function.
Let’s change parseJSON
to have type
FieldParser a
, where FieldParser
is an
applicative functor that we’ll define shortly. The values of
FieldParser
can be constructed with combinators:
field :: Text -- ^ field name
-> Parser a -- ^ value parser
-> FieldParser a
optField :: Text -- ^ field name
-> Parser a -- ^ value parser
-> FieldParser (Maybe a)
The combinators are analogous to the ones I described in JSON validation combinators.
How can we implement the FieldParser
type? One
(“initial”) way is to use a free applicative
functor and later interpret it in two ways: as a
FromJSON
-like parser and as a set of valid keys.
But there’s another (“final”) way which is to compose the applicative
functor from components, one per required semantics. The semantics of
FromJSON
is given by
ReaderT Object (Either ParseError)
. The semantics of a set
of valid keys is given by Constant (HashMap Text ())
. We
take the product of these semantics to get the implementation of
FieldParser
:
newtype FieldParser a = FieldParser
Product
(ReaderT Object (Either ParseError))
(Constant (HashMap Text ())) a) (
Here I used HashMap Text ()
instead of
HashSet Text
to be able to subtract this set from the
object (represented as HashMap Text Value
) later.
Another benefit of this approach is that it’s no longer necessary to
give a name to the object (often called o
) as in the
Aeson-based parser. I’ve always found that awkward and unnecessary.
Improve error messages
Aeson’s approach to error messages is straightforward: it tries every alternative in turn and, if none succeeds, it returns the last error message.
There are two approaches to get a more sophisticated error reporting:
Collect errors from all alternatives and somehow merge them. Each error would carry its level of “matching”. An alternative that matched the object but failed at key lookup matches better than the one that expected a string instead of an object. Thus the error from the first alternative would prevail. If there are multiple errors on the same level, we should try to merge them. For instance, if we expect an object or a string but got an array, then the error message should mention both object and string as valid options.
Limited backtracking. This is what Parsec does. In our example, when it was determined that the object was “at least somewhat” matched by the first alternative, the second one would have been abandoned. This approach is rather restrictive: if you have two alternatives each expecting an object, the second one will never fire. The benefit of this approach is its efficiency (sometimes real, sometimes imaginary), since we never explore more than one alternative deeply.
It turns out, when parsing Value
s, we can remove some of
the backtracking without imposing any restrictions. This is because we
can “factor out” common parser prefixes. If we have two parsers that
expect an object, this is equivalent to having a single parser expecting
an object. To see this, let’s represent a parser as a record with a
field per JSON “type”:
data Parser a = Parser
parseString :: Maybe (Text -> Either ParseError a)
{ parseArray :: Maybe (Vector Value -> Either ParseError a)
, parseObject :: Maybe (HashMap Text Value -> Either ParseError a)
,...
}
Writing a function
Parser a -> Parser a -> Parser a
which merges
individual fields is then a simple exercise.
Why is every field wrapped in Maybe
? How’s
Nothing
different from
Just $ const $ Left "..."
? This is so that we can see which
JSON types are valid and give a better error message. If we tried to
parse a JSON number as an Item
, the error message would say
that it expected an object or a string, because only those fields of the
parser would be Just
values.
The Parser
type above can be mechanically derived from
the Value
datatype itself. In the actual implementation, I
use generics-sop with
great success to reduce the boilerplate. To give you an idea, here’s the
real definition of the Parser
type:
newtype ParserComponent a fs = ParserComponent (Maybe (NP I fs -> Either ParseError a))
newtype Parser a = Parser (NP (ParserComponent a) (Code Value))
We can then apply a Parser
to a Value
using
this
function.
Example revisited
Here is the same Item
type and a combinator-based YAML
parser:
{-# LANGUAGE OverloadedStrings #-}
import Data.Text (Text)
import Data.Maybe
import Data.Monoid
import Data.Yaml.Combinators
data Item = Item
Text -- title
Int -- quantity
deriving Show
itemParser :: Parser Item
itemParser= (flip Item 1 <$> string)
<> (object $ Item
<$> field "title" string
<*> (fromMaybe 1 <$> optField "quantity" integer))
Let’s see now what errors it produces.
Validation
The YAML with a typo in the key name no longer parses:
either putStrLn print $ parse itemParser "{title: Shampoo, quanity: 2}"
Unexpected
quanity: 2
as part of
quanity: 2
title: Shampoo
Error reporting
Since we supplied an object, the parser explains what’s wrong with that object without telling us it’d rather receive a string.
either putStrLn print $ parse itemParser "{name: Shampoo, quantity: 2}"
Expected field "title" as part of
quantity: 2
name: Shampoo
Implementation
I originally implemented these combinators as an internal module while working for Signal Vine in 2015. They kindly agreed to release it under the MIT instance, and I finally did so in 2017.
You can find the code packaged under the name yaml-combinators on hackage and github.