JSON validation combinators
Published on
At Signal Vine we have a JSON-based language for describing text messaging campaigns. We may design a better surface syntax for this language in the future, but the current one gets the job done and there are certain existing systems that depend on it.
Anyway, the problem with this language is that it is too easy to make mistakes — including errors in JSON syntax, structural errors (plugging an array where an object is expected), or name errors (making a typo in a field name).
So the first thing I did was write a validator for our JSON format.
There are several projects of «JSON schemas» around, but there were many reasons against using them.
I don’t know about the quality of the tools that support such schemas (i.e. the quality of error messages they generate), and the expressivity of the schemas themselves (whether they’d let us to express the structure of our JSON structure and the constraints we’d like to enforce). So, though it may seem that using an existing solution is «free», it is not — I’d have to spend time learning and evaluating these existing solutions.
I remember that we went through this in our team at Barclays, and eventually decided to create a custom JSON schema language, although I was not involved in the evaluation process, so can’t share the details.
I was almost certain that no existing «generic JSON schema» solution can provide the power of a custom one. For instance, some of the JSON strings contain expressions in another in-house mini-language. Ideally, I’d like to parse those expressions while I am parsing the enclosing JSON structure, and give locations of possible errors as they appear in the JSON file.
I’d need a parser for the language anyway. Maintaining a schema separately from the parser would mean one more thing to keep in sync and worry about.
I couldn’t use an existing JSON parsing library either. Of course,
aeson
was out of question, being notorious for its poor
error messages (since it’s based on attoparsec
and
optimized for speed). json
, though, is based on
parsec
, so its error messages are better.
But there is a deeper reason why a JSON parsing library is inadequate for validation. All of the existing JSON libraries first parse into a generic JSON structure, and only then do they try to recognize the specific format and convert to a value of the target Haskell type.
Which means that during parsing, only JSON syntax errors will be detected, but not the other kinds of errors described above. Granted, they all can be detected sooner or later. But what differentiates sooner from later is that once we’re out of the parsing monad, we no longer have access to the position information (unless, of course, our JSON parsing library does extra work to store locations in the parsed JSON structure — which it typically doesn’t). And not having such position information severely impacts our ability to produce good error messages.
To summarize, in order to provide good diagnostics, it is important to parse exactly the language we expect (and not some superset thereof), and to perform all the checks in the parsing monad, where we have access to location information.
JSON parsing combinators
Even though I couldn’t re-use an existing JSON parser or schema, I still wanted my parser to be high-level, and ideally to resemble a JSON schema, just embedded in Haskell.
The rest of this article describes the JSON schema combinators I wrote for this purpose.
Strings
As I mentioned before, the json
package uses
parsec
underneath, so I was able to reuse some basic
definitions from there — most notably, p_string
, which
parses a JSON string. This is fortunate, because handling escape
sequences is not straightforward, and I’d rather use a well-tested
existing implementation.
string :: Parser String
= {- copied from Text.JSON.Parsec -} string
I introduced one other combinator, theString
, which
parses a given string:
theString :: String -> Parser ()
= (<?> "\"" ++ str ++ "\"") $ try $ do
theString str <- string
str' if str == str'
then return ()
else empty
Objects
Objects are an interesting case because we know what set of fields to expect, but not the order in which they come (it may be arbitrary). Such syntax is known as a «permutation phrase», and can be parsed as described in the classical paper Parsing Permutation Phrases by Arthur Baars, Andres Löh and Doaitse Swierstra.
There are surprisingly many implementations of permutation parsing on
hackage, including one in parsec
itself. Most of them
suffer from one or both of the following issues:
they use custom combinators, which, despite being similar to
Applicative
andAlternative
operators, have their quirks and require learningthey don’t support permutation phrases with separators, which is obviously required to parse JSON objects. (The technique to parse permutation phrases with separators was descibed in the original paper, too.)
On the other hand, the action-permutations
library by Ross Paterson addresses both of these issues. It provides the
familiar Applicative
interface to combine permutation
elements (or atoms, as it calls them), and includes the function
runPermsSep
to parse phrases with separators. The interface
is also very generic, requiring the underlying functor to be just
Alternative
.
Below are the combinators for parsing JSON objects.
field
parses a single object field (or member, as
it’s called in the JSON spec), using the supplied parser to parse the
field’s value. optField
is similar, except it returns
Nothing
if the field is absent (in which case
field
would produce an error message). Finally,
theField
is a shortcut to parse a field with the fixed
contents. It is useful when there’s a tag-like field identifying the
type/class of the object, for instance
data Item
= Book
String -- writer
| Song
String -- composer
String -- singer
=
item . object $
(try Book <$
"type" "book" <*>
theField "writer" string)
field <|>
. object $
(try Song <$
"type" "song" <*>
theField "composer" string <*>
field "singer" string) field
(Note: examples in this article have nothing to do with the JSON schema we actually use at Signal Vine.)
One thing to pay attention to is how field parsers
(field
, theField
and optField
)
have a different type from the ordinary parsers. This makes it much
easier to reason about what actually gets permuted.
object :: Perms Parser a -> Parser a
= (<?> "JSON object") $
object fields '{')) (tok (char '}')) $
between (tok (char ',')) fields
runPermsSep (tok (char
-- common function used by field and optField
field' :: String -- key name
-> Parser a -- value parser
-> Parser a
= theString key *> tok (char ':') *> value
field' key value
field :: String -- key name
-> Parser a -- value parser
-> Perms Parser a
= atom $ field' key value
field key value
theField :: String -- key name
-> String -- expected value
-> Perms Parser ()
= () <$ field key (theString value)
theField key value
optField :: String -- key name
-> Parser a -- value parser
-> Perms Parser (Maybe a)
= maybeAtom $ field' key value optField key value
Aside: correct separator parsing
There was only one issue I ran into with
action-permutations
, and it is interesting enough that I
decided to describe it here in more detail.
Consider, for example, the expression runPermsSep sep (f <$> atom a <*> atom b <*> atom c)
It would expand to
flip ($) <$> a <*> (
(*> (flip ($) <$> b <*>
(sep *> (flip ($) <$> c <*>
(sep pure (\xc xb xa -> f xc xb xa)))))
<|>
*> (flip ($) <$> c <*>
(sep *> (flip ($) <$> b <*>
(sep pure (\xc xb xa -> f xb xc xa)))))
))<|>
flip ($) <$> b <*> (
(...
))<|>
flip ($) <$> c <*> (
(...
))
See the problem? Suppose the actual order of the atoms in the input
stream is a
, c
, b
. At the
beginning the parser is lucky to enter the right branch (the one
starting from flip ($) <$> a <*> ...
) on the
first guess. After that, it has two alternatives:
b
-then-c
, or
c
-then-b
. First it enters the
b
-then-c
branch (i.e. the wrong one) and
fails. However, it fails after having consumed some input
(namely, the separator) — which in libraries like parsec
and trifecta
means that the other branch (the right one)
won’t be considered.
We cannot even work around this outside of the library by using
try
, because we can’t insert it in the right place. E.g.
wrapping the separator in try
won’t work. The right place
to insert try
would be around the whole alternative
*> (flip ($) <$> b <*>
(sep *> (flip ($) <$> c <*>
(sep pure (\xc xb xa -> f xc xb xa)))))
but this piece is generated by the lirbary and, as a library user, we have no control over it.
The usage of try
inside the library itself is
unsatisfactory, too. Remember, the interface only assumes the
Alternative
instance, which has no notion of
try
. If we had to make it less generic by imposing a
Parsing
constraint, that would be really unfortunate.
Fortunately, once identified, this problem is not hard to fix
properly — and no usage of try
is required! All we need is
to change runPermsSep
so that it expands to the tree where
separator parsing is factored out:
flip ($) <$> a <*> sep *> (
(flip ($) <$> b <*> sep *>
(flip ($) <$> c <*>
(pure (\xc xb xa -> f xc xb xa)))))
<|>
flip ($) <$> c <*> sep *>
(flip ($) <$> b <*>
(pure (\xc xb xa -> f xb xc xa)))
))<|>
flip ($) <$> b <*> sep *> (
(...
))<|>
flip ($) <$> c <*> sep *> (
(...
))
Now, all alternatives start with atoms, so we have full control over whether they consume any input.
Mathematically, this demonstrates that <*>
does
not distribute over <|>
for some backtracking
parsers. Note that such distributive property is not required by the
Alternative
class.
Even for parser monads that allow backtracking by default
(attoparsec
, polyparse
) and for which there’s
no semantic difference between the two versions, this change improves
efficiency by sharing separator parsing across branches.
My patch fixing the issue has been incorporated into the version
0.0.0.1
of action-permutations
.
Arrays
Arrays should be easier to parse than objects in that the order of the elements is fixed. Still, we need to handle separators (commas) between array elements.
If we interpreted arrays as lists, then the schema combinator for arrays might look like
array :: Parser a -- parser for a signle element
-> Parser [a] -- parser for the array
Implementation would be straightforward, too.
However, in our JSON schema we use arrays as tuples rather than lists. That is, we typically expect an array of a fixed number of heterogeneous elements. Thus we’d like to combine these tuple elements into a single parser using the applicative interface.
Let’s say we expect a 2-tuple of a string (a person’s name) and an object (that person’s address).
data Address -- = ...
data Person =
Person
String -- name
Address
Written by hand, the parser may look like
address :: Parser Address
= _
address
=
personAddress '[')) (tok (char ']')) $
between (tok (char Person <$> string <* sep <*> address
where sep = tok $ char ','
It makes sense to move brackets parsing to the array
combinator:
array :: Parser a -> Parser a
= (<?> "JSON array") $
array p '[')) (tok (char ']')) p between (tok (char
But what should we do with the commas? Manually interspersing elements with separators is error-prone and doesn’t correspond to my view of a high-level JSON schema description.
Inserting comma parsers automatically isn’t impossible — after all,
it is done in the action-permutations
package and we used
it to parse object fields, which are comma-separated, too. But it cannot
be as easy as adding a separator to every element since there’s one less
separators than elements. We have somehow to detect the last element and
not to expect a separator after it.
A nice and simple way to achieve this is with a free
applicative functor. A free applicative functor will allow us to
capture the whole applicative expression and postpone the decision on
where to insert separator parsers until we can tell which element is the
last one. In this case we’ll use Twan van Laarhoven’s free applicative,
as implemented in the free
package.
element :: Parser a -> Ap Parser a
= liftAp
element
theArray :: Ap Parser a -> Parser a
= between (tok (char '[')) (tok (char ']')) $ go p
theArray p where
go :: Ap Parser a -> Parser a
Ap p (Pure f)) = f <$> p
go (Ap p1 pn) = flip id <$> p1 <* tok (char ',') <*> go pn go (
Like object fields, array elements have a special type which makes it clear which pieces exactly are comma-separated.
In fact, the applicative functor Perms
is essentially
the free applicative functor Ap
plus branching.
Optional array elements
Now comes the twist. Some of the array elements may be optional — in the same way as positional function arguments in some languages may be optional. Since the elements are positional, if one of them is omitted, all subsequent ones have to be omitted, too — otherwise we won’t be able to tell which one was omitted, exactly.
For that reason, all optional elements should come after all the non-optional ones; if not, then we’ve made a mistake while designing (or describing) our schema. Ideally, our solution should catch such mistakes, too.
So, how can the above solution be adapted to handle optional arguments?
Attempt #1:
optElement :: Parser a -> Ap Parser (Maybe a)
= element $ optional p optElement p
Here optional
is a combinator defined in
Control.Applicative
as
= Just <$> v <|> pure Nothing optional v
This won’t work at all, as it doesn’t give us any information about
whether it’s an optional element or just a parser that happens to return
a Maybe
type.
Attempt #2:
Well, let’s just add a flag indicating whether the element was
created using optElement
, shall we?
data El a =
El
Bool -- is optional?
Parser a)
(
element :: Parser a -> Ap El a
= liftAp . El False
element
optElement :: Parser a -> Ap El (Maybe a)
= liftAp . El True . optional optElement
Now we can check that optional arguments come after non-optional
ones. If an element’s parse result is Nothing
, we also know
whether that element is an optional one, and whether we should stop
trying to parse the subsequent elements.
Still, there are two related issues preventing this version from working:
How do we actually know when a parser returns
Nothing
? Once we lift a parser into the free applicative, its return type becomes existentially quantified, i.e. we should treat it as polymorphic and cannot assume it has formMaybe a
(by pattern-matching on it), even if we can convince ourselves by looking at theBool
flag that it indeed must be of that form.Similarly, once we’ve detected an absent optional element (assuming for a second that it is possible), we have to force all the remaining optional parsers to return
Nothing
without parsing anything. But again, we cannot convince the compiler thatNothing
is an acceptable return value of those parsers.
Attempt #3:
So, we need certain run-time values (the optionality flag) to
introduce type-level information (namely, that the parser’s return type
has form Maybe a
). That’s exactly what GADTs do!
data El a where
El :: Parser a -> El a
OptEl :: Parser a -> El (Maybe a)
El
’s two constructors are a more powerful version of our
old Bool
flag. They let us see whether the element is
optional, and if so, guarantee that its parser’s return type is
Maybe
ish.
And here’s the code for the parsing functions:
element :: Parser a -> Ap El a
= liftAp . El
element
optElement :: Parser a -> Ap El (Maybe a)
= liftAp . OptEl
optElement
theArray :: Ap El a -> Parser a
= between (tok (char '[')) (tok (char ']')) $
theArray p True False False p
go where
go :: Bool -> Bool -> Bool -> Ap El a -> Parser a
Pure x) = pure x
go _ _ _ (Ap el1 eln) =
go isFirst optionalOccurred optionalOmitted (let
eltSequenceError :: a
=
eltSequenceError error "theArray: a non-optional element after an optional one"
!_check =
case el1 of
El {} | optionalOccurred -> eltSequenceError
-> ()
_ in
if optionalOmitted
then
case el1 of
El {} -> eltSequenceError
OptEl {} -> go False True True eln <*> pure Nothing
else do
let
= if isFirst then pure () else () <$ tok (char ',')
sep case el1 of
El p1 ->
flip id <$ sep <*> p1 <*> go False False False eln
OptEl p1 -> do
<- optional $ sep *> p1
r1 False True (isNothing r1) eln <*> pure r1 go
theArray
is a state machine with three pieces of state:
isFirst
, optionalOccurred
and
optionalOmitted
. isFirst
and
optionalOmitted
are used to guide actual parsing, while
optionalOccurred
is needed to check the proper arrangement
of optional vs. non-optional arguments.
Conclusions
Although the standard approach to JSON parsing is to parse into a generic JSON representation first, the article shows that an alternative approach — parsing the expected structure directly — is also viable and can be employed to improve error reporting.
Of course, the tricks and ideas described here are not specific to JSON. Understanding how they work and how to use them may become handy in a variety of parsing situations.