Roman Cheplyaka

Smarter validation

Published on May 2, 2015; tags: Haskell

Today we’ll explore different ways of handling and reporting errors in Haskell. We shall start with the well-known Either monad, proceed to a somewhat less common Validation applicative, and then improve its efficiency and user experience.

The article contains several exercises that will hopefully help you better understand the issues that are being addressed here.

Running example

{-# LANGUAGE GeneralizedNewtypeDeriving, KindSignatures, DataKinds,
             ScopedTypeVariables, RankNTypes, DeriveFunctor #-}
import Text.Printf
import Text.Read
import Control.Monad
import Control.Applicative
import Control.Applicative.Lift (Lift)
import Control.Arrow (left)
import Data.Functor.Constant (Constant)
import Data.Monoid
import Data.Traversable (sequenceA)
import Data.List (intercalate, genericTake, genericLength)
import Data.Proxy
import System.Exit
import System.IO
import GHC.TypeLits

Our running example will consist of reading a list of integer numbers from a file, one number per line, and printing their sum.

Here’s the simplest way to do this in Haskell:

printSum1 :: FilePath -> IO ()
printSum1 path = print . sum . map read . lines =<< readFile path

This code works as expected for a well-formed file; however, if a line in the file can’t be parsed as a number, we’ll get unhelpful

Prelude.read: no parse

Either monad

Let’s rewrite our function to be aware of possible errors.

parseNum
  :: Int -- line number (for error reporting)
  -> String -- line contents
  -> Either String Integer
     -- either parsed number or error message
parseNum ln str =
  case readMaybe str of
    Just num -> Right num
    Nothing -> Left $
      printf "Bad number on line %d: %s" ln str

-- Print a message and exit
die :: String -> IO ()
die msg = do
  hPutStrLn stderr msg
  exitFailure

printSum2 :: FilePath -> IO ()
printSum2 path =
  either die print .
  liftM sum .
  sequence . zipWith parseNum [1..] .
  lines =<< readFile path

Now, upon reading a line that is not a number, we’d see something like

Bad number on line 2: foo

This is a rather standard usage of the Either monad, so I won’t get into details here. I’ll just note that there are two ways in which this version is different from the first one:

  1. We call readMaybe instead of read and, upon detecting an error, construct a helpful error message. For this reason, we keep track of the line number.
  2. Instead of throwing a runtime exception right away (using the error function), we return a pure Either value, and then combine these Eithers together using the Moand Either isntance.

The two changes are independent; there’s no reason why we couldn’t use error and get the same helpful error message. The exceptions emulated by the Either monad have the same semantics here as the runtime exceptions. The benefit of the pure formulation is that the semantics of runtime exceptions is built-in; but the semantics of the pure data is programmable, and we will take advantage of this fact below.

Validation applicative

You get a thousand-line file with numbers from your accountant. He asks you to sum them up because his enterprise software crashes mysteriously when trying to read it.

You accept the challenge, knowing that your Haskell program won’t let you down. The program tells you

Bad number on line 378: 12o0

— I see! Someone put o instead of zero. Let me fix it.

You locate the line 378 in your editor and replace 12o0 with 1200. Then you save the file, exit the editor, and re-run the program.

Bad number on line 380: 11i3

— Come on! There’s another similar mistake just two lines below. Except now 1 got replaced by i. If you told me about both errors from the beginning, I could fix them faster!

Indeed, there’s no reason why our program couldn’t try to parse every line in the file and tell us about all the mistakes at once.

Except now we can’t use the standard Monad and Applicative instances of Either. We need the Validation applicative.

The Validation applicative combines two Either values in such a way that, if they are both Left, their left values are combined with a monoidal operation. (In fact, even a Semigroup would suffice.) This allows us to collect errors from different lines.

newtype Validation e a = Validation { getValidation :: Either e a }
  deriving Functor

instance Monoid e => Applicative (Validation e) where
  pure = Validation . Right
  Validation a <*> Validation b = Validation $
    case a of
      Right va -> fmap va b
      Left ea -> either (Left . mappend ea) (const $ Left ea) b

The following example demonstrates the difference between the standard Applicative instance and the Validation one:

> let e1 = Left "error1"; e2 = Left " error2"
> e1 *> e2
Left "error1"
> getValidation $ Validation e1 *> Validation e2
Left "error1 error2"

A clever implementation of the same applicative functor exists inside the transformers package. Ross Paterson observes that this functor can be constructed as

type Errors e = Lift (Constant e)

(see Control.Applicative.Lift).

Anyway, let’s use this to improve our summing program.

printSum3 :: FilePath -> IO ()
printSum3 path =
  either (die . intercalate "\n") print .
  liftM sum .
  getValidation . sequenceA .
  map (Validation . left (\e -> [e])) .
  zipWith parseNum [1..] .
  lines =<< readFile path

Now a single invocation of the program shows all the errors it can find:

Bad number on line 378: 12o0
Bad number on line 380: 11i3

Exercise. Could we use Writer [String] to collect error messages?

Exercise. When appending lists, there is a danger of incurring quadratic complexity. Does that happen in the above function? Could it happen in a different function that uses the Validation applicative based on the list monoid?

Smarter Validation applicative

Next day your accountant sends you another thousand-line file to sum up. This time your terminal gets flooded by error messages:

Bad number on line 1: 27297.
Bad number on line 2: 11986.
Bad number on line 3: 18938.
Bad number on line 4: 22820.
...

You already see the problem: every number ends with a dot. This is trivial to diagnose and fix, and there is absolutely no need to print a thousand error messages.

In fact, there are two different reasons to limit the number of reported errors:

  1. User experience: it is unlikely that the user will pay attention to more than, say, 10 messages at once. If we try to display too many errors on a web page, it may get slow and ugly.
  2. Efficiency: if we agree it’s only worth printing the first 10 errors, then, once we gather 10 errors, there is no point processing the data further.

Turns out, each of the two goals outlined above will need its own mechanism.

Bounded lists

We first develop a list-like datatype which stores only the first n elements and discards anything else that may get appended. This primarily addresses our first goal, user experience, although it will be handy for achieving the second goal too.

Although for validation purposes we may settle with the limit of 10, it’s nice to make this a generic, reusable type with a flexible limit. So we’ll make the limit a part of the type, taking advantage of the type-level number literals.

Exercise. Think of the alternatives to storing the limit in the type. What are their pros and cons?

On the value level, we will base the new type on difference lists, to avoid the quadratic complexity issue that I allude to above.

data BoundedList (n :: Nat) a =
  BoundedList
    !Integer -- current length of the list
    (Endo [a])

Exercise. Why is it important to cache the current length instead of computing it from the difference list?

Once we’ve figured out the main ideas (encoding the limit in the type, using difference lists, caching the current length), the actual implementation is straightforward.

singleton :: KnownNat n => a -> BoundedList n a
singleton a = fromList [a]

toList :: BoundedList n a -> [a]
toList (BoundedList _ (Endo f)) = f []

fromList :: forall a n . KnownNat n => [a] -> BoundedList n a
fromList lst = BoundedList (min len limit) (Endo (genericTake limit lst ++))
  where
    limit = natVal (Proxy :: Proxy n)
    len = genericLength lst

instance KnownNat n => Monoid (BoundedList n a) where
  mempty = BoundedList 0 mempty
  mappend b1@(BoundedList l1 f1) (BoundedList l2 f2)
    | l1 >= limit = b1
    | l1 + l2 <= limit = BoundedList (l1 + l2) (f1 <> f2)
    | otherwise = BoundedList limit (f1 <> Endo (genericTake (limit - l1)) <> f2)
    where
      limit = natVal (Proxy :: Proxy n)

full :: forall a n . KnownNat n => BoundedList n a -> Bool
full (BoundedList l _) = l >= natVal (Proxy :: Proxy n)

null :: BoundedList n a -> Bool
null (BoundedList l _) = l <= 0

SmartValidation

Now we will build the smart validation applicative which stops doing work when it doesn’t make sense to collect errors further anymore. This is a balance between the Either applicative, which can only store a single error, and Validation, which collects all of them.

Implementing such an applicative functor is not as trivial as it may appear at first. In fact, before reading the code below, I recommend doing the following

Exercise. Try implementing a type and an applicative instance for it which adheres to the above specification.

Did you try it? Did you succeed? This is not a rhetorical question, I am actually interested, so let me know. Is your implementation the same as mine, or is it simpler, or more complicated?

Alright, here’s my implementation.

newtype SmartValidation (n :: Nat) e a = SmartValidation
  { getSmartValidation :: forall r .
      Either (BoundedList n e) (a -> r) -> Either (BoundedList n e) r }
  deriving Functor

instance KnownNat n => Applicative (SmartValidation n e) where
  pure x = SmartValidation $ \k -> k <*> Right x
  SmartValidation a <*> SmartValidation b = SmartValidation $ \k ->
    let k' = fmap (.) k in
    case a k' of
      Left errs | full errs -> Left errs
      r -> b r

And here are some functions to construct and analyze SmartValidation values.

-- Convert SmartValidation to Either
fatal :: SmartValidation n e a -> Either [e] a
fatal = left toList . ($ Right id) . getSmartValidation

-- Convert Either to SmartValidation
nonFatal :: KnownNat n => Either e a -> SmartValidation n e a
nonFatal a = SmartValidation $ (\k -> k <+> left singleton a)

-- like <*>, but mappends the errors
(<+>)
  :: Monoid e
  => Either e (a -> b)
  -> Either e a
  -> Either e b
a <+> b = case (a,b) of
  (Right va, Right vb) -> Right $ va vb
  (Left e,   Right _)  -> Left e
  (Right _,  Left e)   -> Left e
  (Left e1,  Left e2)  -> Left $ e1 <> e2

Exercise. Work out what fmap (.) k does in the definition of <*>.

Exercise. In the definition of <*>, should we check whether k is full before evaluating a k'?

Exercise. We developed two mechanisms — BoundedList and SmartValidation, which seem to do about the same thing on different levels. Would any one of these two mechanisms suffice to achieve both our goals, user experience and efficiency, when there are many errors being reported?

Exercise. If the SmartValidation applicative was based on ordinary lists instead of difference lists, would we be less or more likely to run into the quadratic complexity problem compared to simple Validation?

Conclusion

Although the Validation applicative is known among Haskellers, the need to limit the number of errors it produces is rarely (if ever) discussed. Implementing an applicative functor that limits the number of errors and avoids doing extra work is somewhat tricky. Thus, I am happy to share my solution and curious about how other people have dealt with this problem.