Parsing Recipe Pattern

Recently I figured out a pattern for quickly creating record parsers that utilises generics and HKTs: Parsing Recipe Pattern. I haven’t seen it before, and I think it might be useful and interesting to people, so I’m sharing it.

TL;DR I didn’t like how Cassava/Aeson forced me to proliferate newtype wrappers for different kinds of date formats. The recipe pattern is about allowing me to specify different date parsing formats for each record type without newtype wrappers and keep generic derivation of record parsers.

3 Likes

This Haskell parsing pattern addresses the newtype proliferation problem that arises with typeclass-based generic parsers like Cassava or Aeson.

Typeclasses are bad for (de-)/serialization because most instances are ambiguous, you are absolutely correct.

Megaparsec is a swiss-army knife. … That’s tedious and boilerplate’y. If our record type’s structure already follows its CSV row structure, then couldn’t we somehow autogenerate a parser?

Okay, let’s zoom out a bit. The two approaches you use as examples are a token parser and automatic deriving. It’s enough to get the point across, but in my opinion it’s myopic.

The reason why automatic deriving is a wanted feature is because it aligns inputs and outputs of both serialization functions symmetrically. You don’t need to test this, it works out of the box, and that’s cool.

However:

  • For datatypes that only need to be encoded one way, it doesn’t do anything. Handrolling a parser would take just as many lines of code and RecordWildCards would tell you if you missed anything;

  • There exists a myriad cases that do not map onto automatic deriving. Fundamentally the problem is that decoding is a collection of choices, while encoding has only one correct solution. You can try to mush the two together (tomland did), but it can’t be applicable to every single sidecase (tomland's parser is Applicative, so you can’t make choices based on existing entries);

  • In certain cases, even if you can derive, you shouldn’t. One such example is external production APIs, where slightly adjusting the external format shouldn’t require you to realign your Haskell datatypes.

The answer to all of this is not automatic deriving, it’s ergonomic handrolling. Libraries, for whatever reason, completely forgo any due diligence on this and as a result using them in production sucks.

I spent two months writing a JSON parser just to prove to myself that I’m correct on this and I believe the answer is a resounding “yes”. If you’re ever going to write your own parser for any format, please do better than any of the libraries you mentioned in your post.

3 Likes

Some time ago I considered building a library for configuration file parsing based on this idea. In theory one could generate an .ini file parser as well as a command line option parser from the same HKD record of field parsers. However for the particular problem of configuration there is little to be gained here that can not already be done in Dhall.
I expect that in practice one would end up with multiple cook functions to perform the record Parser -> Parser (record f) traversal, for whatever f is suitable for the specific problem domain. Why? because, as noted in the post one always needs some extra “glue” to join the parser fragments.

Formats such as CSV or JSON are straightforward in the sense that once all the field parsers are given, there is pretty much only one choice how to glue them together. That is why type classes such as FromJSON1 can exist. (But straightforwardness falls apart on second look, see Cassava and Aeson.) If one could come up with a good and generic glue data type that can be mixed into the recipe, like

cook :: record Parser -> glue Parser -> Parser (record f)

then I’d be interested.

It is a interesting idea to create a record of parsers and “traverse it” to get a parser of records.
I’ve also been parsing bank statements (from different bank) but in that case, instead of trying
to parse different formats to one record type I found it much easier to have one data type per format, use automatic parsing instances and then convert each format to a final (unified) data type.

So for example, let’s say we have two similar , one with a name and date and the other with title and day I would have 3 types

data NameAndDate = NameAndDate
        { name :: Text
        , date :: Day
        }  -- deriving Cassava

data TitleAndDay = TitleAndDay
        { title :: Text
        , day :: Day
        } -- deriving Cassava

and the final type (the one to work with)

data Transaction = Transaction
    { tranName :: Text
   ,  tranDate :: Day
   }

and some trivial converters

nameAndDayToTransaction x = Transaction (name x) (day x)
titleAndDateToTransaction x = Transaction (title x) (date y)

The problem with this approach is that you can only derive the Cassava or Aeson instance automatically if the field name to parse are valid Haskell identifier, which is pretty much never the case. In the case of NameAndDay you’ll parse a csv with columns name and day, but you can’t have Name or transaction name etc …

Ideally we would need an easy to specify an external field name. This could be done via type annotation or maybe an external typeclass.

The idea of “traversing” a record is not new and can be generalized to any Applicative.
Some “record” packages like vinyl provide rtraverse. Barbie might provide an equivalent as well.
I rolled up my own package (metamorphosis) for that and use it to traverse the result of a parsing itself (as in MyRecord Result -> Result (MyRecord Identity) but not with a parser itself, which seems a good idea.

1 Like

One could easily solve the field name mismatch using Compose ((,) String) Parser instead of plain Parser, Similar to what the (.:) function in Aeson does. Which goes to show that almost always, a custom cook function for combining the field parsers is needed.

Good ideas tend to get re-invented over and over.

1 Like

I’ve experimented in a similar direction in my by-other-names library, in particular the ByOtherNamesH and ByOtherNamesH.Aeson modules. Although I haven’t used them a lot in practice.

I wanted to achieve two things:

  • Be able to provide aliases to the record’s fields.
  • Start with a “vanilla” record, without having to implement the “fields-wrapped-in-functor” version manually.

Usage looks like this:

data Foo = Foo {aa :: Int, bb :: Bool}
  deriving stock (Read, Show, Eq, Generic)
  deriving (FromJSON, ToJSON) via (JSONRecord "obj" Foo)

instance Aliased JSON Foo where
  aliases =
    aliasListBegin
      . alias @"aa" "aax" (singleSlot fromToJSON)
      . alias @"bb" "bbx" (singleSlot fromToJSON)
      $ aliasListEnd

Where fromToJSON is a function supplied by the user which specifies how to serialize/parse a single field.

1 Like

That is a really cool idea, but doesn’t having to specify singleSlot fromToJSON defeat the object ? The ideal would be to specify a list of alias and use the generic json instance, wouldn’t it ?

You mean the generic JSON instance of each field? Well, singleSlot fromToJSON doesn’t really defeat the object there, it’s just that it’s unnecessarily verbose if you always want to use the standard JSON instances (modules ByOtherNames and ByOtherNames.Aeson don’t require singleSlot fromToJSON, unlike the -H versions of the modules).

However, being able to manually specify the serializing/parsing function for a field solves an issue mentioned in the OP:

The big drawback to these high-level libraries is that they are based on type classes. We tie a type to how its parsed. For record types that’s fine but it’s not so much for primitives.

We are polluting core business logic with different day types that only differ in how they are parsed. That’s neither ergonomical nor elegant.

I believe one disadvantage of my ByOtherNamesH solution compared to the OP solution is that I don’t allow modifying a “higher-order” field after construction. They must be given all in one go. In the case of the OP solution, we can take a higher-order record and easily set a field, because, at the end of the day, it’s just another record, not a weird Aliases construct. That might make reuse easier.

1 Like

We are trying to reduce boiler plate there, so verbosity matters. Writing your own JSON or Cassava instances relatively terse on it’s own, so if I need non standard JSON instances I’ll probably just write it manually. The problem is if you need JSON and Cassava, you have to write the same code 4 times, which is annoying when the only reason to do so it’s because of identifiers to parse are not valid Haskell name.

However I realize now, that 'ByOtherNames` provides just simple aliasing as in

data Foo = Foo {aa :: Int, bb :: Bool, cc :: Char}
  deriving stock (Read, Show, Eq, Generic)
  deriving (FromJSON, ToJSON) via (JSONRecord "obj" Foo)
instance Aliased JSON Foo where
  aliases =
    aliasListBegin
      $ alias @"aa" "aax"
      $ alias @"bb" "bbx"
      $ alias @"cc" "ccx"
      $ aliasListEnd

This is great. Maybe that the example to show when advertising the package.

Yesterday, I learned that data type with parameter where called HKD (high kind datatype), which comes with the handy gtraverse function with this f*ck off signature

forall t f g m. (Applicative m
                     , Generic (t f)
                     , Generic (t g),
                     , GFTraversable (Curried (Yoneda m)) 
                                                f
                                                g
                                                (Rep (t f))
                                                (Rep (t g)))
                     => (forall a. f a -> m (g a)) -> t f -> m (t g)

Does anyone know what Yoneda has to do with this ?

I believe it’s used to optimize the traversal using an “applicative difference list”: Lysxia - Generic traversals with applicative difference lists

2 Likes

I suck at advertising: construct: Haskell version of the Construct library for easy specification of file formats

3 Likes