I have long been dissatisfied with the quality of parsing libraries in Haskell, so over time I’ve narrowed down what it is exactly that I’m looking for. It turned out to be the following:
-
The formats I want to parse are simple enough to not need lexical analysis from the get-go, so the parser should not perform it;
-
Allocating a continuous block of <insert arbitrary large integer> bytes just to start reading a [potentially malformed] file is extremely lame, the parser should be able to stream input;
-
At times instead of parsing some block of data it would be preferable to copy it verbatim. Such copying should occur with no (or minimal) reallocation, mirroring the reasoning in point (2);
Point (1) excludes anything similar to parsec
. Point (2) excludes any parser that optimized itself into only handling strict input. The only remaining parsers that I know of are attoparsec
and binary
and neither of them handles (3) properly.
I chose binary
as a foundation because unlike attoparsec
it doesn’t do any weird magic under the hood and it suffers from certain design issues that were never resolved.
Writing a parser that fits all three points turned out to be remarkably straightforward. It’s much more powerful yet still fully backwards compatible with the previous binary
API and it’s not even that far off performance-wise (3x worse on the “100k brackets” test case, ~1.1-1.5x worse on every other one).
The only big problem I see with it is that it’s an 11-argument CPS, so every function mentions 20 different arguments while modifying 3 or 4. I wish I knew a way to structure this better.