Could binary be... lazier?

I have long been dissatisfied with the quality of parsing libraries in Haskell, so over time I’ve narrowed down what it is exactly that I’m looking for. It turned out to be the following:

  1. The formats I want to parse are simple enough to not need lexical analysis from the get-go, so the parser should not perform it;

  2. Allocating a continuous block of <insert arbitrary large integer> bytes just to start reading a [potentially malformed] file is extremely lame, the parser should be able to stream input;

  3. At times instead of parsing some block of data it would be preferable to copy it verbatim. Such copying should occur with no (or minimal) reallocation, mirroring the reasoning in point (2);

Point (1) excludes anything similar to parsec. Point (2) excludes any parser that optimized itself into only handling strict input. The only remaining parsers that I know of are attoparsec and binary and neither of them handles (3) properly.

I chose binary as a foundation because unlike attoparsec it doesn’t do any weird magic under the hood and it suffers from certain design issues that were never resolved.

Writing a parser that fits all three points turned out to be remarkably straightforward. It’s much more powerful yet still fully backwards compatible with the previous binary API and it’s not even that far off performance-wise (3x worse on the “100k brackets” test case, ~1.1-1.5x worse on every other one).

The only big problem I see with it is that it’s an 11-argument CPS, so every function mentions 20 different arguments while modifying 3 or 4. I wish I knew a way to structure this better.


+1 for your requirements, these all sound like very natural things to wish for.

Would you please elaborate on why e.g. Parsec-style parsers don’t fit your requirement, by giving a (simplified) type signature of such a parser and explaining how that signature contradicts your goals? My understanding is that (neglecting the user state parameter u and the consumed/not consumed distinction) a ParsecT transformer is morally isomorphic to the Scott encoding of

StateT ParserState (Either ParseError) (ParseError,a)

in the Kleisli category of the transformed monad m, where ParserState holds (among other things) the unconsumed input stream. I used to believe that at least streaming can be achieved by clever choice of m that adds streaming capabilities.

I often deal with data formats where the task is simply to extract some fragments of the byte stream verbatim and assemble them into a new byte stream, e.g. filtering some columns from a CSV file. I would still consider this lexical analysis, though. But can we not always assume that by knowing which parts to copy verbatim, we must have forced them and hold on to the memory until we know where he block ends?

My initial thought was that since they convert input to tokens they must also lose access to the underlying input stream, but looking at megaparsec that does not seem to be the case (and that’s why it has a proper match function). So, indeed, my first point doesn’t seem to hold.

megaparsec still cannot stream input though, so I think points (2) and (3) suffice.

I don’t know how realistic that is because streaming input adds an additional possible output to the parser, MoreInput -> Parser a. You can make it unreachable by saying “there will be no more input”, but you can’t remove it.

On top of that the parser state must be different to support this, that’s why my version has to haul around eight arguments and any other parser needs roughly three.

I think that the Stream class of megaparsec is a quite powerful abstraction. It hides how exactly we split a chunk of the front of an input stream. Splitting off a chunk might be a zero-copy operation such as a slicing. The class has an instance for Lazy.ByteString. Hence would you not consider streaming of input a solved problem? A point of criticism might be that one should also add a m type parameter to the Stream class so that the stream can effectfully fetch more input over a network or from another streaming library’s abstraction:

take1_ :: s -> m (Maybe (Token s, s))

With that in place, one might as well look for a parsing abstraction in a proper streaming library, see e.g. streaming.

megaparsec's Stream class has nothing to do with streaming input, it’s their way of overloading possible input types. The library doesn’t do any clever chunk tracking underneath, they simply execute operations and point at new parts of the input; the garbage collector then sees that old parts of the input are not mentioned and destroys them.

I indeed could use a Stream ByteArray m a instead of the LazyByteString in my implementation, but it wouldn’t fundamentally change anything. The Effect constructor would serve [almost] the same role as Partial does right now, and the runner would morph into something atrocious like

parse :: Get a -> Stream ByteArray m a -> m (Either Error a, Stream ByteArray m a)

While I haven’t understood your implementation (and probably I’m lacking the experience to honour the design choices you made) I thought you started this thread in a quest of finding good abstractions to properly structure your approach. Hence I think we should consider:

  • whether the general StateT LeftoverInput (EitherT ParseError m) abstraction could in principle fit your needs,
  • whether any of the established streaming libraries provide abstractions useful for parsing
  • whether the problem of consuming or producing a stream can be de-coupled at all from the internal state of the parser and its branch choices, especially in the presence of backtracking.

Perhaps you can use conduit to control the data streaming capabilities separately from how you execute the parser on the resulting data?

It’s backwards from that, I already have a working abstraction in binary PR, I’m just looking for disagreements on how it should be structured (in which case I can make it better) or suggestions of libraries that already provide all the features I’m looking for (in which case I can use that instead).

I already addressed the question of using streams in a comment above: binary already supports that with the Partial constructor and I don’t think moving to streams meaningfully changes anything (other than add a streaming library dependency).