Could binary be... lazier?

BurningWitness · April 23, 2024, 2:30pm

I have long been dissatisfied with the quality of parsing libraries in Haskell, so over time I’ve narrowed down what it is exactly that I’m looking for. It turned out to be the following:

The formats I want to parse are simple enough to not need lexical analysis from the get-go, so the parser should not perform it;
Allocating a continuous block of <insert arbitrary large integer> bytes just to start reading a [potentially malformed] file is extremely lame, the parser should be able to stream input;
At times instead of parsing some block of data it would be preferable to copy it verbatim. Such copying should occur with no (or minimal) reallocation, mirroring the reasoning in point (2);

Point (1) excludes anything similar to parsec. Point (2) excludes any parser that optimized itself into only handling strict input. The only remaining parsers that I know of are attoparsec and binary and neither of them handles (3) properly.

I chose binary as a foundation because unlike attoparsec it doesn’t do any weird magic under the hood and it suffers from certain design issues that were never resolved.

Writing a parser that fits all three points turned out to be remarkably straightforward. It’s much more powerful yet still fully backwards compatible with the previous binary API and it’s not even that far off performance-wise (3x worse on the “100k brackets” test case, ~1.1-1.5x worse on every other one).

The only big problem I see with it is that it’s an 11-argument CPS, so every function mentions 20 different arguments while modifying 3 or 4. I wish I knew a way to structure this better.

github.com/kolmodin/binary

Proper lazy parsing

kolmodin:master ← BurningWitness:malaise

opened 05:16PM - 12 Jan 24 UTC

BurningWitness

+936 -456

## Motivation It seems like there are no proper lazy byte parsers in the ecos…ystem and `binary` is already known to be both bad performance-wise (https://github.com/kolmodin/binary/pull/194) and hacky implementation-wise (https://github.com/kolmodin/binary/issues/136#issuecomment-307424037). ## Implementation The rules are as follows: - The current chunk is always stored in full. The position within the chunk is tracked separately (`ChunkOffset`); - All known further chunks are stored separately in a lazy ByteString; - Rolling back is implemented as a snoc-list of further chunks (`Rollback`), where track needs to be kept of whether new chunks need to be stored on the rollback list (`Policy`) and whether a rollback can occur (`Location`). These two flags ensure the following behaviors: - When a new known further chunk is accessed, it will be stored on the rollback list if needed; - When the parser runs out of input and the user provides more, all of the new input will be stored on the rollback list if needed, alongside turning off the previous behavior (since we don't want to double-store chunks when nesting alternatives); - Rolling back only requires appending the rollback list to the remaining list of known further chunks. The current chunk remains the same. ## Format deviations - `Decoder` now has a different shape, both accepting and returning lazy `ByteString`s; - All reads now consistently fail at end-of-input (previously the behavior was inconsistent inside `isolate`); - `isolate 0 (isolate 1 (pure ()))` now fails with `not enough bytes` instead of `... less than the expected ...`. ## Functions that could be added - Copying variants of functions that yield lazy ByteStrings, reallocating the first and last chunks of the result if they're partial. - `match :: Get a -> Get (a, LazyByteString)`, where `LazyByteString` is referencing the underlying chunks. --- Things that need resolving: - Performance. I do not know the optimization pipeline and thus cannot speculate what exactly is causing the 1.5x bottleneck on parsing `Word8`s repeatedly. Also the inlining is most probably over-eager. - The benchmarks don't tell a good story (sure, it *is* slower than the previous version, but it's also faster than `cereal` on `brackets` and than `attoparsec` on `Struct4`?) - `isolate` is currently eager input-wise. This is a format deviation that can be resolved by adding a third branch to it. I'm against `isolate` being in the API because it does not seem to be needed in the real-world and including it makes the parser both more complicated and slower (entering the next chunk may need a trim); - The API can be revamped (see the `parse` and `initiate` runners, along with a lot of unsafe functions), however I don't know what the preferred way of doing that is, so I kept the external changes to a bare minimum.

olf · April 24, 2024, 8:21am

+1 for your requirements, these all sound like very natural things to wish for.

Would you please elaborate on why e.g. Parsec-style parsers don’t fit your requirement, by giving a (simplified) type signature of such a parser and explaining how that signature contradicts your goals? My understanding is that (neglecting the user state parameter u and the consumed/not consumed distinction) a ParsecT transformer is morally isomorphic to the Scott encoding of

StateT ParserState (Either ParseError) (ParseError,a)

in the Kleisli category of the transformed monad m, where ParserState holds (among other things) the unconsumed input stream. I used to believe that at least streaming can be achieved by clever choice of m that adds streaming capabilities.

I often deal with data formats where the task is simply to extract some fragments of the byte stream verbatim and assemble them into a new byte stream, e.g. filtering some columns from a CSV file. I would still consider this lexical analysis, though. But can we not always assume that by knowing which parts to copy verbatim, we must have forced them and hold on to the memory until we know where he block ends?

BurningWitness · April 24, 2024, 10:02am

My initial thought was that since they convert input to tokens they must also lose access to the underlying input stream, but looking at megaparsec that does not seem to be the case (and that’s why it has a proper match function). So, indeed, my first point doesn’t seem to hold.

megaparsec still cannot stream input though, so I think points (2) and (3) suffice.

I don’t know how realistic that is because streaming input adds an additional possible output to the parser, MoreInput -> Parser a. You can make it unreachable by saying “there will be no more input”, but you can’t remove it.

On top of that the parser state must be different to support this, that’s why my version has to haul around eight arguments and any other parser needs roughly three.

olf · April 24, 2024, 11:00am

I think that the Stream class of megaparsec is a quite powerful abstraction. It hides how exactly we split a chunk of the front of an input stream. Splitting off a chunk might be a zero-copy operation such as a slicing. The class has an instance for Lazy.ByteString. Hence would you not consider streaming of input a solved problem? A point of criticism might be that one should also add a m type parameter to the Stream class so that the stream can effectfully fetch more input over a network or from another streaming library’s abstraction:

take1_ :: s -> m (Maybe (Token s, s))

With that in place, one might as well look for a parsing abstraction in a proper streaming library, see e.g. streaming.

BurningWitness · April 24, 2024, 11:36am

megaparsec's Stream class has nothing to do with streaming input, it’s their way of overloading possible input types. The library doesn’t do any clever chunk tracking underneath, they simply execute operations and point at new parts of the input; the garbage collector then sees that old parts of the input are not mentioned and destroys them.

I indeed could use a Stream ByteArray m a instead of the LazyByteString in my implementation, but it wouldn’t fundamentally change anything. The Effect constructor would serve [almost] the same role as Partial does right now, and the runner would morph into something atrocious like

parse :: Get a -> Stream ByteArray m a -> m (Either Error a, Stream ByteArray m a)

olf · April 24, 2024, 11:59am

While I haven’t understood your implementation (and probably I’m lacking the experience to honour the design choices you made) I thought you started this thread in a quest of finding good abstractions to properly structure your approach. Hence I think we should consider:

whether the general StateT LeftoverInput (EitherT ParseError m) abstraction could in principle fit your needs,
whether any of the established streaming libraries provide abstractions useful for parsing
whether the problem of consuming or producing a stream can be de-coupled at all from the internal state of the parser and its branch choices, especially in the presence of backtracking.

mixphix · April 24, 2024, 12:43pm

Perhaps you can use conduit to control the data streaming capabilities separately from how you execute the parser on the resulting data?

BurningWitness · April 24, 2024, 1:03pm

It’s backwards from that, I already have a working abstraction in binary PR, I’m just looking for disagreements on how it should be structured (in which case I can make it better) or suggestions of libraries that already provide all the features I’m looking for (in which case I can use that instead).

I already addressed the question of using streams in a comment above: binary already supports that with the Partial constructor and I don’t think moving to streams meaningfully changes anything (other than add a streaming library dependency).

Topic		Replies	Views
Lathe: pure incremental error-polymorphic byte parser Show and Tell	1	490	October 25, 2024
Memory usage for Backtracking in Infinite Stream Parsing Learn	12	1623	August 31, 2020
(Haskell in Haskell) 3. Parsing - Cronokirby Show and Tell	2	656	December 29, 2020
Which high-performance parser library to use?	6	1695	August 6, 2023
New blog post: Parsing Diff Output in Haskell -- Feedback appreciated Links	3	658	May 31, 2023

Could binary be... lazier?

Related topics