On the other hand, when I rewrote some code to manually read the input instead of using binary it gave an 8x speed up:
I think it should be possible to write a more efficient package (maybe it already exists). flatparse is a gold standard for parsing, but it is not really meant for binary input.
Thatâs an incredibly specific (and short) case, for an apples-to-apples comparison you need to remove unnecessary laziness from the original. Also I have no idea why data is stored in cons-lists.
And I think there shouldnât be just one package, there should be a solid default choice that balances power and efficiency, and a bunch of faster libraries that sacrifice power. (pretty much what we already have, except binary could do way more)
In my experience, flatparse is great for binary parsing. It includes some rather esoteric combinators such as isolateToNextNull. I have a library full of combinatory binary flatparse parsers at binrep.
I use cereal for saving and restoring state. I switched from binary many years ago due to something I forget, I vaguely recall it was more strict in some important way. Actually it turns out at some point I also stopped using its Serialize class and wrote my own, which I think was due to Float serialization messing up -0, so itâs actually somewhat related to the topic. But also itâs necessary to retain control over backward compatibility, since as is being discussed, the libraries are free to change the format if theyâre not implementing a standard.
So in my case, I think itâs fine to fix Binary Float, it would have saved me from a surprise, though I would also document that if you need a stable format, donât use the libraryâs typeclass but write your own. Also, itâs fast enough for me, I donât really have a need for high performance. Iâm just curious about what is efficient and what is not. Also I recall a rationale behind wanting better performance for binary was that ghc used it, and of course anything that can speed up compiles is welcome.
Is there a relevant distinction between deserialization and parsing? I think of what Iâm doing as the former, there arenât any branches except some version checks, whereas what I think of a parser has a lot of âorâ operations and hence might have things like backtracking. Maybe the distinction is too fuzzy to be useful though.
Yeah, if your language does not have to be human readable then you can trivially make it LL(1) (by adding tags to all branches). That means you can have a very simple and efficient O(n) parsing algorithm. For example the one developed by Swierstra and Duponcheel in 1996:
Check out the resources linked regarding the store package and the haskell-perf one.
Serialisation libraries in Haskell have pretty straightforward characteristics that make them slower or faster. The binary package has a combination of a few inefficiencies: it has to constantly allocate as it walks the input, it has to support streaming input (meaning dereferencing where we are in the input is laden with indirection), and its default instances for types and derived instances are quite a bit bigger than more competitive libraries (3x by our recent evaluation at work). In sum, it just does a lot more work than other packages. If you donât care about speed, transfer time or space, use that. If you do, youâll soon hit its limits and become acquainted with the other options.
I would say the difference between a serialisation package and a parser package is that both are parsers, but a âserialisationâ library offers a particular format out of the box, so you can take some type from your app and serialise it somewhere for retrieval later. flat, store, persist, serialise, binary, cereal â these come with a format and derivable instances. megaparsec, attoparsec, flatparse are parsers without an accompanying derivable format.