Changing the `Binary` instance for `Double` and `Float`

On the other hand, when I rewrote some code to manually read the input instead of using binary it gave an 8x speed up:

I think it should be possible to write a more efficient package (maybe it already exists). flatparse is a gold standard for parsing, but it is not really meant for binary input.

That’s an incredibly specific (and short) case, for an apples-to-apples comparison you need to remove unnecessary laziness from the original. Also I have no idea why data is stored in cons-lists.

And I think there shouldn’t be just one package, there should be a solid default choice that balances power and efficiency, and a bunch of faster libraries that sacrifice power. (pretty much what we already have, except binary could do way more)

I did, it only reduced the running time by 20%. binary uses CPS which fundamentally limits performance (it builds a tree of thunks at run-time unless everything inlines).

In my experience, flatparse is great for binary parsing. It includes some rather esoteric combinators such as isolateToNextNull. I have a library full of combinatory binary flatparse parsers at binrep.

3 Likes

I think ghc only uses the binary package for the package db at this point. Most internal uses use GHC’s own Binary class:

class Binary a where
  put\_   :: WriteBinHandle -> a -> IO ()
  put    :: WriteBinHandle -> a -> IO (Bin a)
  get    :: ReadBinHandle -> IO a

But while that works for GHC I wouldn’t recommend this interface for a library.

I use cereal for saving and restoring state. I switched from binary many years ago due to something I forget, I vaguely recall it was more strict in some important way. Actually it turns out at some point I also stopped using its Serialize class and wrote my own, which I think was due to Float serialization messing up -0, so it’s actually somewhat related to the topic. But also it’s necessary to retain control over backward compatibility, since as is being discussed, the libraries are free to change the format if they’re not implementing a standard.

So in my case, I think it’s fine to fix Binary Float, it would have saved me from a surprise, though I would also document that if you need a stable format, don’t use the library’s typeclass but write your own. Also, it’s fast enough for me, I don’t really have a need for high performance. I’m just curious about what is efficient and what is not. Also I recall a rationale behind wanting better performance for binary was that ghc used it, and of course anything that can speed up compiles is welcome.

Is there a relevant distinction between deserialization and parsing? I think of what I’m doing as the former, there aren’t any branches except some version checks, whereas what I think of a parser has a lot of “or” operations and hence might have things like backtracking. Maybe the distinction is too fuzzy to be useful though.

2 Likes

Yeah, if your language does not have to be human readable then you can trivially make it LL(1) (by adding tags to all branches). That means you can have a very simple and efficient O(n) parsing algorithm. For example the one developed by Swierstra and Duponcheel in 1996:

Check out the resources linked regarding the store package and the haskell-perf one.

Serialisation libraries in Haskell have pretty straightforward characteristics that make them slower or faster. The binary package has a combination of a few inefficiencies: it has to constantly allocate as it walks the input, it has to support streaming input (meaning dereferencing where we are in the input is laden with indirection), and its default instances for types and derived instances are quite a bit bigger than more competitive libraries (3x by our recent evaluation at work). In sum, it just does a lot more work than other packages. If you don’t care about speed, transfer time or space, use that. If you do, you’ll soon hit its limits and become acquainted with the other options.

I would say the difference between a serialisation package and a parser package is that both are parsers, but a “serialisation” library offers a particular format out of the box, so you can take some type from your app and serialise it somewhere for retrieval later. flat, store, persist, serialise, binary, cereal — these come with a format and derivable instances. megaparsec, attoparsec, flatparse are parsers without an accompanying derivable format.

1 Like