Changing the `Binary` instance for `Double` and `Float`

I am the new maintainer of the binary library. As such, I am thinking about changing the Binary Double and Binary Float instances. They currently use decodeFloat and encodeFloat, which IMHO is just completely insane. A better way would be to use the raw data instead, which would both be faster and consume less memory. Another problem with the current approach is that it doesn’t roundtrip (decode (encode NaN) /= NaN · Issue #64 · haskell/binary · GitHub).

The problem is that this would be a breaking change. In particular, things would break when deserialising data that was serialised using an older version of binary. A new major release should be fine for that, but since binary is a GHC boot library, I’m a bit more careful. @TeofilC suggested to make a super major version bump (i.e. release binary-1.0.0.0) and still ship a binary-0.* version for the next few GHC releases. The instances could also get a deprecation warning to warn about the change in behaviour.

Now, to get to the point of this post: I want to ask you all how you would feel about this change, whether it would affect you and how you suggest dealing with it to avoid/minimise breakage. There has also been some previous discussion at Instance for Double/Float is absolutely batty · Issue #69 · haskell/binary · GitHub.

18 Likes

I think you’re between a rock and a hard place on this one.

Yes, these instances are inefficient and don’t round-trip properly. I would however argue that this doesn’t matter: the only promise that the Binary type class provides is bidirectionality and this change would break it. I don’t know if binary-1.0 versus binary-0.9 makes a difference here, I imagine half the people depending on it didn’t bother with setting an upper version boundary.

Furthermore binary has been abandoned for an entire decade, and it wasn’t a polished library before that either. Bumping it to 1.0 to introduce a silent breaking change instead of properly overhauling it would look like an epitaph to an ecosystem unwilling to challenge any of its old problems.

Oh, also, I spent some of my own time rewriting the parser part of binary into a package I called lathe and I think it has some awesome features, so feel free to borrow from that if you ever feel like adding to the library.

3 Likes

That’s not really true, is it? The current instances for Double and Float don’t satisfy bidirectionality, while the new instance would. The only thing the change would break is decoding data that was encoded using an older version.

You’re probably right, but I think that’s kinda on them. You can’t not set upper bounds and then expect to never get any breaking changes.

What are some of these old problems? I’d love to solve them, as far as is possible without making a completely new package.


As some prior art, cereal, which started as a fork of binary, changed the instances for Double and Float in version 0.5.0.0, 6 years after its initial release.

1 Like

I’m a user of Binary via Cloud Haskell.

I would argue that if you wanted portability across time, then you should have a defined schema in a hand-rolled Binary instance. I think a breaking change to 1.0, introducing this change, is perfectly acceptable.

You could also extend the current version (0.9?) to include newtypes over Double and Float that provide a pathway to migrate to 1.0

9 Likes

Yes - providing compatibility shims until enough people have switched over to using them seems like a good way to defer a major version change until it reaches critical mass

1 Like

I think you should start by asking yourself what your goal is with binary?

Is the goal to just keep binary working as a simple “just works” library with few dependencies and frills? Just keep it.

Is the goal to work on it to make it competitive with faster serialization libraries when it comes to performance even at the cost of breaking changes in API or otherwise? Change it. It will break users but if that is the goal the breakage is necessary.

I accidentally broke stack one when I changed internal-only supposed-to-be-opaque binary serialization in GHC, but it turned out stack was dependent on that. So breakage will 100% happen to someone if you change it. People just depend on the wildest things out there :slight_smile:

Personally I used binary when I wanted a convenient interface more than I wanted speed. While the current implementation for Float/Double is pretty bad even by these standards, if I’m honest it was good enough for those cases anyway. And if I wanted really fast not sure if binary would have been my first choice. So I don’t think changing Float/Double instances by itself would change what I use binary for much or at all unless you plan other changes on top of that.

3 Likes

For what it’s worth as a user of binary or similar Haskell packages, my expectation is that if the package doesn’t implement a particular standard (like JSON or CBOR), not to rely on the package maintaining any sort of backwards compatibility in terms of round tripping.

10 Likes

That’s the most common use for serialization, you encode some data, store it and decode it at some later date. Though then looking at other people’s responses it would appear I expect too much from library authors, alas.

Some basic decency ones would be:

  • runGet throws instead of returning Either;
  • runGetOrFail could instead return (ByteString, ByteOffset, Either String a);
  • Deprecated functions for a 0.6-era interface are still in the library;
  • pushChunk is a weird third way of ingesting input that duplicates runGet/Partial functionality;
  • Host-endian endecs use accursedUnutterablePerformIO and I’d argue these functions are completely unnecessary in the library.

The parser could be way more powerful as I already described.

At an ecosystem level I dream of there being a way to talk about byte arrays with encodings attached (e.g. type-level Builder @UTF8), but with how wildly uninterested everyone is with this topic I don’t expect to see it in my lifetime.

This. The old release is not going anywhere, users can pin it in their build plans. If someone really needs backward compatibility, they can fork old-binary and keep using it indefinitely long. (Or you can publish such fork yourself, if you want to be extra nice)

Not being able to roundtrip NaN is a cardinal sin for instance Binary Double, it must be rectified.

7 Likes

I am very interested in a pattern I believe to be related i.e. describing encoding in types. I have a library binrep that supports doing this in a composable manner e.g. a NullTerminated ByteString will encode like a C-string, and that is explicit in the type. I’d be very keen to support any efforts like this.

Note that as of GHC 9.10.2, you can put {-# DEPRECATED #-} annotations on instance declarations, so you could deprecate those instances for a couple of releases before removing them or moving them to a newtype for people who still want them.

5 Likes

As much as I would love to see “encode/decode this composite type using this type-level template” properly implemented in Haskell, the type system is woefully insufficient for it in my view. Even Generics, with their near-universal support, destroy compilation times and lead to this very topic of “how do I replace a type class instance”.

Ultimately I just want a simple clear hierarchy of libraries that allows me to say “offset me by 10 characters in this UTF-8 array that I know is well-formed and, unless end is reached, return the byte offset”, instead of having to figure out which one of Text/ShortText/String will result in the smaller pointless allocation overhead in my [quite hot] rendering function.

Also instances for Double/Float are not only buggy instances. Int/Word have their own problems:

instance Binary Int where
    put     = putInt64be . fromIntegral
    get     = liftM fromIntegral getInt64be

On 32-bit platforms it will silently truncate large number written on 64-bit platforms. In a sense it’s even worse than NaN problem. I didn’t read binary code closely so there could be other problems as well. I didn’t binary’ code closely so there could be other problems as well.

I think safest option is to declare Binary type class broken and unfixable and simply introduce Binary2, deprecate Binary and in the end ship it off into old-binary package for those who need it.

2 Likes

There’s no such thing as a cross-platform platform-sized integer, so the current instance minimizes breakage (32bit → 64bit conversion is still safe).

You could have a separate type class for types that are only safe to convert on compatible platforms, but serialization type classes are borderline lawless, so I don’t think it’d help much.

Yes and serialization library has to handle this fact somehow. Currently 64→32bit is handled in worst possible way as silent corruption. And I’m not sure what is right way to handle this.

without fully digging in, would this proposed new behavior be affected by endianness, or are we just not caring about that anymore?

It would use the big-endian versions, just like all the other types.

I’ve heard various references over the years to binary (and also cereal?) being slow, specifically when the cbor library was under development and was supposed to replace binary for ghc’s internal use. While the latter seems to have never happened, was the stuff about binary being slow actually true? What are some examples of serialization libraries that are fast?

I ended up creating primitive-serial for this exact issue. All it does is pull representations out of memory using peek, then swap over bytes as needed depending on CPU endianness.

For context, quick search yields benchmarks from a decade ago (fpco, haskell-perf), though I don’t expect much has changed since.

“slow” doesn’t mean much in a well-written parser, they come with trade-offs:

  • Anything small will always be faster in an input-strict parser, since those work over a single continuous chunk of memory. Conversely anything huge may be handled more efficiently in an input-lazy parser;

  • A parser that supports streaming output can be used for incremental processing, and you’re kinda screwed otherwise if you need that;

  • Big data blobs (such as videogame resource files) can be parsed way more efficiently by memory-mapping the file and parsing each section in parallel (though memory-mapping is a platform-specific feature, so it’s a pain in the butt to set up).

In my experience binary parses a 35MB GameMaker resource file in a fraction of a second (~65ms as I check right now), so it’s more than fast enough for general use.