xxHash: extremely fast non-cryptographic hash functions

I’ve prepared a new release of xxhash-ffi: Bindings to the C implementation the xxHash algorithm, which is Haskell bindings and high-level helpers for xxHash family of extremely fast non-cryptographic hash functions.

Changelog:

  • Update bundled xxhash C library to 0.8.2.
  • Add FFI bindings for XXH3 family of hash functions.
  • Add newtype XXH3 and instance Hashable for it.
  • Deprecate class XXHash, please use XXH3-based API instead.

Benchmarks against hashable-1.4.3.0 on x86_64:

10B
  Data.Digest.XXHash.FFI.XXH3:
    20.2 ns ± 692 ps
  Data.Hashable:
    20.3 ns ± 722 ps
1kB
  Data.Digest.XXHash.FFI.XXH3:
    80.8 ns ± 2.0 ns
  Data.Hashable:
    1.27 μs ±  16 ns
4kB
  Data.Digest.XXHash.FFI.XXH3:
    236  ns ± 7.9 ns
  Data.Hashable:
    5.09 μs ± 135 ns
1MB
  Data.Digest.XXHash.FFI.XXH3:
    138  μs ± 3.4 μs
  Data.Hashable:
    3.19 ms ±  60 μs

Benchmarks against hashable-1.4.3.0 on aarch64:

10B
  Data.Digest.XXHash.FFI.XXH3:
    10.2 ns ± 208 ps
  Data.Hashable:
    12.1 ns ± 404 ps
1kB
  Data.Digest.XXHash.FFI.XXH3:
    71.7 ns ± 1.8 ns
  Data.Hashable:
    1.14 μs ±  27 ns
4kB
  Data.Digest.XXHash.FFI.XXH3:
    289  ns ± 7.0 ns
  Data.Hashable:
    4.71 μs ± 187 ns
1MB
  Data.Digest.XXHash.FFI.XXH3:
    175  μs ± 6.9 μs
  Data.Hashable:
    2.87 ms ± 108 μs
8 Likes

Wait, what’s the difference between the two?

You can only use the XXH3 using the Hashable instance, right? So which is which?
And if the Data.Hashable benchmark is not XXH3, what is it?

Or is this all ByteString, but one is wrapped in XXH3?

There are raw FFI bindings in Data.Digest.XXHash.FFI.C.

Correct, sorry for confusion. The first one is instance Hashable (XXH3 ByteString), the second is instance Hashable ByteString.

1 Like

Minor nitpicks

If a C library uses a synonym for a type, the bindings should ideally also port that synonym instead of referencing the underlying type directly. I wouldn’t want to run into an “actually, the types on this machine are slightly different” compilation error just because I’m pedantic with my type signatures.

Also it would be great if packages could stop shoving every single module into the Data directory, this library could have named its modules Crypto.Hash.XXHash and Foreign.Lib.XXHash.


That being said, it’s nice that there’s a maintained library for this. I was quite disappointed to find out there are four different abandoned half-baked MurmurHash2/3 libraries, and this is a solid alternative.

Also this library imports GHC.Exts, so the base < 5 constraint seems incorrect. I wonder if primitive just wasn’t a common solution five years ago when that PR was made. back when this library was originally written.

I compiled the xxhash library and used XXH32 as a hash function via FFI for my 1BRC solution (Fortran) instead of the FNV1-a hash function. It ran significantly slower, to my surprise since some of the xxhash benchmarks are against FNV64. It would be interesting to see if @AndrasKovacs @ReleaseCandidate solutions runs faster using this library: its a practical cross-check I think.

@BurningWitness yeah, module names are a bit long, but I decided not to break things for aesthetic reasons. This is a fairly old package, from 2017 originally.

I prefer to stick to boot libraries only, whenever possible. I had to depend on hashable though…

@emiruz hashing benchmarks usually boast about throughput, not latency, it’s difficult to beat FNV on tiny inputs. You can see from the numbers above that XXH3 is maybe just a bit faster than FNV on 10 bytes.

I never looked into Core generated by xxhash-ffi. I suspect that the overhead of Haskell bindings can be reduced if someone does (accursedUnutterablePerformIO?..).

Since this package seems to provide non-cryptographic hash functions (for e.g. use in hashtables etc), picking Crypto seems misleading and threfore much worse module prefix than Data… [/bikeshedding]

1 Like

True, I wonder if the proper name space would then just be Hash.* for both kinds of hashes.

@emiruz

I’ve tried (assembler SIMD) [xxhash64, not xxHash3] (xxhash package - github.com/cespare/xxhash - Go Packages) with Go, it has been faster than FNV when used as a function on the whole string, but hashing each byte “inline” using FNV has been faster. So I did not try it with Haskell sources.

So the main question is the FFI overhead from Haskell and whether the binary is compiled with (and with which) SIMD extensions.

1 Like

It’s advertised that the latest variant, XXH3, offers improved performance across the board, especially on small data, so it should be more suitable than XXH32. Indeed my interest to revive xxhash-ffi was spurred by :one::honeybee::racing_car:.

I wonder if the FFI would be faster if you used unboxed arguments like this:

foreign import capi unsafe "xxhash.h XXH32_update" c_xxh32_update ::
    XXH32State     -- ^ The state to update
 -> Addr#          -- ^ 'Ptr' to the input buffer
 -> Word64#        -- ^ Buffer length
 -> IO ()

Or can GHC do this unboxing by itself?

Edit: I just checked with a simpler function and GHC should indeed do the unboxing by itself. So never mind this suggestion.

I’m still confused that the FFI would have significant overhead, because this FFI benchmark shows that there is almost no difference between a Haskell FFI call and a normal C function call.