xxHash: extremely fast non-cryptographic hash functions

Bodigrim · March 28, 2024, 9:59pm

I’ve prepared a new release of xxhash-ffi: Bindings to the C implementation the xxHash algorithm, which is Haskell bindings and high-level helpers for xxHash family of extremely fast non-cryptographic hash functions.

Changelog:

Update bundled xxhash C library to 0.8.2.
Add FFI bindings for XXH3 family of hash functions.
Add newtype XXH3 and instance Hashable for it.
Deprecate class XXHash, please use XXH3-based API instead.

Benchmarks against hashable-1.4.3.0 on x86_64:

10B
  Data.Digest.XXHash.FFI.XXH3:
    20.2 ns ± 692 ps
  Data.Hashable:
    20.3 ns ± 722 ps
1kB
  Data.Digest.XXHash.FFI.XXH3:
    80.8 ns ± 2.0 ns
  Data.Hashable:
    1.27 μs ±  16 ns
4kB
  Data.Digest.XXHash.FFI.XXH3:
    236  ns ± 7.9 ns
  Data.Hashable:
    5.09 μs ± 135 ns
1MB
  Data.Digest.XXHash.FFI.XXH3:
    138  μs ± 3.4 μs
  Data.Hashable:
    3.19 ms ±  60 μs

Benchmarks against hashable-1.4.3.0 on aarch64:

10B
  Data.Digest.XXHash.FFI.XXH3:
    10.2 ns ± 208 ps
  Data.Hashable:
    12.1 ns ± 404 ps
1kB
  Data.Digest.XXHash.FFI.XXH3:
    71.7 ns ± 1.8 ns
  Data.Hashable:
    1.14 μs ±  27 ns
4kB
  Data.Digest.XXHash.FFI.XXH3:
    289  ns ± 7.0 ns
  Data.Hashable:
    4.71 μs ± 187 ns
1MB
  Data.Digest.XXHash.FFI.XXH3:
    175  μs ± 6.9 μs
  Data.Hashable:
    2.87 ms ± 108 μs

Vlix · March 28, 2024, 10:11pm

Wait, what’s the difference between the two?

You can only use the XXH3 using the Hashable instance, right? So which is which?
And if the Data.Hashable benchmark is not XXH3, what is it?

Or is this all ByteString, but one is wrapped in XXH3?

Bodigrim · March 28, 2024, 10:15pm

There are raw FFI bindings in Data.Digest.XXHash.FFI.C.

Correct, sorry for confusion. The first one is instance Hashable (XXH3 ByteString), the second is instance Hashable ByteString.

BurningWitness · March 29, 2024, 6:31am

Minor nitpicks

If a C library uses a synonym for a type, the bindings should ideally also port that synonym instead of referencing the underlying type directly. I wouldn’t want to run into an “actually, the types on this machine are slightly different” compilation error just because I’m pedantic with my type signatures.

Also it would be great if packages could stop shoving every single module into the Data directory, this library could have named its modules Crypto.Hash.XXHash and Foreign.Lib.XXHash.

That being said, it’s nice that there’s a maintained library for this. I was quite disappointed to find out there are four different abandoned half-baked MurmurHash2/3 libraries, and this is a solid alternative.

BurningWitness · March 29, 2024, 6:51am

Also this library imports GHC.Exts, so the base < 5 constraint seems incorrect. I wonder if primitive just wasn’t a common solution ~~five years ago when that PR was made.~~ back when this library was originally written.

emiruz · March 29, 2024, 8:00am

I compiled the xxhash library and used XXH32 as a hash function via FFI for my 1BRC solution (Fortran) instead of the FNV1-a hash function. It ran significantly slower, to my surprise since some of the xxhash benchmarks are against FNV64. It would be interesting to see if @AndrasKovacs @ReleaseCandidate solutions runs faster using this library: its a practical cross-check I think.

Bodigrim · March 29, 2024, 9:20am

@BurningWitness yeah, module names are a bit long, but I decided not to break things for aesthetic reasons. This is a fairly old package, from 2017 originally.

I prefer to stick to boot libraries only, whenever possible. I had to depend on hashable though…

@emiruz hashing benchmarks usually boast about throughput, not latency, it’s difficult to beat FNV on tiny inputs. You can see from the numbers above that XXH3 is maybe just a bit faster than FNV on 10 bytes.

I never looked into Core generated by xxhash-ffi. I suspect that the overhead of Haskell bindings can be reduced if someone does (accursedUnutterablePerformIO?..).

noinia · March 29, 2024, 9:45am

Since this package seems to provide non-cryptographic hash functions (for e.g. use in hashtables etc), picking Crypto seems misleading and threfore much worse module prefix than Data… [/bikeshedding]

BurningWitness · March 29, 2024, 9:50am

True, I wonder if the proper name space would then just be Hash.* for both kinds of hashes.

ReleaseCandidate · March 29, 2024, 2:41pm

@emiruz

I’ve tried (assembler SIMD) [xxhash64, not xxHash3] (xxhash package - github.com/cespare/xxhash - Go Packages) with Go, it has been faster than FNV when used as a function on the whole string, but hashing each byte “inline” using FNV has been faster. So I did not try it with Haskell sources.

So the main question is the FFI overhead from Haskell and whether the binary is compiled with (and with which) SIMD extensions.

Bodigrim · March 29, 2024, 4:41pm

It’s advertised that the latest variant, XXH3, offers improved performance across the board, especially on small data, so it should be more suitable than XXH32. Indeed my interest to revive xxhash-ffi was spurred by .

jaror · April 4, 2024, 12:51pm

I wonder if the FFI would be faster if you used unboxed arguments like this:

foreign import capi unsafe "xxhash.h XXH32_update" c_xxh32_update ::
    XXH32State     -- ^ The state to update
 -> Addr#          -- ^ 'Ptr' to the input buffer
 -> Word64#        -- ^ Buffer length
 -> IO ()

Or can GHC do this unboxing by itself?

Edit: I just checked with a simpler function and GHC should indeed do the unboxing by itself. So never mind this suggestion.

I’m still confused that the FFI would have significant overhead, because this FFI benchmark shows that there is almost no difference between a Haskell FFI call and a normal C function call.

Bodigrim · July 4, 2024, 7:56pm

xxHash has been recently integrated into hashable-1.4.5.0+, becoming a default way of hashing of bytearrays (ByteString, Text, etc.) in Haskell ecosystem. At this point xxhash-ffi package is basically obsolete, unless for some reason you want to use low-level bindings.

Kleidukos · July 4, 2024, 9:06pm

Maybe it would safeguard against misuse to remind people that Hashable makes no guarantee when it comes to the stability of the hash used. Which means that for the purpose of computing hashes for the purpose of creating cache keys, or to identify things in a stable manner over a period of several years, hashable is not for you.

Topic		Replies	Views
[ANN] vector-hashtables-0.1.0.1 Announcements	4	532	September 10, 2021
[ANN] unicode-data-0.3.0: APIs to efficiently access the Unicode character database	13	1258	April 11, 2022
Implementing incremental hashing of a data type Learn	6	447	June 19, 2023
Test your knowledge of Haskell's array primitives - Haskell - kbin.social Links	6	578	August 19, 2023
Serverless Haskell with GHC WASM JSFFI + Cloudflare Workers Show and Tell	5	2043	July 2, 2024

xxHash: extremely fast non-cryptographic hash functions

Minor nitpicks

Related topics