One Billion Row challenge in Hs

-XStrictData might be useful as a quick check, but I think it’s better to have explicit bangs in record fields.

If you grep a large codebase, you will see the field declaration, not the LANGUAGE StrictData somewhere at the top of the module.So you may mistakenly think that the field is lazy. An explicit bang helps to disambiguate this.

BTW, a related performance pitfall is to use Record { field :: {-# UNPACK #-} !Int } and then send the field to a non-inlined function that expects a boxed value. This will cause allocation (boxing) and may actually reduce performance.

There’s no royal road to performance. Everything is lazy is usually faster, but you need to care about accumulators, and it may be not faster in raw number crunching. Making everything strict kills a lot of expressiveness and, counter-intuitively, usually makes things slower.

BTW #2: For the “Haskell must be strict” folk, see a take from Edward Kmett on how laziness is necessary for performance in a pure language (basically a short description of the chapter “Amortization and persistence via lazy evaluation” from Okasaki’s book).

3 Likes

Also for the “Haskell must be strict” folk:

…they may also be interested in:

1 Like

FWIW, it is unlikely this was the case with GHCs of at least the 9.0 series, possibly before. The Simplifier would drop the go !acc [] = acc bang as a redundant eval. A recent patch stopped doing so, because that can sometimes make a measurable difference when the list evaluating to [] would take a long time and acc had a huge closure (space leak).

3 Likes

Interesting to know. It was the case with GHC 7.x series. But I oversimplified my example (to the point of making it wrong, as you pointed out). It should be

go acc [] = f acc
go !acc (x:xs) ...
-- is frequently faster than
go !acc [] = f acc
go !acc (x:xs) ...

Sometimes, f might not even need acc, so it will be ignored entirely without a bang.

If somebody wants to have a reference: my (less optimised) parallel Go version runs in 3.2s and this Go program ported to C is at 2.5s. Interestingly both need about the same lines of code, about 350.

What kind of a machine you’re running your benchmarks on?

On mine (Hetzner EX101, uses i9-13900), it shows more or less thes same results as András’ version in

    -- num_threads = 8
    User time (seconds): 14.14
    System time (seconds): 0.62
    Percent of CPU this job got: 712%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.07
    -- num_threads = 24
    User time (seconds): 20.32                                                
    System time (seconds): 1.00                                               
    Percent of CPU this job got: 1739%  -- not 2400%                              
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.22 -- mmap limit?       

It cannot run faster than 1.2s on 24 threads due to mmap. And it runs in about 2.1s on 8 threads much like mine and András’ versions (2.1s was a lucky case, it’s usually 2.4-2.7s, probably because of there are just 8 chunks).

Do you have Intel, or ARM (where GHC codegen might be not that optimized)?

2 Likes

ARM64. (M1 Max). Thanks for your tests.

Fun perf. golfing in the thread, I’ve just scrolled through a month worth of iterations.

One minor comment: in the past I got helpful data on problems like this (parsing large files) by benchmarking doing the bare minimum and incrementally doing more. Benchmark a program that does nothing. Then a program that opens and closes the file. One that fseeks across it to the end. One that copies buffers in a loop. Before doing the actual work parsing or build data structures. If at any point you start seeing allocations, GCs, or slower speeds than C, you can point at “here is where something is leaving C parity too early”. I’ve found this is sometimes easier than writing something high level and then trying to wring speed out of it.

19 Likes

If at any point you start seeing allocations, GCs, or slower speeds than C, you can point at “here is where something is leaving C parity too early”.

But for a great many people, Haskell will always "leave C parity too early” for their liking no matter how “close to C parity” Haskell can be brought…maybe they’ll just be happier with Rust, and the imminent arrival of its second working compiler :-D

This thread isn’t for those people, so I don’t think it matters too much what they think. :smiley: They’ve already got C! And Rust!

Plus, these really specialized (dare I say “not real-world”) algorithms that read data and do the literal simplest mathematical operations on them is definitely a place where Haskell should be able to compete. True, the Haskell solution may end up hyper specialized, but a lot of generally-useful know-how, data structures, and algorithms will come out of the exercise.

This whole thread really warms my heart.

15 Likes

vector-hashtables-0.1.2.0 has been just released, offering a bunch of performance improvements.

14 Likes

In the meantime hashable-1.4.5.0 switched to a faster hashing algorithm.

I wonder if we can re-evaluate “naive” Haskell solutions based on hashable / vector-hashtables, are they any better now?

4 Likes

If anyone does this: Do me a favor and include brokkr-hashtables in the benchmark. Only on github for now, but my benchmarks show it a bit faster than the other haskell alternatives.

1 Like

It would be great if someone could summarize the results of this thread, e.g. benchmarks of the naive implementation vs highly optimized variants and ideally vs the fastest and naive java and C implementations to get a feeling of the relative speeds benchmarked on the same computer.
My hope is that we can place these results in a visible place like haskell.org to have another prominent example on how fast Haskell can become when you need performance.

3 Likes

The C version (the fastest one in this table) is nowhere near the fastest (not comparable with the work András did in his Haskell version). The fastest C (well, SIMD) solution does not run on my ARM64.

3 Likes