One Billion Row challenge in Hs

vshabanov · March 23, 2024, 11:46pm

-XStrictData might be useful as a quick check, but I think it’s better to have explicit bangs in record fields.

If you grep a large codebase, you will see the field declaration, not the LANGUAGE StrictData somewhere at the top of the module.So you may mistakenly think that the field is lazy. An explicit bang helps to disambiguate this.

BTW, a related performance pitfall is to use Record { field :: {-# UNPACK #-} !Int } and then send the field to a non-inlined function that expects a boxed value. This will cause allocation (boxing) and may actually reduce performance.

There’s no royal road to performance. Everything is lazy is usually faster, but you need to care about accumulators, and it may be not faster in raw number crunching. Making everything strict kills a lot of expressiveness and, counter-intuitively, usually makes things slower.

BTW #2: For the “Haskell must be strict” folk, see a take from Edward Kmett on how laziness is necessary for performance in a pure language (basically a short description of the chapter “Amortization and persistence via lazy evaluation” from Okasaki’s book).

atravers · March 24, 2024, 2:52am

Also for the “Haskell must be strict” folk:

A Lazy Pure Language versus Impure Lisp (1996)

…they may also be interested in:

Why calculating is better than scheming (1987)

sgraf · March 24, 2024, 5:11pm

FWIW, it is unlikely this was the case with GHCs of at least the 9.0 series, possibly before. The Simplifier would drop the go !acc [] = acc bang as a redundant eval. A recent patch stopped doing so, because that can sometimes make a measurable difference when the list evaluating to [] would take a long time and acc had a huge closure (space leak).

vshabanov · March 24, 2024, 10:14pm

Interesting to know. It was the case with GHC 7.x series. But I oversimplified my example (to the point of making it wrong, as you pointed out). It should be

go acc [] = f acc
go !acc (x:xs) ...
-- is frequently faster than
go !acc [] = f acc
go !acc (x:xs) ...

Sometimes, f might not even need acc, so it will be ignored entirely without a bang.

ReleaseCandidate · March 25, 2024, 8:47am

If somebody wants to have a reference: my (less optimised) parallel Go version runs in 3.2s and this Go program ported to C is at 2.5s. Interestingly both need about the same lines of code, about 350.

vshabanov · March 25, 2024, 9:03pm

What kind of a machine you’re running your benchmarks on?

On mine (Hetzner EX101, uses i9-13900), it shows more or less thes same results as András’ version in

    -- num_threads = 8
    User time (seconds): 14.14
    System time (seconds): 0.62
    Percent of CPU this job got: 712%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.07
    -- num_threads = 24
    User time (seconds): 20.32                                                
    System time (seconds): 1.00                                               
    Percent of CPU this job got: 1739%  -- not 2400%                              
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.22 -- mmap limit?

It cannot run faster than 1.2s on 24 threads due to mmap. And it runs in about 2.1s on 8 threads much like mine and András’ versions (2.1s was a lucky case, it’s usually 2.4-2.7s, probably because of there are just 8 chunks).

Do you have Intel, or ARM (where GHC codegen might be not that optimized)?

ReleaseCandidate · March 26, 2024, 5:33am

ARM64. (M1 Max). Thanks for your tests.

chrisdone · April 4, 2024, 9:44pm

Fun perf. golfing in the thread, I’ve just scrolled through a month worth of iterations.

One minor comment: in the past I got helpful data on problems like this (parsing large files) by benchmarking doing the bare minimum and incrementally doing more. Benchmark a program that does nothing. Then a program that opens and closes the file. One that fseeks across it to the end. One that copies buffers in a loop. Before doing the actual work parsing or build data structures. If at any point you start seeing allocations, GCs, or slower speeds than C, you can point at “here is where something is leaving C parity too early”. I’ve found this is sometimes easier than writing something high level and then trying to wring speed out of it.

atravers · April 6, 2024, 2:08am

If at any point you start seeing allocations, GCs, or slower speeds than C, you can point at “here is where something is leaving C parity too early”.

But for a great many people, Haskell will always "leave C parity too early” for their liking no matter how “close to C parity” Haskell can be brought…maybe they’ll just be happier with Rust, and the imminent arrival of its second working compiler :-D

chreekat · April 11, 2024, 2:52pm

This thread isn’t for those people, so I don’t think it matters too much what they think. They’ve already got C! And Rust!

Plus, these really specialized (dare I say “not real-world”) algorithms that read data and do the literal simplest mathematical operations on them is definitely a place where Haskell should be able to compete. True, the Haskell solution may end up hyper specialized, but a lot of generally-useful know-how, data structures, and algorithms will come out of the exercise.

This whole thread really warms my heart.

Bodigrim · April 26, 2024, 9:34pm

vector-hashtables-0.1.2.0 has been just released, offering a bunch of performance improvements.

Bodigrim · July 4, 2024, 7:58pm

In the meantime hashable-1.4.5.0 switched to a faster hashing algorithm.

I wonder if we can re-evaluate “naive” Haskell solutions based on hashable / vector-hashtables, are they any better now?

JannisO · July 9, 2024, 1:29pm

If anyone does this: Do me a favor and include brokkr-hashtables in the benchmark. Only on github for now, but my benchmarks show it a bit faster than the other haskell alternatives.

malteneuss · July 14, 2024, 5:15pm

It would be great if someone could summarize the results of this thread, e.g. benchmarks of the naive implementation vs highly optimized variants and ideally vs the fastest and naive java and C implementations to get a feeling of the relative speeds benchmarked on the same computer.
My hope is that we can place these results in a visible place like haskell.org to have another prominent example on how fast Haskell can become when you need performance.

ReleaseCandidate · July 14, 2024, 7:52pm

The C version (the fastest one in this table) is nowhere near the fastest (not comparable with the work András did in his Haskell version). The fastest C (well, SIMD) solution does not run on my ARM64.

Topic		Replies	Views
[ANN] vector-hashtables-0.1.0.1 Announcements	4	532	September 10, 2021
Subquadratic behaviour of Data.List.nub Learn	9	621	April 10, 2023
Strict, StrictData, UNPACK	2	592	March 25, 2024
Haskell performance has come a long way since I started	3	486	April 11, 2025
Haskell-perf/sequences shows interesting results on addition of DList, Acc & snoc into comparison	2	331	November 22, 2021

One Billion Row challenge in Hs

Related topics