One Billion Row challenge in Hs

This is almost -25% on my Computer (from 4.5s down to 3.5s).

1 Like

This is already incredibly fast! And it does not seem to allocate much at all.

I just ran perf record --call-graph on the binary (I passed -g3 to have debug symbols and compiled without LLVM). Just some data:

  • 10% of time is spent in findSemi; 30% in parse, 60% in updateTable.
  • perf stat reports that 46% of branches were mispredicted
  • Within $wupdateTable, there is the isEmptySLB check that appears to be responsible for these branch mispredictions. At least that’s the impression when I “Annotate” updateTable; there is a red 0-test test %r14, %r14 just before what appears to be an inlined call to readEntry. I think this implies that there are a lot of collisions, because if there weren’t, we would very often have that isEmptySLB is True. Alas, the code generator lays out the if in a way that the collision case comes first! I’m not sure how we convince it not to do that.
2 Likes

isEmptySLB should be almost always false, regardless of collisions, because we only have 8800 distinct keys for 1 billion insertions. With the current table configurations, we have 64k slots and barely any collisions. Concretely, for 500M rows on my machine I get maximum scan length of 3 and average scan length of less than 0.1. So I dunno why isEmptySLB gets that many mispredictions.

Btw I have never used perf for Haskell, is there a reference on how to do it?

The perf measurements are also kinda at odds with our manual measurements, where just parsing the input without updating the tables was still 70% of the total runtime (on single thread).

1 Like

As in on average <10% of entries involve a scan. Yeah, concretely on mine for 1B rows there were 75M hops. So those numbers tally: under those conditions I got a significant improvement by re-hashing versus scanning, and an even more significant improvement by reversing query and key arrays before comparing keys.

This version is now the fastest multi-threaded Haskell version so far:

Using 11 threads (which yields the fastest times), the time is 3.5s:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.596 s ±  0.032 s    [User: 28.265 s, System: 1.727 s]
  Range (min … max):    3.567 s …  3.646 s    5 runs
2 Likes

FTR, I convinced GHC to lay out the then branch of the isEmptySLB before the else branch by defining

isEmptySLB (SLB# a _ _) = xor (-13) a + 13 <= 0

but it had no effect. Now the instruction mov %rax,0x60(%rsp) moving the length onto the stack is red… No idea how to interpret that; other memory writes don’t seem to be that expensive according to perf report. Branch mispredictions are likely caused because there are as many updates as there are initial insertions.

I tried reading just the length word, but of course we often need the whole entry anyway, for the key == oldkey case.

Btw I have never used perf for Haskell, is there a reference on how to do it?

I don’t know of any. I simply ran it (both perf stat at first, but now I’m using perf record --call-graph and perf report), then reran again after passing -g3 to GHC to generate debug symbols. That seems to be enough to get a function-level profile, but apparently the instruction-level annotations (which you get by pressing Enter on a function name and then choose “Annotate …”) are not really usable. Not sure what goes on here.

2 Likes

Sure. But I usually tried -XStrict on a program that already had bangs in the necessary places (accumulators, fields). Adding more bangs usually makes things worse.

The opposite can also be true: -XStrict on things like head . sort can convert linear time/space into a log-linear.

-XStrictData may as well require much more data and overload GC. In general, -XStrict is a rather dangerous sledgehammer that doesn’t always work (never in my case).

Sorry, perhaps I used the wrong word (although a pun on instruction reordering was intended :slight_smile: ). I meant the ability to do evaluation where it’s needed, not where the bang is:

main :: IO ()
main = do
  x <- randomRIO (0,9) :: IO Int
  let a = trace "a" x
      b = trace "b" x
  if x < 5 then
    print a
  else
    print b

Both a and b will always be evaluated if you add -XStrict.

Superfluous bangs tend to slow things down for the same reason (forces to evaluate the unnecessary):

go acc [] = acc
go !acc (x:xs) ...
-- is frequently faster than
go !acc [] = acc
go !acc (x:xs) ...
--------------
go !acc rarelyUsedArg (x:xs) ...
-- is frequently faster than
go !acc !rarelyUsedArg (x:xs) ...

You can see the source of Text.HTML.TagSoup.Fast. The code is more than a decade old ugliness, I would write it with pattern synonyms now; but you can notice a lack of bangs in non-recursive cases, and it helped the code to perform better, at least back then.

I have a feeling (cannot prove it) that bangs break the strictness analyzer and lead to a suboptimal code generation. GHC knows better when to evaluate if it’s sure that evaluation is necessary, bangs force GHC to do it in the wrong places (and maybe multiple times).

Bangs should be used sparingly, to give GHC a hint that you don’t want to return a lazy accumulator, and let the strictness analyzer do the rest.

2 Likes

Oh, I beg your pardon. I incorrectly misinterpreted -XStrict as -XStrictData. I agree that -XStrict is a dangerous sledgehammer (I recommend -XStrictData without much reservation though).

2 Likes

…but is -XStrictData really all that much better?

CONS Should Not Evaluate its Arguments (1976)

1 Like

-XStrictData might be useful as a quick check, but I think it’s better to have explicit bangs in record fields.

If you grep a large codebase, you will see the field declaration, not the LANGUAGE StrictData somewhere at the top of the module.So you may mistakenly think that the field is lazy. An explicit bang helps to disambiguate this.

BTW, a related performance pitfall is to use Record { field :: {-# UNPACK #-} !Int } and then send the field to a non-inlined function that expects a boxed value. This will cause allocation (boxing) and may actually reduce performance.

There’s no royal road to performance. Everything is lazy is usually faster, but you need to care about accumulators, and it may be not faster in raw number crunching. Making everything strict kills a lot of expressiveness and, counter-intuitively, usually makes things slower.

BTW #2: For the “Haskell must be strict” folk, see a take from Edward Kmett on how laziness is necessary for performance in a pure language (basically a short description of the chapter “Amortization and persistence via lazy evaluation” from Okasaki’s book).

3 Likes

Also for the “Haskell must be strict” folk:

…they may also be interested in:

FWIW, it is unlikely this was the case with GHCs of at least the 9.0 series, possibly before. The Simplifier would drop the go !acc [] = acc bang as a redundant eval. A recent patch stopped doing so, because that can sometimes make a measurable difference when the list evaluating to [] would take a long time and acc had a huge closure (space leak).

3 Likes

Interesting to know. It was the case with GHC 7.x series. But I oversimplified my example (to the point of making it wrong, as you pointed out). It should be

go acc [] = f acc
go !acc (x:xs) ...
-- is frequently faster than
go !acc [] = f acc
go !acc (x:xs) ...

Sometimes, f might not even need acc, so it will be ignored entirely without a bang.

If somebody wants to have a reference: my (less optimised) parallel Go version runs in 3.2s and this Go program ported to C is at 2.5s. Interestingly both need about the same lines of code, about 350.

What kind of a machine you’re running your benchmarks on?

On mine (Hetzner EX101, uses i9-13900), it shows more or less thes same results as András’ version in

    -- num_threads = 8
    User time (seconds): 14.14
    System time (seconds): 0.62
    Percent of CPU this job got: 712%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.07
    -- num_threads = 24
    User time (seconds): 20.32                                                
    System time (seconds): 1.00                                               
    Percent of CPU this job got: 1739%  -- not 2400%                              
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.22 -- mmap limit?       

It cannot run faster than 1.2s on 24 threads due to mmap. And it runs in about 2.1s on 8 threads much like mine and András’ versions (2.1s was a lucky case, it’s usually 2.4-2.7s, probably because of there are just 8 chunks).

Do you have Intel, or ARM (where GHC codegen might be not that optimized)?

2 Likes

ARM64. (M1 Max). Thanks for your tests.

Fun perf. golfing in the thread, I’ve just scrolled through a month worth of iterations.

One minor comment: in the past I got helpful data on problems like this (parsing large files) by benchmarking doing the bare minimum and incrementally doing more. Benchmark a program that does nothing. Then a program that opens and closes the file. One that fseeks across it to the end. One that copies buffers in a loop. Before doing the actual work parsing or build data structures. If at any point you start seeing allocations, GCs, or slower speeds than C, you can point at “here is where something is leaving C parity too early”. I’ve found this is sometimes easier than writing something high level and then trying to wring speed out of it.

16 Likes

If at any point you start seeing allocations, GCs, or slower speeds than C, you can point at “here is where something is leaving C parity too early”.

But for a great many people, Haskell will always "leave C parity too early” for their liking no matter how “close to C parity” Haskell can be brought…maybe they’ll just be happier with Rust, and the imminent arrival of its second working compiler :-D

This thread isn’t for those people, so I don’t think it matters too much what they think. :smiley: They’ve already got C! And Rust!

Plus, these really specialized (dare I say “not real-world”) algorithms that read data and do the literal simplest mathematical operations on them is definitely a place where Haskell should be able to compete. True, the Haskell solution may end up hyper specialized, but a lot of generally-useful know-how, data structures, and algorithms will come out of the exercise.

This whole thread really warms my heart.

10 Likes

vector-hashtables-0.1.2.0 has been just released, offering a bunch of performance improvements.

8 Likes