One Billion Row challenge in Hs

Bodigrim · March 13, 2024, 9:47pm

Is there a measurable difference on modern CPUs between x * 6 and x `shiftL` 2 + x `shiftL` 1?

jaror · March 13, 2024, 10:01pm

Using the power-of-2 table shaved off a few seconds, but the rest didn’t really change things.

It now takes only 33.088s.

AndrasKovacs · March 13, 2024, 11:29pm

I swapped out hashing and comparison. Hashing doesn’t make a difference, comparison has a small benefit. I don’t have enough RAM for 1 billion rows, I’m testing 100 million instead. It’s 3.28s for me now.

gist.github.com

https://gist.github.com/AndrasKovacs/e156ae66b8c28b1b84abe6b483ea20ec

OneBRC.hs

-- Based on: https://gist.github.com/noughtmare/ab64aa24e5a106142164c91ea1ac5af7

#!/usr/bin/env cabal
{- cabal:
build-depends: base >= 4.19, bytestring, primitive, hashable
default-language: GHC2021
ghc-options: -Wall -O2 -fllvm -ddump-simpl -dsuppress-all -dno-typeable-binds -ddump-to-file -dno-suppress-type-signatures -ddump-stg-final -fforce-recomp
-}

{-# LANGUAGE ExtendedLiterals #-}

This file has been truncated. show original

Things we could try:

Store the unique strings in a separate array. Currently our table keys are pointers to the huge input string and I imagine that they are randomly strewn around in lots of different pages.
Try to minimize table size. With 10k keys we should be able to fit into L2. Pack entries more, use higher load factor, but maybe also store some of the hash and compare hashes during scanning (which makes scanning faster so we can afford longer scans).

jaror · March 13, 2024, 11:33pm

With bytestring-mmap I’m able to get the time down to 25.147s (I did have to fix a breaking change in that package to make it work with GHC 9.8).

Edit: just found out mmap is an updated package that should probably be used instead

Edit: I’ve updated my gist with the 25 second version.

Edit: With your hash and equality it becomes: 22.596s

vshabanov · March 14, 2024, 1:12am

I’ve updated my gist to include multi-threading. I still see that it’s about 2-3 times slower than 7.c version. This time I run it on 1B lines on a 32 core machine:

~/OneBRC $ `which time` --verbose `cabal exec which OneBRC` +RTS -N32 -qg >out.txt
	User time (seconds): 54.29
	System time (seconds): 4.99
	Percent of CPU this job got: 2159%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.74
~/OneBRC $ `which time` --verbose ./7c >7c-out.txt
	User time (seconds): 18.05
	System time (seconds): 0.72
	Percent of CPU this job got: 1187%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.58

There’s another source file analyse.c which runs in about 0.8s (3.5 times faster), but it doesn’t produce correct results.

The single threaded version runs in 28s (but it’s a quite fast machine):

~/OneBRC $ `which time` --verbose `cabal exec which OneBRC` +RTS -N1 >out.txt
	User time (seconds): 26.33
	System time (seconds): 1.75
	Percent of CPU this job got: 100%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:28.08

So the more or less idiomatic Haskell is 2-3 times slower than C (it’s possible to drop the custom hash table and use unordered-containers and still be in the 2-4 times range; the only important optimization is using unboxed arrays for counters).

vshabanov · March 14, 2024, 1:33am

Just found that the 1.53s number to beat was achieved on 8 cores:

~/OneBRC $ `which time` --verbose `cabal exec which OneBRC` +RTS -N8 >out.txt
	User time (seconds): 33.90
	System time (seconds): 2.50
	Percent of CPU this job got: 597%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.09

about 4 times difference give or take (machines are not the same). It cannot use 800% CPU so it’s probably I/O or memory bound (or maybe GHC runtime bound).

ReleaseCandidate · March 14, 2024, 6:39am

I’m trying to get all versions run for comparisons, and post them here. But that will také some time, as I also have to do “real work”, which too much interferes with the benchmark results, so running them aside my work would not be a fair comparison.

Yes, I’ve seen the samé thing with Bodigrim’s [sorry, confused you] multithreaded version, it used only 6 (of 10) cores too.

ReleaseCandidate · March 14, 2024, 7:45am

For comparison, the fasted multithreaded Go version (that I have run) by Alexander Yastrebov takes 4 seconds on my computer.

I did run all benchmarks 3 times, every benchmark consists of one warmup (no time taken) and 5 timed runs. The fasted time of these 3 I’ve posted here and rounded to the (absolutely) nearest 1/2 second. As last step I diffed the output with the reference output.
Everything that doesn’t need 9.8 is run on GHC 9.6.4 (ghcup version), 9.8.2 for the ones that need 9.8.
I’ve set -threaded -O2 -with-rtsopts=-N and every other options needed by your executables.

Yours is the first one, because it doesn’t need any additional fancy command line arguments to GHC

It used 7 out of 10 cores and 333MB RAM at most. Time: 11.5s. Output is correct except for negative zeroes -0.0 and a missing newline at the end.

Benchmark 1: ./exe-exe >  solution.txt
  Time (mean ± σ):     11.321 s ±  0.062 s    [User: 85.505 s, System: 2.347 s]
  Range (min … max):   11.233 s … 11.385 s    5 runs

András Kovács’ version: GHC 9.8.2 with option -fllvm added to the default ones.

Does not work with -fllvm on my computer, error is:

opt: Unknown command line argument '-enable-new-pm=0'.  Try: 'opt --help'
opt: Did you mean '--enable-newgvn=0'?

This is with clang 17.0.6, using Apple’s default LLVM, it doesn’t work either:

<no location info>: error:
    Warning: Couldn't figure out LLVM version!
             Make sure you have installed LLVM between [11 and 16)

While i can’t say I appreciate that the MacOS version of GHC doesn’t run with the default LLVM, I appreciate the usage of parens for the non-inclusive interval end.

So, installing LLVM 15 and using this (some minor tweaking later …): the executable throws a runtime error:

exe-exe: app/Main.hs:153:10-17: No instance nor default method for class operation setByteArray#

Jaror’s version: GHC 9.8.2 with option -fllvm added to the default ones.
Does not compile because of missing keepAliveUnlifted

jaror · March 14, 2024, 9:38am

I’ve removed that and updated my gist. It was kind of silly to use that any way.

ReleaseCandidate · March 14, 2024, 10:11am

Thanks, it compiles now, but throws the same runtime error as Ondráš’.

jaror · March 14, 2024, 10:28am

Ah, primitive-0.9.0.0 changed that into a default method. You should be able to fix that error by changing the bounds to:

build-depends: base >= 4.19, bytestring, primitive >=0.9, mmap

ReleaseCandidate · March 14, 2024, 11:07am

Now both of your executables compile, and first benchmarks (“real”, faster ones I’m going to post in the evening, now there is a VM running, but still 16GB RAM free).

Your program is acting strange, it needs a bit less RAM (55MB) than the last one that I’ve run before, but now it runs for more than 10 minutes (600s), I’ve stopped it after 10 minutes of runtime.

Kovács András’ version:

Needs 14.9GB (the data file) RAM, a single thread and about 39s. This is the fastest single thread version by far, 2.4 times as fast as my version. The results are not sorted, so I can’t (well I don’t want to comapre them with the reference, but the number of stations is correct and the values do look correct at first glance. There are also negative zeros -0.0 in the results

jaror · March 14, 2024, 11:37am

I see I reduced the hashtable size to 0x1000 while testing. That may have been too small and caused an infinite loop when trying to insert a new entry. I’ve expanded it again in my gist. Can you try again?

ReleaseCandidate · March 14, 2024, 11:46am

Of course. Way better now! 36s and your version is now the fastest one (3s faster than the one by András)!

As I’ve said, the “official” timing happens sometime in the evening (CET) and should yield better values.

ReleaseCandidate · March 14, 2024, 2:06pm

@jaror single threaded version, the fastest one so far: 36s

% hyperfine -w 1 -r 5  './exe-exe > ../solution.txt'
Benchmark 1: ./exe-exe > ../solution.txt
  Time (mean ± σ):     35.787 s ±  0.046 s    [User: 32.847 s, System: 2.822 s]
  Range (min … max):   35.736 s … 35.836 s    5 runs

@AndrasKovacs with the second fastest single threaded version by a small margin: 37s

% hyperfine -w 1 -r 5  './exe-exe > ../solution.txt'
Benchmark 1: ./exe-exe > ../solution.txt
  Time (mean ± σ):     36.849 s ±  0.102 s    [User: 32.488 s, System: 1.886 s]
  Range (min … max):   36.704 s … 36.967 s    5 runs

AndrasKovacs · March 14, 2024, 5:20pm

Nice, I don’t run out of memory with the mmap-ed version. I added “short-string optimization” on the top of it, I get a very nice speedup from that. Now it’s 23 seconds on my machine.

gist.github.com

https://gist.github.com/AndrasKovacs/e156ae66b8c28b1b84abe6b483ea20ec

OneBRC.hs

-- based on: https://gist.github.com/noughtmare/ab64aa24e5a106142164c91ea1ac5af7

#!/usr/bin/env cabal
{- cabal:
build-depends: base >= 4.19, bytestring, primitive >= 0.9, mmap
default-language: GHC2021
ghc-options: -Wall -O2 -fllvm
-}
{-# options_ghc -Wno-name-shadowing #-}
{-# LANGUAGE MagicHash, UnboxedTuples, UnliftedDatatypes, Strict,

This file has been truncated. show original

emiruz · March 14, 2024, 5:23pm

Here is a naive implementation of my left-child, right-sibling binary tree idea from above, in Fortran. On 2.8Ghz laptop, it completes 1B rows single threaded in 2m8s: < 4x slower than the best version reported by @ReleaseCandidate. Other than a custom float parser, it is currently naive. When I replace the current linear scan for siblings with an indexed array, I expect a 5-10x improvement. Once the concept is proven, I’ll give it a shot in Haskell, but I think its a good baseline for a structure I haven’t seen exploited in any other solution.

jaror · March 14, 2024, 6:09pm

Running this twice results in the second run only taking 15.407s on my machine. The first run took 18.160s.

ReleaseCandidate · March 14, 2024, 8:12pm

This version now runs in 29s on my computer, 8s less than the previous version. And RAM usage went down to 60MB (-15GB).

% hyperfine -w 1 -r 5  './exe-exe > ../solution.txt'
Benchmark 1: ./exe-exe > ../solution.txt
  Time (mean ± σ):     28.535 s ±  0.023 s    [User: 25.609 s, System: 2.818 s]
  Range (min … max):   28.505 s … 28.556 s    5 runs

AndrasKovacs · March 15, 2024, 12:48am

I made some modest changes, namely added a more “vectorized” newline search, and doubled the table size. Now it’s 22.5 secs for me, so I gained just 1,5 secs. I also experimented with other random changes, but apparently we’re getting into the territory where it’s hard to predict performance impact even by looking at the core.

I find it very suspicious that we apparently get no benefit from fitting the hashtable into L2, and instead the best choice is to make it large, while still fitting in L3. Normally this would indicate that scanning in table insertion is expensive, but I see nothing especially expensive in the core.

Topic		Replies	Views
[ANN] vector-hashtables-0.1.0.1 Announcements	4	535	September 10, 2021
Subquadratic behaviour of Data.List.nub Learn	9	621	April 10, 2023
Strict, StrictData, UNPACK	2	595	March 25, 2024
Haskell performance has come a long way since I started	3	496	April 11, 2025
Haskell-perf/sequences shows interesting results on addition of DList, Acc & snoc into comparison	2	332	November 22, 2021

One Billion Row challenge in Hs

Related topics