Is there a measurable difference on modern CPUs between x * 6
and x `shiftL` 2 + x `shiftL` 1
?
Using the power-of-2 table shaved off a few seconds, but the rest didnāt really change things.
It now takes only 33.088s.
I swapped out hashing and comparison. Hashing doesnāt make a difference, comparison has a small benefit. I donāt have enough RAM for 1 billion rows, Iām testing 100 million instead. Itās 3.28s for me now.
Things we could try:
- Store the unique strings in a separate array. Currently our table keys are pointers to the huge input string and I imagine that they are randomly strewn around in lots of different pages.
- Try to minimize table size. With 10k keys we should be able to fit into L2. Pack entries more, use higher load factor, but maybe also store some of the hash and compare hashes during scanning (which makes scanning faster so we can afford longer scans).
With bytestring-mmap
Iām able to get the time down to 25.147s (I did have to fix a breaking change in that package to make it work with GHC 9.8).
Edit: just found out mmap
is an updated package that should probably be used instead
Edit: Iāve updated my gist with the 25 second version.
Edit: With your hash and equality it becomes: 22.596s
Iāve updated my gist to include multi-threading. I still see that itās about 2-3 times slower than 7.c version. This time I run it on 1B lines on a 32 core machine:
~/OneBRC $ `which time` --verbose `cabal exec which OneBRC` +RTS -N32 -qg >out.txt
User time (seconds): 54.29
System time (seconds): 4.99
Percent of CPU this job got: 2159%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.74
~/OneBRC $ `which time` --verbose ./7c >7c-out.txt
User time (seconds): 18.05
System time (seconds): 0.72
Percent of CPU this job got: 1187%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.58
Thereās another source file analyse.c
which runs in about 0.8s (3.5 times faster), but it doesnāt produce correct results.
The single threaded version runs in 28s (but itās a quite fast machine):
~/OneBRC $ `which time` --verbose `cabal exec which OneBRC` +RTS -N1 >out.txt
User time (seconds): 26.33
System time (seconds): 1.75
Percent of CPU this job got: 100%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:28.08
So the more or less idiomatic Haskell is 2-3 times slower than C (itās possible to drop the custom hash table and use unordered-containers
and still be in the 2-4 times range; the only important optimization is using unboxed arrays for counters).
Just found that the 1.53s number to beat was achieved on 8 cores:
~/OneBRC $ `which time` --verbose `cabal exec which OneBRC` +RTS -N8 >out.txt
User time (seconds): 33.90
System time (seconds): 2.50
Percent of CPU this job got: 597%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.09
about 4 times difference give or take (machines are not the same). It cannot use 800% CPU so itās probably I/O or memory bound (or maybe GHC runtime bound).
Iām trying to get all versions run for comparisons, and post them here. But that will takĆ© some time, as I also have to do āreal workā, which too much interferes with the benchmark results, so running them aside my work would not be a fair comparison.
Yes, Iāve seen the samĆ© thing with Bodigrimās [sorry, confused you] multithreaded version, it used only 6 (of 10) cores too.
For comparison, the fasted multithreaded Go version (that I have run) by Alexander Yastrebov takes 4 seconds on my computer.
I did run all benchmarks 3 times, every benchmark consists of one warmup (no time taken) and 5 timed runs. The fasted time of these 3 Iāve posted here and rounded to the (absolutely) nearest 1/2 second. As last step I diffed the output with the reference output.
Everything that doesnāt need 9.8 is run on GHC 9.6.4 (ghcup version), 9.8.2 for the ones that need 9.8.
Iāve set -threaded -O2 -with-rtsopts=-N
and every other options needed by your executables.
Yours is the first one, because it doesnāt need any additional fancy command line arguments to GHC
It used 7 out of 10 cores and 333MB RAM at most. Time: 11.5s. Output is correct except for negative zeroes -0.0
and a missing newline at the end.
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 11.321 s Ā± 0.062 s [User: 85.505 s, System: 2.347 s]
Range (min ā¦ max): 11.233 s ā¦ 11.385 s 5 runs
AndrĆ”s KovĆ”csā version: GHC 9.8.2 with option -fllvm
added to the default ones.
Does not work with -fllvm
on my computer, error is:
opt: Unknown command line argument '-enable-new-pm=0'. Try: 'opt --help'
opt: Did you mean '--enable-newgvn=0'?
This is with clang 17.0.6, using Appleās default LLVM, it doesnāt work either:
<no location info>: error:
Warning: Couldn't figure out LLVM version!
Make sure you have installed LLVM between [11 and 16)
While i canāt say I appreciate that the MacOS version of GHC doesnāt run with the default LLVM, I appreciate the usage of parens for the non-inclusive interval end.
So, installing LLVM 15 and using this (some minor tweaking later ā¦): the executable throws a runtime error:
exe-exe: app/Main.hs:153:10-17: No instance nor default method for class operation setByteArray#
Jarorās version: GHC 9.8.2 with option -fllvm
added to the default ones.
Does not compile because of missing keepAliveUnlifted
Iāve removed that and updated my gist. It was kind of silly to use that any way.
Thanks, it compiles now, but throws the same runtime error as OndrĆ”Å”ā.
Ah, primitive-0.9.0.0 changed that into a default method. You should be able to fix that error by changing the bounds to:
build-depends: base >= 4.19, bytestring, primitive >=0.9, mmap
Now both of your executables compile, and first benchmarks (ārealā, faster ones Iām going to post in the evening, now there is a VM running, but still 16GB RAM free).
Your program is acting strange, it needs a bit less RAM (55MB) than the last one that Iāve run before, but now it runs for more than 10 minutes (600s), Iāve stopped it after 10 minutes of runtime.
KovĆ”cs AndrĆ”sā version:
Needs 14.9GB (the data file) RAM, a single thread and about 39s. This is the fastest single thread version by far, 2.4 times as fast as my version. The results are not sorted, so I canāt (well I donāt want to comapre them with the reference, but the number of stations is correct and the values do look correct at first glance. There are also negative zeros -0.0
in the results
I see I reduced the hashtable size to 0x1000 while testing. That may have been too small and caused an infinite loop when trying to insert a new entry. Iāve expanded it again in my gist. Can you try again?
Of course. Way better now! 36s and your version is now the fastest one (3s faster than the one by AndrƔs)!
As Iāve said, the āofficialā timing happens sometime in the evening (CET) and should yield better values.
@jaror single threaded version, the fastest one so far: 36s
% hyperfine -w 1 -r 5 './exe-exe > ../solution.txt'
Benchmark 1: ./exe-exe > ../solution.txt
Time (mean Ā± Ļ): 35.787 s Ā± 0.046 s [User: 32.847 s, System: 2.822 s]
Range (min ā¦ max): 35.736 s ā¦ 35.836 s 5 runs
@AndrasKovacs with the second fastest single threaded version by a small margin: 37s
% hyperfine -w 1 -r 5 './exe-exe > ../solution.txt'
Benchmark 1: ./exe-exe > ../solution.txt
Time (mean Ā± Ļ): 36.849 s Ā± 0.102 s [User: 32.488 s, System: 1.886 s]
Range (min ā¦ max): 36.704 s ā¦ 36.967 s 5 runs
Nice, I donāt run out of memory with the mmap-ed version. I added āshort-string optimizationā on the top of it, I get a very nice speedup from that. Now itās 23 seconds on my machine.
Here is a naive implementation of my left-child, right-sibling binary tree idea from above, in Fortran. On 2.8Ghz laptop, it completes 1B rows single threaded in 2m8s: < 4x slower than the best version reported by @ReleaseCandidate. Other than a custom float parser, it is currently naive. When I replace the current linear scan for siblings with an indexed array, I expect a 5-10x improvement. Once the concept is proven, Iāll give it a shot in Haskell, but I think its a good baseline for a structure I havenāt seen exploited in any other solution.
Running this twice results in the second run only taking 15.407s on my machine. The first run took 18.160s.
This version now runs in 29s on my computer, 8s less than the previous version. And RAM usage went down to 60MB (-15GB).
% hyperfine -w 1 -r 5 './exe-exe > ../solution.txt'
Benchmark 1: ./exe-exe > ../solution.txt
Time (mean Ā± Ļ): 28.535 s Ā± 0.023 s [User: 25.609 s, System: 2.818 s]
Range (min ā¦ max): 28.505 s ā¦ 28.556 s 5 runs
I made some modest changes, namely added a more āvectorizedā newline search, and doubled the table size. Now itās 22.5 secs for me, so I gained just 1,5 secs. I also experimented with other random changes, but apparently weāre getting into the territory where itās hard to predict performance impact even by looking at the core.
I find it very suspicious that we apparently get no benefit from fitting the hashtable into L2, and instead the best choice is to make it large, while still fitting in L3. Normally this would indicate that scanning in table insertion is expensive, but I see nothing especially expensive in the core.