You are parsing the line twice. Most important is to never touch a byte twice, even if it is in the CPU’s cache. But tomorrow (well, almost today) I’ll adapt your version. When I had been trying, unsafeIndex
was surprisingly not faster than uncons
and pattern matching the result. Ivm curiois, whether your solution will help.
re: Not hitting Go - is it a GC thing?
Any large boxed dataset is gonna crater throughput with the default GC. Unless put in a compact region.
(I didn’t parse the recent code to see if this is the case.)
Yesterday night, before falling asleep I thought I should add this calculation (ignoring 1000 vs. 1024 differences) instead of always repeating “don’t touch the data in the input more than once”:
Let’s say your CPU runs at 3GHZ, then a single cycle takes 1/3 nanosecond. Access to the level 1 cache of this CPU is about 3 cycles (AFAIK this is about correct), so one nanosecond. So if you access a byte of the input in every row, you need 1ns * 1,000,000,000 = 1 second. My data file has an average line length of 14.9 bytes (the size is 14.9GB), so if I would access every byte of the file once that would také 15s. If you look at the results, you can see that really accessing every byte in the input only once instead of twice (although I didn’t parse everything twice) gained mé 13s.
Of course the samé holds for any other op in every row, so even a single integer add of 1 cycle adds 333ms to the runtime. That’s what I like about that challenge, it un-nanos the processing of the input data and makes the processing “visible”.
The data is in one unboxed array of bytes, 4 unboxed 64 bit integer arrays and one unboxed 64 bit integer in a hashmap (same as in Haskell). And nothing of this changes with the architectúre the executable is built for. So this is another “proof” that this whole exercise is just about memory access (of the data and in the hashmap).
Your executable uses 6 Cores (of 10) and a max. of 26GB of RAM (most of the time 7GB, the 26GB is the summing of the results, I guess). It takes 5s (5%) more than my (single threaded) version at 105s. It is about 2.4 times faster than the last (single thread) version posted in this thread of Jaror, with 240s.
Your output contains negative zeroes -0.0
and is missing a newline at the end of the file. Apart from that it is the same as the reference output.
stack install --local-bin-path ./ && hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe ../measurements_big.txt > solution_big.txt
Time (mean ± σ): 104.907 s ± 1.791 s [User: 613.344 s, System: 19.411 s]
Range (min … max): 101.754 s … 106.213 s 5 runs
The config:
ghcup GHC 9.8.2
dependencies:
- base >= 4.19 && < 5
ghc-options:
- -Wall
- -Wcompat
- -Widentities
- -Wincomplete-record-updates
- -Wincomplete-uni-patterns
- -Wmissing-export-lists
- -Wmissing-home-modules
- -Wpartial-fields
- -Wredundant-constraints
- -XGHC2021
- -O2
- -threaded
- -rtsopts
- -with-rtsopts=-N
executables:
exe-exe:
main: Main.hs
source-dirs: app
dependencies:
- flatparse
- text
- bytestring
- containers
- unordered-containers
- mmap
- array
- linear
- vector
- vector-hashtables
resolver: nightly-2024-03-12
system-ghc: true
There was a similar optimisation thread here which ended up birthing two new hashmap packages
Interestingly, even when adapting your solution to my program like below, unsafeIndex
still is slower (by about 2s or 2%) than uncons
like in the second version:
unsafeIndex
let x1 = BSU.unsafeIndex rest2 0
let x2 = BSU.unsafeIndex rest2 1
let x3 = BSU.unsafeIndex rest2 2
let x4 = BSU.unsafeIndex rest2 3
let x5 = BSU.unsafeIndex rest2 4
let (temp, rest) = case x1 of
-- ord '-' == 45 => -N.N or -NN.N
W8# 45#Word8 -> case x3 of
-- ord '.' == 46 => -N.N
W8# 46#Word8 -> (528 - fromIntegral x2 * 10 - fromIntegral x4, BS.drop 4 rest2)
-- -NN.N
_ -> (5328 - fromIntegral x2 * 100 - fromIntegral x3 * 10 - fromIntegral x5, BS.drop 5 rest2)
-- N.N or NN.N
_ -> case x2 of
-- ord '.' == 46 => N.N
W8# 46#Word8 -> (fromIntegral x1 * 10 + fromIntegral x3 - 528, BS.drop 3 rest2)
-- NN.N
_ -> (fromIntegral x1 * 100 + fromIntegral x2 * 10 + fromIntegral x4 - 5328, BS.drop 4 rest2)
hyperfine -w 1 -r 5 './exe-exe ../measurements.txt > solution.txt'
Benchmark 1: ./exe-exe ../measurements.txt > solution.txt
Time (mean ± σ): 99.749 s ± 0.512 s [User: 90.609 s, System: 20.358 s]
Range (min … max): 99.099 s … 100.309 s 5 runs
uncons
{-# INLINE parseNegTemp #-}
parseNegTemp :: BS.ByteString -> (Int, BS.ByteString)
parseNegTemp rest = case BS.uncons rest of
Nothing -> (0, rest)
Just (ch1, rest3) -> case BS.uncons rest3 of
Nothing -> (0, rest3)
Just ('.', rest4) -> case BS.uncons rest4 of
Nothing -> (0, rest4)
Just (ch2, rest5) -> (528 - 10 * ord ch1 - ord ch2, rest5)
Just (ch2, rest4) -> case BS.uncons rest4 of
Nothing -> (0, rest4)
-- Must be '.'
Just (_, rest5) -> case BS.uncons rest5 of
Nothing -> (0, rest5)
Just (ch3, rest6) -> (5328 - 100 * ord ch1 - 10 * ord ch2 - ord ch3, rest6)
{-# INLINE parseTemp #-}
parseTemp :: Char -> BS.ByteString -> (Int, BS.ByteString)
parseTemp ch1 rest = case BS.uncons rest of
Nothing -> (0, rest)
Just ('.', rest4) -> case BS.uncons rest4 of
Nothing -> (0, rest4)
Just (ch2, rest5) -> (10 * ord ch1 + ord ch2 - 528, rest5)
Just (ch2, rest4) -> case BS.uncons rest4 of
Nothing -> (0, rest4)
-- Must be '.'
Just (_, rest5) -> case BS.uncons rest5 of
Nothing -> (0, rest5)
Just (ch3, rest6) -> (100 * ord ch1 + 10 * ord ch2 + ord ch3 - 5328, rest6)
hyperfine -w 1 -r 5 './exe-exe ../measurements.txt > solution.txt'
Benchmark 1: ./exe-exe ../measurements.txt > solution.txt
Time (mean ± σ): 97.509 s ± 0.695 s [User: 88.850 s, System: 17.293 s]
Range (min … max): 96.692 s … 98.301 s 5 runs
Interesting, thanks, will take a look!
Cannot resist to add my attempt
It’s about 2 times slower than the single threaded C version from One Billion Rows Challenge in C - Danny van Kooten. Surprisingly, a no-brainer unordered-containers HashMap
version was only 2.5-3 times slower.
The main idea is to have counters in unboxed arrays to remove allocations, and to use pattern synonyms for uncons
to have a list-like pattern matching that runs on bytestrings with almost no allocations (GHC is smart enough not to create intermediate bytestrings).
I have used a similar idea before in https://hackage.haskell.org/package/css-syntax-0.1.0.1/docs/src/Data.CSS.Syntax.Tokens.html and think it looks quite nice while being performant (it still requires careful inlining and occasional bang patterns though).
Say we wrote a B.uncons byte by byte based fold which builds a trie (prefix tree) and does the calculation at the leafs. Would I be wrong in stating that it could combine parsing, grouping and calculating in a single fold? I.e. nothing would need to be done “twice” and the whole thing could remain efficiently immutable?
I’d be very happy if something like that could at least perform on par with the flatparse
solution.
The problem alphabet is 54 characters, so a simple time efficient but space inefficient version could use a vector of length 54 to key leafs directly in O(1). insertWith could then be done in O(k) where k is the key length, which I’d suspect would be close to an O(1) hash table given that hashing is O(k).
What do you think?
EDIT: Argh, but the vectors at the nodes will have to be mutated to modify the Trie, so it won’t be immutable Maybe that’s ok, but another route would be to use bits instead of bytes as keys as nodes, then each node is binary but I don’t exactly know what that would mean for insertWith time complexity: naively O(8k)? I assume there are smarter bit level schemes.
This version needs 14.9GB, and runs in 103s on my computer, which is a bit slower (5s) than mine and as fast as Bodigrims multi-threaded version and 2.4times as fast as Jaror’s last version. It is about twice the time of a fast single threaded Go version (mine), but should not be twice the time of a performant C version. Sadly I cannot compare to the C version, as it uses a Linux-specifix header.
The output is correct.
I’ve just read the comments
-- 13s->10s (for 100M rows) from !n, less allocs from !xs'
-- 10s->1.7s when no HashMap is involved
-- 10s->7s HashMap -> linear scan table (unboxed-containers are quite fast)
As this benchmark is mostly about memory access, the input data size (whether and what fits in your CPU’s Caches) is significant for the comparison of results.
Its that 2^16 allocation for the hash-table presumably so the rest of hash table code could be simpler? As noted above, maybe a Trie could result in similar performance whilst keeping everything immutable?
The 14.9GB are the file in memory, which is fine. The data itself is only 10,000 stations (actually 8900 in my data file), which is updated a billion times.
Lol, yes quite right, 2^16 is just 65536. I suspect @jaror version will run 20-40% faster as a strict bytestring version. Mine did in early experiments. Its all still at least 2 orders of magnitude from the faster C version, so I’m not sure if the fiddling is worthwhile at this point – at least not for performance reasons.
What really worries me is that using mmap
didn’t yield any speedups when I tried it.
yeah i tried that too – the bytestring author has a mmap package which makes it super easy – but it made no difference.
Yes, that’s the one I tried too.
Btw. I’ve added the Haskell version and data to my Github project.
Yeah, vector-hashtables
is rather poorly optimized. We’d want a custom open addressing hash table here that’s implemented as a single mutable bytearray, where keys and values are all unboxed.
but getting better by the minute More bangs by Bodigrim · Pull Request #24 · klapaucius/vector-hashtables · GitHub
Its a good reminder that threads like these often have tangible carry to other things.
I’ve written a solution with an open addressed unboxed hash table based on @Bodigrim’s gist: Based on @bodigrim's solution, but with open addressed primitive hash table · GitHub
It gets a single threaded performance of 44.721s on my machine. But oddly enough it is still allocating 3,780,154,668 bytes per MUT second, so there must still be something to improve.
Edit: found the culprit: C8.lines
doesn’t fuse, so it was constructing bytestrings and cons cells for every row. Now it only takes 38.734s and an allocation rate of 362,770,267 bytes per MUT second.
My profiling shows 74.3% of the remaining allocations are due to hashing the pointers. That seems suboptimal too.
Here are the full profiling results:
Wed Mar 13 22:19 2024 Time and Allocation Profiling Report (Final)
cabal-script-1brc.hs +RTS -p -RTS
total time = 47.73 secs (47728 ticks @ 1000 us, 1 processor)
total alloc = 53,806,642,352 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
$wmain Main <no location info> 63.6 25.6
hashPtr Data.Hashable.Class src/Data/Hashable/Class.hs:776:1-49 19.0 44.6
hashPtrWithSalt Data.Hashable.LowLevel src/Data/Hashable/LowLevel.hs:(113,1)-(115,23) 13.4 29.7
$wcompareBytes Main /tmp/1brc/1brc.hs:42:1-12 3.9 0.0
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 233 0 0.0 0.0 100.0 100.0
main Main /tmp/1brc/1brc.hs:246:1-4 466 1 0.0 0.0 100.0 100.0
main2 Main <no location info> 467 1 0.0 0.0 100.0 100.0
main1 Main <no location info> 468 1 0.0 0.0 100.0 100.0
$wmain Main <no location info> 469 1 63.6 25.6 100.0 100.0
hashPtr Data.Hashable.Class src/Data/Hashable/Class.hs:776:1-49 476 1000000000 19.0 44.6 32.5 74.3
hashPtrWithSalt Data.Hashable.LowLevel src/Data/Hashable/LowLevel.hs:(113,1)-(115,23) 477 1000000000 13.4 29.7 13.4 29.7
$wcompareBytes Main /tmp/1brc/1brc.hs:42:1-12 478 999999587 3.9 0.0 3.9 0.0
lvl8 Main <no location info> 479 413 0.0 0.0 0.0 0.0
lvl2 Main <no location info> 481 1239 0.0 0.0 0.0 0.0
$j Main <no location info> 482 413 0.0 0.0 0.0 0.0
$fPrimRow_$csetByteArray# Main /tmp/1brc/1brc.hs:123:10-17 471 1 0.0 0.0 0.0 0.0
defaultSetByteArray# Data.Primitive.Types Data/Primitive/Types.hs:(252,1)-(257,11) 472 1 0.0 0.0 0.0 0.0
$fPrimRow_$cwriteByteArray# Main /tmp/1brc/1brc.hs:142:3-17 474 10000 0.0 0.0 0.0 0.0
CAF Main <entire-module> 465 0 0.0 0.0 0.0 0.0
$fPrimRow Main /tmp/1brc/1brc.hs:123:10-17 473 1 0.0 0.0 0.0 0.0
lvl1 Main <no location info> 480 1 0.0 0.0 0.0 0.0
lvl11 Main <no location info> 470 1 0.0 0.0 0.0 0.0
lvl9 Main <no location info> 475 1 0.0 0.0 0.0 0.0
CAF Data.ByteString.Char8 <entire-module> 421 0 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 361 0 0.0 0.0 0.0 0.0
CAF GHC.Float <entire-module> 353 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding <entire-module> 344 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv <entire-module> 342 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD <entire-module> 334 0 0.0 0.0 0.0 0.0
CAF GHC.Weak.Finalize <entire-module> 302 0 0.0 0.0 0.0 0.0
CAF Text.Printf <entire-module> 294 0 0.0 0.0 0.0 0.0
CAF GHC.IO.FD <entire-module> 266 0 0.0 0.0 0.0 0.0
The C8.lines
returns a list, it allocates a lot. Would be better to parse lines yourself. The rem
function is also quite expensive, could try power-of-2 table instead. Could try -fllvm
as well. I didn’t run the benchmark, but looking at the assembly, there’s a lot of *6
which is compiled to actual multiplication by GHC, but not by LLVM.