One Billion Row challenge in Hs

Parsing is not the bottleneck. This only improves my implementation to 59.2 seconds for 1 billion rows. Although it might make it easier to spot improvement opportunities in the Core.

You are parsing the line twice. Most important is to never touch a byte twice, even if it is in the CPUā€™s cache. But tomorrow (well, almost today) Iā€™ll adapt your version. When I had been trying, unsafeIndex was surprisingly not faster than uncons and pattern matching the result. Ivm curiois, whether your solution will help.

1 Like

re: Not hitting Go - is it a GC thing?

Any large boxed dataset is gonna crater throughput with the default GC. Unless put in a compact region.

(I didnā€™t parse the recent code to see if this is the case.)

Yesterday night, before falling asleep I thought I should add this calculation (ignoring 1000 vs. 1024 differences) instead of always repeating ā€œdonā€™t touch the data in the input more than onceā€:
Letā€™s say your CPU runs at 3GHZ, then a single cycle takes 1/3 nanosecond. Access to the level 1 cache of this CPU is about 3 cycles (AFAIK this is about correct), so one nanosecond. So if you access a byte of the input in every row, you need 1ns * 1,000,000,000 = 1 second. My data file has an average line length of 14.9 bytes (the size is 14.9GB), so if I would access every byte of the file once that would takĆ© 15s. If you look at the results, you can see that really accessing every byte in the input only once instead of twice (although I didnā€™t parse everything twice) gained mĆ© 13s.

Of course the samĆ© holds for any other op in every row, so even a single integer add of 1 cycle adds 333ms to the runtime. Thatā€™s what I like about that challenge, it un-nanos the processing of the input data and makes the processing ā€œvisibleā€.

The data is in one unboxed array of bytes, 4 unboxed 64 bit integer arrays and one unboxed 64 bit integer in a hashmap (same as in Haskell). And nothing of this changes with the architectĆŗre the executable is built for. So this is another ā€œproofā€ that this whole exercise is just about memory access (of the data and in the hashmap).

1 Like

Your executable uses 6 Cores (of 10) and a max. of 26GB of RAM (most of the time 7GB, the 26GB is the summing of the results, I guess). It takes 5s (5%) more than my (single threaded) version at 105s. It is about 2.4 times faster than the last (single thread) version posted in this thread of Jaror, with 240s.

Your output contains negative zeroes -0.0 and is missing a newline at the end of the file. Apart from that it is the same as the reference output.

stack install --local-bin-path ./ && hyperfine -w 1 -r 5  './exe-exe >  solution.txt'
Benchmark 1: ./exe-exe ../measurements_big.txt >  solution_big.txt
  Time (mean Ā± Ļƒ):     104.907 s Ā±  1.791 s    [User: 613.344 s, System: 19.411 s]
  Range (min ā€¦ max):   101.754 s ā€¦ 106.213 s    5 runs

The config:

ghcup GHC 9.8.2

dependencies:
  - base >= 4.19 && < 5

ghc-options:
  - -Wall
  - -Wcompat
  - -Widentities
  - -Wincomplete-record-updates
  - -Wincomplete-uni-patterns
  - -Wmissing-export-lists
  - -Wmissing-home-modules
  - -Wpartial-fields
  - -Wredundant-constraints
  - -XGHC2021
  - -O2
  - -threaded
  - -rtsopts
  - -with-rtsopts=-N

executables:
  exe-exe:
    main: Main.hs
    source-dirs: app
    dependencies:
      - flatparse
      - text
      - bytestring
      - containers
      - unordered-containers
      - mmap
      - array
      - linear
      - vector
      - vector-hashtables
resolver: nightly-2024-03-12
system-ghc: true

There was a similar optimisation thread here which ended up birthing two new hashmap packages :slight_smile:

3 Likes

Interestingly, even when adapting your solution to my program like below, unsafeIndex still is slower (by about 2s or 2%) than uncons like in the second version:

unsafeIndex

  let x1 = BSU.unsafeIndex rest2 0
  let x2 = BSU.unsafeIndex rest2 1
  let x3 = BSU.unsafeIndex rest2 2
  let x4 = BSU.unsafeIndex rest2 3
  let x5 = BSU.unsafeIndex rest2 4
  let (temp, rest) = case x1 of
        -- ord '-' == 45 => -N.N or -NN.N
        W8# 45#Word8 -> case x3 of
          -- ord '.' == 46 => -N.N
          W8# 46#Word8 -> (528 - fromIntegral x2 * 10 - fromIntegral x4, BS.drop 4 rest2)
          -- -NN.N
          _ -> (5328 - fromIntegral x2 * 100 - fromIntegral x3 * 10 - fromIntegral x5, BS.drop 5 rest2)
        -- N.N or NN.N
        _ -> case x2 of
          -- ord '.' == 46 => N.N
          W8# 46#Word8 -> (fromIntegral x1 * 10 + fromIntegral x3 - 528, BS.drop 3 rest2)
          -- NN.N
          _ -> (fromIntegral x1 * 100 + fromIntegral x2 * 10 + fromIntegral x4 - 5328, BS.drop 4 rest2)
hyperfine -w 1 -r 5  './exe-exe ../measurements.txt > solution.txt' 
Benchmark 1: ./exe-exe ../measurements.txt > solution.txt
  Time (mean Ā± Ļƒ):     99.749 s Ā±  0.512 s    [User: 90.609 s, System: 20.358 s]
  Range (min ā€¦ max):   99.099 s ā€¦ 100.309 s    5 runs

uncons

{-# INLINE parseNegTemp #-}
parseNegTemp :: BS.ByteString -> (Int, BS.ByteString)
parseNegTemp rest = case BS.uncons rest of
  Nothing -> (0, rest)
  Just (ch1, rest3) -> case BS.uncons rest3 of
    Nothing -> (0, rest3)
    Just ('.', rest4) -> case BS.uncons rest4 of
      Nothing -> (0, rest4)
      Just (ch2, rest5) -> (528 - 10 * ord ch1 - ord ch2, rest5)
    Just (ch2, rest4) -> case BS.uncons rest4 of
      Nothing -> (0, rest4)
      -- Must be '.'
      Just (_, rest5) -> case BS.uncons rest5 of
        Nothing -> (0, rest5)
        Just (ch3, rest6) -> (5328 - 100 * ord ch1 - 10 * ord ch2 - ord ch3, rest6)

{-# INLINE parseTemp #-}
parseTemp :: Char -> BS.ByteString -> (Int, BS.ByteString)
parseTemp ch1 rest = case BS.uncons rest of
  Nothing -> (0, rest)
  Just ('.', rest4) -> case BS.uncons rest4 of
    Nothing -> (0, rest4)
    Just (ch2, rest5) -> (10 * ord ch1 + ord ch2 - 528, rest5)
  Just (ch2, rest4) -> case BS.uncons rest4 of
    Nothing -> (0, rest4)
    -- Must be '.'
    Just (_, rest5) -> case BS.uncons rest5 of
      Nothing -> (0, rest5)
      Just (ch3, rest6) -> (100 * ord ch1 + 10 * ord ch2 + ord ch3 - 5328, rest6)
hyperfine -w 1 -r 5  './exe-exe ../measurements.txt > solution.txt' 
Benchmark 1: ./exe-exe ../measurements.txt > solution.txt
  Time (mean Ā± Ļƒ):     97.509 s Ā±  0.695 s    [User: 88.850 s, System: 17.293 s]
  Range (min ā€¦ max):   96.692 s ā€¦ 98.301 s    5 runs

Interesting, thanks, will take a look!

1 Like

Cannot resist to add my attempt

Itā€™s about 2 times slower than the single threaded C version from One Billion Rows Challenge in C - Danny van Kooten. Surprisingly, a no-brainer unordered-containers HashMap version was only 2.5-3 times slower.

The main idea is to have counters in unboxed arrays to remove allocations, and to use pattern synonyms for uncons to have a list-like pattern matching that runs on bytestrings with almost no allocations (GHC is smart enough not to create intermediate bytestrings).

I have used a similar idea before in https://hackage.haskell.org/package/css-syntax-0.1.0.1/docs/src/Data.CSS.Syntax.Tokens.html and think it looks quite nice while being performant (it still requires careful inlining and occasional bang patterns though).

11 Likes

Say we wrote a B.uncons byte by byte based fold which builds a trie (prefix tree) and does the calculation at the leafs. Would I be wrong in stating that it could combine parsing, grouping and calculating in a single fold? I.e. nothing would need to be done ā€œtwiceā€ and the whole thing could remain efficiently immutable?

Iā€™d be very happy if something like that could at least perform on par with the flatparse solution.

The problem alphabet is 54 characters, so a simple time efficient but space inefficient version could use a vector of length 54 to key leafs directly in O(1). insertWith could then be done in O(k) where k is the key length, which Iā€™d suspect would be close to an O(1) hash table given that hashing is O(k).

What do you think?

EDIT: Argh, but the vectors at the nodes will have to be mutated to modify the Trie, so it wonā€™t be immutable :frowning: Maybe thatā€™s ok, but another route would be to use bits instead of bytes as keys as nodes, then each node is binary but I donā€™t exactly know what that would mean for insertWith time complexity: naively O(8k)? I assume there are smarter bit level schemes.

EDIT2: Left-child right-sibling binary tree - Wikipedia

This version needs 14.9GB, and runs in 103s on my computer, which is a bit slower (5s) than mine and as fast as Bodigrims multi-threaded version and 2.4times as fast as Jarorā€™s last version. It is about twice the time of a fast single threaded Go version (mine), but should not be twice the time of a performant C version. Sadly I cannot compare to the C version, as it uses a Linux-specifix header.

The output is correct.

Iā€™ve just read the comments

-- 13s->10s (for 100M rows) from !n, less allocs from !xs'
-- 10s->1.7s when no HashMap is involved
-- 10s->7s   HashMap -> linear scan table (unboxed-containers are quite fast)

As this benchmark is mostly about memory access, the input data size (whether and what fits in your CPUā€™s Caches) is significant for the comparison of results.

1 Like

Its that 2^16 allocation for the hash-table presumably so the rest of hash table code could be simpler? As noted above, maybe a Trie could result in similar performance whilst keeping everything immutable?

The 14.9GB are the file in memory, which is fine. The data itself is only 10,000 stations (actually 8900 in my data file), which is updated a billion times.

1 Like

Lol, yes quite right, 2^16 is just 65536. I suspect @jaror version will run 20-40% faster as a strict bytestring version. Mine did in early experiments. Its all still at least 2 orders of magnitude from the faster C version, so Iā€™m not sure if the fiddling is worthwhile at this point ā€“ at least not for performance reasons.

What really worries me is that using mmap didnā€™t yield any speedups when I tried it.

yeah i tried that too ā€“ the bytestring author has a mmap package which makes it super easy ā€“ but it made no difference.

Yes, thatā€™s the one I tried too.

Btw. Iā€™ve added the Haskell version and data to my Github project.

Yeah, vector-hashtables is rather poorly optimized. Weā€™d want a custom open addressing hash table here thatā€™s implemented as a single mutable bytearray, where keys and values are all unboxed.

1 Like

but getting better by the minute More bangs by Bodigrim Ā· Pull Request #24 Ā· klapaucius/vector-hashtables Ā· GitHub :smile:

3 Likes

Its a good reminder that threads like these often have tangible carry to other things.

Iā€™ve written a solution with an open addressed unboxed hash table based on @Bodigrimā€™s gist: Based on @bodigrim's solution, but with open addressed primitive hash table Ā· GitHub

It gets a single threaded performance of 44.721s on my machine. But oddly enough it is still allocating 3,780,154,668 bytes per MUT second, so there must still be something to improve.

Edit: found the culprit: C8.lines doesnā€™t fuse, so it was constructing bytestrings and cons cells for every row. Now it only takes 38.734s and an allocation rate of 362,770,267 bytes per MUT second.

My profiling shows 74.3% of the remaining allocations are due to hashing the pointers. That seems suboptimal too.

Here are the full profiling results:
	Wed Mar 13 22:19 2024 Time and Allocation Profiling Report  (Final)

	   cabal-script-1brc.hs +RTS -p -RTS

	total time  =       47.73 secs   (47728 ticks @ 1000 us, 1 processor)
	total alloc = 53,806,642,352 bytes  (excludes profiling overheads)

COST CENTRE     MODULE                 SRC                                             %time %alloc

$wmain          Main                   <no location info>                               63.6   25.6
hashPtr         Data.Hashable.Class    src/Data/Hashable/Class.hs:776:1-49              19.0   44.6
hashPtrWithSalt Data.Hashable.LowLevel src/Data/Hashable/LowLevel.hs:(113,1)-(115,23)   13.4   29.7
$wcompareBytes  Main                   /tmp/1brc/1brc.hs:42:1-12                         3.9    0.0


                                                                                                                            individual      inherited
COST CENTRE                        MODULE                 SRC                                            no.     entries  %time %alloc   %time %alloc

MAIN                               MAIN                   <built-in>                                     233           0    0.0    0.0   100.0  100.0
 main                              Main                   /tmp/1brc/1brc.hs:246:1-4                      466           1    0.0    0.0   100.0  100.0
  main2                            Main                   <no location info>                             467           1    0.0    0.0   100.0  100.0
   main1                           Main                   <no location info>                             468           1    0.0    0.0   100.0  100.0
    $wmain                         Main                   <no location info>                             469           1   63.6   25.6   100.0  100.0
     hashPtr                       Data.Hashable.Class    src/Data/Hashable/Class.hs:776:1-49            476  1000000000   19.0   44.6    32.5   74.3
      hashPtrWithSalt              Data.Hashable.LowLevel src/Data/Hashable/LowLevel.hs:(113,1)-(115,23) 477  1000000000   13.4   29.7    13.4   29.7
     $wcompareBytes                Main                   /tmp/1brc/1brc.hs:42:1-12                      478   999999587    3.9    0.0     3.9    0.0
     lvl8                          Main                   <no location info>                             479         413    0.0    0.0     0.0    0.0
      lvl2                         Main                   <no location info>                             481        1239    0.0    0.0     0.0    0.0
      $j                           Main                   <no location info>                             482         413    0.0    0.0     0.0    0.0
     $fPrimRow_$csetByteArray#     Main                   /tmp/1brc/1brc.hs:123:10-17                    471           1    0.0    0.0     0.0    0.0
      defaultSetByteArray#         Data.Primitive.Types   Data/Primitive/Types.hs:(252,1)-(257,11)       472           1    0.0    0.0     0.0    0.0
       $fPrimRow_$cwriteByteArray# Main                   /tmp/1brc/1brc.hs:142:3-17                     474       10000    0.0    0.0     0.0    0.0
 CAF                               Main                   <entire-module>                                465           0    0.0    0.0     0.0    0.0
  $fPrimRow                        Main                   /tmp/1brc/1brc.hs:123:10-17                    473           1    0.0    0.0     0.0    0.0
  lvl1                             Main                   <no location info>                             480           1    0.0    0.0     0.0    0.0
  lvl11                            Main                   <no location info>                             470           1    0.0    0.0     0.0    0.0
  lvl9                             Main                   <no location info>                             475           1    0.0    0.0     0.0    0.0
 CAF                               Data.ByteString.Char8  <entire-module>                                421           0    0.0    0.0     0.0    0.0
 CAF                               GHC.Conc.Signal        <entire-module>                                361           0    0.0    0.0     0.0    0.0
 CAF                               GHC.Float              <entire-module>                                353           0    0.0    0.0     0.0    0.0
 CAF                               GHC.IO.Encoding        <entire-module>                                344           0    0.0    0.0     0.0    0.0
 CAF                               GHC.IO.Encoding.Iconv  <entire-module>                                342           0    0.0    0.0     0.0    0.0
 CAF                               GHC.IO.Handle.FD       <entire-module>                                334           0    0.0    0.0     0.0    0.0
 CAF                               GHC.Weak.Finalize      <entire-module>                                302           0    0.0    0.0     0.0    0.0
 CAF                               Text.Printf            <entire-module>                                294           0    0.0    0.0     0.0    0.0
 CAF                               GHC.IO.FD              <entire-module>                                266           0    0.0    0.0     0.0    0.0
1 Like