One Billion Row challenge in Hs

emiruz · March 6, 2024, 12:38pm

Trying running yours with mawk if you haven’t already – it may make a difference. I like your Go write up by the way: very nice

Indeed, but I don’t mind. It won’t affect the run time much.

jaror · March 6, 2024, 1:01pm

Maybe you can speed it up with bytestrings:

import qualified Data.Map.Strict as M
import qualified Data.ByteString.Lazy.Char8 as B

main :: IO ()
main = do
  str <- B.getContents --readFile "measurements.txt"
  let ls = map ((\(a,b)-> (a,read (B.unpack (B.tail b)) :: Float)) . B.break (== ';')) (B.lines str)
      f (a,b,c,d) (a',b',c',d') = let (i,j,k,l) = (min a a', max b b', c+c', d+d') in
        i `seq` j `seq` k `seq` l `seq` (i,j,k,l)
      collect = foldl (\ m (k,v) -> M.insertWith f (B.toStrict k) (v,v,v,1) m) M.empty ls
  mapM_ (putStrLn . show) $ M.mapWithKey (\k (a,b,c,d)-> (B.unpack (B.fromStrict k),a,b,c/d)) collect

emiruz · March 6, 2024, 1:08pm

Wow! Down from 25s to 15s (for 10M rows) using HashMap and ByteString at 10 lines of code:

import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Lazy.Char8 as B

main :: IO ()
main = do
  str <- B.getContents --readFile "measurements.txt"
  let ls = map ((\(a,b)-> (a,read (B.unpack (B.tail b)) :: Float)) . B.break (== ';')) (B.lines str)
      f (a,b,c,d) (a',b',c',d') = let (i,j,k,l) = (min a a', max b b', c+c', d+d') in
        i `seq` j `seq` k `seq` l `seq` (i,j,k,l)
      collect = foldl (\ m (k,v) -> M.insertWith f (B.toStrict k) (v,v,v,1) m) M.empty ls
  mapM_ (putStrLn . show) $ M.mapWithKey (\k (a,b,c,d)-> (B.unpack (B.fromStrict k),a,b,c/d)) collect

emiruz · March 6, 2024, 1:23pm

I think the above is probably enough to lay down the gauntlet Can you beat the above code, in <30 LOC, single threaded?

ReleaseCandidate · March 6, 2024, 2:07pm

I guess using 4 arrays in a record to store the values and just storing the index (for the arrays) in the map helps shaving off some seconds.

Thanks, now I’m already <10s, let’s see if I can get to 5 without changing the hash map (or doing anything non-portable, like using mmap).

Just in case somebody would want to compare the results (they aren’t correctly rounded, but at least “diffable”).

vaibhavsagar · March 8, 2024, 12:48am

The best I’ve been able to do is this concurrent implementation, which takes less than 1m10s on my machine for the full 1B rows.

ReleaseCandidate · March 8, 2024, 8:42am

Your version needs about 2m20s on my computer, but it is a great starting point to optimisation. The first step should be to not parse the whole text twice.

Btw. you did not round your results correctly, there are negative zeroes -0.0 in your output, just in case somebody wants to compare your results with the reference.

emiruz · March 8, 2024, 9:45am

I did a quick experiment. Since HashMap lookups are O(log N) I suspect that creating non overlapping hashmaps from the stream and then combining them together would improve performance.

The row per second processing rate is 667556 row/sec for the vanilla 10 LOC version above. There are 8875 unique entries in the hash table, so O(log N) lookups are expensive. However, if I filter for cities starting with ‘A’ I end up with a hash table with 600 unique entries, and a processing rate of 848446 row/sec: roughly 27% improvement with roughly 15x fewer unique entries.

There are 54 unique 1 letter city prefixes, with differing row numbers, and 603 unique 2 letter city prefixes. I suspect thus that reducing the burden of O(logN) lookups with a simple prefix check could improve performance of the simple 10 LOC version by another 20-50%. Would you agree or am I failing to take something into account?

I’ll give it a go when I have time.

PS: An assumption is that picking the right hash table can be hard-coded and therefore O(1). For example, in the single character prefix version, the character can be used to index a Vector of HashMaps directly which should be O(1).

ReleaseCandidate · March 8, 2024, 10:48am

I should’ve made clear, that this is (as long as the data size has been chosen accordingly and everything fits in RAM) nothing else than memory access, of the byte (or whatever) array holding the data and mainly hash-map access (reading and writing).
So the main job of the optimisation is not doing unnecessary memory access, accessing memory as sequentially as possible and keeping the data in memory as “together” as possible (to have as much as possible of it in the cores caches).
The hash-map has two main problems:

the hash itself
it is big and access is random

So your idea of a trie-like structure helps keeping data in smaller locations, which may offset the added indirection (as storing an index into a record of arrays instead of storing the record in the hash map, is worth a try.

jaror · March 8, 2024, 11:18am

I don’t know if that would actually help by itself, because consecutive updates would still often end up in different hashmaps. Perhaps you could do better if you buffer the updates and only perform them when you’ve collected a certain number of names that start with the same letter.

emiruz · March 8, 2024, 1:14pm

GPT helped me quickly try a version which held the hashmaps in a mutable vector: epically slow.

I also quickly tried to see how much faster it would be by running on data I pre-split and put into data/:

time parallel -j1 'cat {} | ./1brc' ::: data/*.txt | wc -l

-j1 ensures that only one process is run: it takes 12s to finish, versus vanilla version which is 15s. That’s about 20% faster. Running it fully parallel takes 4s, so < 7 min for 1B rows.

I think if there was a way to do this kind of segmenting very efficiently, some of the hashmap speed problems could be mitigated.

jaror · March 9, 2024, 11:27am

Through a bit of “wolf-fence” benchmarking I found out that we are spending by far most of our time on parsing the input floating point numbers. With read from base, a 10 million line input takes about 12 seconds on my machine. With a custom parseSignedFloat function it takes only 1.6 seconds. Here’s my implementation of that parseSignedFloat:

parseSignedFloat :: BS.ByteString -> Float
parseSignedFloat bs = case BS.uncons bs of 
  Just ('-',bs') -> negate (parseUnsignedFloat bs')
  _ -> parseUnsignedFloat bs

parseUnsignedFloat :: BS.ByteString -> Float
parseUnsignedFloat = whole 0 where
  whole !f !bs =
    case BS.uncons bs of
      Just ('.',bs') -> frac f 0.1 bs'
      Just (c,bs') -> 
        let d = realToFrac (digitToInt c) in
        whole (10 * f + d) bs'
      Nothing -> f

  frac !f !m !bs =
    case BS.uncons bs of
      Just (c,bs') ->
        let d = realToFrac (digitToInt c) in
        frac (f + m * d) (m * 0.1) bs'
      _ -> f

That is a bit long for my taste and it is also quite unstable numerically. Maybe we can improve that?

jaror · March 9, 2024, 12:19pm

I shrunk it down a bit, now the full program only takes 21 lines:

import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Lazy.Char8 as B
import qualified Data.ByteString.Char8 as BS
import Data.Char

parseSignedFloat :: BS.ByteString -> Float
parseSignedFloat = result . BS.foldl' step (0 :: Int, 0, 0.1) where
  step (0, !n, !m) '-' = (1, n, m)
  step (s, !n, !m) c | s <= 1 = (s + 2, 10 * n + tof c, m)
  step (s, !n, !m) '.' | s <= 3 = (s + 2, n, m)
  step (s, !n, !m) c | s <= 3 = (s, 10 * n + tof c, m)
  step (s, !n, !m) c = (s, n + m * tof c, m * 0.1)
  
  tof c = realToFrac (ord c - ord '0')
  result (s, n, _) = if odd s then negate n else n

main :: IO ()
main = do
  str <- B.readFile "measurements.txt"
  let ls = map ((\(a,b)-> (a, parseSignedFloat (BS.tail (B.toStrict b)))) . B.break (== ';')) (B.lines str)
      f (a,b,c,d) (a',b',c',d') = let (i,j,k,l) = (min a a', max b b', c+c', d+d') in
        i `seq` j `seq` k `seq` l `seq` (i,j,k,l)
      collect = foldl (\ m (k,v) -> M.insertWith f (B.toStrict k) (v,v,v,1) m) M.empty ls
  mapM_ (putStrLn . show) (M.mapWithKey (\k (a,b,c,d)-> (k,a,b,c/d)) collect)

emiruz · March 9, 2024, 12:32pm

Incredible, and totally surprising… Here is a slightly tighter version using the fact that the data is 1 decimal place, and the readInt function from ByteString. Its 14 LOC but the run time is down from 15s to 4.2s for 10M rows: it is now roughly on par with AWK version.

import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Maybe(fromJust)

main :: IO ()
main = do
  str <- B.getContents
  let parseInt b = i where (i,_) = fromJust $ B.readInt $ B.filter (/='.') $ B.tail b
      ls = map ((\(a,b)-> (a, parseInt b)) . B.break (== ';')) (B.lines str)
      f (a,b,c,d) (a',b',c',d') = let (i,j,k,l) = (min a a', max b b', c+c', d+d') in
        i `seq` j `seq` k `seq` l `seq` (i,j,k,l)
      collect = foldl (\m (k,v) -> M.insertWith f (B.toStrict k) (v,v,v,1) m) M.empty ls
  mapM_ (putStrLn . show) $ M.mapWithKey
    (\k (a,b,c,d)-> (B.unpack (B.fromStrict k),
                     fromIntegral a/10,fromIntegral b/10,(fromIntegral c)/(10*d))) collect

EDIT:

A narrowly faster version using splitWith:

import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Maybe(fromJust)

main :: IO ()
main = do
  str <- B.getContents
  let parseInt = fst . fromJust . B.readInt . (B.filter (/='.'))
      f (a,b,c,d) (a',b',c',d') = let (i,j,k,l) = (min a a', max b b', c+c', d+d') in
        i `seq` j `seq` k `seq` l `seq` (i,j,k,l)  
      acc (k:x:xs) m = let v=parseInt x in acc xs (M.insertWith f (B.toStrict k) (v,v,v,1) m)
      acc _ m = m
      collect = acc (B.splitWith (\x->x==';' || x=='\n') str) M.empty
  mapM_ (putStrLn . show) $ M.mapWithKey
    (\k (a,b,c,d)-> (B.unpack (B.fromStrict k),
                     fromIntegral a/10,fromIntegral b/10,(fromIntegral c)/(10*d))) collect

jaror · March 9, 2024, 12:39pm

It can be either 1 or 2 decimals

It might be slightly faster, but the lazy version will occupy much less memory.

ReleaseCandidate · March 9, 2024, 1:05pm

You all are still parsing everything twice. Although Haskell’s laziness may help insofar, as it doesn’t split the whole data file at once (and the end of the file is cached when parsing the start).

The relevant part of the rules:

Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit

So one or two digits, a decimal point and exactly one digit.

emiruz · March 9, 2024, 1:14pm

Are you sure? Unless its in the specification, if I checking my actual file with cat measurements.txt | grep -E "\\.[0-9]{2}", it turns up nothing.

You’re right. The memory usage is very low for 10M rows but explodes for 1B. I’ll edit it above.

emiruz · March 9, 2024, 1:15pm

I’m struggling to understand this. In what sense is everything parsed twice?

EDIT: Ok, I guess you mean lines yields strings which are then parsed again, hence parsing everything twice.

EDIT2: I can’t find a way to merge break, lines and parseInt into a single pass short of using attoparse / megaparsec. Can you see a more pedestrian route?

ReleaseCandidate · March 9, 2024, 3:05pm

You have to change the algorithm into a single “loop” (well, recursive function) and do “simple” parsing (a sketch):

parse ::  ByteString -> ....
parse content acc ... = 
   let 
      stationName, rest' = B.break (== ';') content
      rest'' = drop 1 rest'
      (a, b, c, d), rest = parseTemps rest''
      acc' = add (a, b, c, d) acc 
   in
     parse rest acc'

If it should be really fast, never use regex/parser combinators/parser generators/…

jaror · March 9, 2024, 3:27pm

No, apologies. I thought it was always 3 significant figures, but it seems that was wrong.

I don’t think that is necessarily true. For example I think flatparse would do well for this problem.