I think the above is probably enough to lay down the gauntlet Can you beat the above code, in <30 LOC, single threaded?
I guess using 4 arrays in a record to store the values and just storing the index (for the arrays) in the map helps shaving off some seconds.
Thanks, now Iām already <10s, letās see if I can get to 5 without changing the hash map (or doing anything non-portable, like using mmap
).
Just in case somebody would want to compare the results (they arenāt correctly rounded, but at least ādiffableā).
The best Iāve been able to do is this concurrent implementation, which takes less than 1m10s on my machine for the full 1B rows.
Your version needs about 2m20s on my computer, but it is a great starting point to optimisation. The first step should be to not parse the whole text twice.
Btw. you did not round your results correctly, there are negative zeroes -0.0
in your output, just in case somebody wants to compare your results with the reference.
I did a quick experiment. Since HashMap lookups are O(log N) I suspect that creating non overlapping hashmaps from the stream and then combining them together would improve performance.
The row per second processing rate is 667556 row/sec for the vanilla 10 LOC version above. There are 8875 unique entries in the hash table, so O(log N) lookups are expensive. However, if I filter for cities starting with āAā I end up with a hash table with 600 unique entries, and a processing rate of 848446 row/sec: roughly 27% improvement with roughly 15x fewer unique entries.
There are 54 unique 1 letter city prefixes, with differing row numbers, and 603 unique 2 letter city prefixes. I suspect thus that reducing the burden of O(logN) lookups with a simple prefix check could improve performance of the simple 10 LOC version by another 20-50%. Would you agree or am I failing to take something into account?
Iāll give it a go when I have time.
PS: An assumption is that picking the right hash table can be hard-coded and therefore O(1). For example, in the single character prefix version, the character can be used to index a Vector of HashMaps directly which should be O(1).
I shouldāve made clear, that this is (as long as the data size has been chosen accordingly and everything fits in RAM) nothing else than memory access, of the byte (or whatever) array holding the data and mainly hash-map access (reading and writing).
So the main job of the optimisation is not doing unnecessary memory access, accessing memory as sequentially as possible and keeping the data in memory as ātogetherā as possible (to have as much as possible of it in the cores caches).
The hash-map has two main problems:
- the hash itself
- it is big and access is random
So your idea of a trie-like structure helps keeping data in smaller locations, which may offset the added indirection (as storing an index into a record of arrays instead of storing the record in the hash map, is worth a try.
I donāt know if that would actually help by itself, because consecutive updates would still often end up in different hashmaps. Perhaps you could do better if you buffer the updates and only perform them when youāve collected a certain number of names that start with the same letter.
GPT helped me quickly try a version which held the hashmaps in a mutable vector: epically slow.
I also quickly tried to see how much faster it would be by running on data I pre-split and put into data/
:
time parallel -j1 'cat {} | ./1brc' ::: data/*.txt | wc -l
-j1
ensures that only one process is run: it takes 12s to finish, versus vanilla version which is 15s. Thatās about 20% faster. Running it fully parallel takes 4s, so < 7 min for 1B rows.
I think if there was a way to do this kind of segmenting very efficiently, some of the hashmap speed problems could be mitigated.
Through a bit of āwolf-fenceā benchmarking I found out that we are spending by far most of our time on parsing the input floating point numbers. With read
from base, a 10 million line input takes about 12 seconds on my machine. With a custom parseSignedFloat
function it takes only 1.6 seconds. Hereās my implementation of that parseSignedFloat
:
parseSignedFloat :: BS.ByteString -> Float
parseSignedFloat bs = case BS.uncons bs of
Just ('-',bs') -> negate (parseUnsignedFloat bs')
_ -> parseUnsignedFloat bs
parseUnsignedFloat :: BS.ByteString -> Float
parseUnsignedFloat = whole 0 where
whole !f !bs =
case BS.uncons bs of
Just ('.',bs') -> frac f 0.1 bs'
Just (c,bs') ->
let d = realToFrac (digitToInt c) in
whole (10 * f + d) bs'
Nothing -> f
frac !f !m !bs =
case BS.uncons bs of
Just (c,bs') ->
let d = realToFrac (digitToInt c) in
frac (f + m * d) (m * 0.1) bs'
_ -> f
That is a bit long for my taste and it is also quite unstable numerically. Maybe we can improve that?
I shrunk it down a bit, now the full program only takes 21 lines:
import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Lazy.Char8 as B
import qualified Data.ByteString.Char8 as BS
import Data.Char
parseSignedFloat :: BS.ByteString -> Float
parseSignedFloat = result . BS.foldl' step (0 :: Int, 0, 0.1) where
step (0, !n, !m) '-' = (1, n, m)
step (s, !n, !m) c | s <= 1 = (s + 2, 10 * n + tof c, m)
step (s, !n, !m) '.' | s <= 3 = (s + 2, n, m)
step (s, !n, !m) c | s <= 3 = (s, 10 * n + tof c, m)
step (s, !n, !m) c = (s, n + m * tof c, m * 0.1)
tof c = realToFrac (ord c - ord '0')
result (s, n, _) = if odd s then negate n else n
main :: IO ()
main = do
str <- B.readFile "measurements.txt"
let ls = map ((\(a,b)-> (a, parseSignedFloat (BS.tail (B.toStrict b)))) . B.break (== ';')) (B.lines str)
f (a,b,c,d) (a',b',c',d') = let (i,j,k,l) = (min a a', max b b', c+c', d+d') in
i `seq` j `seq` k `seq` l `seq` (i,j,k,l)
collect = foldl (\ m (k,v) -> M.insertWith f (B.toStrict k) (v,v,v,1) m) M.empty ls
mapM_ (putStrLn . show) (M.mapWithKey (\k (a,b,c,d)-> (k,a,b,c/d)) collect)
Incredible, and totally surprisingā¦ Here is a slightly tighter version using the fact that the data is 1 decimal place, and the readInt function from ByteString. Its 14 LOC but the run time is down from 15s to 4.2s for 10M rows: it is now roughly on par with AWK version.
import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Maybe(fromJust)
main :: IO ()
main = do
str <- B.getContents
let parseInt b = i where (i,_) = fromJust $ B.readInt $ B.filter (/='.') $ B.tail b
ls = map ((\(a,b)-> (a, parseInt b)) . B.break (== ';')) (B.lines str)
f (a,b,c,d) (a',b',c',d') = let (i,j,k,l) = (min a a', max b b', c+c', d+d') in
i `seq` j `seq` k `seq` l `seq` (i,j,k,l)
collect = foldl (\m (k,v) -> M.insertWith f (B.toStrict k) (v,v,v,1) m) M.empty ls
mapM_ (putStrLn . show) $ M.mapWithKey
(\k (a,b,c,d)-> (B.unpack (B.fromStrict k),
fromIntegral a/10,fromIntegral b/10,(fromIntegral c)/(10*d))) collect
EDIT:
A narrowly faster version using splitWith:
import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Maybe(fromJust)
main :: IO ()
main = do
str <- B.getContents
let parseInt = fst . fromJust . B.readInt . (B.filter (/='.'))
f (a,b,c,d) (a',b',c',d') = let (i,j,k,l) = (min a a', max b b', c+c', d+d') in
i `seq` j `seq` k `seq` l `seq` (i,j,k,l)
acc (k:x:xs) m = let v=parseInt x in acc xs (M.insertWith f (B.toStrict k) (v,v,v,1) m)
acc _ m = m
collect = acc (B.splitWith (\x->x==';' || x=='\n') str) M.empty
mapM_ (putStrLn . show) $ M.mapWithKey
(\k (a,b,c,d)-> (B.unpack (B.fromStrict k),
fromIntegral a/10,fromIntegral b/10,(fromIntegral c)/(10*d))) collect
It can be either 1 or 2 decimals
It might be slightly faster, but the lazy version will occupy much less memory.
You all are still parsing everything twice. Although Haskellās laziness may help insofar, as it doesnāt split the whole data file at once (and the end of the file is cached when parsing the start).
The relevant part of the rules:
Temperature value: non null double between -99.9 (inclusive) and 99.9 (inclusive), always with one fractional digit
So one or two digits, a decimal point and exactly one digit.
Are you sure? Unless its in the specification, if I checking my actual file with cat measurements.txt | grep -E "\\.[0-9]{2}"
, it turns up nothing.
Youāre right. The memory usage is very low for 10M rows but explodes for 1B. Iāll edit it above.
Iām struggling to understand this. In what sense is everything parsed twice?
EDIT: Ok, I guess you mean lines
yields strings which are then parsed again, hence parsing everything twice.
EDIT2: I canāt find a way to merge break, lines and parseInt into a single pass short of using attoparse / megaparsec. Can you see a more pedestrian route?
You have to change the algorithm into a single āloopā (well, recursive function) and do āsimpleā parsing (a sketch):
parse :: ByteString -> ....
parse content acc ... =
let
stationName, rest' = B.break (== ';') content
rest'' = drop 1 rest'
(a, b, c, d), rest = parseTemps rest''
acc' = add (a, b, c, d) acc
in
parse rest acc'
If it should be really fast, never use regex/parser combinators/parser generators/ā¦
No, apologies. I thought it was always 3 significant figures, but it seems that was wrong.
I donāt think that is necessarily true. For example I think flatparse would do well for this problem.
Hereās my go at what you said. What I thought I was doing is progressively consuming the stream and avoiding multiple passes. It works for small inputs, doesnāt leak memory but something is totally wrong. Lines take longer and longer to parse as it goes on. No idea why at the moment: some technicality no doubt.
import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Lazy.Char8 as B
import Data.Maybe(fromJust)
parse bs m
| B.null bs = m
| otherwise = parse (B.tail bs'') $! M.insertWith f (B.toStrict k) (v,v,v,1) m
where (k, bs') = B.break (== ';') bs
(v, bs'') = fromJust $ B.readInt $ B.filter (/='.') (B.tail bs')
f (a,b,c,d) (a',b',c',d') = let (i,j,k,l) = (min a a', max b b', c+c', d+d') in
i `seq` j `seq` k `seq` l `seq` (i,j,k,l)
main :: IO ()
main = do
bs <- B.getContents
mapM_ (putStrLn . show) $ M.mapWithKey
(\k (a,b,c,d)-> (B.unpack (B.fromStrict k),
fromIntegral a/10,fromIntegral b/10,(fromIntegral c)/(10*d)))
(parse bs M.empty)
bytestring-lexing: Efficiently parse and produce common integral and fractional numbers. comes to mind
Yes, hereās an implementation that uses flatparse:
import qualified Data.HashMap.Strict as M
import qualified Data.ByteString.Char8 as BS
import FlatParse.Basic hiding (char, isDigit)
import Data.Char
data T = T !Float !Float !Float !Float
data S = S !BS.ByteString !Float
anyFloat :: Parser () Float
anyFloat = withOption (satisfyAscii (== '-')) (\_ -> negate <$> anyPositiveFloat) anyPositiveFloat where
anyPositiveFloat = do
x <- anyAsciiDecimalInt <* skipSatisfyAscii (== '.')
c <- satisfyAscii isDigit
pure $! fromIntegral x + fromIntegral (ord c - ord '0') * 0.1
parseLine :: Parser () S
parseLine = do
name <- byteStringOf (skipSome (satisfyAscii (/= ';'))) <* skipAnyAsciiChar
f <- anyFloat <* skipSatisfyAscii (== '\n')
return $! S name f
parseMeasurements :: Parser () (M.HashMap BS.ByteString T)
parseMeasurements = go M.empty where
go !m = withOption parseLine (\(S k v) -> go (M.insertWith f k (T v v v 1) m)) (pure m)
f (T a b c d) (T a' b' c' d') = T (min a a') (max b b') (c+c') (d+d')
main = do
str <- BS.readFile "measurements.txt"
OK x _ <- pure $ runParser parseMeasurements str
M.foldrWithKey (\k (T a b c d) go -> print (k, a, b, c/d) *> go) (pure ()) x
Thatās the fastest solution Iāve seen on my machine up to now, clocking in at about 1 second for 10 million rows.