Of course “write your code as similar to the fast imperative solution as possible” isn’t what we (well, at least I) wanted to experience (although it isn’t unexpected).
Interestingly, neither the usage of vector-hashtables
instead of Data.HashMap.Strict
nor (and this is a bit worrying) the usage of (unsafe) indices when parsing the temperature instead of pattern matching the result of ByteString.uncons
did yield any gains, on the contrary, both where slightly slower than the initial version. Also using mmap made the program about 5s (5%) slower.
So, now the (still) fastest single thread version, with almost correct output - sorted and rounded. “Almost” means that there is an extra comma and space ,
at the end of the output, which I couldn’t be arsed to remove. I’ve added it to the reference file to compare against instead ;).
Time is still 100s on my computer, twice as slow as the equivalent Go version. The next steps are reading the file in chunks and afterwards doing this in parallel. If Haskell/GHC doesn’t do anything surprising, the end time should be at about 15s (on my computer), still about twice the time of the parallelised Go version.
Code
{-# LANGUAGE StrictData #-}
{-# LANGUAGE ViewPatterns #-}
module Main (main) where
import Control.Monad (when)
import Data.Array.IO qualified as A
import Data.ByteString.Char8 qualified as BS
import Data.Char (Char, ord)
import Data.Foldable (traverse_)
import Data.HashMap.Strict qualified as HM
import Data.List (sort)
import Data.Text.Encoding qualified as TE
import Data.Word (Word64)
import GHC.Float (roundDouble)
import System.Environment (getArgs)
import Text.Printf qualified as T
data TemperatureData
= TemperatureData
(A.IOUArray Word64 Word64)
(A.IOUArray Word64 Int)
(A.IOUArray Word64 Int)
(A.IOUArray Word64 Int)
parse ::
Word64 ->
BS.ByteString ->
(HM.HashMap BS.ByteString Word64, TemperatureData) ->
IO (HM.HashMap BS.ByteString Word64, TemperatureData)
parse _ (BS.null -> True) acc = pure acc
parse idx content (accMap, TemperatureData countsT sumT minT maxT) = do
let (stationName, rest') = BS.break (== ';') content
let rest2 = BS.drop 1 rest'
let (temp, rest) = case BS.uncons rest2 of
Nothing -> (0, rest2)
Just ('-', rest2') -> parseNegTemp rest2'
Just (ch1, rest3) -> parseTemp ch1 rest3
(newAcc, newIdx) <- case HM.lookup stationName accMap of
Nothing -> do
A.writeArray countsT idx 1
A.writeArray sumT idx temp
A.writeArray minT idx temp
A.writeArray maxT idx temp
pure ((HM.insert stationName idx accMap, TemperatureData countsT sumT minT maxT), idx + 1)
Just idxM -> do
A.modifyArray' countsT idxM (+ 1)
A.modifyArray' sumT idxM (+ temp)
A.modifyArray' minT idxM (min temp)
A.modifyArray' maxT idxM (max temp)
pure ((accMap, TemperatureData countsT sumT minT maxT), idx)
parse newIdx (BS.drop 1 rest) newAcc
{-# INLINE parseNegTemp #-}
parseNegTemp :: BS.ByteString -> (Int, BS.ByteString)
parseNegTemp rest = case BS.uncons rest of
Nothing -> (0, rest)
Just (ch1, rest3) -> case BS.uncons rest3 of
Nothing -> (0, rest3)
Just ('.', rest4) -> case BS.uncons rest4 of
Nothing -> (0, rest4)
Just (ch2, rest5) -> ((-1) * (10 * ord ch1 + ord ch2 - 11 * ord '0'), rest5)
Just (ch2, rest4) -> case BS.uncons rest4 of
Nothing -> (0, rest4)
-- Must be '.'
Just (_, rest5) -> case BS.uncons rest5 of
Nothing -> (0, rest5)
Just (ch3, rest6) -> ((-1) * (100 * ord ch1 + 10 * ord ch2 + ord ch3 - 111 * ord '0'), rest6)
{-# INLINE parseTemp #-}
parseTemp :: Char -> BS.ByteString -> (Int, BS.ByteString)
parseTemp ch1 rest = case BS.uncons rest of
Nothing -> (0, rest)
Just ('.', rest4) -> case BS.uncons rest4 of
Nothing -> (0, rest4)
Just (ch2, rest5) -> (10 * ord ch1 + ord ch2 - 11 * ord '0', rest5)
Just (ch2, rest4) -> case BS.uncons rest4 of
Nothing -> (0, rest4)
-- Must be '.'
Just (_, rest5) -> case BS.uncons rest5 of
Nothing -> (0, rest5)
Just (ch3, rest6) -> (100 * ord ch1 + 10 * ord ch2 + ord ch3 - 111 * ord '0', rest6)
main :: IO ()
main = do
args <- getArgs
when (null args) $ T.perror "Error: no data file to read given! Exiting."
content <- BS.readFile $ head args
c <- A.newArray_ (0 :: Word64, 10000 :: Word64)
s <- A.newArray_ (0 :: Word64, 10000 :: Word64)
m <- A.newArray_ (0 :: Word64, 10000 :: Word64)
n <- A.newArray_ (0 :: Word64, 10000 :: Word64)
(ma, TemperatureData rC rS rM rN) <- parse 0 content (HM.empty, TemperatureData c s m n)
let keys = sort (HM.keys ma)
T.printf "{"
traverse_
( \k ->
do
let i = HM.findWithDefault 0 k ma
let name = TE.decodeUtf8 k
count <- A.readArray rC i
sum <- A.readArray rS i
tMin <- A.readArray rM i
tMax <- A.readArray rN i
let mean = fromIntegral sum / fromIntegral count
T.printf "%s=%.1f/%.1f/%.1f, " name (roundJava $ fromIntegral tMin) (roundJava mean) (roundJava $ fromIntegral tMax)
)
keys
T.printf "}"
{-# INLINE roundJava #-}
roundJava :: Double -> Double
roundJava y =
let
r = fromIntegral (roundDouble y :: Int)
in
case y of
x
| x < 0.0 && r - x == 0.5 -> r / 10.0
| abs (x - r) >= 0.5 -> (r + signum y) / 10.0
| otherwise -> r / 10.0
Yeah, for me as a noob, seeing the progression from my naive slow 14LOC solution through 4 optimisations to a 14LOC solution which runs in 4s (10M rows) was very useful. It gave me a good sense of how to approach the next such problem and where the low hanging fruit is. The introduction of a 25LOC flatparse
version and with it a ~1s solution is just within my bit wrangling threshold, but further than and I’d personally write it in C and use FFI. The optimal C version goes down to 0.15s – as in 15/100 seconds – for 1B rows, so wrangling has more purchase if pure performance is what is needed.
Nothing useful to add, I definitely expected Haskell performance to be very close to Go, I’m quite surprised.
I was educated more than surprised There is a whole bunch of other languages and their benchmarks here: Python, Dart, C#, Java, Javascript, C, C++, Rust, Zig, Scala, Erlang and Crystal, all have faster solutions than ours.
I don’t think its a big deal … The closer to the metal you can get, the faster it will go, but that isn’t Haskell’s selling point. I’m personally here for equational reasoning, an elegant type system and beautiful concise code.
The problem is the non-mutable hashmap.
Actually, as soon as the optimisation is “good enough”, this is a comparison of hashmap implementations - any “fast” version uses one tailored to the problem, as far as I’ve seen.
Now you are selling Haskell under it’s value. I seriously doubt there is a Python, Dart or JS Version that is comparable. And BEAM isn’t known to be performant (it has low latency!) either. For example the fastest Elixir version (1 billion rows, 10K stations) needs 28GB RAM, that’s an overhead of 13GB.
Just click the link and see for yourself Here is a 45s Python version, here is a 17s DART version, and so on.
Here is a write up of how to get a 480s C version to 1.6s: it makes it easy to see how far a hashmap will get you.
You seem to be saying that the fastest Haskell version is ~1s for 10M rows, so ~100s for 1B rows, whereas there is a C version that is 0.15s on 1B rows, i.e. almost 1000x faster. That is very suprising to me, if it’s a like for like comparison (e.g. same level of parallelism), and suggests there is a significantly faster Haskell version waiting to be found.
The best code so far is @jaror and it isn’t parallel yet, but it completes the 1B rows in 80s by his report. 8 cores is the prescription for the challenge on specific hardware, if making it parallel was completely free, it would finish in 8s: so just two orders of magnitude slower than C, not three
No, mine is still 2.4 times faster. And you can’t compare runtime at all. E.g. Jaror’s 80s are 240s on my computer, which is about half as “fast” as the official one (the one the Java and one Go version have been benchmarked).
Certainly not intended to doubt you, and only to say that yours and his builds may not be comparable, but has anyone else corroborated the relative timings? I can’t seem to run your code above – ghci says Data.Array.IO does not export modifyArray
, and I note that for some reason his single threaded code takes 3 times longer to run on your machine than his: it isn’t likely that his processor is 3 times faster, and it is single threaded code.
This is a simple problem, where comparable solutions produce comparable times.
That’s why it’s of utmost importance that everybody uses the same input data (or at least the same input data generator) and compares the output with a reference solution. The least thing to do is run a time wc -l
on the input data, control if there are 1,000,000,000 rows and post the time.
I’m going to add the Haskell part to the Go code when I have time, for now (this includes unnecessary packages):
ghcup’s GHC 9.6.4
dependencies:
- base >= 4.7 && < 5
ghc-options:
- -Wall
- -Wcompat
- -Widentities
- -Wincomplete-record-updates
- -Wincomplete-uni-patterns
- -Wmissing-export-lists
- -Wmissing-home-modules
- -Wpartial-fields
- -Wredundant-constraints
- -XGHC2021
- -O2
- -threaded
executables:
exe-exe:
main: Main.hs
source-dirs: app
ghc-options:
- -threaded
- -rtsopts
- -with-rtsopts=-N
dependencies:
- flatparse
- text
- bytestring
- unordered-containers
- mmap
- array
- linear
- vector
- vector-hashtables
and
resolver:
url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/22/11.yaml
system-ghc: true
I had a brief look at the Core/STG and did a bit of TickyTicky profiling. The problem is that alter
expression:
liftIO $ alter m (\x -> Just $ case x of Nothing -> (V4 v v v 1); Just (V4 a b c d) -> V4 (min a v) (max b v) (c+v) (d+1)) k
I realised that this lambda accounts for ~20% of all allocations and got rid of it by doing Just $! case x of ...
,
liftIO $ alter m (\x -> Just $! case x of Nothing -> (V4 v v v 1); Just (V4 a b c d) -> V4 (min a v) (max b v) (c+v) (d+1)) k
Now it properly inlines.
Alas, that’s not enough: I see that $wonFound
causes an enormous amount of allocations, because it allocates a closure and also because it is lazy in V4 Float
and thus will allocate 4 boxed floats as well as a 4 word vector V4
. It comes from the implementation of alter
, which is unfortunately too lazy in the updated value:
alter
:: ( MVector ks k, MVector vs v, DeleteEntry ks, DeleteEntry vs
, PrimMonad m, Hashable k, Eq k
)
=> Dictionary (PrimState m) ks k vs v -> (Maybe v -> Maybe v) -> k -> m ()
alter ht f k = do
d@Dictionary{..} <- readMutVar . getDRef $ ht
let
hashCode' = hash k .&. mask
targetBucket = hashCode' `rem` A.length buckets
onFound' value' dict i = insertWithIndex targetBucket hashCode' k value' (getDRef ht) dict i
-- ^ BAD BAD BAD not strict in value'
onNothing' dict i = deleteWithIndex targetBucket hashCode' d k (-1) i
onFound dict i = do
d'@Dictionary{..} <- readMutVar . getDRef $ dict
v <- value !~ i
case f (Just v) of
Nothing -> onNothing' d' i
Just v' -> onFound' v' d' i
onNothing dict = do
d' <- readMutVar . getDRef $ dict
case f Nothing of
Nothing -> return ()
Just v' -> onFound' v' d' (-1)
void $ atWithOrElse ht k onFound onNothing
{-# INLINE alter #-}
Rather than trying to fix alter
and having to duct tape our way around this absurd amount of laziness, I would go ahead and implement insertWith
, which would be something like this:
insertWith
:: ( MVector ks k, MVector vs v, DeleteEntry ks, DeleteEntry vs
, PrimMonad m, Hashable k, Eq k
)
=> Dictionary (PrimState m) ks k vs v -> (v -> v -> v) -> k -> v -> m ()
insertWith ht f k v = do
d@Dictionary{..} <- readMutVar . getDRef $ ht
let
hashCode' = hash k .&. mask
targetBucket = hashCode' `rem` A.length buckets
onFound dict i = do
d'@Dictionary{..} <- readMutVar . getDRef $ dict
v' <- value !~ i
insertWithIndex targetBucket hashCode' k (f v' v) (getDRef ht) d' i
onNothing dict = do
d' <- readMutVar . getDRef $ dict
insertWithIndex targetBucket hashCode' k v (getDRef ht) d' (-1)
void $ atWithOrElse ht k onFound onNothing
{-# INLINE insertWith #-}
I went out of time before type-checking.
This sadly doesn’t change much. It seems to help a bit if I make that f v' v
call strict. That together with some other changes (like making a custom V4 type such that the floats can be unpacked) got the time down to 0.58 seconds for 10m and 60.7 seconds for 1b rows. Here’s my current implementation: GitHub - noughtmare/1brc-hs
One thing I do notice when looking at the core if I disable inlining for that insertWith
function is that it is duplicated 11 times in the parseLine
function. Also the parseLine
function is over 60000 lines of STG. That’s quite a lot!
1000000000 measurements-1b.txt
real 0m5,467s
user 0m0,186s
sys 0m4,316s
This is 3 times as fast too, so everything is consistent, wc -l
is 17s (15.32s user 1.44s system 96% cpu 17.379 total
) here.
May I ask you, what CPU do you use for the benchmarks?
I’m using an AMD 5600X. And perhaps also important is that I’m using a 7400 MB/s NVMe SSD (that is the advertised maximum read speed).
AMD 5600X
Base Clock 3.7GHz
Max. Boost Clock Up to 4.6GHz
L3 Cache 32MB
7400 MB/s NVMe SSD
I have:
Apple M1 Max Pro
Max. Clock 3206 MHz (it did run at that frequency, I checked that)
L3 Cache 48MB
SSD has a measured speed of 4629.7MB/s write 5180.3 MB/s read
There is no reason for your CPU to be 3times as fast, I’ve got to complain to Apple
I’ve just done a test (because cross compiling is easy and fast with Go): the AMD64 1BRC binary runs about 100ms (7.0s vs 6.9s) faster in Rosetta than the native ARM64 binary. I guess GHC’s ARM code generation isn’t significantly better …
This is already great progress; total allocations went down from 5GB to 1.5GB.
I had another look at the ticky profile. The bulk of the remaining allocation is in $wparseLine
, and there I see that it allocates a fresh ByteString
on every entry. This appears to be due to flatparse
's heavy use of unboxed sums and GHC’s current inability to unbox through those. But actually it appears that we don’t get around allocating that (4 word) ByteString
, because we need to store it in the Hashtable as well, for some reason.
I also noticed that there are a lot of lets relating to updating Dictionary
s. That made me realise that all the fields of Dictionary
are lazy and thus not unboxed: https://hackage.haskell.org/package/vector-hashtables-0.1.1.4/docs/src/Data.Vector.Hashtables.Internal.html#Dictionary_. It is also a bit problematic that we can’t unbox Dictionary_
itself through the MutVar#
; for a tight inner loop it would be preferable to get a hold of that inner Dictionary_
and update that instead of accessing it repeatedly through a ref cell.
So there is quite a lot that can be improved in vector-hashtables
in order to optimise for constants. I’m pretty sure it would be worth doing so; the inner loop should not make any allocations, except perhaps for new ByteString
slices.
The other high hitter in the profile is anyFloat
. There we mainly allocate Float
s, due to the same unboxed sum reasons we allocate ByteString
s in parseLine
. We also seem to allocate an Int
that is never used. Gah, we should equip Demand Analysis to reason through unboxed sums. flatparse
provides ample motivation to do so. (TODO create issue.)
Could someone try my parser implementation from 1brc.hs · GitHub? I don’t see how flatparse
or anything generic could go better.
data Entry = Entry
{ _station :: !Station
, _temperature :: !Int
} deriving (Show)
-- Bayawan;-21.1
-- Andranomenatsa;-1.2
-- Benton Harbor;36.2
-- Taulahā;0.6
parseLine :: ByteString -> Entry
parseLine xs = case x4 of
W8# 59#Word8 -- ord ';'
-> Entry (Station $ B.unsafeTake (l - 4) xs) (x3' * 10 + x1' - 528)
W8# 45#Word8 -- ord '-'
-> Entry (Station $ B.unsafeTake (l - 5) xs) (528 - x3' * 10 - x1')
_ -> case x5 of
W8# 59#Word8 -- ord ';'
-> Entry (Station $ B.unsafeTake (l - 5) xs) (x4' * 100 + x3' * 10 + x1' - 5328)
_ -- ord '-'
-> Entry (Station $ B.unsafeTake (l - 6) xs) (5328 - x4' * 100 - x3' * 10 - x1')
where
l = B.length xs
x1 = B.unsafeIndex xs (l - 1) -- last digit
x3 = B.unsafeIndex xs (l - 3) -- another digit
x4 = B.unsafeIndex xs (l - 4) -- digit or sign or semicolon
x5 = B.unsafeIndex xs (l - 5) -- sign or semicolon
x1' = fromIntegral x1
x3' = fromIntegral x3
x4' = fromIntegral x4
Parsing is not the bottleneck. This only improves my implementation to 59.2 seconds for 1 billion rows. Although it might make it easier to spot improvement opportunities in the Core.