One Billion Row challenge in Hs

Of course “write your code as similar to the fast imperative solution as possible” isn’t what we (well, at least I) wanted to experience :wink: (although it isn’t unexpected).

Interestingly, neither the usage of vector-hashtables instead of Data.HashMap.Strict nor (and this is a bit worrying) the usage of (unsafe) indices when parsing the temperature instead of pattern matching the result of ByteString.uncons did yield any gains, on the contrary, both where slightly slower than the initial version. Also using mmap made the program about 5s (5%) slower.

So, now the (still) fastest single thread version, with almost correct output - sorted and rounded. “Almost” means that there is an extra comma and space , at the end of the output, which I couldn’t be arsed to remove. I’ve added it to the reference file to compare against instead ;).

Time is still 100s on my computer, twice as slow as the equivalent Go version. The next steps are reading the file in chunks and afterwards doing this in parallel. If Haskell/GHC doesn’t do anything surprising, the end time should be at about 15s (on my computer), still about twice the time of the parallelised Go version.

Code
{-# LANGUAGE StrictData #-}
{-# LANGUAGE ViewPatterns #-}

module Main (main) where

import Control.Monad (when)
import Data.Array.IO qualified as A
import Data.ByteString.Char8 qualified as BS
import Data.Char (Char, ord)
import Data.Foldable (traverse_)
import Data.HashMap.Strict qualified as HM
import Data.List (sort)
import Data.Text.Encoding qualified as TE
import Data.Word (Word64)
import GHC.Float (roundDouble)
import System.Environment (getArgs)
import Text.Printf qualified as T

data TemperatureData
  = TemperatureData
      (A.IOUArray Word64 Word64)
      (A.IOUArray Word64 Int)
      (A.IOUArray Word64 Int)
      (A.IOUArray Word64 Int)

parse ::
  Word64 ->
  BS.ByteString ->
  (HM.HashMap BS.ByteString Word64, TemperatureData) ->
  IO (HM.HashMap BS.ByteString Word64, TemperatureData)
parse _ (BS.null -> True) acc = pure acc
parse idx content (accMap, TemperatureData countsT sumT minT maxT) = do
  let (stationName, rest') = BS.break (== ';') content
  let rest2 = BS.drop 1 rest'
  let (temp, rest) = case BS.uncons rest2 of
        Nothing -> (0, rest2)
        Just ('-', rest2') -> parseNegTemp rest2'
        Just (ch1, rest3) -> parseTemp ch1 rest3
  (newAcc, newIdx) <- case HM.lookup stationName accMap of
    Nothing -> do
      A.writeArray countsT idx 1
      A.writeArray sumT idx temp
      A.writeArray minT idx temp
      A.writeArray maxT idx temp
      pure ((HM.insert stationName idx accMap, TemperatureData countsT sumT minT maxT), idx + 1)
    Just idxM -> do
      A.modifyArray' countsT idxM (+ 1)
      A.modifyArray' sumT idxM (+ temp)
      A.modifyArray' minT idxM (min temp)
      A.modifyArray' maxT idxM (max temp)
      pure ((accMap, TemperatureData countsT sumT minT maxT), idx)
  parse newIdx (BS.drop 1 rest) newAcc

{-# INLINE parseNegTemp #-}
parseNegTemp :: BS.ByteString -> (Int, BS.ByteString)
parseNegTemp rest = case BS.uncons rest of
  Nothing -> (0, rest)
  Just (ch1, rest3) -> case BS.uncons rest3 of
    Nothing -> (0, rest3)
    Just ('.', rest4) -> case BS.uncons rest4 of
      Nothing -> (0, rest4)
      Just (ch2, rest5) -> ((-1) * (10 * ord ch1 + ord ch2 - 11 * ord '0'), rest5)
    Just (ch2, rest4) -> case BS.uncons rest4 of
      Nothing -> (0, rest4)
      -- Must be '.'
      Just (_, rest5) -> case BS.uncons rest5 of
        Nothing -> (0, rest5)
        Just (ch3, rest6) -> ((-1) * (100 * ord ch1 + 10 * ord ch2 + ord ch3 - 111 * ord '0'), rest6)

{-# INLINE parseTemp #-}
parseTemp :: Char -> BS.ByteString -> (Int, BS.ByteString)
parseTemp ch1 rest = case BS.uncons rest of
  Nothing -> (0, rest)
  Just ('.', rest4) -> case BS.uncons rest4 of
    Nothing -> (0, rest4)
    Just (ch2, rest5) -> (10 * ord ch1 + ord ch2 - 11 * ord '0', rest5)
  Just (ch2, rest4) -> case BS.uncons rest4 of
    Nothing -> (0, rest4)
    -- Must be '.'
    Just (_, rest5) -> case BS.uncons rest5 of
      Nothing -> (0, rest5)
      Just (ch3, rest6) -> (100 * ord ch1 + 10 * ord ch2 + ord ch3 - 111 * ord '0', rest6)

main :: IO ()
main = do
  args <- getArgs
  when (null args) $ T.perror "Error: no data file to read given! Exiting."
  content <- BS.readFile $ head args
  c <- A.newArray_ (0 :: Word64, 10000 :: Word64)
  s <- A.newArray_ (0 :: Word64, 10000 :: Word64)
  m <- A.newArray_ (0 :: Word64, 10000 :: Word64)
  n <- A.newArray_ (0 :: Word64, 10000 :: Word64)
  (ma, TemperatureData rC rS rM rN) <- parse 0 content (HM.empty, TemperatureData c s m n)
  let keys = sort (HM.keys ma)
  T.printf "{"
  traverse_
    ( \k ->
        do
          let i = HM.findWithDefault 0 k ma
          let name = TE.decodeUtf8 k
          count <- A.readArray rC i
          sum <- A.readArray rS i
          tMin <- A.readArray rM i
          tMax <- A.readArray rN i
          let mean = fromIntegral sum / fromIntegral count
          T.printf "%s=%.1f/%.1f/%.1f, " name (roundJava $ fromIntegral tMin) (roundJava mean) (roundJava $ fromIntegral tMax)
    )
    keys
  T.printf "}"

{-# INLINE roundJava #-}
roundJava :: Double -> Double
roundJava y =
  let
    r = fromIntegral (roundDouble y :: Int)
   in
    case y of
      x
        | x < 0.0 && r - x == 0.5 -> r / 10.0
        | abs (x - r) >= 0.5 -> (r + signum y) / 10.0
        | otherwise -> r / 10.0

Yeah, for me as a noob, seeing the progression from my naive slow 14LOC solution through 4 optimisations to a 14LOC solution which runs in 4s (10M rows) was very useful. It gave me a good sense of how to approach the next such problem and where the low hanging fruit is. The introduction of a 25LOC flatparse version and with it a ~1s solution is just within my bit wrangling threshold, but further than and I’d personally write it in C and use FFI. The optimal C version goes down to 0.15s – as in 15/100 seconds – for 1B rows, so wrangling has more purchase if pure performance is what is needed.

1 Like

Nothing useful to add, I definitely expected Haskell performance to be very close to Go, I’m quite surprised.

I was educated more than surprised :slight_smile: There is a whole bunch of other languages and their benchmarks here: Python, Dart, C#, Java, Javascript, C, C++, Rust, Zig, Scala, Erlang and Crystal, all have faster solutions than ours.

I don’t think its a big deal … The closer to the metal you can get, the faster it will go, but that isn’t Haskell’s selling point. I’m personally here for equational reasoning, an elegant type system and beautiful concise code.

The problem is the non-mutable hashmap.
Actually, as soon as the optimisation is “good enough”, this is a comparison of hashmap implementations - any “fast” version uses one tailored to the problem, as far as I’ve seen.

Now you are selling Haskell under it’s value. I seriously doubt there is a Python, Dart or JS Version that is comparable. And BEAM isn’t known to be performant (it has low latency!) either. For example the fastest Elixir version (1 billion rows, 10K stations) needs 28GB RAM, that’s an overhead of 13GB.

2 Likes

Just click the link and see for yourself :slight_smile: Here is a 45s Python version, here is a 17s DART version, and so on.

Here is a write up of how to get a 480s C version to 1.6s: it makes it easy to see how far a hashmap will get you.

You seem to be saying that the fastest Haskell version is ~1s for 10M rows, so ~100s for 1B rows, whereas there is a C version that is 0.15s on 1B rows, i.e. almost 1000x faster. That is very suprising to me, if it’s a like for like comparison (e.g. same level of parallelism), and suggests there is a significantly faster Haskell version waiting to be found.

The best code so far is @jaror and it isn’t parallel yet, but it completes the 1B rows in 80s by his report. 8 cores is the prescription for the challenge on specific hardware, if making it parallel was completely free, it would finish in 8s: so just two orders of magnitude slower than C, not three :slight_smile:

No, mine is still 2.4 times faster. And you can’t compare runtime at all. E.g. Jaror’s 80s are 240s on my computer, which is about half as “fast” as the official one (the one the Java and one Go version have been benchmarked).

Certainly not intended to doubt you, and only to say that yours and his builds may not be comparable, but has anyone else corroborated the relative timings? I can’t seem to run your code above – ghci says Data.Array.IO does not export modifyArray, and I note that for some reason his single threaded code takes 3 times longer to run on your machine than his: it isn’t likely that his processor is 3 times faster, and it is single threaded code.

This is a simple problem, where comparable solutions produce comparable times.

That’s why it’s of utmost importance that everybody uses the same input data (or at least the same input data generator) and compares the output with a reference solution. The least thing to do is run a time wc -l on the input data, control if there are 1,000,000,000 rows and post the time.

I’m going to add the Haskell part to the Go code when I have time, for now (this includes unnecessary packages):

ghcup’s GHC 9.6.4

dependencies:
  - base >= 4.7 && < 5

ghc-options:
  - -Wall
  - -Wcompat
  - -Widentities
  - -Wincomplete-record-updates
  - -Wincomplete-uni-patterns
  - -Wmissing-export-lists
  - -Wmissing-home-modules
  - -Wpartial-fields
  - -Wredundant-constraints
  - -XGHC2021
  - -O2
  - -threaded

executables:
  exe-exe:
    main: Main.hs
    source-dirs: app
    ghc-options:
      - -threaded
      - -rtsopts
      - -with-rtsopts=-N
    dependencies:
      - flatparse
      - text
      - bytestring
      - unordered-containers
      - mmap
      - array
      - linear
      - vector
      - vector-hashtables

and

resolver:
  url: https://raw.githubusercontent.com/commercialhaskell/stackage-snapshots/master/lts/22/11.yaml

system-ghc: true
1 Like

I had a brief look at the Core/STG and did a bit of TickyTicky profiling. The problem is that alter expression:

      liftIO $ alter m (\x -> Just $ case x of Nothing -> (V4 v v v 1); Just (V4 a b c d) -> V4 (min a v) (max b v) (c+v) (d+1)) k

I realised that this lambda accounts for ~20% of all allocations and got rid of it by doing Just $! case x of ...,

      liftIO $ alter m (\x -> Just $! case x of Nothing -> (V4 v v v 1); Just (V4 a b c d) -> V4 (min a v) (max b v) (c+v) (d+1)) k

Now it properly inlines.

Alas, that’s not enough: I see that $wonFound causes an enormous amount of allocations, because it allocates a closure and also because it is lazy in V4 Float and thus will allocate 4 boxed floats as well as a 4 word vector V4. It comes from the implementation of alter, which is unfortunately too lazy in the updated value:

alter
  :: ( MVector ks k, MVector vs v, DeleteEntry ks, DeleteEntry vs
     , PrimMonad m, Hashable k, Eq k
     )
  => Dictionary (PrimState m) ks k vs v -> (Maybe v -> Maybe v) -> k -> m ()
alter ht f k = do
  d@Dictionary{..} <- readMutVar . getDRef $ ht
  let
      hashCode' = hash k .&. mask
      targetBucket = hashCode' `rem` A.length buckets

      onFound' value' dict i = insertWithIndex targetBucket hashCode' k value' (getDRef ht) dict i
      -- ^ BAD BAD BAD not strict in value'
      onNothing' dict i = deleteWithIndex targetBucket hashCode' d k (-1) i

      onFound dict i = do
        d'@Dictionary{..} <- readMutVar . getDRef $ dict
        v <- value !~ i
        case f (Just v) of
          Nothing -> onNothing' d' i
          Just v' ->  onFound' v' d' i

      onNothing dict = do
        d' <- readMutVar . getDRef $ dict
        case f Nothing of
          Nothing -> return ()
          Just v' -> onFound' v' d' (-1)

  void $ atWithOrElse ht k onFound onNothing

{-# INLINE alter #-}

Rather than trying to fix alter and having to duct tape our way around this absurd amount of laziness, I would go ahead and implement insertWith, which would be something like this:

insertWith
  :: ( MVector ks k, MVector vs v, DeleteEntry ks, DeleteEntry vs
     , PrimMonad m, Hashable k, Eq k
     )
  => Dictionary (PrimState m) ks k vs v -> (v -> v -> v) -> k -> v -> m ()
insertWith ht f k v = do
  d@Dictionary{..} <- readMutVar . getDRef $ ht
  let
      hashCode' = hash k .&. mask
      targetBucket = hashCode' `rem` A.length buckets

      onFound dict i = do
        d'@Dictionary{..} <- readMutVar . getDRef $ dict
        v' <- value !~ i
        insertWithIndex targetBucket hashCode' k (f v' v) (getDRef ht) d' i

      onNothing dict = do
        d' <- readMutVar . getDRef $ dict
        insertWithIndex targetBucket hashCode' k v (getDRef ht) d' (-1)

  void $ atWithOrElse ht k onFound onNothing
{-# INLINE insertWith #-}

I went out of time before type-checking.

4 Likes

This sadly doesn’t change much. It seems to help a bit if I make that f v' v call strict. That together with some other changes (like making a custom V4 type such that the floats can be unpacked) got the time down to 0.58 seconds for 10m and 60.7 seconds for 1b rows. Here’s my current implementation: GitHub - noughtmare/1brc-hs

One thing I do notice when looking at the core if I disable inlining for that insertWith function is that it is duplicated 11 times in the parseLine function. Also the parseLine function is over 60000 lines of STG. That’s quite a lot!

1000000000 measurements-1b.txt

real    0m5,467s
user    0m0,186s
sys     0m4,316s

This is 3 times as fast too, so everything is consistent, wc -l is 17s (15.32s user 1.44s system 96% cpu 17.379 total) here.

May I ask you, what CPU do you use for the benchmarks?

I’m using an AMD 5600X. And perhaps also important is that I’m using a 7400 MB/s NVMe SSD (that is the advertised maximum read speed).

AMD 5600X
Base Clock 3.7GHz
Max. Boost Clock Up to 4.6GHz
L3 Cache 32MB
7400 MB/s NVMe SSD

I have:

Apple M1 Max Pro
Max. Clock 3206 MHz (it did run at that frequency, I checked that)
L3 Cache 48MB
SSD has a measured speed of 4629.7MB/s write 5180.3 MB/s read

There is no reason for your CPU to be 3times as fast, I’ve got to complain to Apple :smiley:

I’ve just done a test (because cross compiling is easy and fast with Go): the AMD64 1BRC binary runs about 100ms (7.0s vs 6.9s) faster in Rosetta than the native ARM64 binary. I guess GHC’s ARM code generation isn’t significantly better …

This is already great progress; total allocations went down from 5GB to 1.5GB.

I had another look at the ticky profile. The bulk of the remaining allocation is in $wparseLine, and there I see that it allocates a fresh ByteString on every entry. This appears to be due to flatparse's heavy use of unboxed sums and GHC’s current inability to unbox through those. But actually it appears that we don’t get around allocating that (4 word) ByteString, because we need to store it in the Hashtable as well, for some reason.

I also noticed that there are a lot of lets relating to updating Dictionarys. That made me realise that all the fields of Dictionary are lazy and thus not unboxed: https://hackage.haskell.org/package/vector-hashtables-0.1.1.4/docs/src/Data.Vector.Hashtables.Internal.html#Dictionary_. It is also a bit problematic that we can’t unbox Dictionary_ itself through the MutVar#; for a tight inner loop it would be preferable to get a hold of that inner Dictionary_ and update that instead of accessing it repeatedly through a ref cell.

So there is quite a lot that can be improved in vector-hashtables in order to optimise for constants. I’m pretty sure it would be worth doing so; the inner loop should not make any allocations, except perhaps for new ByteString slices.

The other high hitter in the profile is anyFloat. There we mainly allocate Floats, due to the same unboxed sum reasons we allocate ByteStrings in parseLine. We also seem to allocate an Int that is never used. Gah, we should equip Demand Analysis to reason through unboxed sums. flatparse provides ample motivation to do so. (TODO create issue.)

3 Likes

Could someone try my parser implementation from 1brc.hs · GitHub? I don’t see how flatparse or anything generic could go better.

data Entry = Entry
  { _station :: !Station
  , _temperature :: !Int
  } deriving (Show)

-- Bayawan;-21.1
-- Andranomenatsa;-1.2
-- Benton Harbor;36.2
-- Taulahā;0.6
parseLine :: ByteString -> Entry
parseLine xs = case x4 of
  W8# 59#Word8 -- ord ';'
    -> Entry (Station $ B.unsafeTake (l - 4) xs) (x3' * 10 + x1' - 528)
  W8# 45#Word8 -- ord '-'
    -> Entry (Station $ B.unsafeTake (l - 5) xs) (528 - x3' * 10 - x1')
  _ -> case x5 of
    W8# 59#Word8 -- ord ';'
      -> Entry (Station $ B.unsafeTake (l - 5) xs) (x4' * 100 + x3' * 10 + x1' - 5328)
    _ -- ord '-'
      -> Entry (Station $ B.unsafeTake (l - 6) xs) (5328 - x4' * 100 - x3' * 10 - x1')
  where
    l = B.length xs
    x1 = B.unsafeIndex xs (l - 1) -- last digit
    x3 = B.unsafeIndex xs (l - 3) -- another digit
    x4 = B.unsafeIndex xs (l - 4) -- digit or sign or semicolon
    x5 = B.unsafeIndex xs (l - 5) -- sign or semicolon

    x1' = fromIntegral x1
    x3' = fromIntegral x3
    x4' = fromIntegral x4

Parsing is not the bottleneck. This only improves my implementation to 59.2 seconds for 1 billion rows. Although it might make it easier to spot improvement opportunities in the Core.