One Billion Row challenge in Hs

I cannot get the C Macros to work with MacOS’ clang, CPP does not substitute (all?) occurrences of BYTE_INDEX.
The ouput of ghc -E looks as follows:

Summary
{-# LINE 1 "app/Main.hs" #-}
# 1 "app/Main.hs"
# 1 "<built-in>" 1
# 15 "<built-in>"
# 1 "/Users/roland/.ghcup/ghc/9.6.4/lib/ghc-9.6.4/lib/../lib/aarch64-osx-ghc-9.6.4/rts-1.0.2/include/ghcversion.h" 1


















# 16 "<built-in>" 2
# 1 "/var/folders/07/f6y7bm0x7n72vhtwd596514h0000gn/T/ghc84125_0/ghc_2.h" 1

























































































































































































































# 17 "<built-in>" 2
# 1 "app/Main.hs" 2
{-# LANGUAGE BlockArguments #-}
{-# LANGUAGE CPP #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE PatternSynonyms #-}
{-# LANGUAGE Strict #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE UnboxedTuples #-}
{-# LANGUAGE ViewPatterns #-}
{-# OPTIONS_GHC -Wall
  -Wno-missing-signatures
  -Wno-name-shadowing #-}

{- cabal:
  build-depends: base >= 4.19, bytestring, mmap, async
  default-language: GHC2021
  ghc-options: -Wall -O2 -fllvm -rtsopts -threaded -split-sections
-}

-- ghc -O2 -fllvm -rtsopts -threaded -split-sections -ddump-simpl -dsuppress-all
-- -dno-suppress-type-signatures -ddump-to-file -fforce-recomp Main7.hs

-- CONFIGURATION
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

-- Run executable with "+RTS -NX" to set number of threads to X.

-- display output


-- should be power of 2, minimum 16384
-- #define TABLE_SIZE 131072

-- #define 65536 32768
-- #define 65536 16384

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

import Control.Concurrent
import Control.Concurrent.Async
import Control.Monad
import Data.Bits
import qualified Data.ByteString.Builder as BB
import qualified Data.ByteString.Lazy.Char8 as LC8
import Foreign.Marshal.Alloc
import GHC.Exts
import GHC.IO
import GHC.Word
import System.IO.MMap


import Data.List
import Text.Printf
import System.IO hiding (withFile)


-- Random common functions
--------------------------------------------------------------------------------

fi :: (Integral a, Num b) => a -> b
fi = fromIntegral
{-# INLINE fi #-}

sl :: (Bits a) => a -> Int -> a
sl = unsafeShiftL

sr :: (Bits a) => a -> Int -> a
sr = unsafeShiftR

isrl :: Int -> Int -> Int
isrl (I# x) (I# y) = I# (uncheckedIShiftRL# x y)

max' :: Int -> Int -> Int
max' a b = let diff = a - b in a - (diff .&. sr diff 63)

min' :: Int -> Int -> Int
min' a b = let diff = a - b in b + (diff .&. sr diff 63)

plusAddr :: Addr# -> Int -> Addr#
plusAddr p (I# x) = plusAddr# p x

int2Addr :: Int -> Addr#
int2Addr (I# x) = int2Addr# x

addr2Int :: Addr# -> Int
addr2Int p = I# (addr2Int# p)

eqI :: Int -> Int -> Int
eqI (I# x) (I# y) = I# (x ==# y)

readI :: Addr# -> IO Int
readI p = IO \s -> case readIntOffAddr# p 0# s of (# s, x #) -> (# s, I# x #)

writeI :: Addr# -> Int -> IO ()
writeI p (I# x) = IO \s -> case writeIntOffAddr# p 0# x s of s -> (# s, () #)

-- Generic buffers
--------------------------------------------------------------------------------

data Buffer = Buffer {_ptr :: Addr#, len :: Int}

plus :: Buffer -> Int -> Buffer
plus (Buffer p l) (I# x) = Buffer (plusAddr# p x) (l - I# x)

memset :: Buffer -> Word8 -> IO ()
memset (Buffer p (I# l)) (W8# x) = IO \s ->
  case setAddrRange# p l (word2Int# (word8ToWord# x)) s of s -> (# s, () #)

withFile :: FilePath -> (Buffer -> IO a) -> IO a
withFile path k = mmapWithFilePtr path ReadOnly Nothing \(Ptr p, l) -> k (Buffer p l)
{-# INLINE withFile #-}

indexW8 :: Buffer -> Int -> Word8
indexW8 (Buffer p _) (I# x) = W8# (indexWord8OffAddr# p x)

indexW32 :: Buffer -> Int -> Word32
indexW32 (Buffer p _) (I# x) = W32# (indexWord32OffAddr# p x)

indexW :: Buffer -> Int -> Word
indexW (Buffer p _) (I# x) = W# (indexWordOffAddr# p x)

indexI :: Buffer -> Int -> Int
indexI (Buffer p _) (I# x) = I# (indexIntOffAddr# p x)

getW8 = (`indexW8` 0)
getW32 = (`indexW32` 0)
getW = (`indexW` 0)

instance Eq Buffer where
  Buffer p l == Buffer p' l' = l == l' && go p p' l
   where
    buf p = Buffer p l
    go p p' l
      | l >= 8 =
          getW (buf p) == getW (buf p') && go (plusAddr# p 8#) (plusAddr# p' 8#) (l - 8)
      | l >= 4 =
          getW32 (buf p) == getW32 (buf p') && go (plusAddr# p 4#) (plusAddr# p' 4#) (l - 4)
      | l == 0 = True
      | True =
          getW8 (buf p) == getW8 (buf p') && go (plusAddr# p 1#) (plusAddr# p' 1#) (l - 1)
  {-# INLINE (==) #-}

foldedMul :: Word -> Word -> Word
foldedMul (W# x) (W# y) = case timesWord2# x y of (# hi, lo #) -> W# (xor# hi lo)

salt :: Word
salt = 3032525626373534813

combine :: Word -> Word -> Word
combine x y = foldedMul (xor x y) 11400714819323198549

hashBuffer :: Buffer -> Word
hashBuffer p = go p salt
 where
  go p acc
    | len p >= 8 = go (plus p 8) (combine (getW p) acc)
    | len p >= 4 = go (plus p 4) (combine (fromIntegral (getW32 p)) acc)
    | len p == 0 = acc
    | otherwise = go (plus p 1) (combine (fromIntegral (getW8 p)) acc)

buildBuffer :: Buffer -> BB.Builder
buildBuffer b | len b == 0 = mempty
buildBuffer b = BB.word8 (getW8 b) <> buildBuffer (plus b 1)

-- printBuffer :: Buffer -> IO ()
-- printBuffer = BB.hPutBuilder stdout . buildBuffer

instance Show Buffer where
  show x =
    LC8.unpack $ BB.toLazyByteString $ buildBuffer x

instance Ord Buffer where
  compare x x' = compare (show x) (show x')

-- Short buffer
--------------------------------------------------------------------------------

-- Unboxed buffer containing at most 23 bytes. The first field is the length,
-- the rest is the payload. The 24-th byte in the payload is always zeroed out.
data ShortBuffer = ShortBuffer# Int Int Int Int

instance Eq ShortBuffer where
  ShortBuffer# _ a b c == ShortBuffer# _ a' b' c' =
    (eqI a a' .&. eqI b b' .&. eqI c c') == 1
  {-# INLINE (==) #-}

hashShortBuffer :: ShortBuffer -> Word
hashShortBuffer (ShortBuffer# _ a b c) =
  (salt `combine` fi a) `combine` (fi b `combine` fi c)

buildShortBuffer :: ShortBuffer -> BB.Builder
buildShortBuffer (ShortBuffer# l a b c) =
  BB.lazyByteString $
    LC8.take (fi l) $
      BB.toLazyByteString $
        BB.int64LE (fi c) <> BB.int64LE (fi b) <> BB.int64LE (fi a)

instance Show ShortBuffer where
  show = LC8.unpack . BB.toLazyByteString . buildShortBuffer

instance Ord ShortBuffer where
  compare (ShortBuffer# _ a b c) (ShortBuffer# _ a' b' c') =
    let sw (I# x) = W# (byteSwap# (int2Word# x))
     in compare (sw c) (sw c') <> compare (sw b) (sw b') <> compare (sw a) (sw a')

-- Unboxed sum of short and standard buffers.
--------------------------------------------------------------------------------

data SLBuffer = SLB# Int Int Int

isEmptySLB :: SLBuffer -> Bool
isEmptySLB (SLB# a _ _) = a == 0

unpackSLB# :: SLBuffer -> (# ShortBuffer | Buffer #)
unpackSLB# (SLB# a b c) =
  let l = a .&. 255
   in if l <= 23
        then (# ShortBuffer# l (isrl a 8) b c | #)
        else (# | Buffer (int2Addr b) a #)

pattern ShortBuffer :: ShortBuffer -> SLBuffer
pattern ShortBuffer buf <- (unpackSLB# -> (# buf | #))
  where
    ShortBuffer (ShortBuffer# len a b c) = SLB# (sl a 8 .|. len) b c

pattern LongBuffer :: Buffer -> SLBuffer
pattern LongBuffer buf <- (unpackSLB# -> (# | buf #))
  where
    LongBuffer (Buffer p l) = SLB# l (addr2Int p) 0
{-# COMPLETE ShortBuffer, LongBuffer #-}

instance Eq SLBuffer where
  ShortBuffer b == ShortBuffer b' = b == b'
  LongBuffer b == LongBuffer b' = b == b'
  _ == _ = False
  {-# INLINE (==) #-}

-- Try to pack a Buffer into a short one.
packBuffer :: Buffer -> SLBuffer
packBuffer b =
  let
    l = len b
    ix = indexI b
    mask l = isrl (-1) (64 - sl l 3)
   in
    if l <= 8
      then ShortBuffer (ShortBuffer# l 0 0 (ix 0 .&. mask l))
      else
        if l <= 16
          then ShortBuffer (ShortBuffer# l 0 (ix 1 .&. mask (l - 8)) (ix 0))
          else
            if l <= 23
              then ShortBuffer (ShortBuffer# l (ix 2 .&. mask (l - 16)) (ix 1) (ix 0))
              else LongBuffer b

hashSLB :: SLBuffer -> Word
hashSLB (ShortBuffer b) = hashShortBuffer b
hashSLB (LongBuffer b) = hashBuffer b

buildSLB :: SLBuffer -> BB.Builder
buildSLB (ShortBuffer b) = buildShortBuffer b
buildSLB (LongBuffer b) = buildBuffer b

instance Show SLBuffer where
  show = LC8.unpack . BB.toLazyByteString . buildSLB

instance Ord SLBuffer where
  compare (ShortBuffer b) (ShortBuffer b') = compare b b'
  compare b b' = compare (show b) (show b')

-- Branchless scanning for bytes in words.
--------------------------------------------------------------------------------



-- Given a hexadecimal byte, generate the (Word -> Int) function which returns the
-- index of the rightmost occurrence of the byte, or returns 8 if the byte does not
-- occur.




-- Hash table of measurements
--------------------------------------------------------------------------------

data Val = Val
  { _min :: Int
  , _max :: Int
  , _cnt :: Int
  , _total :: Int
  }

data Entry = Entry
  { _key :: {-# UNPACK #-} SLBuffer
  , _val :: {-# UNPACK #-} Val
  }

-- size of entry in bytes (includes padding to 64 bytes!)
entrySize :: Int
entrySize = 8 * 8

tableMask :: Int
tableMask = 65536 - 1

tableBytes :: Int
tableBytes = 65536 * entrySize

type Table = Addr#

initTables :: [Buffer] -> ([(Buffer, Ptr Word8)] -> IO a) -> IO a
initTables bs f = do
  let l = 65536 * entrySize
  let go [] acc = f acc
      go (b : bs) acc = allocaBytesAligned l entrySize \p@(Ptr p') -> do
        memset (Buffer p' l) 0
        go bs ((b, p) : acc)
  go bs []

-- read entry from a *byte* offset
readEntry :: Table -> Int -> IO Entry
readEntry p i = case plusAddr p i of
  p -> do
    a <- readI p
    b <- readI (plusAddr p 8)
    c <- readI (plusAddr p 16)
    d <- readI (plusAddr p 24)
    e <- readI (plusAddr p 32)
    f <- readI (plusAddr p 40)
    g <- readI (plusAddr p 48)
    pure $ Entry (SLB# a b c) (Val d e f g)

-- write entry to a *byte* offset
writeEntry :: Table -> Int -> Entry -> IO ()
writeEntry p i (Entry (SLB# a b c) (Val d e f g)) = case plusAddr p i of
  p -> do
    writeI p a
    writeI (plusAddr p 8) b
    writeI (plusAddr p 16) c
    writeI (plusAddr p 24) d
    writeI (plusAddr p 32) e
    writeI (plusAddr p 40) f
    writeI (plusAddr p 48) g

newVal :: Int -> Val
newVal temp = Val temp temp 1 temp

updateEntry :: Entry -> Val -> Entry
updateEntry (Entry k (Val mi ma cn to)) (Val mi' ma' cn' to') =
  Entry k (Val (min' mi mi') (max' ma ma') (cn + cn') (to + to'))

forTable :: Table -> (Entry -> IO ()) -> IO ()
forTable t f = do
  let go ix | ix == tableBytes = pure ()
      go ix = do
        e@(Entry k _) <- readEntry t ix
        if isEmptySLB k
          then do
            go (ix + entrySize)
          else do
            f e
            go (ix + entrySize)
  go 0
{-# INLINE forTable #-}

updateTable :: Table -> Entry -> IO ()
updateTable tbl e@(Entry key val) = do
  let go ix | ix == tableBytes = go 0
      go ix = do
        olde@(Entry oldkey _) <- readEntry tbl ix
        if isEmptySLB oldkey
          then do
            writeEntry tbl ix e
          else
            if key == oldkey
              then do
                writeEntry tbl ix (updateEntry olde val)
              else do
                go (ix + entrySize)
  go ((fi (hashSLB key) .&. tableMask) * entrySize)

parse :: Table -> Buffer -> IO ()
parse _ b | len b == 0 = do
  pure ()
parse tbl b = do
  -- scan for semicolon
  let
    findSemi :: Int -> Buffer -> Int
    findSemi i b = case BYTE_INDEX (3 B) (getW b) of
      8 -> findSemi (i + 8) (plus b 8)
      i' -> i + i'
  let keylen = findSemi 0 b

  let key = packBuffer $ b{len = keylen}
  b <- pure $ plus b (keylen + 1)

  let
    digit :: Word8 -> Int
    digit x = fi x - 48

  let
    join :: Buffer -> Int -> IO ()
    join b temp = do
      updateTable tbl (Entry key (newVal temp))
      parse tbl b

  case getW8 b of
    -- '-'
    45 -> do
      let d1 = getW8 (plus b 1)
      case getW8 (plus b 2) of
        -- '.' so the next must be digit
        46 -> do
          let d2 = getW8 (plus b 3)
          join (plus b 5) ((-10) * (digit d1) - digit d2)

        -- digit, so the next must be '.' and then digit
        d2 -> do
          let d3 = getW8 (plus b 4)
          join (plus b 6) ((-100) * (digit d1) - 10 * (digit d2) - digit d3)

    -- a digit
    d1 -> case getW8 (plus b 1) of
      -- '.', so the next must be digit
      46 -> do
        let d2 = getW8 (plus b 2)
        join (plus b 4) (10 * digit d1 + digit d2)

      -- another digit, so the next must be '.', and then digit
      d2 -> do
        let d3 = getW8 (plus b 3)
        join (plus b 5) (100 * digit d1 + 10 * digit d2 + digit d3)

-- Split file to THREAD_NUM buffers
--------------------------------------------------------------------------------

splitBuffer :: Int -> Buffer -> [Buffer]
splitBuffer num_threads b =
  let
    chunkSize = div (len b) num_threads

    go b
      | len b <= chunkSize =
          [b]
    go b =
      let
        findNewl i b = case BYTE_INDEX (0 A) (getW b) of
          8 -> findNewl (i + 8) (plus b 8)
          i' -> i + i'
        keylen = findNewl 0 (plus b chunkSize)
        chunkSize' = chunkSize + keylen + 1
        rest = go (plus b chunkSize')
       in
        Buffer (_ptr b) chunkSize' : rest
   in
    go b


tableToList :: Table -> IO [Entry]
tableToList tbl = do
  let go ix | ix == tableBytes = pure []
      go ix = do
        e@(Entry k _) <- readEntry tbl ix
        if isEmptySLB k then do
          go (ix + entrySize)
        else do
          es <- go (ix + entrySize)
          pure (e:es)
  go 0

displayEntries :: [Entry] -> BB.Builder
displayEntries es = BB.char8 '{' <> go es <> BB.char8 '}' where

  f $$! x = f x; infixl 8 $$!

  goEntry (Entry key (Val mi ma cn to)) =
    buildSLB key <>
    BB.string8
      (printf "=%.1f/%.1f/%.1f" $$!
          (fi mi / 10 :: Double) $$!
          (fi to / (fi cn * 10) :: Double) $$!
          (fi ma / 10 :: Double))

  go []     = mempty
  go [e]    = goEntry e
  go (e:es) = goEntry e <> BB.string8 ", " <> go es


main :: IO ()
main =
  withFile "data/measurements.txt" \b -> do
    num_threads <- getNumCapabilities
    initTables (splitBuffer num_threads b) \bts -> do
      Ptr tbl : ts <- mapConcurrently (\(b, Ptr t) -> Ptr t <$ parse t b) bts
      forM_ ts \(Ptr tbl') ->
        forTable tbl' \e ->
          updateTable tbl e

      es <- sortBy (\e e' -> compare (_key e) (_key e')) <$> tableToList tbl
      BB.hPutBuilder stdout (displayEntries es)
      putChar '\n'

WTF? This is with GHC 9.6.4, but with 9.8.2 (which is used when actually compiling the program) it is correctly preprocessed?

ghc-9.8 -E:

Summary
{-# LINE 1 "app/Main.hs" #-}
# 1 "app/Main.hs"
# 1 "<built-in>" 1
# 16 "<built-in>"
# 1 "/Users/roland/.ghcup/ghc/9.8.2/lib/ghc-9.8.2/lib/../lib/aarch64-osx-ghc-9.8.2/rts-1.0.2/include/ghcversion.h" 1


















# 17 "<built-in>" 2
# 1 "/var/folders/07/f6y7bm0x7n72vhtwd596514h0000gn/T/ghc90157_0/ghc_2.h" 1

























































































































































































































# 18 "<built-in>" 2
# 1 "app/Main.hs" 2

{-# language
    BlockArguments
  , CPP
  , LambdaCase
  , MagicHash
  , PatternSynonyms
  , Strict
  , TypeApplications
  , UnboxedTuples
  , ViewPatterns
  #-}

{-# options_ghc
  -Wall
  -Wno-missing-signatures
  -Wno-name-shadowing
  #-}

{- cabal:
  build-depends: base >= 4.19, bytestring, mmap, async
  default-language: GHC2021
  ghc-options: -Wall -O2 -fllvm -rtsopts -threaded -split-sections
-}

-- ghc -O2 -fllvm -rtsopts -threaded -split-sections -ddump-simpl -dsuppress-all
-- -dno-suppress-type-signatures -ddump-to-file -fforce-recomp Main7.hs

-- CONFIGURATION
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

-- Run executable with "+RTS -NX" to set number of threads to X.

-- display output


-- should be power of 2, minimum 16384
-- #define TABLE_SIZE 131072

-- #define 65536 32768
-- #define 65536 16384

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

import Control.Concurrent
import Control.Concurrent.Async
import Control.Monad
import Data.Bits
import Foreign.Marshal.Alloc
import GHC.Exts
import GHC.IO
import GHC.Word
import System.IO.MMap
import qualified Data.ByteString.Builder    as BB
import qualified Data.ByteString.Lazy.Char8 as LC8


import Data.List
import Text.Printf
import System.IO hiding (withFile)



-- Random common functions
--------------------------------------------------------------------------------

fi :: (Integral a, Num b) => a -> b
fi = fromIntegral; {-# inline fi #-}

sl :: Bits a => a -> Int -> a
sl = unsafeShiftL

sr :: Bits a => a -> Int -> a
sr = unsafeShiftR

isrl :: Int -> Int -> Int
isrl (I# x) (I# y) = I# (uncheckedIShiftRL# x y)

max' :: Int -> Int -> Int
max' a b = let diff = a - b in a - (diff .&. sr diff 63)

min' :: Int -> Int -> Int
min' a b = let diff = a - b in b + (diff .&. sr diff 63)

plusAddr :: Addr# -> Int -> Addr#
plusAddr p (I# x) = plusAddr# p x

int2Addr :: Int -> Addr#
int2Addr (I# x) = int2Addr# x

addr2Int :: Addr# -> Int
addr2Int p = I# (addr2Int# p)

eqI :: Int -> Int -> Int
eqI (I# x) (I# y) = I# (x ==# y)

readI :: Addr# -> IO Int
readI p = IO \s -> case readIntOffAddr# p 0# s of (# s, x #) -> (# s, I# x #)

writeI :: Addr# -> Int -> IO ()
writeI p (I# x) = IO \s -> case writeIntOffAddr# p 0# x s of s -> (# s, () #)

-- Generic buffers
--------------------------------------------------------------------------------

data Buffer = Buffer {_ptr :: Addr#, len :: Int}

plus :: Buffer -> Int -> Buffer
plus (Buffer p l) (I# x) = Buffer (plusAddr# p x) (l - I# x)

memset :: Buffer -> Word8 -> IO ()
memset (Buffer p (I# l)) (W8# x) = IO \s ->
  case setAddrRange# p l (word2Int# (word8ToWord# x)) s of s -> (# s, () #)

withFile :: FilePath -> (Buffer -> IO a) -> IO a
withFile path k = mmapWithFilePtr path ReadOnly Nothing \(Ptr p, l) -> k (Buffer p l)
{-# inline withFile #-}

indexW8 :: Buffer -> Int -> Word8
indexW8 (Buffer p _) (I# x) = W8# (indexWord8OffAddr# p x)

indexW32 :: Buffer -> Int -> Word32
indexW32 (Buffer p _) (I# x) = W32# (indexWord32OffAddr# p x)

indexW :: Buffer -> Int -> Word
indexW (Buffer p _) (I# x) = W# (indexWordOffAddr# p x)

indexI :: Buffer -> Int -> Int
indexI (Buffer p _) (I# x) = I# (indexIntOffAddr# p x)

getW8  = (`indexW8` 0)
getW32 = (`indexW32` 0)
getW   = (`indexW` 0)

instance Eq Buffer where
  Buffer p l == Buffer p' l' = l == l' && go p p' l where
    buf p = Buffer p l
    go p p' l
      | l >= 8 = getW   (buf p) == getW   (buf p') && go (plusAddr# p 8#) (plusAddr# p' 8#) (l - 8)
      | l >= 4 = getW32 (buf p) == getW32 (buf p') && go (plusAddr# p 4#) (plusAddr# p' 4#) (l - 4)
      | l == 0 = True
      | True   = getW8  (buf p) == getW8  (buf p') && go (plusAddr# p 1#) (plusAddr# p' 1#) (l - 1)
  {-# inline (==) #-}

foldedMul :: Word -> Word -> Word
foldedMul (W# x) (W# y) = case timesWord2# x y of (# hi, lo #) -> W# (xor# hi lo)

salt :: Word
salt = 3032525626373534813

combine :: Word -> Word -> Word
combine x y = foldedMul (xor x y) 11400714819323198549

hashBuffer :: Buffer -> Word
hashBuffer p = go p salt where
  go p acc
    | len p >= 8 = go (plus p 8) (combine (getW p) acc)
    | len p >= 4 = go (plus p 4) (combine (fromIntegral (getW32 p)) acc)
    | len p == 0 = acc
    | otherwise  = go (plus p 1) (combine (fromIntegral (getW8 p)) acc)

buildBuffer :: Buffer -> BB.Builder
buildBuffer b | len b == 0 = mempty
buildBuffer b = BB.word8 (getW8 b) <> buildBuffer (plus b 1)

-- printBuffer :: Buffer -> IO ()
-- printBuffer = BB.hPutBuilder stdout . buildBuffer

instance Show Buffer where
  show x =
    LC8.unpack $ BB.toLazyByteString $ buildBuffer x

instance Ord Buffer where
  compare x x' = compare (show x) (show x')

-- Short buffer
--------------------------------------------------------------------------------

-- Unboxed buffer containing at most 23 bytes. The first field is the length,
-- the rest is the payload. The 24-th byte in the payload is always zeroed out.
data ShortBuffer = ShortBuffer# Int Int Int Int

instance Eq ShortBuffer where
  ShortBuffer# _ a b c == ShortBuffer# _ a' b' c' =
    (eqI a a' .&. eqI b b' .&. eqI c c') == 1
  {-# inline (==) #-}

hashShortBuffer :: ShortBuffer -> Word
hashShortBuffer (ShortBuffer# _ a b c) =
  (salt `combine` fi a) `combine` (fi b `combine` fi c)

buildShortBuffer :: ShortBuffer -> BB.Builder
buildShortBuffer (ShortBuffer# l a b c) =
  BB.lazyByteString $ LC8.take (fi l) $ BB.toLazyByteString $
  BB.int64LE (fi c) <> BB.int64LE (fi b) <> BB.int64LE (fi a)

instance Show ShortBuffer where
  show = LC8.unpack . BB.toLazyByteString . buildShortBuffer

instance Ord ShortBuffer where
  compare (ShortBuffer# _ a b c) (ShortBuffer# _ a' b' c') =
    let sw (I# x) = W# (byteSwap# (int2Word# x))
    in compare (sw c) (sw c') <> compare (sw b) (sw b') <> compare (sw a) (sw a')

-- Unboxed sum of short and standard buffers.
--------------------------------------------------------------------------------

data SLBuffer = SLB# Int Int Int

isEmptySLB :: SLBuffer -> Bool
isEmptySLB (SLB# a _ _) = a == 0

unpackSLB# :: SLBuffer -> (# ShortBuffer | Buffer #)
unpackSLB# (SLB# a b c) =
  let l = a .&. 255 in
  if l <= 23 then (# ShortBuffer# l (isrl a 8) b c | #)
             else (# | Buffer (int2Addr b) a #)

pattern ShortBuffer :: ShortBuffer -> SLBuffer
pattern ShortBuffer buf <- (unpackSLB# -> (# buf | #)) where
  ShortBuffer (ShortBuffer# len a b c) = SLB# (sl a 8 .|. len) b c

pattern LongBuffer :: Buffer -> SLBuffer
pattern LongBuffer buf <- (unpackSLB# -> (# | buf #)) where
  LongBuffer (Buffer p l) = SLB# l (addr2Int p) 0
{-# complete ShortBuffer, LongBuffer #-}

instance Eq SLBuffer where
  ShortBuffer b == ShortBuffer b' = b == b'
  LongBuffer b  == LongBuffer b'  = b == b'
  _             == _              = False
  {-# inline (==) #-}

-- Try to pack a Buffer into a short one.
packBuffer :: Buffer -> SLBuffer
packBuffer b =
  let l      = len b
      ix     = indexI b
      mask l = isrl (-1) (64 - sl l 3) in
  if      l <= 8  then ShortBuffer (ShortBuffer# l 0 0 (ix 0 .&. mask l))
  else if l <= 16 then ShortBuffer (ShortBuffer# l 0 (ix 1 .&. mask (l - 8)) (ix 0))
  else if l <= 23 then ShortBuffer (ShortBuffer# l (ix 2 .&. mask (l - 16)) (ix 1) (ix 0))
  else LongBuffer b

hashSLB :: SLBuffer -> Word
hashSLB (ShortBuffer b) = hashShortBuffer b
hashSLB (LongBuffer b)  = hashBuffer b

buildSLB :: SLBuffer -> BB.Builder
buildSLB (ShortBuffer b) = buildShortBuffer b
buildSLB (LongBuffer b)  = buildBuffer b

instance Show SLBuffer where
  show = LC8.unpack . BB.toLazyByteString . buildSLB

instance Ord SLBuffer where
  compare (ShortBuffer b) (ShortBuffer b') = compare b b'
  compare b               b'               = compare (show b) (show b')

-- Branchless scanning for bytes in words.
--------------------------------------------------------------------------------



-- Given a hexadecimal byte, generate the (Word -> Int) function which returns the
-- index of the rightmost occurrence of the byte, or returns 8 if the byte does not
-- occur.





-- Hash table of measurements
--------------------------------------------------------------------------------

data Val = Val {
    _min     :: Int
  , _max     :: Int
  , _cnt     :: Int
  , _total   :: Int
  }

data Entry = Entry {
    _key     :: {-# unpack #-} SLBuffer
  , _val     :: {-# unpack #-} Val
  }

-- size of entry in bytes (includes padding to 64 bytes!)
entrySize :: Int
entrySize = 8 * 8

tableMask :: Int
tableMask = 65536 - 1

tableBytes :: Int
tableBytes = 65536 * entrySize

type Table = Addr#

initTables :: [Buffer] -> ([(Buffer, Ptr Word8)] -> IO a) -> IO a
initTables bs f = do
  let l = 65536 * entrySize
  let go []     acc = f acc
      go (b:bs) acc = allocaBytesAligned l entrySize \p@(Ptr p') -> do
        memset (Buffer p' l) 0
        go bs ((b, p):acc)
  go bs []

-- read entry from a *byte* offset
readEntry :: Table -> Int -> IO Entry
readEntry p i = case plusAddr p i of
  p -> do
    a <- readI p
    b <- readI (plusAddr p 8)
    c <- readI (plusAddr p 16)
    d <- readI (plusAddr p 24)
    e <- readI (plusAddr p 32)
    f <- readI (plusAddr p 40)
    g <- readI (plusAddr p 48)
    pure $ Entry (SLB# a b c) (Val d e f g)

-- write entry to a *byte* offset
writeEntry :: Table -> Int -> Entry -> IO ()
writeEntry p i (Entry (SLB# a b c) (Val d e f g)) = case plusAddr p i of
  p -> do
    writeI p               a
    writeI (plusAddr p 8)  b
    writeI (plusAddr p 16) c
    writeI (plusAddr p 24) d
    writeI (plusAddr p 32) e
    writeI (plusAddr p 40) f
    writeI (plusAddr p 48) g

newVal :: Int -> Val
newVal temp = Val temp temp 1 temp

updateEntry :: Entry -> Val -> Entry
updateEntry (Entry k (Val mi ma cn to)) (Val mi' ma' cn' to')
  = Entry k (Val (min' mi mi') (max' ma ma') (cn + cn') (to + to'))

forTable :: Table -> (Entry -> IO ()) -> IO ()
forTable t f = do
  let go ix | ix == tableBytes = pure ()
      go ix = do
        e@(Entry k _) <- readEntry t ix
        if isEmptySLB k then do
          go (ix + entrySize)
        else do
          f e
          go (ix + entrySize)
  go 0
{-# inline forTable #-}

updateTable :: Table -> Entry -> IO ()
updateTable tbl e@(Entry key val) = do
  let go ix | ix == tableBytes = go 0
      go ix = do
        olde@(Entry oldkey _) <- readEntry tbl ix
        if isEmptySLB oldkey then do
          writeEntry tbl ix e
        else if key == oldkey then do
          writeEntry tbl ix (updateEntry olde val)
        else do
          go (ix + entrySize)
  go ((fi (hashSLB key) .&. tableMask) * entrySize)

parse :: Table -> Buffer -> IO ()
parse _   b | len b == 0 = do
  pure ()
parse tbl b = do

-- scan for semicolon
  let findSemi :: Int -> Buffer -> Int
      findSemi i b = case (\(x :: Word) -> case xor x 0x 3B 3B 3B 3B 3B 3B 3B 3B of x -> case (x - 0x0101010101010101) .&. complement x .&. 0x8080808080808080 of x -> countTrailingZeros x `sr` 3) (getW b) of
        8  -> findSemi (i + 8) (plus b 8)
        i' -> i + i'
  let keylen = findSemi 0 b

  let key = packBuffer $ b {len = keylen}
  b <- pure $ plus b (keylen + 1)

  let digit :: Word8 -> Int
      digit x = fi x - 48

  let join :: Buffer -> Int -> IO ()
      join b temp = do
        updateTable tbl (Entry key (newVal temp))
        parse tbl b

  case getW8 b of
    -- '-'
    45 -> do
      let d1 = getW8 (plus b 1)
      case getW8 (plus b 2) of
        -- '.' so the next must be digit
        46 -> do
          let d2 = getW8 (plus b 3)
          join (plus b 5) ((-10)*(digit d1) - digit d2)

        -- digit, so the next must be '.' and then digit
        d2 -> do
          let d3 = getW8 (plus b 4)
          join (plus b 6) ((-100)*(digit d1) - 10*(digit d2) - digit d3)

    -- a digit
    d1 -> case getW8 (plus b 1) of
      -- '.', so the next must be digit
      46 -> do
        let d2 = getW8 (plus b 2)
        join (plus b 4) (10*digit d1 + digit d2)

      -- another digit, so the next must be '.', and then digit
      d2 -> do
        let d3 = getW8 (plus b 3)
        join (plus b 5) (100*digit d1 + 10*digit d2 + digit d3)

-- Split file to THREAD_NUM buffers
--------------------------------------------------------------------------------

splitBuffer :: Int -> Buffer -> [Buffer]
splitBuffer num_threads b = let
  chunkSize = div (len b) num_threads

  go b | len b <= chunkSize =
    [b]
  go b = let
    findNewl i b = case (\(x :: Word) -> case xor x 0x 0A 0A 0A 0A 0A 0A 0A 0A of x -> case (x - 0x0101010101010101) .&. complement x .&. 0x8080808080808080 of x -> countTrailingZeros x `sr` 3) (getW b) of
      8  -> findNewl (i + 8) (plus b 8)
      i' -> i + i'
    keylen     = findNewl 0 (plus b chunkSize)
    chunkSize' = chunkSize + keylen + 1
    rest       = go (plus b chunkSize')
    in
    Buffer (_ptr b) chunkSize' : rest
  in go b


tableToList :: Table -> IO [Entry]
tableToList tbl = do
  let go ix | ix == tableBytes = pure []
      go ix = do
        e@(Entry k _) <- readEntry tbl ix
        if isEmptySLB k then do
          go (ix + entrySize)
        else do
          es <- go (ix + entrySize)
          pure (e:es)
  go 0

displayEntries :: [Entry] -> BB.Builder
displayEntries es = BB.char8 '{' <> go es <> BB.char8 '}' where

  f $$! x = f x; infixl 8 $$!

  goEntry (Entry key (Val mi ma cn to)) =
    buildSLB key <>
    BB.string8
      (printf "=%.1f/%.1f/%.1f" $$!
          (fi mi / 10 :: Double) $$!
          (fi to / (fi cn * 10) :: Double) $$!
          (fi ma / 10 :: Double))

  go []     = mempty
  go [e]    = goEntry e
  go (e:es) = goEntry e <> BB.string8 ", " <> go es


main :: IO ()
main =
  withFile "data/measurements.txt" \b -> do
    num_threads <- getNumCapabilities
    initTables (splitBuffer num_threads b) \bts -> do
      Ptr tbl:ts <- mapConcurrently (\(b, Ptr t) -> Ptr t <$ parse t b) bts
      forM_ ts \(Ptr tbl') ->
        forTable tbl' \e ->
          updateTable tbl e

      es <- sortBy (\e e' -> compare (_key e) (_key e')) <$> tableToList tbl
      BB.hPutBuilder stdout (displayEntries es)
      putChar '\n'


Threadscope shows me it runs the two expensive functions on the same capability:

I guess that could also cause your

Inserting yields does not help, not even if I insert it every 1000 iterations in the parse function.

What does seem to help is to define my own mapConcurrently like this:

mapConcurrently :: (a -> IO b) -> [a] -> IO [b]
mapConcurrently f xs = do
  vs <- for xs \x -> do
    v <- newEmptyMVar
    v <$ forkIO do
      yield
      y <- f x
      putMVar v y
  for vs takeMVar

Then the results are more sensible:

$ for i in {1..12}; do echo "----------------"; echo "-N$i"; time cabal run --ghc-options="-threaded -fllvm -rtsopts" -O2 1brc.hs -- +RTS -N$i -RTS >/dev/null; done
----------------
-N1

real	0m18,624s
user	0m17,886s
sys	0m0,750s
----------------
-N2

real	0m9,484s
user	0m17,847s
sys	0m0,748s
----------------
-N3

real	0m6,474s
user	0m17,908s
sys	0m0,755s
----------------
-N4

real	0m4,954s
user	0m17,770s
sys	0m0,781s
----------------
-N5

real	0m4,014s
user	0m17,676s
sys	0m0,793s
----------------
-N6

real	0m3,484s
user	0m17,935s
sys	0m0,802s
----------------
-N7

real	0m3,904s
user	0m19,959s
sys	0m0,842s
----------------
-N8

real	0m3,474s
user	0m21,041s
sys	0m0,855s
----------------
-N9

real	0m3,154s
user	0m22,307s
sys	0m0,826s
----------------
-N10

real	0m3,084s
user	0m24,076s
sys	0m0,862s
----------------
-N11

real	0m2,934s
user	0m25,998s
sys	0m0,918s
----------------
-N12

real	0m2,774s
user	0m27,884s
sys	0m0,932s

One thing to notice is that the user time starts increasing from -N7 onwards, that must be because of hyperthreading.

Additionally, threadscope shows hyperthreading causes more variations in task duration, so perhaps having exactly the same number of tasks as (hyper)threads is not optimal. A solution would be to split the tasks up into smaller parts, but unfortunately the pre and post processing of the tasks is not very well optimized so the overhead is bigger than the potential gains at the moment.

However, even if there was no overhead before and after the main parsing tasks, I doubt we could get below 2 seconds on my machine. I think the last mile in comparison to that fastest C solution is the SIMD instructions which could in theory give another 4x speed up by moving from 64 bits at a time to 256 bits at a time (and that would actually bring this solution very close to the time of that C solution).

3 Likes

On my machine I need forkOn to get rid of the performance variation:

mapConcurrently :: (a -> IO b) -> [a] -> IO [b]
mapConcurrently f xs = do
  caps <- getNumCapabilities
  unless (caps == length xs) $ error "wrong number of capabilities"
  vs <- forM (zip [0..] xs) \(i, x) -> do
    v <- newEmptyMVar
    v <$ forkOn i do
      yield
      y <- f x
      putMVar v y
  forM vs takeMVar

Updated my gist with this. I played around with table sizes and entry sizes (aligning on 8 words vs. minimizing entry size to 5 words), but there’s not a lot of difference. Sticking to pure Haskell, we could still compress the hashtable significantly, but I expect that SIMD would be needed for the big gains. And the SIMD support in GHC.Prim looks insufficient, so it looks like we can’t do much better in pure Haskell.

Thank you. Interesting to know that 64bit hash doesn’t help much on your machine. I just checked it on my old iMac (2.8 GHz Intel Core i7) and it’s actually 10-20% slower than a previous byte-by-byte version.

I also suspect that a combination of LLVM version and CPU model matters a lot.

I’m not sure why you’d want to test it with more than 10 threads if you have 10 cores, but it’s good to know that Haskell RTS doesn’t struggle with 200 threads (unfortunately, it does struggle when processing and reading are done in separate threads).

I’ve tried it on a beefier machine and it has more or less the same performance as mine version (with your hash :slight_smile: ):

    -- single threaded 
    User time (seconds): 13.26
    System time (seconds): 0.67
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.93
    -- -N8
    User time (seconds): 15.52
    System time (seconds): 0.66
    Percent of CPU this job got: 719%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.25
    -- -N24
    User time (seconds): 19.23
    System time (seconds): 0.86
    Percent of CPU this job got: 1636% -- not 2400%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.22 -- the fastest measurement, can be 1.5-1.6s  

On a larger number of threads (-N24) mine is slightly faster due to read working faster than mmap (and maybe RTS issues). My numbers for comparison:

    -- single threaded
    User time (seconds): 12.43
    System time (seconds): 1.33 -- higher due to 'read' instead of 'mmap'
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:13.76
    -- -N8
    User time (seconds): 15.66
    System time (seconds): 1.59
    Percent of CPU this job got: 773%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.23
    -- -N24
    User time (seconds): 20.02
    System time (seconds): 2.20
    Percent of CPU this job got: 2216%  -- not 2400% too
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.00 -- fastest, can be 1.2-1.5s

I found the vectorization to be very flaky. When I tried to change some Ints to Words in your hash it could become a bit slower. Probably that’s what happens with your code (13.2s vs 12.4s, I’ve seen similar slowdowns).

I don’t see much change after adding -XStrict. perf says that instructions count goes from 186B to 192B so it does something unnecessary.

<strictness rant>

In my experience, I have never seen -XStrict lead to performance improvements (perhaps because I tried it on the already optimised code). It’s the same with too many bang patterns – they prevent the GHC optimiser from reordering the code in the most performant way and usually slow things down. You may notice that there are not many bangs in my code (although there is a -fspec-constr-count=100), in fact you can remove them all and it will work fine.

Haskell laziness is the key to performance. Especially in large programs. Yes, there are cases where some lazy thunks can accumulate, but there are many cases where unnecessary things just aren’t evaluated at all.

When comparing bulk performance (you have to parse 1B lines and do these computations), C might win, but once the algorithm becomes less trivial, the ability to automatically skip unnecessary computations helps a lot (e.g. I’ve seen cases where a Haskell program was unexpectedly 10-100 faster than Java just because debugging information wasn’t evaluated, while Java kept “logging”).

My general experience with Haskell is that it has the best performance/effort ratio. You can get reasonable performance (say 2 times slower than optimized C) in a fraction of the time. By the time a basic solution is implemented in C/Java, the Haskell version will already be optimised.

Yes, if you need low latency and/or extremely high performance throughout the system, you may need to look elsewhere. But these are niche tasks. Most users won’t notice a GC pause that takes less than an average network round trip (especially if the program is generally fast enough), and many performance problems can be solved simply by using more cores. Or with a better algorithm – and Haskell shines here thanks to its expressiveness.

7 Likes

Removing -XStrict can change a program with constant space usage into one with linear space usage! That has a knock-on effect on run time too, because the GC has more to traverse.

Do you have an example you can share that shows the benefit of avoiding -XStrict due to freedom to reorder? I’ve never seen one.

(At Groq we have at least one example where we make a data field lazy so that we don’t waste evaluating it if it is not used. That gives a speed boost at the cost of space usage.)

1 Like

@AndrasKovacs @vshabanov: if you’re comparing slots to determine whether keys match in your hash table routine, I’ve noticed roughly a 10% improvement in my solution by reversing both the query and slot keys prior to comparison since suffixes are less likely match than prefixes are.

I’ve also noted roughly a 10% improvement by just repeatedly hashing values to determine new slots (i.e. hash the hash) rather than linear probing since the former leads to less sequential allocations and therefore fewer hops overall.

That may help get us up to par with C.

1 Like

I chose linear probing because it should cause less cache misses. Hashing the hash means you’ll probably get a cache miss for every collision, while linear probing should only usually give you a single miss even if you get a few collisions.

But it would be good to test that hypothesis for this program.

FNV1-a hash causes about 75M collisions in 1B records when using linear probing. DJB2 is similar. Give re-hashing a shot. As a worse solution but better than just linear probing, you could also take mod 2 of the first key character to decide whether to step -1/+1. That will also reduce scanning a little bit.

This is almost -25% on my Computer (from 4.5s down to 3.5s).

1 Like

This is already incredibly fast! And it does not seem to allocate much at all.

I just ran perf record --call-graph on the binary (I passed -g3 to have debug symbols and compiled without LLVM). Just some data:

  • 10% of time is spent in findSemi; 30% in parse, 60% in updateTable.
  • perf stat reports that 46% of branches were mispredicted
  • Within $wupdateTable, there is the isEmptySLB check that appears to be responsible for these branch mispredictions. At least that’s the impression when I “Annotate” updateTable; there is a red 0-test test %r14, %r14 just before what appears to be an inlined call to readEntry. I think this implies that there are a lot of collisions, because if there weren’t, we would very often have that isEmptySLB is True. Alas, the code generator lays out the if in a way that the collision case comes first! I’m not sure how we convince it not to do that.
2 Likes

isEmptySLB should be almost always false, regardless of collisions, because we only have 8800 distinct keys for 1 billion insertions. With the current table configurations, we have 64k slots and barely any collisions. Concretely, for 500M rows on my machine I get maximum scan length of 3 and average scan length of less than 0.1. So I dunno why isEmptySLB gets that many mispredictions.

Btw I have never used perf for Haskell, is there a reference on how to do it?

The perf measurements are also kinda at odds with our manual measurements, where just parsing the input without updating the tables was still 70% of the total runtime (on single thread).

1 Like

As in on average <10% of entries involve a scan. Yeah, concretely on mine for 1B rows there were 75M hops. So those numbers tally: under those conditions I got a significant improvement by re-hashing versus scanning, and an even more significant improvement by reversing query and key arrays before comparing keys.

This version is now the fastest multi-threaded Haskell version so far:

Using 11 threads (which yields the fastest times), the time is 3.5s:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.596 s ±  0.032 s    [User: 28.265 s, System: 1.727 s]
  Range (min … max):    3.567 s …  3.646 s    5 runs
2 Likes

FTR, I convinced GHC to lay out the then branch of the isEmptySLB before the else branch by defining

isEmptySLB (SLB# a _ _) = xor (-13) a + 13 <= 0

but it had no effect. Now the instruction mov %rax,0x60(%rsp) moving the length onto the stack is red… No idea how to interpret that; other memory writes don’t seem to be that expensive according to perf report. Branch mispredictions are likely caused because there are as many updates as there are initial insertions.

I tried reading just the length word, but of course we often need the whole entry anyway, for the key == oldkey case.

Btw I have never used perf for Haskell, is there a reference on how to do it?

I don’t know of any. I simply ran it (both perf stat at first, but now I’m using perf record --call-graph and perf report), then reran again after passing -g3 to GHC to generate debug symbols. That seems to be enough to get a function-level profile, but apparently the instruction-level annotations (which you get by pressing Enter on a function name and then choose “Annotate …”) are not really usable. Not sure what goes on here.

2 Likes

Sure. But I usually tried -XStrict on a program that already had bangs in the necessary places (accumulators, fields). Adding more bangs usually makes things worse.

The opposite can also be true: -XStrict on things like head . sort can convert linear time/space into a log-linear.

-XStrictData may as well require much more data and overload GC. In general, -XStrict is a rather dangerous sledgehammer that doesn’t always work (never in my case).

Sorry, perhaps I used the wrong word (although a pun on instruction reordering was intended :slight_smile: ). I meant the ability to do evaluation where it’s needed, not where the bang is:

main :: IO ()
main = do
  x <- randomRIO (0,9) :: IO Int
  let a = trace "a" x
      b = trace "b" x
  if x < 5 then
    print a
  else
    print b

Both a and b will always be evaluated if you add -XStrict.

Superfluous bangs tend to slow things down for the same reason (forces to evaluate the unnecessary):

go acc [] = acc
go !acc (x:xs) ...
-- is frequently faster than
go !acc [] = acc
go !acc (x:xs) ...
--------------
go !acc rarelyUsedArg (x:xs) ...
-- is frequently faster than
go !acc !rarelyUsedArg (x:xs) ...

You can see the source of Text.HTML.TagSoup.Fast. The code is more than a decade old ugliness, I would write it with pattern synonyms now; but you can notice a lack of bangs in non-recursive cases, and it helped the code to perform better, at least back then.

I have a feeling (cannot prove it) that bangs break the strictness analyzer and lead to a suboptimal code generation. GHC knows better when to evaluate if it’s sure that evaluation is necessary, bangs force GHC to do it in the wrong places (and maybe multiple times).

Bangs should be used sparingly, to give GHC a hint that you don’t want to return a lazy accumulator, and let the strictness analyzer do the rest.

2 Likes

Oh, I beg your pardon. I incorrectly misinterpreted -XStrict as -XStrictData. I agree that -XStrict is a dangerous sledgehammer (I recommend -XStrictData without much reservation though).

2 Likes

…but is -XStrictData really all that much better?

CONS Should Not Evaluate its Arguments (1976)

1 Like