One Billion Row challenge in Hs

Oh I say, that’s a bit strong. deepseq is a tool with uses, and while I agree that in that specific problem, the strict vector was ultimately the right solution, in a more complex scenario in which either you want some laziness sometimes in your data structures or you don’t yet know which parts of your program are eating all of your memory with thunks and refactoring everything to be strict would be time-consuming, deepseq is an important tool to have at your disposal, even if just to try force-ing different things until you figure out which things to refactor. (But sometimes invalid laziness has to be representable; Haskell’s type system is very good at letting you express invariants through types but no type system is perfect at this.)

Certainly when you’re attempting to squeeze your thunks out by forcing everything through a show, as OP was, force is something you ought to be told about.

I updated my gist with @AndrasKovacs hash. Now it runs at about the same speed as yours, but uses less intrinsics and doesn’t have short string optimisation (only “short string” hash).

For such a bulk task (parsing CSV at 1GB/sec), not only every memory access counts, but also every instruction. Using a hash that works on 64-bit words instead of bytes helps to have fewer instructions.

2 Likes

deepseq is a time leak machine! and many other languages have it built into their spec :skull:

Correct.

Consider let x = head (return x) in 'a'

  • “outside in” evaluation: the result is 'a' as it ought to be, since x isn’t needed here to obtain that result.

  • “inside out” evaluation: the result will always be ⊥ no matter how much memory or how many threads are available.

To alleviate this fundamental problem with “inside out” evaluation, Eager Haskell would periodically and completely restart (reduction of) the program. This “re-scorched earth” style had two competing requirements:

  • avoiding too many restarts, so that the program can actually run at all;

  • avoiding too few restarts, so the program won’t do too much unnecessary work (like trying to evaluate x = head (return x) or other similarly-cyclic definitions).

…a “delicate balancing act”.

We’ve gone off topic, but I invite anyone who sneers at deepseq to take a look at Fix space leak between modules during compilation by MonoidMusician · Pull Request #4517 · purescript/purescript · GitHub and recommend a better way to address that problem. We weren’t happy with the deep infiltration of NFData into our types and would love to hear from someone who can clearly see a less invasive approach.

(ahem) …speaking for myself here: I wasn’t so much “sneering” at (the use of) entities like deepseq, but a prevailing impression that they (unfortunately) seem to accentuate:

  • that laziness/non-strict semantics are now more of an impediment than being of ongoing benefit to Haskell,

  • and little if anything would be lost if Haskell were strict by default.

(Again, this is merely how I’m perceiving the current situation).

So it is to that impression that I say:

  • did someone solve the halting problem?

  • or is Haskell going to eventually be total as well as dependent?

Because without either of those, being strict by default generally means more programs won’t produce a useful result at all. Therefore definitions like deepseq (or compel, as I mentioned here) along with other “strictness modifiers” should ideally be considered “features of last resort” because of their potential for non-termination.

So how can we get to that ideal? This looks promising:

…if the multi-threaded RTS does have to be overhauled to solve the “N > 4” problem observed by @chreekat and others, can any of Robert Ennal’s research be reused as well, so that comments like:

don’t end up being treated as basic advice when using Haskell?

1 Like

The result first: 3.9s!

11 threads (the number of cores is 10, although 2 of them are not “performance cores”):

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.934 s ±  0.007 s    [User: 36.062 s, System: 1.502 s]
  Range (min … max):    3.925 s …  3.943 s    5 runs

On my computer, -fllvm is still not faster - but for your version at least not slower - than not using it.

Without LLVM:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.344 s ±  0.009 s    [User: 32.554 s, System: 1.222 s]
  Range (min … max):    4.333 s …  4.355 s    5 runs

With LLVM (about the same).

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.346 s ±  0.005 s    [User: 32.559 s, System: 1.227 s]
  Range (min … max):    4.340 s …  4.353 s    5 runs

But now to the number of threads:

11 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.934 s ±  0.007 s    [User: 36.062 s, System: 1.502 s]
  Range (min … max):    3.925 s …  3.943 s    5 runs

12 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.951 s ±  0.015 s    [User: 36.132 s, System: 1.523 s]
  Range (min … max):    3.936 s …  3.970 s    5 runs

10 threads (the number of cores, although 2 of them are not “performance cores”):

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.968 s ±  0.013 s    [User: 35.878 s, System: 1.449 s]
  Range (min … max):    3.954 s …  3.982 s    5 runs

14 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.991 s ±  0.012 s    [User: 36.293 s, System: 1.555 s]
  Range (min … max):    3.979 s …  4.004 s    5 runs

16 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.029 s ±  0.012 s    [User: 36.438 s, System: 1.589 s]
  Range (min … max):    4.020 s …  4.051 s    5 runs

20 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.100 s ±  0.037 s    [User: 36.761 s, System: 1.631 s]
  Range (min … max):    4.054 s …  4.142 s    5 runs

8 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.344 s ±  0.009 s    [User: 32.554 s, System: 1.222 s]
  Range (min … max):    4.333 s …  4.355 s    5 runs

200 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      6.006 s ±  0.109 s    [User: 41.527 s, System: 3.630 s]
  Range (min … max):    5.921 s …  6.156 s    5 runs
1 Like

I think embarrassingly parallel (EP) workloads are meant to use data parallel (DPH) features like what Repa does, forkIO spawns a green thread that has to support all sorts of things that we usually don’t consider as EP, for example a GPU at it’s core is usually just a 32 or 64 instruction wide SIMD processor [1] , which means that misuse of if statements can lead to worse performance than just computing both sides of the if, all instructions are executed in sync, and a memory access for one “thread” will stall all 64 threads, these are things we don’t have to worry about with green threads.

The N>4 issue might be something else, the stm-containers package mentions it can scale to 8 threads with ease, I also have a hunch that modern computers can sort of brute force up to about 4 CPUs (why Intel was king with 4-cores for a decade), L2 cache is only as fast as about 1/2 a DDR4 channel, meaning you can saturate a dual-channel system with 4 cores.

I’ve tried to specific stick to Haskell style code and that it being orders of magnitude slower than C sucks as per @emiruz comments, and what’s even worse is that when you look at other functional languages, most notably Lisp, the C-like performance versions are idiomatic, functional style. [2]

@AndrasKovacs how does your single threaded version work in-comparison to mine? Not sure I understand why mine is so slow compared to many of the solutions here, strict map and fold shouldn’t be that much slower.

I don’t think I’ve seen anyone implement a pure parallel version here using par, maybe if I find some more time I’ll have a crack at a par, stm-containers, and something using DPT, and also fix up some of the code to use bounded threads to prevent cache misses etc.

Something I’ve been wondering about is can we use the type system validate we have no overlap when delegating reads from the file. a correct implementation is be fairly easy to write, but that’s not why we use Haskell eh?

[1]: and end up with absurd marketing claims like 3000 “stream” processors
[2]: Lisp SBCL #2 n-body - Which programs are fastest? (Benchmarks Game)

3 Likes

…and strict by default. So you’re the one who’s dealing with (more of) the tedious and boring jobs, such as determining the correct order of evaluation:

We’ve got it already, you like Haskell’s laziness.

But:

Adding Strictness makes the version (minimally, but consistently) faster on my computer, at least with the load that it has right now (a VM and many other things running), that’s why the times are longer than before:

Original version:

Benchmark 1: ./exe-exe  > solution.txt
  Time (mean ± σ):      4.307 s ±  0.021 s    [User: 37.089 s, System: 2.015 s]
  Range (min … max):    4.276 s …  4.328 s    5 runs

With {-# LANGUAGE Strict #-}:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.269 s ±  0.023 s    [User: 36.972 s, System: 1.955 s]
  Range (min … max):    4.247 s …  4.305 s    5 runs

…but I wonder if I’m now in the minority here :-/


I could be mistaken…but I think you’re the first person on this thread to make use of -XStrict. It would be interesting to see if it makes a difference for the other/older solutions given here: if it works, then perhaps it could help to solve sgraf’s third problem:

That’s not my program, I’m just running them (almost?) all for benchmarks. I’ve actually tried it on my own some time ago and it had been slower. Although that might have changed and the last version would benefit.

That is incorrect, Lisp SBCL #2 code is clean doesn’t look like it cares too much about order, meanwhile I’m having to splatter strictness annotations and really think about why my haskell program is so slow, and on top of that, we have to remember that modern C compilers are extremely good at re-ordering operations, and lets not forget that modern CPUs performance comes from real time re-ordering of instructions as well, half of a modern processor is dedicated to this dark-art of running instructions that take 4 cycles per instruction, at 4 instructions / cycle.

At the end of the day all the code generator has too do is build a dependency graph and spit out an ordering that the CPU (or uarch) will run fast in most cases. This is how software like fex can run badly optimized legacy x86_32 code on x86_64 processor faster than the processor can itself.

There’s definitely something else going on here, that’s not strict v laziness, The fact my pure functional implementation with no use of forkIO is faster with -qm passed to the RTS would hint there’s work needed in the scheduler.

That SBCL code seems rife with destructive updates (…or maybe I just can’t read Common Lisp). What makes it more idiomatic than the faster Haskell GHC #2 solution: n-body Haskell GHC #2 program (Benchmarks Game)

TBH I’m not hugely familiar with CL to know exactly what is going on with the code, Nor am I saying that the the Lisp version is without mutation or anything like that.

What I’m saying is that the Lisp version is closer to the ideal of what I would want to write in Haskell. Meanwhile the fast Haskell version is basically Ugly C, There’s huges amounts of direct use of ptrs, we have a storable instance and a heavy splattering of peek/poke, that’s worse than having a few mutable vars in the Lisp code.

We still have to think about ordering in the Haskell version, especially when it comes to performance code (as every example here proves), So the notion that writing Haskell code you do not have to think about ordering is wrong, we have libraries like Pipes/Conduit that are designed to help, and we also write slower functional style code in other languages.

1 Like

I thought the reason those compilers could do that was largely due to the input language being a “semantics-free zone” (sometimes known as a “dumpster fire” ), which is another reason why I’m annoyed by performance comparisons with that ol’ warhorse…anyone for a Haskell-Clean comparison?


…but this clearly suggests otherwise:

To obtain the correct results, those mutable variables must be used in the correct order: this is why the encapsulated-state type ST s a is monadic.


For Haskell code based on visible effects, that is correct - as you subsequently noted:

we have libraries like Pipes/Conduit that are designed to help […]

But ordinary (effect-free) Haskell code ought to be capable of being evaluated in any order, just as long as the data dependencies are preserved. This form of implicit parallelism works for SQL, so it should also work for Haskell. So it not (quite) working like that at the moment just means more work needs to be directed at the implementation; something which you’ve also noted:

I parallelized and refactored the previous version: 1brc · GitHub

Parallel performance is very erratic on my laptop, there’s 2-3x difference between runs. I’m testing 500M inputs because I don’t have enough RAM to hold the 1B file in memory, and while I don’t run out of memory with mmap, apparently I get some IO bottleneck. The best I was able to do was 1.9s for 500M. The number of threads can be set with +RTS -N.

2 Likes

I’ve seen similar erratic behaviour on my Fortran version on machines with SATA SSDs with a sequential read speed of about 560MB/s. I suspected it was because the mmap buffer is being consumed quicker than that which causes strange IO signalling effects leading to arbitrary delays. I suspect those with SSDs on a PCIe interface won’t see this problem since the bandwidth is about 10x SATA. Does this correspond to your experience?

For me the results are relatively consistent, but a bit odd:

$ for i in {1..12}; do echo "---------------------"; echo "-N$i"; time cabal run --ghc-options="-threaded -fllvm" -O2 1brc.hs -- +RTS -N$i -RTS >/dev/null; done
---------------------
-N1

real	0m18,785s
user	0m18,035s
sys	0m0,755s
---------------------
-N2

real	0m18,784s
user	0m18,073s
sys	0m0,778s
---------------------
-N3

real	0m12,654s
user	0m18,013s
sys	0m0,810s
---------------------
-N4

real	0m9,574s
user	0m17,987s
sys	0m0,814s
---------------------
-N5

real	0m4,054s
user	0m17,944s
sys	0m0,838s
---------------------
-N6

real	0m6,524s
user	0m18,083s
sys	0m0,783s
---------------------
-N7

real	0m5,724s
user	0m18,379s
sys	0m0,822s
---------------------
-N8

real	0m5,694s
user	0m19,745s
sys	0m0,851s
---------------------
-N9

real	0m6,604s
user	0m18,359s
sys	0m0,799s
---------------------
-N10

real	0m4,694s
user	0m21,872s
sys	0m0,881s
---------------------
-N11

real	0m4,434s
user	0m23,389s
sys	0m0,890s
---------------------
-N12

real	0m2,664s
user	0m26,655s
sys	0m0,980s

So it takes only 2.664 seconds for the full 1 billion rows with -N12 which is the number of hyperthreads that my CPU has. If I make the -N larger then there are no more gains as expected, although it also doesn’t seem to get slower.

The strange thing is that -N2 gives pretty much the same time as -N1 (and the same for some other numbers).

I cannot get the C Macros to work with MacOS’ clang, CPP does not substitute (all?) occurrences of BYTE_INDEX.
The ouput of ghc -E looks as follows:

Summary
{-# LINE 1 "app/Main.hs" #-}
# 1 "app/Main.hs"
# 1 "<built-in>" 1
# 15 "<built-in>"
# 1 "/Users/roland/.ghcup/ghc/9.6.4/lib/ghc-9.6.4/lib/../lib/aarch64-osx-ghc-9.6.4/rts-1.0.2/include/ghcversion.h" 1


















# 16 "<built-in>" 2
# 1 "/var/folders/07/f6y7bm0x7n72vhtwd596514h0000gn/T/ghc84125_0/ghc_2.h" 1

























































































































































































































# 17 "<built-in>" 2
# 1 "app/Main.hs" 2
{-# LANGUAGE BlockArguments #-}
{-# LANGUAGE CPP #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE MagicHash #-}
{-# LANGUAGE PatternSynonyms #-}
{-# LANGUAGE Strict #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE UnboxedTuples #-}
{-# LANGUAGE ViewPatterns #-}
{-# OPTIONS_GHC -Wall
  -Wno-missing-signatures
  -Wno-name-shadowing #-}

{- cabal:
  build-depends: base >= 4.19, bytestring, mmap, async
  default-language: GHC2021
  ghc-options: -Wall -O2 -fllvm -rtsopts -threaded -split-sections
-}

-- ghc -O2 -fllvm -rtsopts -threaded -split-sections -ddump-simpl -dsuppress-all
-- -dno-suppress-type-signatures -ddump-to-file -fforce-recomp Main7.hs

-- CONFIGURATION
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

-- Run executable with "+RTS -NX" to set number of threads to X.

-- display output


-- should be power of 2, minimum 16384
-- #define TABLE_SIZE 131072

-- #define 65536 32768
-- #define 65536 16384

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

import Control.Concurrent
import Control.Concurrent.Async
import Control.Monad
import Data.Bits
import qualified Data.ByteString.Builder as BB
import qualified Data.ByteString.Lazy.Char8 as LC8
import Foreign.Marshal.Alloc
import GHC.Exts
import GHC.IO
import GHC.Word
import System.IO.MMap


import Data.List
import Text.Printf
import System.IO hiding (withFile)


-- Random common functions
--------------------------------------------------------------------------------

fi :: (Integral a, Num b) => a -> b
fi = fromIntegral
{-# INLINE fi #-}

sl :: (Bits a) => a -> Int -> a
sl = unsafeShiftL

sr :: (Bits a) => a -> Int -> a
sr = unsafeShiftR

isrl :: Int -> Int -> Int
isrl (I# x) (I# y) = I# (uncheckedIShiftRL# x y)

max' :: Int -> Int -> Int
max' a b = let diff = a - b in a - (diff .&. sr diff 63)

min' :: Int -> Int -> Int
min' a b = let diff = a - b in b + (diff .&. sr diff 63)

plusAddr :: Addr# -> Int -> Addr#
plusAddr p (I# x) = plusAddr# p x

int2Addr :: Int -> Addr#
int2Addr (I# x) = int2Addr# x

addr2Int :: Addr# -> Int
addr2Int p = I# (addr2Int# p)

eqI :: Int -> Int -> Int
eqI (I# x) (I# y) = I# (x ==# y)

readI :: Addr# -> IO Int
readI p = IO \s -> case readIntOffAddr# p 0# s of (# s, x #) -> (# s, I# x #)

writeI :: Addr# -> Int -> IO ()
writeI p (I# x) = IO \s -> case writeIntOffAddr# p 0# x s of s -> (# s, () #)

-- Generic buffers
--------------------------------------------------------------------------------

data Buffer = Buffer {_ptr :: Addr#, len :: Int}

plus :: Buffer -> Int -> Buffer
plus (Buffer p l) (I# x) = Buffer (plusAddr# p x) (l - I# x)

memset :: Buffer -> Word8 -> IO ()
memset (Buffer p (I# l)) (W8# x) = IO \s ->
  case setAddrRange# p l (word2Int# (word8ToWord# x)) s of s -> (# s, () #)

withFile :: FilePath -> (Buffer -> IO a) -> IO a
withFile path k = mmapWithFilePtr path ReadOnly Nothing \(Ptr p, l) -> k (Buffer p l)
{-# INLINE withFile #-}

indexW8 :: Buffer -> Int -> Word8
indexW8 (Buffer p _) (I# x) = W8# (indexWord8OffAddr# p x)

indexW32 :: Buffer -> Int -> Word32
indexW32 (Buffer p _) (I# x) = W32# (indexWord32OffAddr# p x)

indexW :: Buffer -> Int -> Word
indexW (Buffer p _) (I# x) = W# (indexWordOffAddr# p x)

indexI :: Buffer -> Int -> Int
indexI (Buffer p _) (I# x) = I# (indexIntOffAddr# p x)

getW8 = (`indexW8` 0)
getW32 = (`indexW32` 0)
getW = (`indexW` 0)

instance Eq Buffer where
  Buffer p l == Buffer p' l' = l == l' && go p p' l
   where
    buf p = Buffer p l
    go p p' l
      | l >= 8 =
          getW (buf p) == getW (buf p') && go (plusAddr# p 8#) (plusAddr# p' 8#) (l - 8)
      | l >= 4 =
          getW32 (buf p) == getW32 (buf p') && go (plusAddr# p 4#) (plusAddr# p' 4#) (l - 4)
      | l == 0 = True
      | True =
          getW8 (buf p) == getW8 (buf p') && go (plusAddr# p 1#) (plusAddr# p' 1#) (l - 1)
  {-# INLINE (==) #-}

foldedMul :: Word -> Word -> Word
foldedMul (W# x) (W# y) = case timesWord2# x y of (# hi, lo #) -> W# (xor# hi lo)

salt :: Word
salt = 3032525626373534813

combine :: Word -> Word -> Word
combine x y = foldedMul (xor x y) 11400714819323198549

hashBuffer :: Buffer -> Word
hashBuffer p = go p salt
 where
  go p acc
    | len p >= 8 = go (plus p 8) (combine (getW p) acc)
    | len p >= 4 = go (plus p 4) (combine (fromIntegral (getW32 p)) acc)
    | len p == 0 = acc
    | otherwise = go (plus p 1) (combine (fromIntegral (getW8 p)) acc)

buildBuffer :: Buffer -> BB.Builder
buildBuffer b | len b == 0 = mempty
buildBuffer b = BB.word8 (getW8 b) <> buildBuffer (plus b 1)

-- printBuffer :: Buffer -> IO ()
-- printBuffer = BB.hPutBuilder stdout . buildBuffer

instance Show Buffer where
  show x =
    LC8.unpack $ BB.toLazyByteString $ buildBuffer x

instance Ord Buffer where
  compare x x' = compare (show x) (show x')

-- Short buffer
--------------------------------------------------------------------------------

-- Unboxed buffer containing at most 23 bytes. The first field is the length,
-- the rest is the payload. The 24-th byte in the payload is always zeroed out.
data ShortBuffer = ShortBuffer# Int Int Int Int

instance Eq ShortBuffer where
  ShortBuffer# _ a b c == ShortBuffer# _ a' b' c' =
    (eqI a a' .&. eqI b b' .&. eqI c c') == 1
  {-# INLINE (==) #-}

hashShortBuffer :: ShortBuffer -> Word
hashShortBuffer (ShortBuffer# _ a b c) =
  (salt `combine` fi a) `combine` (fi b `combine` fi c)

buildShortBuffer :: ShortBuffer -> BB.Builder
buildShortBuffer (ShortBuffer# l a b c) =
  BB.lazyByteString $
    LC8.take (fi l) $
      BB.toLazyByteString $
        BB.int64LE (fi c) <> BB.int64LE (fi b) <> BB.int64LE (fi a)

instance Show ShortBuffer where
  show = LC8.unpack . BB.toLazyByteString . buildShortBuffer

instance Ord ShortBuffer where
  compare (ShortBuffer# _ a b c) (ShortBuffer# _ a' b' c') =
    let sw (I# x) = W# (byteSwap# (int2Word# x))
     in compare (sw c) (sw c') <> compare (sw b) (sw b') <> compare (sw a) (sw a')

-- Unboxed sum of short and standard buffers.
--------------------------------------------------------------------------------

data SLBuffer = SLB# Int Int Int

isEmptySLB :: SLBuffer -> Bool
isEmptySLB (SLB# a _ _) = a == 0

unpackSLB# :: SLBuffer -> (# ShortBuffer | Buffer #)
unpackSLB# (SLB# a b c) =
  let l = a .&. 255
   in if l <= 23
        then (# ShortBuffer# l (isrl a 8) b c | #)
        else (# | Buffer (int2Addr b) a #)

pattern ShortBuffer :: ShortBuffer -> SLBuffer
pattern ShortBuffer buf <- (unpackSLB# -> (# buf | #))
  where
    ShortBuffer (ShortBuffer# len a b c) = SLB# (sl a 8 .|. len) b c

pattern LongBuffer :: Buffer -> SLBuffer
pattern LongBuffer buf <- (unpackSLB# -> (# | buf #))
  where
    LongBuffer (Buffer p l) = SLB# l (addr2Int p) 0
{-# COMPLETE ShortBuffer, LongBuffer #-}

instance Eq SLBuffer where
  ShortBuffer b == ShortBuffer b' = b == b'
  LongBuffer b == LongBuffer b' = b == b'
  _ == _ = False
  {-# INLINE (==) #-}

-- Try to pack a Buffer into a short one.
packBuffer :: Buffer -> SLBuffer
packBuffer b =
  let
    l = len b
    ix = indexI b
    mask l = isrl (-1) (64 - sl l 3)
   in
    if l <= 8
      then ShortBuffer (ShortBuffer# l 0 0 (ix 0 .&. mask l))
      else
        if l <= 16
          then ShortBuffer (ShortBuffer# l 0 (ix 1 .&. mask (l - 8)) (ix 0))
          else
            if l <= 23
              then ShortBuffer (ShortBuffer# l (ix 2 .&. mask (l - 16)) (ix 1) (ix 0))
              else LongBuffer b

hashSLB :: SLBuffer -> Word
hashSLB (ShortBuffer b) = hashShortBuffer b
hashSLB (LongBuffer b) = hashBuffer b

buildSLB :: SLBuffer -> BB.Builder
buildSLB (ShortBuffer b) = buildShortBuffer b
buildSLB (LongBuffer b) = buildBuffer b

instance Show SLBuffer where
  show = LC8.unpack . BB.toLazyByteString . buildSLB

instance Ord SLBuffer where
  compare (ShortBuffer b) (ShortBuffer b') = compare b b'
  compare b b' = compare (show b) (show b')

-- Branchless scanning for bytes in words.
--------------------------------------------------------------------------------



-- Given a hexadecimal byte, generate the (Word -> Int) function which returns the
-- index of the rightmost occurrence of the byte, or returns 8 if the byte does not
-- occur.




-- Hash table of measurements
--------------------------------------------------------------------------------

data Val = Val
  { _min :: Int
  , _max :: Int
  , _cnt :: Int
  , _total :: Int
  }

data Entry = Entry
  { _key :: {-# UNPACK #-} SLBuffer
  , _val :: {-# UNPACK #-} Val
  }

-- size of entry in bytes (includes padding to 64 bytes!)
entrySize :: Int
entrySize = 8 * 8

tableMask :: Int
tableMask = 65536 - 1

tableBytes :: Int
tableBytes = 65536 * entrySize

type Table = Addr#

initTables :: [Buffer] -> ([(Buffer, Ptr Word8)] -> IO a) -> IO a
initTables bs f = do
  let l = 65536 * entrySize
  let go [] acc = f acc
      go (b : bs) acc = allocaBytesAligned l entrySize \p@(Ptr p') -> do
        memset (Buffer p' l) 0
        go bs ((b, p) : acc)
  go bs []

-- read entry from a *byte* offset
readEntry :: Table -> Int -> IO Entry
readEntry p i = case plusAddr p i of
  p -> do
    a <- readI p
    b <- readI (plusAddr p 8)
    c <- readI (plusAddr p 16)
    d <- readI (plusAddr p 24)
    e <- readI (plusAddr p 32)
    f <- readI (plusAddr p 40)
    g <- readI (plusAddr p 48)
    pure $ Entry (SLB# a b c) (Val d e f g)

-- write entry to a *byte* offset
writeEntry :: Table -> Int -> Entry -> IO ()
writeEntry p i (Entry (SLB# a b c) (Val d e f g)) = case plusAddr p i of
  p -> do
    writeI p a
    writeI (plusAddr p 8) b
    writeI (plusAddr p 16) c
    writeI (plusAddr p 24) d
    writeI (plusAddr p 32) e
    writeI (plusAddr p 40) f
    writeI (plusAddr p 48) g

newVal :: Int -> Val
newVal temp = Val temp temp 1 temp

updateEntry :: Entry -> Val -> Entry
updateEntry (Entry k (Val mi ma cn to)) (Val mi' ma' cn' to') =
  Entry k (Val (min' mi mi') (max' ma ma') (cn + cn') (to + to'))

forTable :: Table -> (Entry -> IO ()) -> IO ()
forTable t f = do
  let go ix | ix == tableBytes = pure ()
      go ix = do
        e@(Entry k _) <- readEntry t ix
        if isEmptySLB k
          then do
            go (ix + entrySize)
          else do
            f e
            go (ix + entrySize)
  go 0
{-# INLINE forTable #-}

updateTable :: Table -> Entry -> IO ()
updateTable tbl e@(Entry key val) = do
  let go ix | ix == tableBytes = go 0
      go ix = do
        olde@(Entry oldkey _) <- readEntry tbl ix
        if isEmptySLB oldkey
          then do
            writeEntry tbl ix e
          else
            if key == oldkey
              then do
                writeEntry tbl ix (updateEntry olde val)
              else do
                go (ix + entrySize)
  go ((fi (hashSLB key) .&. tableMask) * entrySize)

parse :: Table -> Buffer -> IO ()
parse _ b | len b == 0 = do
  pure ()
parse tbl b = do
  -- scan for semicolon
  let
    findSemi :: Int -> Buffer -> Int
    findSemi i b = case BYTE_INDEX (3 B) (getW b) of
      8 -> findSemi (i + 8) (plus b 8)
      i' -> i + i'
  let keylen = findSemi 0 b

  let key = packBuffer $ b{len = keylen}
  b <- pure $ plus b (keylen + 1)

  let
    digit :: Word8 -> Int
    digit x = fi x - 48

  let
    join :: Buffer -> Int -> IO ()
    join b temp = do
      updateTable tbl (Entry key (newVal temp))
      parse tbl b

  case getW8 b of
    -- '-'
    45 -> do
      let d1 = getW8 (plus b 1)
      case getW8 (plus b 2) of
        -- '.' so the next must be digit
        46 -> do
          let d2 = getW8 (plus b 3)
          join (plus b 5) ((-10) * (digit d1) - digit d2)

        -- digit, so the next must be '.' and then digit
        d2 -> do
          let d3 = getW8 (plus b 4)
          join (plus b 6) ((-100) * (digit d1) - 10 * (digit d2) - digit d3)

    -- a digit
    d1 -> case getW8 (plus b 1) of
      -- '.', so the next must be digit
      46 -> do
        let d2 = getW8 (plus b 2)
        join (plus b 4) (10 * digit d1 + digit d2)

      -- another digit, so the next must be '.', and then digit
      d2 -> do
        let d3 = getW8 (plus b 3)
        join (plus b 5) (100 * digit d1 + 10 * digit d2 + digit d3)

-- Split file to THREAD_NUM buffers
--------------------------------------------------------------------------------

splitBuffer :: Int -> Buffer -> [Buffer]
splitBuffer num_threads b =
  let
    chunkSize = div (len b) num_threads

    go b
      | len b <= chunkSize =
          [b]
    go b =
      let
        findNewl i b = case BYTE_INDEX (0 A) (getW b) of
          8 -> findNewl (i + 8) (plus b 8)
          i' -> i + i'
        keylen = findNewl 0 (plus b chunkSize)
        chunkSize' = chunkSize + keylen + 1
        rest = go (plus b chunkSize')
       in
        Buffer (_ptr b) chunkSize' : rest
   in
    go b


tableToList :: Table -> IO [Entry]
tableToList tbl = do
  let go ix | ix == tableBytes = pure []
      go ix = do
        e@(Entry k _) <- readEntry tbl ix
        if isEmptySLB k then do
          go (ix + entrySize)
        else do
          es <- go (ix + entrySize)
          pure (e:es)
  go 0

displayEntries :: [Entry] -> BB.Builder
displayEntries es = BB.char8 '{' <> go es <> BB.char8 '}' where

  f $$! x = f x; infixl 8 $$!

  goEntry (Entry key (Val mi ma cn to)) =
    buildSLB key <>
    BB.string8
      (printf "=%.1f/%.1f/%.1f" $$!
          (fi mi / 10 :: Double) $$!
          (fi to / (fi cn * 10) :: Double) $$!
          (fi ma / 10 :: Double))

  go []     = mempty
  go [e]    = goEntry e
  go (e:es) = goEntry e <> BB.string8 ", " <> go es


main :: IO ()
main =
  withFile "data/measurements.txt" \b -> do
    num_threads <- getNumCapabilities
    initTables (splitBuffer num_threads b) \bts -> do
      Ptr tbl : ts <- mapConcurrently (\(b, Ptr t) -> Ptr t <$ parse t b) bts
      forM_ ts \(Ptr tbl') ->
        forTable tbl' \e ->
          updateTable tbl e

      es <- sortBy (\e e' -> compare (_key e) (_key e')) <$> tableToList tbl
      BB.hPutBuilder stdout (displayEntries es)
      putChar '\n'

WTF? This is with GHC 9.6.4, but with 9.8.2 (which is used when actually compiling the program) it is correctly preprocessed?

ghc-9.8 -E:

Summary
{-# LINE 1 "app/Main.hs" #-}
# 1 "app/Main.hs"
# 1 "<built-in>" 1
# 16 "<built-in>"
# 1 "/Users/roland/.ghcup/ghc/9.8.2/lib/ghc-9.8.2/lib/../lib/aarch64-osx-ghc-9.8.2/rts-1.0.2/include/ghcversion.h" 1


















# 17 "<built-in>" 2
# 1 "/var/folders/07/f6y7bm0x7n72vhtwd596514h0000gn/T/ghc90157_0/ghc_2.h" 1

























































































































































































































# 18 "<built-in>" 2
# 1 "app/Main.hs" 2

{-# language
    BlockArguments
  , CPP
  , LambdaCase
  , MagicHash
  , PatternSynonyms
  , Strict
  , TypeApplications
  , UnboxedTuples
  , ViewPatterns
  #-}

{-# options_ghc
  -Wall
  -Wno-missing-signatures
  -Wno-name-shadowing
  #-}

{- cabal:
  build-depends: base >= 4.19, bytestring, mmap, async
  default-language: GHC2021
  ghc-options: -Wall -O2 -fllvm -rtsopts -threaded -split-sections
-}

-- ghc -O2 -fllvm -rtsopts -threaded -split-sections -ddump-simpl -dsuppress-all
-- -dno-suppress-type-signatures -ddump-to-file -fforce-recomp Main7.hs

-- CONFIGURATION
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

-- Run executable with "+RTS -NX" to set number of threads to X.

-- display output


-- should be power of 2, minimum 16384
-- #define TABLE_SIZE 131072

-- #define 65536 32768
-- #define 65536 16384

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------

import Control.Concurrent
import Control.Concurrent.Async
import Control.Monad
import Data.Bits
import Foreign.Marshal.Alloc
import GHC.Exts
import GHC.IO
import GHC.Word
import System.IO.MMap
import qualified Data.ByteString.Builder    as BB
import qualified Data.ByteString.Lazy.Char8 as LC8


import Data.List
import Text.Printf
import System.IO hiding (withFile)



-- Random common functions
--------------------------------------------------------------------------------

fi :: (Integral a, Num b) => a -> b
fi = fromIntegral; {-# inline fi #-}

sl :: Bits a => a -> Int -> a
sl = unsafeShiftL

sr :: Bits a => a -> Int -> a
sr = unsafeShiftR

isrl :: Int -> Int -> Int
isrl (I# x) (I# y) = I# (uncheckedIShiftRL# x y)

max' :: Int -> Int -> Int
max' a b = let diff = a - b in a - (diff .&. sr diff 63)

min' :: Int -> Int -> Int
min' a b = let diff = a - b in b + (diff .&. sr diff 63)

plusAddr :: Addr# -> Int -> Addr#
plusAddr p (I# x) = plusAddr# p x

int2Addr :: Int -> Addr#
int2Addr (I# x) = int2Addr# x

addr2Int :: Addr# -> Int
addr2Int p = I# (addr2Int# p)

eqI :: Int -> Int -> Int
eqI (I# x) (I# y) = I# (x ==# y)

readI :: Addr# -> IO Int
readI p = IO \s -> case readIntOffAddr# p 0# s of (# s, x #) -> (# s, I# x #)

writeI :: Addr# -> Int -> IO ()
writeI p (I# x) = IO \s -> case writeIntOffAddr# p 0# x s of s -> (# s, () #)

-- Generic buffers
--------------------------------------------------------------------------------

data Buffer = Buffer {_ptr :: Addr#, len :: Int}

plus :: Buffer -> Int -> Buffer
plus (Buffer p l) (I# x) = Buffer (plusAddr# p x) (l - I# x)

memset :: Buffer -> Word8 -> IO ()
memset (Buffer p (I# l)) (W8# x) = IO \s ->
  case setAddrRange# p l (word2Int# (word8ToWord# x)) s of s -> (# s, () #)

withFile :: FilePath -> (Buffer -> IO a) -> IO a
withFile path k = mmapWithFilePtr path ReadOnly Nothing \(Ptr p, l) -> k (Buffer p l)
{-# inline withFile #-}

indexW8 :: Buffer -> Int -> Word8
indexW8 (Buffer p _) (I# x) = W8# (indexWord8OffAddr# p x)

indexW32 :: Buffer -> Int -> Word32
indexW32 (Buffer p _) (I# x) = W32# (indexWord32OffAddr# p x)

indexW :: Buffer -> Int -> Word
indexW (Buffer p _) (I# x) = W# (indexWordOffAddr# p x)

indexI :: Buffer -> Int -> Int
indexI (Buffer p _) (I# x) = I# (indexIntOffAddr# p x)

getW8  = (`indexW8` 0)
getW32 = (`indexW32` 0)
getW   = (`indexW` 0)

instance Eq Buffer where
  Buffer p l == Buffer p' l' = l == l' && go p p' l where
    buf p = Buffer p l
    go p p' l
      | l >= 8 = getW   (buf p) == getW   (buf p') && go (plusAddr# p 8#) (plusAddr# p' 8#) (l - 8)
      | l >= 4 = getW32 (buf p) == getW32 (buf p') && go (plusAddr# p 4#) (plusAddr# p' 4#) (l - 4)
      | l == 0 = True
      | True   = getW8  (buf p) == getW8  (buf p') && go (plusAddr# p 1#) (plusAddr# p' 1#) (l - 1)
  {-# inline (==) #-}

foldedMul :: Word -> Word -> Word
foldedMul (W# x) (W# y) = case timesWord2# x y of (# hi, lo #) -> W# (xor# hi lo)

salt :: Word
salt = 3032525626373534813

combine :: Word -> Word -> Word
combine x y = foldedMul (xor x y) 11400714819323198549

hashBuffer :: Buffer -> Word
hashBuffer p = go p salt where
  go p acc
    | len p >= 8 = go (plus p 8) (combine (getW p) acc)
    | len p >= 4 = go (plus p 4) (combine (fromIntegral (getW32 p)) acc)
    | len p == 0 = acc
    | otherwise  = go (plus p 1) (combine (fromIntegral (getW8 p)) acc)

buildBuffer :: Buffer -> BB.Builder
buildBuffer b | len b == 0 = mempty
buildBuffer b = BB.word8 (getW8 b) <> buildBuffer (plus b 1)

-- printBuffer :: Buffer -> IO ()
-- printBuffer = BB.hPutBuilder stdout . buildBuffer

instance Show Buffer where
  show x =
    LC8.unpack $ BB.toLazyByteString $ buildBuffer x

instance Ord Buffer where
  compare x x' = compare (show x) (show x')

-- Short buffer
--------------------------------------------------------------------------------

-- Unboxed buffer containing at most 23 bytes. The first field is the length,
-- the rest is the payload. The 24-th byte in the payload is always zeroed out.
data ShortBuffer = ShortBuffer# Int Int Int Int

instance Eq ShortBuffer where
  ShortBuffer# _ a b c == ShortBuffer# _ a' b' c' =
    (eqI a a' .&. eqI b b' .&. eqI c c') == 1
  {-# inline (==) #-}

hashShortBuffer :: ShortBuffer -> Word
hashShortBuffer (ShortBuffer# _ a b c) =
  (salt `combine` fi a) `combine` (fi b `combine` fi c)

buildShortBuffer :: ShortBuffer -> BB.Builder
buildShortBuffer (ShortBuffer# l a b c) =
  BB.lazyByteString $ LC8.take (fi l) $ BB.toLazyByteString $
  BB.int64LE (fi c) <> BB.int64LE (fi b) <> BB.int64LE (fi a)

instance Show ShortBuffer where
  show = LC8.unpack . BB.toLazyByteString . buildShortBuffer

instance Ord ShortBuffer where
  compare (ShortBuffer# _ a b c) (ShortBuffer# _ a' b' c') =
    let sw (I# x) = W# (byteSwap# (int2Word# x))
    in compare (sw c) (sw c') <> compare (sw b) (sw b') <> compare (sw a) (sw a')

-- Unboxed sum of short and standard buffers.
--------------------------------------------------------------------------------

data SLBuffer = SLB# Int Int Int

isEmptySLB :: SLBuffer -> Bool
isEmptySLB (SLB# a _ _) = a == 0

unpackSLB# :: SLBuffer -> (# ShortBuffer | Buffer #)
unpackSLB# (SLB# a b c) =
  let l = a .&. 255 in
  if l <= 23 then (# ShortBuffer# l (isrl a 8) b c | #)
             else (# | Buffer (int2Addr b) a #)

pattern ShortBuffer :: ShortBuffer -> SLBuffer
pattern ShortBuffer buf <- (unpackSLB# -> (# buf | #)) where
  ShortBuffer (ShortBuffer# len a b c) = SLB# (sl a 8 .|. len) b c

pattern LongBuffer :: Buffer -> SLBuffer
pattern LongBuffer buf <- (unpackSLB# -> (# | buf #)) where
  LongBuffer (Buffer p l) = SLB# l (addr2Int p) 0
{-# complete ShortBuffer, LongBuffer #-}

instance Eq SLBuffer where
  ShortBuffer b == ShortBuffer b' = b == b'
  LongBuffer b  == LongBuffer b'  = b == b'
  _             == _              = False
  {-# inline (==) #-}

-- Try to pack a Buffer into a short one.
packBuffer :: Buffer -> SLBuffer
packBuffer b =
  let l      = len b
      ix     = indexI b
      mask l = isrl (-1) (64 - sl l 3) in
  if      l <= 8  then ShortBuffer (ShortBuffer# l 0 0 (ix 0 .&. mask l))
  else if l <= 16 then ShortBuffer (ShortBuffer# l 0 (ix 1 .&. mask (l - 8)) (ix 0))
  else if l <= 23 then ShortBuffer (ShortBuffer# l (ix 2 .&. mask (l - 16)) (ix 1) (ix 0))
  else LongBuffer b

hashSLB :: SLBuffer -> Word
hashSLB (ShortBuffer b) = hashShortBuffer b
hashSLB (LongBuffer b)  = hashBuffer b

buildSLB :: SLBuffer -> BB.Builder
buildSLB (ShortBuffer b) = buildShortBuffer b
buildSLB (LongBuffer b)  = buildBuffer b

instance Show SLBuffer where
  show = LC8.unpack . BB.toLazyByteString . buildSLB

instance Ord SLBuffer where
  compare (ShortBuffer b) (ShortBuffer b') = compare b b'
  compare b               b'               = compare (show b) (show b')

-- Branchless scanning for bytes in words.
--------------------------------------------------------------------------------



-- Given a hexadecimal byte, generate the (Word -> Int) function which returns the
-- index of the rightmost occurrence of the byte, or returns 8 if the byte does not
-- occur.





-- Hash table of measurements
--------------------------------------------------------------------------------

data Val = Val {
    _min     :: Int
  , _max     :: Int
  , _cnt     :: Int
  , _total   :: Int
  }

data Entry = Entry {
    _key     :: {-# unpack #-} SLBuffer
  , _val     :: {-# unpack #-} Val
  }

-- size of entry in bytes (includes padding to 64 bytes!)
entrySize :: Int
entrySize = 8 * 8

tableMask :: Int
tableMask = 65536 - 1

tableBytes :: Int
tableBytes = 65536 * entrySize

type Table = Addr#

initTables :: [Buffer] -> ([(Buffer, Ptr Word8)] -> IO a) -> IO a
initTables bs f = do
  let l = 65536 * entrySize
  let go []     acc = f acc
      go (b:bs) acc = allocaBytesAligned l entrySize \p@(Ptr p') -> do
        memset (Buffer p' l) 0
        go bs ((b, p):acc)
  go bs []

-- read entry from a *byte* offset
readEntry :: Table -> Int -> IO Entry
readEntry p i = case plusAddr p i of
  p -> do
    a <- readI p
    b <- readI (plusAddr p 8)
    c <- readI (plusAddr p 16)
    d <- readI (plusAddr p 24)
    e <- readI (plusAddr p 32)
    f <- readI (plusAddr p 40)
    g <- readI (plusAddr p 48)
    pure $ Entry (SLB# a b c) (Val d e f g)

-- write entry to a *byte* offset
writeEntry :: Table -> Int -> Entry -> IO ()
writeEntry p i (Entry (SLB# a b c) (Val d e f g)) = case plusAddr p i of
  p -> do
    writeI p               a
    writeI (plusAddr p 8)  b
    writeI (plusAddr p 16) c
    writeI (plusAddr p 24) d
    writeI (plusAddr p 32) e
    writeI (plusAddr p 40) f
    writeI (plusAddr p 48) g

newVal :: Int -> Val
newVal temp = Val temp temp 1 temp

updateEntry :: Entry -> Val -> Entry
updateEntry (Entry k (Val mi ma cn to)) (Val mi' ma' cn' to')
  = Entry k (Val (min' mi mi') (max' ma ma') (cn + cn') (to + to'))

forTable :: Table -> (Entry -> IO ()) -> IO ()
forTable t f = do
  let go ix | ix == tableBytes = pure ()
      go ix = do
        e@(Entry k _) <- readEntry t ix
        if isEmptySLB k then do
          go (ix + entrySize)
        else do
          f e
          go (ix + entrySize)
  go 0
{-# inline forTable #-}

updateTable :: Table -> Entry -> IO ()
updateTable tbl e@(Entry key val) = do
  let go ix | ix == tableBytes = go 0
      go ix = do
        olde@(Entry oldkey _) <- readEntry tbl ix
        if isEmptySLB oldkey then do
          writeEntry tbl ix e
        else if key == oldkey then do
          writeEntry tbl ix (updateEntry olde val)
        else do
          go (ix + entrySize)
  go ((fi (hashSLB key) .&. tableMask) * entrySize)

parse :: Table -> Buffer -> IO ()
parse _   b | len b == 0 = do
  pure ()
parse tbl b = do

-- scan for semicolon
  let findSemi :: Int -> Buffer -> Int
      findSemi i b = case (\(x :: Word) -> case xor x 0x 3B 3B 3B 3B 3B 3B 3B 3B of x -> case (x - 0x0101010101010101) .&. complement x .&. 0x8080808080808080 of x -> countTrailingZeros x `sr` 3) (getW b) of
        8  -> findSemi (i + 8) (plus b 8)
        i' -> i + i'
  let keylen = findSemi 0 b

  let key = packBuffer $ b {len = keylen}
  b <- pure $ plus b (keylen + 1)

  let digit :: Word8 -> Int
      digit x = fi x - 48

  let join :: Buffer -> Int -> IO ()
      join b temp = do
        updateTable tbl (Entry key (newVal temp))
        parse tbl b

  case getW8 b of
    -- '-'
    45 -> do
      let d1 = getW8 (plus b 1)
      case getW8 (plus b 2) of
        -- '.' so the next must be digit
        46 -> do
          let d2 = getW8 (plus b 3)
          join (plus b 5) ((-10)*(digit d1) - digit d2)

        -- digit, so the next must be '.' and then digit
        d2 -> do
          let d3 = getW8 (plus b 4)
          join (plus b 6) ((-100)*(digit d1) - 10*(digit d2) - digit d3)

    -- a digit
    d1 -> case getW8 (plus b 1) of
      -- '.', so the next must be digit
      46 -> do
        let d2 = getW8 (plus b 2)
        join (plus b 4) (10*digit d1 + digit d2)

      -- another digit, so the next must be '.', and then digit
      d2 -> do
        let d3 = getW8 (plus b 3)
        join (plus b 5) (100*digit d1 + 10*digit d2 + digit d3)

-- Split file to THREAD_NUM buffers
--------------------------------------------------------------------------------

splitBuffer :: Int -> Buffer -> [Buffer]
splitBuffer num_threads b = let
  chunkSize = div (len b) num_threads

  go b | len b <= chunkSize =
    [b]
  go b = let
    findNewl i b = case (\(x :: Word) -> case xor x 0x 0A 0A 0A 0A 0A 0A 0A 0A of x -> case (x - 0x0101010101010101) .&. complement x .&. 0x8080808080808080 of x -> countTrailingZeros x `sr` 3) (getW b) of
      8  -> findNewl (i + 8) (plus b 8)
      i' -> i + i'
    keylen     = findNewl 0 (plus b chunkSize)
    chunkSize' = chunkSize + keylen + 1
    rest       = go (plus b chunkSize')
    in
    Buffer (_ptr b) chunkSize' : rest
  in go b


tableToList :: Table -> IO [Entry]
tableToList tbl = do
  let go ix | ix == tableBytes = pure []
      go ix = do
        e@(Entry k _) <- readEntry tbl ix
        if isEmptySLB k then do
          go (ix + entrySize)
        else do
          es <- go (ix + entrySize)
          pure (e:es)
  go 0

displayEntries :: [Entry] -> BB.Builder
displayEntries es = BB.char8 '{' <> go es <> BB.char8 '}' where

  f $$! x = f x; infixl 8 $$!

  goEntry (Entry key (Val mi ma cn to)) =
    buildSLB key <>
    BB.string8
      (printf "=%.1f/%.1f/%.1f" $$!
          (fi mi / 10 :: Double) $$!
          (fi to / (fi cn * 10) :: Double) $$!
          (fi ma / 10 :: Double))

  go []     = mempty
  go [e]    = goEntry e
  go (e:es) = goEntry e <> BB.string8 ", " <> go es


main :: IO ()
main =
  withFile "data/measurements.txt" \b -> do
    num_threads <- getNumCapabilities
    initTables (splitBuffer num_threads b) \bts -> do
      Ptr tbl:ts <- mapConcurrently (\(b, Ptr t) -> Ptr t <$ parse t b) bts
      forM_ ts \(Ptr tbl') ->
        forTable tbl' \e ->
          updateTable tbl e

      es <- sortBy (\e e' -> compare (_key e) (_key e')) <$> tableToList tbl
      BB.hPutBuilder stdout (displayEntries es)
      putChar '\n'