One Billion Row challenge in Hs

Hrm…to choose between “library bloat” (having several similar libraries which only differ in their strictness properties), or ye ol’ code bloat (having several similar groups of definitions which only differ in certain other properties)? As noted here in this presentation:

https://www.youtube.com/watch?v=2PxsyWqZ5dI

…just imagine if we had to choose between several similar memory management libraries which only differed in their specific properties: thankfully, we don’t! Instead, it’s a part of the implementation.


Suppose multi-threading by default:

https://github.com/ghc-proposals/ghc-proposals/pull/240

had already been implemented - would there really still as much interest in “wringing every last bit” of performance out of the old single-threaded RTS? Or would we be:

  • advising people on how to migrate their old “unthreaded” code to the new default;

  • and discussing how to get the best multi-threaded performance out of the RTS?


It seems to me that you would merely be exchanging many short simplifier runs for a few much-longer simplifier runs, because of that extra complexity.

Oh I say, that’s a bit strong. deepseq is a tool with uses, and while I agree that in that specific problem, the strict vector was ultimately the right solution, in a more complex scenario in which either you want some laziness sometimes in your data structures or you don’t yet know which parts of your program are eating all of your memory with thunks and refactoring everything to be strict would be time-consuming, deepseq is an important tool to have at your disposal, even if just to try force-ing different things until you figure out which things to refactor. (But sometimes invalid laziness has to be representable; Haskell’s type system is very good at letting you express invariants through types but no type system is perfect at this.)

Certainly when you’re attempting to squeeze your thunks out by forcing everything through a show, as OP was, force is something you ought to be told about.

I updated my gist with @AndrasKovacs hash. Now it runs at about the same speed as yours, but uses less intrinsics and doesn’t have short string optimisation (only “short string” hash).

For such a bulk task (parsing CSV at 1GB/sec), not only every memory access counts, but also every instruction. Using a hash that works on 64-bit words instead of bytes helps to have fewer instructions.

2 Likes

deepseq is a time leak machine! and many other languages have it built into their spec :skull:

Correct.

Consider let x = head (return x) in 'a'

  • “outside in” evaluation: the result is 'a' as it ought to be, since x isn’t needed here to obtain that result.

  • “inside out” evaluation: the result will always be ⊥ no matter how much memory or how many threads are available.

To alleviate this fundamental problem with “inside out” evaluation, Eager Haskell would periodically and completely restart (reduction of) the program. This “re-scorched earth” style had two competing requirements:

  • avoiding too many restarts, so that the program can actually run at all;

  • avoiding too few restarts, so the program won’t do too much unnecessary work (like trying to evaluate x = head (return x) or other similarly-cyclic definitions).

…a “delicate balancing act”.

We’ve gone off topic, but I invite anyone who sneers at deepseq to take a look at Fix space leak between modules during compilation by MonoidMusician · Pull Request #4517 · purescript/purescript · GitHub and recommend a better way to address that problem. We weren’t happy with the deep infiltration of NFData into our types and would love to hear from someone who can clearly see a less invasive approach.

(ahem) …speaking for myself here: I wasn’t so much “sneering” at (the use of) entities like deepseq, but a prevailing impression that they (unfortunately) seem to accentuate:

  • that laziness/non-strict semantics are now more of an impediment than being of ongoing benefit to Haskell,

  • and little if anything would be lost if Haskell were strict by default.

(Again, this is merely how I’m perceiving the current situation).

So it is to that impression that I say:

  • did someone solve the halting problem?

  • or is Haskell going to eventually be total as well as dependent?

Because without either of those, being strict by default generally means more programs won’t produce a useful result at all. Therefore definitions like deepseq (or compel, as I mentioned here) along with other “strictness modifiers” should ideally be considered “features of last resort” because of their potential for non-termination.

So how can we get to that ideal? This looks promising:

…if the multi-threaded RTS does have to be overhauled to solve the “N > 4” problem observed by @chreekat and others, can any of Robert Ennal’s research be reused as well, so that comments like:

don’t end up being treated as basic advice when using Haskell?

1 Like

The result first: 3.9s!

11 threads (the number of cores is 10, although 2 of them are not “performance cores”):

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.934 s ±  0.007 s    [User: 36.062 s, System: 1.502 s]
  Range (min … max):    3.925 s …  3.943 s    5 runs

On my computer, -fllvm is still not faster - but for your version at least not slower - than not using it.

Without LLVM:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.344 s ±  0.009 s    [User: 32.554 s, System: 1.222 s]
  Range (min … max):    4.333 s …  4.355 s    5 runs

With LLVM (about the same).

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.346 s ±  0.005 s    [User: 32.559 s, System: 1.227 s]
  Range (min … max):    4.340 s …  4.353 s    5 runs

But now to the number of threads:

11 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.934 s ±  0.007 s    [User: 36.062 s, System: 1.502 s]
  Range (min … max):    3.925 s …  3.943 s    5 runs

12 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.951 s ±  0.015 s    [User: 36.132 s, System: 1.523 s]
  Range (min … max):    3.936 s …  3.970 s    5 runs

10 threads (the number of cores, although 2 of them are not “performance cores”):

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.968 s ±  0.013 s    [User: 35.878 s, System: 1.449 s]
  Range (min … max):    3.954 s …  3.982 s    5 runs

14 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.991 s ±  0.012 s    [User: 36.293 s, System: 1.555 s]
  Range (min … max):    3.979 s …  4.004 s    5 runs

16 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.029 s ±  0.012 s    [User: 36.438 s, System: 1.589 s]
  Range (min … max):    4.020 s …  4.051 s    5 runs

20 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.100 s ±  0.037 s    [User: 36.761 s, System: 1.631 s]
  Range (min … max):    4.054 s …  4.142 s    5 runs

8 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.344 s ±  0.009 s    [User: 32.554 s, System: 1.222 s]
  Range (min … max):    4.333 s …  4.355 s    5 runs

200 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      6.006 s ±  0.109 s    [User: 41.527 s, System: 3.630 s]
  Range (min … max):    5.921 s …  6.156 s    5 runs
1 Like

I think embarrassingly parallel (EP) workloads are meant to use data parallel (DPH) features like what Repa does, forkIO spawns a green thread that has to support all sorts of things that we usually don’t consider as EP, for example a GPU at it’s core is usually just a 32 or 64 instruction wide SIMD processor [1] , which means that misuse of if statements can lead to worse performance than just computing both sides of the if, all instructions are executed in sync, and a memory access for one “thread” will stall all 64 threads, these are things we don’t have to worry about with green threads.

The N>4 issue might be something else, the stm-containers package mentions it can scale to 8 threads with ease, I also have a hunch that modern computers can sort of brute force up to about 4 CPUs (why Intel was king with 4-cores for a decade), L2 cache is only as fast as about 1/2 a DDR4 channel, meaning you can saturate a dual-channel system with 4 cores.

I’ve tried to specific stick to Haskell style code and that it being orders of magnitude slower than C sucks as per @emiruz comments, and what’s even worse is that when you look at other functional languages, most notably Lisp, the C-like performance versions are idiomatic, functional style. [2]

@AndrasKovacs how does your single threaded version work in-comparison to mine? Not sure I understand why mine is so slow compared to many of the solutions here, strict map and fold shouldn’t be that much slower.

I don’t think I’ve seen anyone implement a pure parallel version here using par, maybe if I find some more time I’ll have a crack at a par, stm-containers, and something using DPT, and also fix up some of the code to use bounded threads to prevent cache misses etc.

Something I’ve been wondering about is can we use the type system validate we have no overlap when delegating reads from the file. a correct implementation is be fairly easy to write, but that’s not why we use Haskell eh?

[1]: and end up with absurd marketing claims like 3000 “stream” processors
[2]: Lisp SBCL #2 n-body - Which programs are fastest? (Benchmarks Game)

3 Likes

…and strict by default. So you’re the one who’s dealing with (more of) the tedious and boring jobs, such as determining the correct order of evaluation:

We’ve got it already, you like Haskell’s laziness.

But:

Adding Strictness makes the version (minimally, but consistently) faster on my computer, at least with the load that it has right now (a VM and many other things running), that’s why the times are longer than before:

Original version:

Benchmark 1: ./exe-exe  > solution.txt
  Time (mean ± σ):      4.307 s ±  0.021 s    [User: 37.089 s, System: 2.015 s]
  Range (min … max):    4.276 s …  4.328 s    5 runs

With {-# LANGUAGE Strict #-}:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.269 s ±  0.023 s    [User: 36.972 s, System: 1.955 s]
  Range (min … max):    4.247 s …  4.305 s    5 runs

…but I wonder if I’m now in the minority here :-/


I could be mistaken…but I think you’re the first person on this thread to make use of -XStrict. It would be interesting to see if it makes a difference for the other/older solutions given here: if it works, then perhaps it could help to solve sgraf’s third problem:

That’s not my program, I’m just running them (almost?) all for benchmarks. I’ve actually tried it on my own some time ago and it had been slower. Although that might have changed and the last version would benefit.

That is incorrect, Lisp SBCL #2 code is clean doesn’t look like it cares too much about order, meanwhile I’m having to splatter strictness annotations and really think about why my haskell program is so slow, and on top of that, we have to remember that modern C compilers are extremely good at re-ordering operations, and lets not forget that modern CPUs performance comes from real time re-ordering of instructions as well, half of a modern processor is dedicated to this dark-art of running instructions that take 4 cycles per instruction, at 4 instructions / cycle.

At the end of the day all the code generator has too do is build a dependency graph and spit out an ordering that the CPU (or uarch) will run fast in most cases. This is how software like fex can run badly optimized legacy x86_32 code on x86_64 processor faster than the processor can itself.

There’s definitely something else going on here, that’s not strict v laziness, The fact my pure functional implementation with no use of forkIO is faster with -qm passed to the RTS would hint there’s work needed in the scheduler.

That SBCL code seems rife with destructive updates (…or maybe I just can’t read Common Lisp). What makes it more idiomatic than the faster Haskell GHC #2 solution: n-body Haskell GHC #2 program (Benchmarks Game)

TBH I’m not hugely familiar with CL to know exactly what is going on with the code, Nor am I saying that the the Lisp version is without mutation or anything like that.

What I’m saying is that the Lisp version is closer to the ideal of what I would want to write in Haskell. Meanwhile the fast Haskell version is basically Ugly C, There’s huges amounts of direct use of ptrs, we have a storable instance and a heavy splattering of peek/poke, that’s worse than having a few mutable vars in the Lisp code.

We still have to think about ordering in the Haskell version, especially when it comes to performance code (as every example here proves), So the notion that writing Haskell code you do not have to think about ordering is wrong, we have libraries like Pipes/Conduit that are designed to help, and we also write slower functional style code in other languages.

1 Like

I thought the reason those compilers could do that was largely due to the input language being a “semantics-free zone” (sometimes known as a “dumpster fire” ), which is another reason why I’m annoyed by performance comparisons with that ol’ warhorse…anyone for a Haskell-Clean comparison?


…but this clearly suggests otherwise:

To obtain the correct results, those mutable variables must be used in the correct order: this is why the encapsulated-state type ST s a is monadic.


For Haskell code based on visible effects, that is correct - as you subsequently noted:

we have libraries like Pipes/Conduit that are designed to help […]

But ordinary (effect-free) Haskell code ought to be capable of being evaluated in any order, just as long as the data dependencies are preserved. This form of implicit parallelism works for SQL, so it should also work for Haskell. So it not (quite) working like that at the moment just means more work needs to be directed at the implementation; something which you’ve also noted:

I parallelized and refactored the previous version: 1brc · GitHub

Parallel performance is very erratic on my laptop, there’s 2-3x difference between runs. I’m testing 500M inputs because I don’t have enough RAM to hold the 1B file in memory, and while I don’t run out of memory with mmap, apparently I get some IO bottleneck. The best I was able to do was 1.9s for 500M. The number of threads can be set with +RTS -N.

2 Likes

I’ve seen similar erratic behaviour on my Fortran version on machines with SATA SSDs with a sequential read speed of about 560MB/s. I suspected it was because the mmap buffer is being consumed quicker than that which causes strange IO signalling effects leading to arbitrary delays. I suspect those with SSDs on a PCIe interface won’t see this problem since the bandwidth is about 10x SATA. Does this correspond to your experience?

For me the results are relatively consistent, but a bit odd:

$ for i in {1..12}; do echo "---------------------"; echo "-N$i"; time cabal run --ghc-options="-threaded -fllvm" -O2 1brc.hs -- +RTS -N$i -RTS >/dev/null; done
---------------------
-N1

real	0m18,785s
user	0m18,035s
sys	0m0,755s
---------------------
-N2

real	0m18,784s
user	0m18,073s
sys	0m0,778s
---------------------
-N3

real	0m12,654s
user	0m18,013s
sys	0m0,810s
---------------------
-N4

real	0m9,574s
user	0m17,987s
sys	0m0,814s
---------------------
-N5

real	0m4,054s
user	0m17,944s
sys	0m0,838s
---------------------
-N6

real	0m6,524s
user	0m18,083s
sys	0m0,783s
---------------------
-N7

real	0m5,724s
user	0m18,379s
sys	0m0,822s
---------------------
-N8

real	0m5,694s
user	0m19,745s
sys	0m0,851s
---------------------
-N9

real	0m6,604s
user	0m18,359s
sys	0m0,799s
---------------------
-N10

real	0m4,694s
user	0m21,872s
sys	0m0,881s
---------------------
-N11

real	0m4,434s
user	0m23,389s
sys	0m0,890s
---------------------
-N12

real	0m2,664s
user	0m26,655s
sys	0m0,980s

So it takes only 2.664 seconds for the full 1 billion rows with -N12 which is the number of hyperthreads that my CPU has. If I make the -N larger then there are no more gains as expected, although it also doesn’t seem to get slower.

The strange thing is that -N2 gives pretty much the same time as -N1 (and the same for some other numbers).