One Billion Row challenge in Hs

Yes, the ā€œfastestā€ C solution above is mostly about 256 bit SIMD.

But I would prefer not to have to rely on (T)TH for performance, GHC is unbearable slow already.

1 Like

I think (T)TH being slow is not inherent to the approach; itā€™s just that its specification/implementation needs love. I would even say that good support for staging should yield a much faster compiler, because we can dial down on the number of Simplifier runs and its complexity.

4 Likes

Hrmā€¦to choose between ā€œlibrary bloatā€ (having several similar libraries which only differ in their strictness properties), or ye olā€™ code bloat (having several similar groups of definitions which only differ in certain other properties)? As noted here in this presentation:

https://www.youtube.com/watch?v=2PxsyWqZ5dI

ā€¦just imagine if we had to choose between several similar memory management libraries which only differed in their specific properties: thankfully, we donā€™t! Instead, itā€™s a part of the implementation.


Suppose multi-threading by default:

https://github.com/ghc-proposals/ghc-proposals/pull/240

had already been implemented - would there really still as much interest in ā€œwringing every last bitā€ of performance out of the old single-threaded RTS? Or would we be:

  • advising people on how to migrate their old ā€œunthreadedā€ code to the new default;

  • and discussing how to get the best multi-threaded performance out of the RTS?


It seems to me that you would merely be exchanging many short simplifier runs for a few much-longer simplifier runs, because of that extra complexity.

Oh I say, thatā€™s a bit strong. deepseq is a tool with uses, and while I agree that in that specific problem, the strict vector was ultimately the right solution, in a more complex scenario in which either you want some laziness sometimes in your data structures or you donā€™t yet know which parts of your program are eating all of your memory with thunks and refactoring everything to be strict would be time-consuming, deepseq is an important tool to have at your disposal, even if just to try force-ing different things until you figure out which things to refactor. (But sometimes invalid laziness has to be representable; Haskellā€™s type system is very good at letting you express invariants through types but no type system is perfect at this.)

Certainly when youā€™re attempting to squeeze your thunks out by forcing everything through a show, as OP was, force is something you ought to be told about.

I updated my gist with @AndrasKovacs hash. Now it runs at about the same speed as yours, but uses less intrinsics and doesnā€™t have short string optimisation (only ā€œshort stringā€ hash).

For such a bulk task (parsing CSV at 1GB/sec), not only every memory access counts, but also every instruction. Using a hash that works on 64-bit words instead of bytes helps to have fewer instructions.

2 Likes

deepseq is a time leak machine! and many other languages have it built into their spec :skull:

Correct.

Consider let x = head (return x) in 'a'

  • ā€œoutside inā€ evaluation: the result is 'a' as it ought to be, since x isnā€™t needed here to obtain that result.

  • ā€œinside outā€ evaluation: the result will always be āŠ„ no matter how much memory or how many threads are available.

To alleviate this fundamental problem with ā€œinside outā€ evaluation, Eager Haskell would periodically and completely restart (reduction of) the program. This ā€œre-scorched earthā€ style had two competing requirements:

  • avoiding too many restarts, so that the program can actually run at all;

  • avoiding too few restarts, so the program wonā€™t do too much unnecessary work (like trying to evaluate x = head (return x) or other similarly-cyclic definitions).

ā€¦a ā€œdelicate balancing actā€.

Weā€™ve gone off topic, but I invite anyone who sneers at deepseq to take a look at Fix space leak between modules during compilation by MonoidMusician Ā· Pull Request #4517 Ā· purescript/purescript Ā· GitHub and recommend a better way to address that problem. We werenā€™t happy with the deep infiltration of NFData into our types and would love to hear from someone who can clearly see a less invasive approach.

(ahem) ā€¦speaking for myself here: I wasnā€™t so much ā€œsneeringā€ at (the use of) entities like deepseq, but a prevailing impression that they (unfortunately) seem to accentuate:

  • that laziness/non-strict semantics are now more of an impediment than being of ongoing benefit to Haskell,

  • and little if anything would be lost if Haskell were strict by default.

(Again, this is merely how Iā€™m perceiving the current situation).

So it is to that impression that I say:

  • did someone solve the halting problem?

  • or is Haskell going to eventually be total as well as dependent?

Because without either of those, being strict by default generally means more programs wonā€™t produce a useful result at all. Therefore definitions like deepseq (or compel, as I mentioned here) along with other ā€œstrictness modifiersā€ should ideally be considered ā€œfeatures of last resortā€ because of their potential for non-termination.

So how can we get to that ideal? This looks promising:

ā€¦if the multi-threaded RTS does have to be overhauled to solve the ā€œN > 4ā€ problem observed by @chreekat and others, can any of Robert Ennalā€™s research be reused as well, so that comments like:

ā€¦donā€™t end up being treated as basic advice when using Haskell?

1 Like

The result first: 3.9s!

11 threads (the number of cores is 10, although 2 of them are not ā€œperformance coresā€):

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      3.934 s Ā±  0.007 s    [User: 36.062 s, System: 1.502 s]
  Range (min ā€¦ max):    3.925 s ā€¦  3.943 s    5 runs

On my computer, -fllvm is still not faster - but for your version at least not slower - than not using it.

Without LLVM:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      4.344 s Ā±  0.009 s    [User: 32.554 s, System: 1.222 s]
  Range (min ā€¦ max):    4.333 s ā€¦  4.355 s    5 runs

With LLVM (about the same).

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      4.346 s Ā±  0.005 s    [User: 32.559 s, System: 1.227 s]
  Range (min ā€¦ max):    4.340 s ā€¦  4.353 s    5 runs

But now to the number of threads:

11 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      3.934 s Ā±  0.007 s    [User: 36.062 s, System: 1.502 s]
  Range (min ā€¦ max):    3.925 s ā€¦  3.943 s    5 runs

12 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      3.951 s Ā±  0.015 s    [User: 36.132 s, System: 1.523 s]
  Range (min ā€¦ max):    3.936 s ā€¦  3.970 s    5 runs

10 threads (the number of cores, although 2 of them are not ā€œperformance coresā€):

Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      3.968 s Ā±  0.013 s    [User: 35.878 s, System: 1.449 s]
  Range (min ā€¦ max):    3.954 s ā€¦  3.982 s    5 runs

14 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      3.991 s Ā±  0.012 s    [User: 36.293 s, System: 1.555 s]
  Range (min ā€¦ max):    3.979 s ā€¦  4.004 s    5 runs

16 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      4.029 s Ā±  0.012 s    [User: 36.438 s, System: 1.589 s]
  Range (min ā€¦ max):    4.020 s ā€¦  4.051 s    5 runs

20 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      4.100 s Ā±  0.037 s    [User: 36.761 s, System: 1.631 s]
  Range (min ā€¦ max):    4.054 s ā€¦  4.142 s    5 runs

8 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      4.344 s Ā±  0.009 s    [User: 32.554 s, System: 1.222 s]
  Range (min ā€¦ max):    4.333 s ā€¦  4.355 s    5 runs

200 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      6.006 s Ā±  0.109 s    [User: 41.527 s, System: 3.630 s]
  Range (min ā€¦ max):    5.921 s ā€¦  6.156 s    5 runs
1 Like

I think embarrassingly parallel (EP) workloads are meant to use data parallel (DPH) features like what Repa does, forkIO spawns a green thread that has to support all sorts of things that we usually donā€™t consider as EP, for example a GPU at itā€™s core is usually just a 32 or 64 instruction wide SIMD processor [1] , which means that misuse of if statements can lead to worse performance than just computing both sides of the if, all instructions are executed in sync, and a memory access for one ā€œthreadā€ will stall all 64 threads, these are things we donā€™t have to worry about with green threads.

The N>4 issue might be something else, the stm-containers package mentions it can scale to 8 threads with ease, I also have a hunch that modern computers can sort of brute force up to about 4 CPUs (why Intel was king with 4-cores for a decade), L2 cache is only as fast as about 1/2 a DDR4 channel, meaning you can saturate a dual-channel system with 4 cores.

Iā€™ve tried to specific stick to Haskell style code and that it being orders of magnitude slower than C sucks as per @emiruz comments, and whatā€™s even worse is that when you look at other functional languages, most notably Lisp, the C-like performance versions are idiomatic, functional style. [2]

@AndrasKovacs how does your single threaded version work in-comparison to mine? Not sure I understand why mine is so slow compared to many of the solutions here, strict map and fold shouldnā€™t be that much slower.

I donā€™t think Iā€™ve seen anyone implement a pure parallel version here using par, maybe if I find some more time Iā€™ll have a crack at a par, stm-containers, and something using DPT, and also fix up some of the code to use bounded threads to prevent cache misses etc.

Something Iā€™ve been wondering about is can we use the type system validate we have no overlap when delegating reads from the file. a correct implementation is be fairly easy to write, but thatā€™s not why we use Haskell eh?

[1]: and end up with absurd marketing claims like 3000 ā€œstreamā€ processors
[2]: Lisp SBCL #2 n-body - Which programs are fastest? (Benchmarks Game)

3 Likes

ā€¦and strict by default. So youā€™re the one whoā€™s dealing with (more of) the tedious and boring jobs, such as determining the correct order of evaluation:

Weā€™ve got it already, you like Haskellā€™s laziness.

But:

Adding Strictness makes the version (minimally, but consistently) faster on my computer, at least with the load that it has right now (a VM and many other things running), thatā€™s why the times are longer than before:

Original version:

Benchmark 1: ./exe-exe  > solution.txt
  Time (mean Ā± Ļƒ):      4.307 s Ā±  0.021 s    [User: 37.089 s, System: 2.015 s]
  Range (min ā€¦ max):    4.276 s ā€¦  4.328 s    5 runs

With {-# LANGUAGE Strict #-}:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean Ā± Ļƒ):      4.269 s Ā±  0.023 s    [User: 36.972 s, System: 1.955 s]
  Range (min ā€¦ max):    4.247 s ā€¦  4.305 s    5 runs

ā€¦but I wonder if Iā€™m now in the minority here :-/


I could be mistakenā€¦but I think youā€™re the first person on this thread to make use of -XStrict. It would be interesting to see if it makes a difference for the other/older solutions given here: if it works, then perhaps it could help to solve sgrafā€™s third problem:

Thatā€™s not my program, Iā€™m just running them (almost?) all for benchmarks. Iā€™ve actually tried it on my own some time ago and it had been slower. Although that might have changed and the last version would benefit.

That is incorrect, Lisp SBCL #2 code is clean doesnā€™t look like it cares too much about order, meanwhile Iā€™m having to splatter strictness annotations and really think about why my haskell program is so slow, and on top of that, we have to remember that modern C compilers are extremely good at re-ordering operations, and lets not forget that modern CPUs performance comes from real time re-ordering of instructions as well, half of a modern processor is dedicated to this dark-art of running instructions that take 4 cycles per instruction, at 4 instructions / cycle.

At the end of the day all the code generator has too do is build a dependency graph and spit out an ordering that the CPU (or uarch) will run fast in most cases. This is how software like fex can run badly optimized legacy x86_32 code on x86_64 processor faster than the processor can itself.

Thereā€™s definitely something else going on here, thatā€™s not strict v laziness, The fact my pure functional implementation with no use of forkIO is faster with -qm passed to the RTS would hint thereā€™s work needed in the scheduler.

That SBCL code seems rife with destructive updates (ā€¦or maybe I just canā€™t read Common Lisp). What makes it more idiomatic than the faster Haskell GHC #2 solution: n-body Haskell GHCĀ #2 program (Benchmarks Game)

TBH Iā€™m not hugely familiar with CL to know exactly what is going on with the code, Nor am I saying that the the Lisp version is without mutation or anything like that.

What Iā€™m saying is that the Lisp version is closer to the ideal of what I would want to write in Haskell. Meanwhile the fast Haskell version is basically Ugly C, Thereā€™s huges amounts of direct use of ptrs, we have a storable instance and a heavy splattering of peek/poke, thatā€™s worse than having a few mutable vars in the Lisp code.

We still have to think about ordering in the Haskell version, especially when it comes to performance code (as every example here proves), So the notion that writing Haskell code you do not have to think about ordering is wrong, we have libraries like Pipes/Conduit that are designed to help, and we also write slower functional style code in other languages.

1 Like

I thought the reason those compilers could do that was largely due to the input language being a ā€œsemantics-free zoneā€ (sometimes known as a ā€œdumpster fireā€ ), which is another reason why Iā€™m annoyed by performance comparisons with that olā€™ warhorseā€¦anyone for a Haskell-Clean comparison?


ā€¦but this clearly suggests otherwise:

To obtain the correct results, those mutable variables must be used in the correct order: this is why the encapsulated-state type ST s a is monadic.


For Haskell code based on visible effects, that is correct - as you subsequently noted:

we have libraries like Pipes/Conduit that are designed to help [ā€¦]

But ordinary (effect-free) Haskell code ought to be capable of being evaluated in any order, just as long as the data dependencies are preserved. This form of implicit parallelism works for SQL, so it should also work for Haskell. So it not (quite) working like that at the moment just means more work needs to be directed at the implementation; something which youā€™ve also noted:

I parallelized and refactored the previous version: 1brc Ā· GitHub

Parallel performance is very erratic on my laptop, thereā€™s 2-3x difference between runs. Iā€™m testing 500M inputs because I donā€™t have enough RAM to hold the 1B file in memory, and while I donā€™t run out of memory with mmap, apparently I get some IO bottleneck. The best I was able to do was 1.9s for 500M. The number of threads can be set with +RTS -N.

2 Likes