Our single-threaded version earlier in this thread had the same speed as straightforward C solutions.
Ah good. At more than 170 comments it’s hard to keep track of everything! (And Discourse doesn’t even load the whole page at once – it uses “progressive” (or “regressive”?) loading.)
Yes, the “fastest” C solution above is mostly about 256 bit SIMD.
But I would prefer not to have to rely on (T)TH for performance, GHC is unbearable slow already.
I think (T)TH being slow is not inherent to the approach; it’s just that its specification/implementation needs love. I would even say that good support for staging should yield a much faster compiler, because we can dial down on the number of Simplifier runs and its complexity.
Hrm…to choose between “library bloat” (having several similar libraries which only differ in their strictness properties), or ye ol’ code bloat (having several similar groups of definitions which only differ in certain other properties)? As noted here in this presentation:
https://www.youtube.com/watch?v=2PxsyWqZ5dI
…just imagine if we had to choose between several similar memory management libraries which only differed in their specific properties: thankfully, we don’t! Instead, it’s a part of the implementation.
Suppose multi-threading by default:
https://github.com/ghc-proposals/ghc-proposals/pull/240
had already been implemented - would there really still as much interest in “wringing every last bit” of performance out of the old single-threaded RTS? Or would we be:
-
advising people on how to migrate their old “unthreaded” code to the new default;
-
and discussing how to get the best multi-threaded performance out of the RTS?
It seems to me that you would merely be exchanging many short simplifier runs for a few much-longer simplifier runs, because of that extra complexity.
Oh I say, that’s a bit strong. deepseq
is a tool with uses, and while I agree that in that specific problem, the strict vector was ultimately the right solution, in a more complex scenario in which either you want some laziness sometimes in your data structures or you don’t yet know which parts of your program are eating all of your memory with thunks and refactoring everything to be strict would be time-consuming, deepseq
is an important tool to have at your disposal, even if just to try force
-ing different things until you figure out which things to refactor. (But sometimes invalid laziness has to be representable; Haskell’s type system is very good at letting you express invariants through types but no type system is perfect at this.)
Certainly when you’re attempting to squeeze your thunks out by forcing everything through a show
, as OP was, force
is something you ought to be told about.
I updated my gist with @AndrasKovacs hash. Now it runs at about the same speed as yours, but uses less intrinsics and doesn’t have short string optimisation (only “short string” hash).
For such a bulk task (parsing CSV at 1GB/sec), not only every memory access counts, but also every instruction. Using a hash that works on 64-bit words instead of bytes helps to have fewer instructions.
deepseq
is a time leak machine! and many other languages have it built into their spec
Correct.
Consider let x = head (return x) in 'a'
-
“outside in” evaluation: the result is
'a'
as it ought to be, sincex
isn’t needed here to obtain that result. -
“inside out” evaluation: the result will always be ⊥ no matter how much memory or how many threads are available.
To alleviate this fundamental problem with “inside out” evaluation, Eager Haskell would periodically and completely restart (reduction of) the program. This “re-scorched earth” style had two competing requirements:
-
avoiding too many restarts, so that the program can actually run at all;
-
avoiding too few restarts, so the program won’t do too much unnecessary work (like trying to evaluate
x = head (return x)
or other similarly-cyclic definitions).
…a “delicate balancing act”.
We’ve gone off topic, but I invite anyone who sneers at deepseq
to take a look at Fix space leak between modules during compilation by MonoidMusician · Pull Request #4517 · purescript/purescript · GitHub and recommend a better way to address that problem. We weren’t happy with the deep infiltration of NFData
into our types and would love to hear from someone who can clearly see a less invasive approach.
(ahem) …speaking for myself here: I wasn’t so much “sneering” at (the use of) entities like deepseq
, but a prevailing impression that they (unfortunately) seem to accentuate:
-
that laziness/non-strict semantics are now more of an impediment than being of ongoing benefit to Haskell,
-
and little if anything would be lost if Haskell were strict by default.
(Again, this is merely how I’m perceiving the current situation).
So it is to that impression that I say:
-
did someone solve the halting problem?
-
or is Haskell going to eventually be total as well as dependent?
Because without either of those, being strict by default generally means more programs won’t produce a useful result at all. Therefore definitions like deepseq
(or compel
, as I mentioned here) along with other “strictness modifiers” should ideally be considered “features of last resort” because of their potential for non-termination.
So how can we get to that ideal? This looks promising:
…if the multi-threaded RTS does have to be overhauled to solve the “N > 4” problem observed by @chreekat and others, can any of Robert Ennal’s research be reused as well, so that comments like:
…don’t end up being treated as basic advice when using Haskell?
The result first: 3.9s!
11 threads (the number of cores is 10, although 2 of them are not “performance cores”):
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 3.934 s ± 0.007 s [User: 36.062 s, System: 1.502 s]
Range (min … max): 3.925 s … 3.943 s 5 runs
On my computer, -fllvm
is still not faster - but for your version at least not slower - than not using it.
Without LLVM:
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 4.344 s ± 0.009 s [User: 32.554 s, System: 1.222 s]
Range (min … max): 4.333 s … 4.355 s 5 runs
With LLVM (about the same).
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 4.346 s ± 0.005 s [User: 32.559 s, System: 1.227 s]
Range (min … max): 4.340 s … 4.353 s 5 runs
But now to the number of threads:
11 threads:
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 3.934 s ± 0.007 s [User: 36.062 s, System: 1.502 s]
Range (min … max): 3.925 s … 3.943 s 5 runs
12 threads:
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 3.951 s ± 0.015 s [User: 36.132 s, System: 1.523 s]
Range (min … max): 3.936 s … 3.970 s 5 runs
10 threads (the number of cores, although 2 of them are not “performance cores”):
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 3.968 s ± 0.013 s [User: 35.878 s, System: 1.449 s]
Range (min … max): 3.954 s … 3.982 s 5 runs
14 threads:
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 3.991 s ± 0.012 s [User: 36.293 s, System: 1.555 s]
Range (min … max): 3.979 s … 4.004 s 5 runs
16 threads:
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 4.029 s ± 0.012 s [User: 36.438 s, System: 1.589 s]
Range (min … max): 4.020 s … 4.051 s 5 runs
20 threads:
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 4.100 s ± 0.037 s [User: 36.761 s, System: 1.631 s]
Range (min … max): 4.054 s … 4.142 s 5 runs
8 threads:
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 4.344 s ± 0.009 s [User: 32.554 s, System: 1.222 s]
Range (min … max): 4.333 s … 4.355 s 5 runs
200 threads:
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 6.006 s ± 0.109 s [User: 41.527 s, System: 3.630 s]
Range (min … max): 5.921 s … 6.156 s 5 runs
I think embarrassingly parallel (EP) workloads are meant to use data parallel (DPH) features like what Repa does, forkIO spawns a green thread that has to support all sorts of things that we usually don’t consider as EP, for example a GPU at it’s core is usually just a 32 or 64 instruction wide SIMD processor [1] , which means that misuse of if statements can lead to worse performance than just computing both sides of the if, all instructions are executed in sync, and a memory access for one “thread” will stall all 64 threads, these are things we don’t have to worry about with green threads.
The N>4 issue might be something else, the stm-containers package mentions it can scale to 8 threads with ease, I also have a hunch that modern computers can sort of brute force up to about 4 CPUs (why Intel was king with 4-cores for a decade), L2 cache is only as fast as about 1/2 a DDR4 channel, meaning you can saturate a dual-channel system with 4 cores.
I’ve tried to specific stick to Haskell style code and that it being orders of magnitude slower than C sucks as per @emiruz comments, and what’s even worse is that when you look at other functional languages, most notably Lisp, the C-like performance versions are idiomatic, functional style. [2]
@AndrasKovacs how does your single threaded version work in-comparison to mine? Not sure I understand why mine is so slow compared to many of the solutions here, strict map and fold shouldn’t be that much slower.
I don’t think I’ve seen anyone implement a pure parallel version here using par
, maybe if I find some more time I’ll have a crack at a par
, stm-containers, and something using DPT, and also fix up some of the code to use bounded threads to prevent cache misses etc.
Something I’ve been wondering about is can we use the type system validate we have no overlap when delegating reads from the file. a correct implementation is be fairly easy to write, but that’s not why we use Haskell eh?
[1]: and end up with absurd marketing claims like 3000 “stream” processors
[2]: Lisp SBCL #2 n-body - Which programs are fastest? (Benchmarks Game)
…and strict by default. So you’re the one who’s dealing with (more of) the tedious and boring jobs, such as determining the correct order of evaluation:
We’ve got it already, you like Haskell’s laziness.
But:
Adding Strict
ness makes the version (minimally, but consistently) faster on my computer, at least with the load that it has right now (a VM and many other things running), that’s why the times are longer than before:
Original version:
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 4.307 s ± 0.021 s [User: 37.089 s, System: 2.015 s]
Range (min … max): 4.276 s … 4.328 s 5 runs
With {-# LANGUAGE Strict #-}
:
Benchmark 1: ./exe-exe > solution.txt
Time (mean ± σ): 4.269 s ± 0.023 s [User: 36.972 s, System: 1.955 s]
Range (min … max): 4.247 s … 4.305 s 5 runs
…but I wonder if I’m now in the minority here :-/
I could be mistaken…but I think you’re the first person on this thread to make use of -XStrict
. It would be interesting to see if it makes a difference for the other/older solutions given here: if it works, then perhaps it could help to solve sgraf’s third problem:
That’s not my program, I’m just running them (almost?) all for benchmarks. I’ve actually tried it on my own some time ago and it had been slower. Although that might have changed and the last version would benefit.
That is incorrect, Lisp SBCL #2 code is clean doesn’t look like it cares too much about order, meanwhile I’m having to splatter strictness annotations and really think about why my haskell program is so slow, and on top of that, we have to remember that modern C compilers are extremely good at re-ordering operations, and lets not forget that modern CPUs performance comes from real time re-ordering of instructions as well, half of a modern processor is dedicated to this dark-art of running instructions that take 4 cycles per instruction, at 4 instructions / cycle.
At the end of the day all the code generator has too do is build a dependency graph and spit out an ordering that the CPU (or uarch) will run fast in most cases. This is how software like fex can run badly optimized legacy x86_32 code on x86_64 processor faster than the processor can itself.
There’s definitely something else going on here, that’s not strict v laziness, The fact my pure functional implementation with no use of forkIO
is faster with -qm
passed to the RTS would hint there’s work needed in the scheduler.
That SBCL code seems rife with destructive updates (…or maybe I just can’t read Common Lisp). What makes it more idiomatic than the faster Haskell GHC #2 solution: n-body Haskell GHC #2 program (Benchmarks Game)
TBH I’m not hugely familiar with CL to know exactly what is going on with the code, Nor am I saying that the the Lisp version is without mutation or anything like that.
What I’m saying is that the Lisp version is closer to the ideal of what I would want to write in Haskell. Meanwhile the fast Haskell version is basically Ugly C, There’s huges amounts of direct use of ptrs, we have a storable instance and a heavy splattering of peek/poke, that’s worse than having a few mutable vars in the Lisp code.
We still have to think about ordering in the Haskell version, especially when it comes to performance code (as every example here proves), So the notion that writing Haskell code you do not have to think about ordering is wrong, we have libraries like Pipes/Conduit that are designed to help, and we also write slower functional style code in other languages.