Yes, the āfastestā C solution above is mostly about 256 bit SIMD.
But I would prefer not to have to rely on (T)TH for performance, GHC is unbearable slow already.
Yes, the āfastestā C solution above is mostly about 256 bit SIMD.
But I would prefer not to have to rely on (T)TH for performance, GHC is unbearable slow already.
I think (T)TH being slow is not inherent to the approach; itās just that its specification/implementation needs love. I would even say that good support for staging should yield a much faster compiler, because we can dial down on the number of Simplifier runs and its complexity.
Hrmā¦to choose between ālibrary bloatā (having several similar libraries which only differ in their strictness properties), or ye olā code bloat (having several similar groups of definitions which only differ in certain other properties)? As noted here in this presentation:
https://www.youtube.com/watch?v=2PxsyWqZ5dI
ā¦just imagine if we had to choose between several similar memory management libraries which only differed in their specific properties: thankfully, we donāt! Instead, itās a part of the implementation.
Suppose multi-threading by default:
https://github.com/ghc-proposals/ghc-proposals/pull/240
had already been implemented - would there really still as much interest in āwringing every last bitā of performance out of the old single-threaded RTS? Or would we be:
advising people on how to migrate their old āunthreadedā code to the new default;
and discussing how to get the best multi-threaded performance out of the RTS?
It seems to me that you would merely be exchanging many short simplifier runs for a few much-longer simplifier runs, because of that extra complexity.
Oh I say, thatās a bit strong. deepseq
is a tool with uses, and while I agree that in that specific problem, the strict vector was ultimately the right solution, in a more complex scenario in which either you want some laziness sometimes in your data structures or you donāt yet know which parts of your program are eating all of your memory with thunks and refactoring everything to be strict would be time-consuming, deepseq
is an important tool to have at your disposal, even if just to try force
-ing different things until you figure out which things to refactor. (But sometimes invalid laziness has to be representable; Haskellās type system is very good at letting you express invariants through types but no type system is perfect at this.)
Certainly when youāre attempting to squeeze your thunks out by forcing everything through a show
, as OP was, force
is something you ought to be told about.
I updated my gist with @AndrasKovacs hash. Now it runs at about the same speed as yours, but uses less intrinsics and doesnāt have short string optimisation (only āshort stringā hash).
For such a bulk task (parsing CSV at 1GB/sec), not only every memory access counts, but also every instruction. Using a hash that works on 64-bit words instead of bytes helps to have fewer instructions.
deepseq
is a time leak machine! and many other languages have it built into their spec
Correct.
Consider let x = head (return x) in 'a'
āoutside inā evaluation: the result is 'a'
as it ought to be, since x
isnāt needed here to obtain that result.
āinside outā evaluation: the result will always be ā„ no matter how much memory or how many threads are available.
To alleviate this fundamental problem with āinside outā evaluation, Eager Haskell would periodically and completely restart (reduction of) the program. This āre-scorched earthā style had two competing requirements:
avoiding too many restarts, so that the program can actually run at all;
avoiding too few restarts, so the program wonāt do too much unnecessary work (like trying to evaluate x = head (return x)
or other similarly-cyclic definitions).
ā¦a ādelicate balancing actā.
Weāve gone off topic, but I invite anyone who sneers at deepseq
to take a look at Fix space leak between modules during compilation by MonoidMusician Ā· Pull Request #4517 Ā· purescript/purescript Ā· GitHub and recommend a better way to address that problem. We werenāt happy with the deep infiltration of NFData
into our types and would love to hear from someone who can clearly see a less invasive approach.
(ahem) ā¦speaking for myself here: I wasnāt so much āsneeringā at (the use of) entities like deepseq
, but a prevailing impression that they (unfortunately) seem to accentuate:
that laziness/non-strict semantics are now more of an impediment than being of ongoing benefit to Haskell,
and little if anything would be lost if Haskell were strict by default.
(Again, this is merely how Iām perceiving the current situation).
So it is to that impression that I say:
did someone solve the halting problem?
or is Haskell going to eventually be total as well as dependent?
Because without either of those, being strict by default generally means more programs wonāt produce a useful result at all. Therefore definitions like deepseq
(or compel
, as I mentioned here) along with other āstrictness modifiersā should ideally be considered āfeatures of last resortā because of their potential for non-termination.
So how can we get to that ideal? This looks promising:
ā¦if the multi-threaded RTS does have to be overhauled to solve the āN > 4ā problem observed by @chreekat and others, can any of Robert Ennalās research be reused as well, so that comments like:
ā¦donāt end up being treated as basic advice when using Haskell?
The result first: 3.9s!
11 threads (the number of cores is 10, although 2 of them are not āperformance coresā):
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 3.934 s Ā± 0.007 s [User: 36.062 s, System: 1.502 s]
Range (min ā¦ max): 3.925 s ā¦ 3.943 s 5 runs
On my computer, -fllvm
is still not faster - but for your version at least not slower - than not using it.
Without LLVM:
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 4.344 s Ā± 0.009 s [User: 32.554 s, System: 1.222 s]
Range (min ā¦ max): 4.333 s ā¦ 4.355 s 5 runs
With LLVM (about the same).
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 4.346 s Ā± 0.005 s [User: 32.559 s, System: 1.227 s]
Range (min ā¦ max): 4.340 s ā¦ 4.353 s 5 runs
But now to the number of threads:
11 threads:
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 3.934 s Ā± 0.007 s [User: 36.062 s, System: 1.502 s]
Range (min ā¦ max): 3.925 s ā¦ 3.943 s 5 runs
12 threads:
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 3.951 s Ā± 0.015 s [User: 36.132 s, System: 1.523 s]
Range (min ā¦ max): 3.936 s ā¦ 3.970 s 5 runs
10 threads (the number of cores, although 2 of them are not āperformance coresā):
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 3.968 s Ā± 0.013 s [User: 35.878 s, System: 1.449 s]
Range (min ā¦ max): 3.954 s ā¦ 3.982 s 5 runs
14 threads:
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 3.991 s Ā± 0.012 s [User: 36.293 s, System: 1.555 s]
Range (min ā¦ max): 3.979 s ā¦ 4.004 s 5 runs
16 threads:
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 4.029 s Ā± 0.012 s [User: 36.438 s, System: 1.589 s]
Range (min ā¦ max): 4.020 s ā¦ 4.051 s 5 runs
20 threads:
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 4.100 s Ā± 0.037 s [User: 36.761 s, System: 1.631 s]
Range (min ā¦ max): 4.054 s ā¦ 4.142 s 5 runs
8 threads:
hyperfine -w 1 -r 5 './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 4.344 s Ā± 0.009 s [User: 32.554 s, System: 1.222 s]
Range (min ā¦ max): 4.333 s ā¦ 4.355 s 5 runs
200 threads:
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 6.006 s Ā± 0.109 s [User: 41.527 s, System: 3.630 s]
Range (min ā¦ max): 5.921 s ā¦ 6.156 s 5 runs
I think embarrassingly parallel (EP) workloads are meant to use data parallel (DPH) features like what Repa does, forkIO spawns a green thread that has to support all sorts of things that we usually donāt consider as EP, for example a GPU at itās core is usually just a 32 or 64 instruction wide SIMD processor [1] , which means that misuse of if statements can lead to worse performance than just computing both sides of the if, all instructions are executed in sync, and a memory access for one āthreadā will stall all 64 threads, these are things we donāt have to worry about with green threads.
The N>4 issue might be something else, the stm-containers package mentions it can scale to 8 threads with ease, I also have a hunch that modern computers can sort of brute force up to about 4 CPUs (why Intel was king with 4-cores for a decade), L2 cache is only as fast as about 1/2 a DDR4 channel, meaning you can saturate a dual-channel system with 4 cores.
Iāve tried to specific stick to Haskell style code and that it being orders of magnitude slower than C sucks as per @emiruz comments, and whatās even worse is that when you look at other functional languages, most notably Lisp, the C-like performance versions are idiomatic, functional style. [2]
@AndrasKovacs how does your single threaded version work in-comparison to mine? Not sure I understand why mine is so slow compared to many of the solutions here, strict map and fold shouldnāt be that much slower.
I donāt think Iāve seen anyone implement a pure parallel version here using par
, maybe if I find some more time Iāll have a crack at a par
, stm-containers, and something using DPT, and also fix up some of the code to use bounded threads to prevent cache misses etc.
Something Iāve been wondering about is can we use the type system validate we have no overlap when delegating reads from the file. a correct implementation is be fairly easy to write, but thatās not why we use Haskell eh?
[1]: and end up with absurd marketing claims like 3000 āstreamā processors
[2]: Lisp SBCL #2 n-body - Which programs are fastest? (Benchmarks Game)
ā¦and strict by default. So youāre the one whoās dealing with (more of) the tedious and boring jobs, such as determining the correct order of evaluation:
Weāve got it already, you like Haskellās laziness.
But:
Adding Strict
ness makes the version (minimally, but consistently) faster on my computer, at least with the load that it has right now (a VM and many other things running), thatās why the times are longer than before:
Original version:
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 4.307 s Ā± 0.021 s [User: 37.089 s, System: 2.015 s]
Range (min ā¦ max): 4.276 s ā¦ 4.328 s 5 runs
With {-# LANGUAGE Strict #-}
:
Benchmark 1: ./exe-exe > solution.txt
Time (mean Ā± Ļ): 4.269 s Ā± 0.023 s [User: 36.972 s, System: 1.955 s]
Range (min ā¦ max): 4.247 s ā¦ 4.305 s 5 runs
ā¦but I wonder if Iām now in the minority here :-/
I could be mistakenā¦but I think youāre the first person on this thread to make use of -XStrict
. It would be interesting to see if it makes a difference for the other/older solutions given here: if it works, then perhaps it could help to solve sgrafās third problem:
Thatās not my program, Iām just running them (almost?) all for benchmarks. Iāve actually tried it on my own some time ago and it had been slower. Although that might have changed and the last version would benefit.
That is incorrect, Lisp SBCL #2 code is clean doesnāt look like it cares too much about order, meanwhile Iām having to splatter strictness annotations and really think about why my haskell program is so slow, and on top of that, we have to remember that modern C compilers are extremely good at re-ordering operations, and lets not forget that modern CPUs performance comes from real time re-ordering of instructions as well, half of a modern processor is dedicated to this dark-art of running instructions that take 4 cycles per instruction, at 4 instructions / cycle.
At the end of the day all the code generator has too do is build a dependency graph and spit out an ordering that the CPU (or uarch) will run fast in most cases. This is how software like fex can run badly optimized legacy x86_32 code on x86_64 processor faster than the processor can itself.
Thereās definitely something else going on here, thatās not strict v laziness, The fact my pure functional implementation with no use of forkIO
is faster with -qm
passed to the RTS would hint thereās work needed in the scheduler.
That SBCL code seems rife with destructive updates (ā¦or maybe I just canāt read Common Lisp). What makes it more idiomatic than the faster Haskell GHC #2 solution: n-body Haskell GHCĀ #2 program (Benchmarks Game)
TBH Iām not hugely familiar with CL to know exactly what is going on with the code, Nor am I saying that the the Lisp version is without mutation or anything like that.
What Iām saying is that the Lisp version is closer to the ideal of what I would want to write in Haskell. Meanwhile the fast Haskell version is basically Ugly C, Thereās huges amounts of direct use of ptrs, we have a storable instance and a heavy splattering of peek/poke, thatās worse than having a few mutable vars in the Lisp code.
We still have to think about ordering in the Haskell version, especially when it comes to performance code (as every example here proves), So the notion that writing Haskell code you do not have to think about ordering is wrong, we have libraries like Pipes/Conduit that are designed to help, and we also write slower functional style code in other languages.
I thought the reason those compilers could do that was largely due to the input language being a āsemantics-free zoneā (sometimes known as a ādumpster fireā ), which is another reason why Iām annoyed by performance comparisons with that olā warhorseā¦anyone for a Haskell-Clean comparison?
ā¦but this clearly suggests otherwise:
To obtain the correct results, those mutable variables must be used in the correct order: this is why the encapsulated-state type ST s a
is monadic.
For Haskell code based on visible effects, that is correct - as you subsequently noted:
we have libraries like Pipes/Conduit that are designed to help [ā¦]
But ordinary (effect-free) Haskell code ought to be capable of being evaluated in any order, just as long as the data dependencies are preserved. This form of implicit parallelism works for SQL, so it should also work for Haskell. So it not (quite) working like that at the moment just means more work needs to be directed at the implementation; something which youāve also noted:
I parallelized and refactored the previous version: 1brc Ā· GitHub
Parallel performance is very erratic on my laptop, thereās 2-3x difference between runs. Iām testing 500M inputs because I donāt have enough RAM to hold the 1B file in memory, and while I donāt run out of memory with mmap, apparently I get some IO bottleneck. The best I was able to do was 1.9s for 500M. The number of threads can be set with +RTS -N
.