-XStrictData might be useful as a quick check, but I think it’s better to have explicit bangs in record fields.
If you grep a large codebase, you will see the field declaration, not the LANGUAGE StrictData somewhere at the top of the module.So you may mistakenly think that the field is lazy. An explicit bang helps to disambiguate this.
BTW, a related performance pitfall is to use Record { field :: {-# UNPACK #-} !Int } and then send the field to a non-inlined function that expects a boxed value. This will cause allocation (boxing) and may actually reduce performance.
There’s no royal road to performance. Everything is lazy is usually faster, but you need to care about accumulators, and it may be not faster in raw number crunching. Making everything strict kills a lot of expressiveness and, counter-intuitively, usually makes things slower.
BTW #2: For the “Haskell must be strict” folk, see a take from Edward Kmett on how laziness is necessary for performance in a pure language (basically a short description of the chapter “Amortization and persistence via lazy evaluation” from Okasaki’s book).
FWIW, it is unlikely this was the case with GHCs of at least the 9.0 series, possibly before. The Simplifier would drop the go !acc [] = acc bang as a redundant eval. A recent patch stopped doing so, because that can sometimes make a measurable difference when the list evaluating to [] would take a long time and acc had a huge closure (space leak).
Interesting to know. It was the case with GHC 7.x series. But I oversimplified my example (to the point of making it wrong, as you pointed out). It should be
go acc [] = f acc
go !acc (x:xs) ...
-- is frequently faster than
go !acc [] = f acc
go !acc (x:xs) ...
Sometimes, f might not even need acc, so it will be ignored entirely without a bang.
If somebody wants to have a reference: my (less optimised) parallel Go version runs in 3.2s and this Go program ported to C is at 2.5s. Interestingly both need about the same lines of code, about 350.
What kind of a machine you’re running your benchmarks on?
On mine (Hetzner EX101, uses i9-13900), it shows more or less thes same results as András’ version in
-- num_threads = 8
User time (seconds): 14.14
System time (seconds): 0.62
Percent of CPU this job got: 712%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.07
-- num_threads = 24
User time (seconds): 20.32
System time (seconds): 1.00
Percent of CPU this job got: 1739% -- not 2400%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.22 -- mmap limit?
It cannot run faster than 1.2s on 24 threads due to mmap. And it runs in about 2.1s on 8 threads much like mine and András’ versions (2.1s was a lucky case, it’s usually 2.4-2.7s, probably because of there are just 8 chunks).
Do you have Intel, or ARM (where GHC codegen might be not that optimized)?
Fun perf. golfing in the thread, I’ve just scrolled through a month worth of iterations.
One minor comment: in the past I got helpful data on problems like this (parsing large files) by benchmarking doing the bare minimum and incrementally doing more. Benchmark a program that does nothing. Then a program that opens and closes the file. One that fseeks across it to the end. One that copies buffers in a loop. Before doing the actual work parsing or build data structures. If at any point you start seeing allocations, GCs, or slower speeds than C, you can point at “here is where something is leaving C parity too early”. I’ve found this is sometimes easier than writing something high level and then trying to wring speed out of it.
If at any point you start seeing allocations, GCs, or slower speeds than C, you can point at “here is where something is leaving C parity too early”.
But for a great many people, Haskell will always "leave C parity too early” for their liking no matter how “close to C parity” Haskell can be brought…maybe they’ll just be happier with Rust, and the imminent arrival of its second working compiler :-D
This thread isn’t for those people, so I don’t think it matters too much what they think. They’ve already got C! And Rust!
Plus, these really specialized (dare I say “not real-world”) algorithms that read data and do the literal simplest mathematical operations on them is definitely a place where Haskell should be able to compete. True, the Haskell solution may end up hyper specialized, but a lot of generally-useful know-how, data structures, and algorithms will come out of the exercise.
If anyone does this: Do me a favor and include brokkr-hashtables in the benchmark. Only on github for now, but my benchmarks show it a bit faster than the other haskell alternatives.
It would be great if someone could summarize the results of this thread, e.g. benchmarks of the naive implementation vs highly optimized variants and ideally vs the fastest and naive java and C implementations to get a feeling of the relative speeds benchmarked on the same computer.
My hope is that we can place these results in a visible place like haskell.org to have another prominent example on how fast Haskell can become when you need performance.
The C version (the fastest one in this table) is nowhere near the fastest (not comparable with the work András did in his Haskell version). The fastest C (well, SIMD) solution does not run on my ARM64.
Even though it’s not the point and arguably not the right tool (the original challenge required you use a standard library or standard-library-looking primitives) I thought it’d be fun to baseline what performance looks like on a larger than memory dataset with dataframe (of course). I have an experimental branch which is still a major WIP with some kinks (e.g the mean logic is off but easily fixable).
main :: IO ()
main = do
let df = LD.scanCsv "../1brc/measurements.txt"
let temperature = F.col @Double "temperature"
df' <-
df
|> LD.withSpill 512_000
|> LD.groupByAggregate
["names"]
[ "minimum" .= F.minimum temp
, "mean" .= F.mean temp
, "maximum" .= F.maximum temp
]
|> LD.runDataFrame
print df'
On a 2018 Dell latitude 5490 with 8GB of RAM using the full 1 billion dataset (13GB) this finishes in ~30 minutes. The approach entails sharding the CSV first then accumulating+aggregating the dataframes. Overall I’m glad the computer could handle it as it usually taps out when programs consume ~3GB.
A couple of areas of improvement:
sharding into CSV (and reading back) is a big bottleneck - would be better to use a columnar format like Parquet. Have the reader ready but haven’t had the patience to start on the writer. That should shave a lot of the time spent on I/O.
There is a lot of corner cutting to make this work. Allowing groupbys in arbitrary positions (instead of at the end) with complex aggregations might be complicated. This work is a little more exciting (to me at least).
Specifying the schema up front would really make things faster and we could probably fail earlier (I think).
An aside: withFile is silly. If there is an IO exception on an unrelated temp inside the the withFile it bubbles it up as though it belongs to the “main” file.