One Billion Row challenge in Hs

ReleaseCandidate · March 19, 2024, 7:32pm

Oh yes, this is better

The format of the result is different, but the number of stations is correct and the values look correct too.

In multi-threaded mode it needs 21MB of RAM and 8 threads, but uses only 120% of CPU (instead of 800% for 8 fully used cores). The time is therefore longer than in the single-thread version, 320s, 5min 20s.

% hyperfine -w 1 -r 5 './1brc m > solution.txt'
Benchmark 1: ./1brc m > solution.txt
  Time (mean ± σ):     318.603 s ± 10.467 s    [User: 365.078 s, System: 9.115 s]
  Range (min … max):   305.822 s … 332.892 s    5 runs

In single-threaded mode: 16MB RAM, 100% CPU and a runtime of 290s, 4m 50s.

% hyperfine -w 1 -r 5 './1brc s > solution.txt'
Benchmark 1: ./1brc s > solution.txt
  Time (mean ± σ):     290.503 s ±  4.688 s    [User: 285.638 s, System: 2.397 s]
  Range (min … max):   283.823 s … 294.683 s    5 runs

YellowOnion · March 19, 2024, 11:17pm

The multithreaded one needs +RTS -N<n>, 120% CPU and only 20MB of RAM would hint it’s set to 1 (the default), mine seems to bottleneck around 4-5 threads.

ReleaseCandidate · March 20, 2024, 7:08am

Oh, yes, that’s true.

Using 10 threads, it needs 125MB of RAM and uses about 470% CPU. Runtime is 245s, 4min 5s.

% hyperfine -w 1 -r 5 './1brc m > solution.txt'
Benchmark 1: ./1brc m > solution.txt
  Time (mean ± σ):     245.405 s ± 14.430 s    [User: 1059.317 s, System: 37.166 s]
  Range (min … max):   235.293 s … 269.609 s    5 runs

chreekat · March 20, 2024, 10:13am

One clear takeaway already is that somebody needs to look into GHC’s handling of n>4 threads. I have heard in the context of GHC’s own CI that anything more than N=4 creates a performance degradation. For all the hype of how functional programming makes it easier to write parallel code, we definitely need a platform that makes embarrassingly parallel algorithms not so embarrassing at runtime!

atravers · March 20, 2024, 10:29am

…along with the fact that approval has been officially given for the multi-threaded RTS to eventually be the default in GHC.

TeofilC · March 20, 2024, 10:30am

Often the key to getting parallel code to perform better is to implement some sort of chunking/batching. Any point where threads potentially have to synchronise is a point that can lead to contention.

Handling one of these items is very quick, so any queue you have will quickly become a point of contention. Whereas if you have each thread, say, handle 100/1000 items at a time, you are much less likely to have contention.

Threadscope can also be very helpful for debugging why your threads aren’t as active as you might expect.

I’ve also found in the past that sometimes these bottlenecks are caused by memory bottlenecks. You can test if this is the case by seeing if the instructions per cycle (IPC) state reported by perf is going below 1 (which indicates that your CPU is waiting on memory).

atravers · March 20, 2024, 10:51am

…yes, but the ultimate goal ought to be implicit thread management, in the same way we currently benefit from (mostly) implicit memory management:

…because processor thread/core counts are only increasing.

ReleaseCandidate · March 20, 2024, 11:34am

In general yes, but the 1BRC is easy to parallelise, as there is no synchronisation needed at all and we can assign a chunk of memory to process to every thread.

emiruz · March 20, 2024, 11:45am

I’d add that – and forgive me if Im wrong – the performant solutions are mostly imperative + mutable, which – as a pleb – strikes me as indicating that it is currently not possible to write a performant declarative (functional, immutable) solution.

If this problem was to be a motivational case for improvement – speaking as a consumer, as opposed to a language researcher – I’d personally prioritise changing whatever makes possible a fast immutable declarative single threaded solution, over consideration of the parallel aspects.

Here is 1B rows in 25s in 7 LoC of obvious SQL: to the pragmatist, it speaks for itself.

tomjaguarpaw · March 20, 2024, 12:32pm

I’m not sure what you mean here. Certainly, the SQL interface to DuckDB is declarative; I’m confident that the implementation is not! It is commonplace in Haskell to take an imperative implementation (written in Haskell or otherwise) and wrap it in a declarative API, so would you be satisfied with such a solution? Would a Haskell implementation in terms of an Opaleye- or Rel8-like typed, declarative API, that uses DuckDB on the backend be satisfactory?

If not, would a declarative Haskell implementation in terms of imperative, mutable Haskell (the text, bytestring etc. approach) be satisfactory? Or are you really asking for an implementation that is “declarative all the way down”?

chreekat · March 20, 2024, 1:12pm

I agree with @tomjaguarpaw here. At a low enough level, all Haskell code is mutable and procedural, since that’s the physical model used by all supported hardware systems GHC can run on!

The only thing stopping an idiomatic Haskell implementation from being as fast as the fastest C implementation is the appropriate library support. The C FFI exists just for this reason.

To put it another way: right now, it’s clear that no Haskell implementation can match the fastest implementations in other (languages|computation engines). The pragmatic challenge here is to identify where Haskell’s best-practice libraries have not yet been optimized for this use case, and optimize them appropriately.

…but lol, I guess that’s exactly what you were saying!

That sounds very close to what Tom and I are saying, so we must be on the same page.

I guess I just focused on the deficiencies in parallel execution because that has always been a topic I’ve found very interesting.

emiruz · March 20, 2024, 1:45pm

There is no such thing as “declarative all the way down”: at some point something has to tell the computer what to do step by step. However, if a language claims to be “declarative” then I expect to solve problems in it (mostly) declaratively. That is the case in relational paradigms. E.g. MiniZinc, SQL, Datalog and Prolog. Every language has a “and then we break the paradigm and do X instead” cut-off, I suppose I’m saying it didn’t take long in this case for a very simple problem.

Yes, exactly . Additionally though, note that almost all solutions except for mine and perhaps one other, are by experts. Meanwhile, the naive solutions literally don’t work because of the way thunks accumulate in lazy evaluation, wherein “only evaluate what you need to” ends up causing no evaluation at all due to memory overflow When eventually they do work they are exceptionally slow. I’ve coded Haskell for 4-5 weeks and I’ve run into this same problem in 3 different data processing use cases (all on this forum).

For example, check out the “Optimiser performance problems” thread, where I both had a memory leak and terrible performance. To resolve the memory leak I was advised by the experts to either use force from deepseq or adopt a specialised strict container structure (that many were unaware of), and told that the ad package is basically doomed to be slow because of the way it deals with some aspect of thunk management (I had noticed that it was 2x slower than an analytically calculated gradient for a 2 parameter problem!). Meanwhile, we are talking about <15LoC of actual Haskell, and a noob like me obviously wonders what that means at scale or with more complex problems

atravers · March 20, 2024, 2:27pm

Hmm:

emiruz:

If this problem was to be a motivational case for improvement – speaking as a consumer, as opposed to a language researcher – I’d personally prioritise changing whatever makes possible a fast immutable declarative single threaded solution, over consideration of the parallel aspects.

Here is 1B rows in 25s in 7 LoC of obvious SQL: to the pragmatist, it speaks for itself.
chreekat:

One clear takeaway already is that somebody needs to look into GHC’s handling of n>4 threads. I have heard in the context of GHC’s own CI that anything more than N=4 creates a performance degradation.

…are there any extant proposals to add laziness to the SQL standard? If so, then we could use SQL as a target language (at least it would be portable :-).

tomjaguarpaw · March 20, 2024, 3:59pm

I’m inclined to agree, but I still don’t understand what you’re asking for. Is an Opaleye-like solution satisfactory? Or a bytestring-like solution?

I suspect that’s because it’s a simple problem. Once the problem becomes sufficiently complicated the challenge becomes managing complexity in the implementation rather than the speed of bit twiddling. I suspect Haskell to eclipse C in complex scenarios.

the naive solutions literally don’t work because of the way thunks accumulate in lazy evaluation

I don’t think we’ve seen any problem in this thread that was related to thunks accumulating, have we? There are indeed some strictness annotations, but they are about specifying whether individual thunks are allocated or not, not about whether long chains of thunks are allocated.

I remember one case (resolved at Optimiser performance problems - #30 by tomjaguarpaw). Could you share the two others?

The suggestion to use force from deepseq is absolutely wrong, and adopting the strict container structure (I wouldn’t call it “specialised”, I’d just call it “correct”) is absolutely the right solution (i.e. make invalid laziness unrepresentable). I am willing to die on this hill! In brief: if you don’t want thunks to appear in your data type then choose a data type that doesn’t allow thunks to appear.

No, that’s not right. I explained that the approach to AD that the ad package takes is doomed to be slow on vector problems. That has nothing to do with thunks.

It’s a curious thing. There’s one large group of people (mostly outside the Haskell world) that thinks that “Haskell is hard to use because laziness and thunks make it hard to reason about performance and space usage” and another group of people (mostly Haskell insiders) who think “Haskell is great, laziness has all sorts of benefits, and no problems in practice”. They’re both wrong! Laziness has severe implications when used incorrectly, but the solution is simple, not difficult: make invalid laziness unrepresentable. Unfortunately this message has not really percolated very far in the Haskell world. I will keep on trying.

To reiterate, Haskell will be better for more complex problems, not worse! After all, the most simple problems will always be fastest with hand-crafted assembly.

sgraf · March 20, 2024, 4:07pm

I think what we want is a language L that makes it easy to encapsulate performant code in reusable libraries, such that users can use performant library functions instead of rolling their own implementation. This way, only the library authors need to know how to write performant L code, not the users of that library.

Of course, in order to write these libraries, we

have to start from a concrete motivation such as the 1brc problem
come up with a fast solution (e.g., using hacky W8# in parseLine, perhaps unboxed byte strings)
generalise this solution into a library.

We are currently doing (2) for L=Haskell. If we can’t do (2), a.k.a., the solution is not fast enough, then we should think about how our language/its compilation can be improved in order to make it fast enough. (I like thinking about this.)
Only then can we think about (3), which is another hard problem. Evidence is the abundance of INLINE pragmas in high-perf libraries that result in code size explosion.
I’m increasingly convinced that a reliable solution to (3) can only be achieved by libraries using staged compilation/Typed Template Haskell, but in a way that users of that library can stay mostly oblivious to staging (so no syntactic noise through splicing/quoting). @AndrasKovacs knows much about this.

C (C++, Fortran) is a language that makes (2) easy, but (3) hard. Hence, performant C code needs an expert C programmer. L should do better, enabling performant L code written by novice L programmers using the right libraries.

When using SQL to solve the problem, SQL is like the interface of a C library (DuckDB); its implementation is certainly not pretty nor easily understood by its users. However, DuckDB is but one library, and it is not entirely trivial to interface with custom data types/your problem domain in L. I’d rather have the means in L to define a competitive DBMS as a library, first class (factor 2 at most compared to C). But that needs us solving (2) and (3).

So what’s causing problems with (2)? Why isn’t it “fast”, single-threaded? That’s what we need to figure out. (Parallelism is another problem.)

Edit: Andras claims below that the fastest single-threaded solution performs like straightforward C solutions already, so we might be “fast enough”. Not sure what the fastest C solutions do better; perhaps vectorisation etc. which is kind of a pain point for Haskell (but it might not be for a staged solution).

tomjaguarpaw · March 20, 2024, 4:19pm

If we’re really unable to resolve (2), that’s very surprising to me. My expectation is that it’s possible to write low level Haskell code that is as fast as the “same approach implemented in C”, and that it will typically be able to wrap that code in a functional/declarative API that provides a reasonable level of reuse.

AndrasKovacs · March 20, 2024, 4:36pm

Our single-threaded version earlier in this thread had the same speed as straightforward C solutions.

tomjaguarpaw · March 20, 2024, 4:38pm

Ah good. At more than 170 comments it’s hard to keep track of everything! (And Discourse doesn’t even load the whole page at once – it uses “progressive” (or “regressive”?) loading.)

ReleaseCandidate · March 20, 2024, 6:02pm

Yes, the “fastest” C solution above is mostly about 256 bit SIMD.

But I would prefer not to have to rely on (T)TH for performance, GHC is unbearable slow already.

sgraf · March 20, 2024, 8:22pm

I think (T)TH being slow is not inherent to the approach; it’s just that its specification/implementation needs love. I would even say that good support for staging should yield a much faster compiler, because we can dial down on the number of Simplifier runs and its complexity.

Topic		Replies	Views
[ANN] vector-hashtables-0.1.0.1 Announcements	4	535	September 10, 2021
Subquadratic behaviour of Data.List.nub Learn	9	621	April 10, 2023
Strict, StrictData, UNPACK	2	595	March 25, 2024
Haskell performance has come a long way since I started	3	496	April 11, 2025
Haskell-perf/sequences shows interesting results on addition of DList, Acc & snoc into comparison	2	332	November 22, 2021

One Billion Row challenge in Hs

Related topics