One Billion Row challenge in Hs

Oh yes, this is better :smiley:

The format of the result is different, but the number of stations is correct and the values look correct too.

In multi-threaded mode it needs 21MB of RAM and 8 threads, but uses only 120% of CPU (instead of 800% for 8 fully used cores). The time is therefore longer than in the single-thread version, 320s, 5min 20s.

% hyperfine -w 1 -r 5 './1brc m > solution.txt'
Benchmark 1: ./1brc m > solution.txt
  Time (mean Ā± Ļƒ):     318.603 s Ā± 10.467 s    [User: 365.078 s, System: 9.115 s]
  Range (min ā€¦ max):   305.822 s ā€¦ 332.892 s    5 runs

In single-threaded mode: 16MB RAM, 100% CPU and a runtime of 290s, 4m 50s.

% hyperfine -w 1 -r 5 './1brc s > solution.txt'
Benchmark 1: ./1brc s > solution.txt
  Time (mean Ā± Ļƒ):     290.503 s Ā±  4.688 s    [User: 285.638 s, System: 2.397 s]
  Range (min ā€¦ max):   283.823 s ā€¦ 294.683 s    5 runs

The multithreaded one needs +RTS -N<n>, 120% CPU and only 20MB of RAM would hint itā€™s set to 1 (the default), mine seems to bottleneck around 4-5 threads.

Oh, yes, thatā€™s true.

Using 10 threads, it needs 125MB of RAM and uses about 470% CPU. Runtime is 245s, 4min 5s.

% hyperfine -w 1 -r 5 './1brc m > solution.txt'
Benchmark 1: ./1brc m > solution.txt
  Time (mean Ā± Ļƒ):     245.405 s Ā± 14.430 s    [User: 1059.317 s, System: 37.166 s]
  Range (min ā€¦ max):   235.293 s ā€¦ 269.609 s    5 runs

One clear takeaway already is that somebody needs to look into GHCā€™s handling of n>4 threads. I have heard in the context of GHCā€™s own CI that anything more than N=4 creates a performance degradation. For all the hype of how functional programming makes it easier to write parallel code, we definitely need a platform that makes embarrassingly parallel algorithms not so embarrassing at runtime!

5 Likes

ā€¦along with the fact that approval has been officially given for the multi-threaded RTS to eventually be the default in GHC.

Often the key to getting parallel code to perform better is to implement some sort of chunking/batching. Any point where threads potentially have to synchronise is a point that can lead to contention.

Handling one of these items is very quick, so any queue you have will quickly become a point of contention. Whereas if you have each thread, say, handle 100/1000 items at a time, you are much less likely to have contention.

Threadscope can also be very helpful for debugging why your threads arenā€™t as active as you might expect.

Iā€™ve also found in the past that sometimes these bottlenecks are caused by memory bottlenecks. You can test if this is the case by seeing if the instructions per cycle (IPC) state reported by perf is going below 1 (which indicates that your CPU is waiting on memory).

5 Likes

ā€¦yes, but the ultimate goal ought to be implicit thread management, in the same way we currently benefit from (mostly) implicit memory management:

ā€¦because processor thread/core counts are only increasing.

3 Likes

In general yes, but the 1BRC is easy to parallelise, as there is no synchronisation needed at all and we can assign a chunk of memory to process to every thread.

1 Like

Iā€™d add that ā€“ and forgive me if Im wrong ā€“ the performant solutions are mostly imperative + mutable, which ā€“ as a pleb ā€“ strikes me as indicating that it is currently not possible to write a performant declarative (functional, immutable) solution.

If this problem was to be a motivational case for improvement ā€“ speaking as a consumer, as opposed to a language researcher ā€“ Iā€™d personally prioritise changing whatever makes possible a fast immutable declarative single threaded solution, over consideration of the parallel aspects.

Here is 1B rows in 25s in 7 LoC of obvious SQL: to the pragmatist, it speaks for itself.

4 Likes

Iā€™m not sure what you mean here. Certainly, the SQL interface to DuckDB is declarative; Iā€™m confident that the implementation is not! It is commonplace in Haskell to take an imperative implementation (written in Haskell or otherwise) and wrap it in a declarative API, so would you be satisfied with such a solution? Would a Haskell implementation in terms of an Opaleye- or Rel8-like typed, declarative API, that uses DuckDB on the backend be satisfactory?

If not, would a declarative Haskell implementation in terms of imperative, mutable Haskell (the text, bytestring etc. approach) be satisfactory? Or are you really asking for an implementation that is ā€œdeclarative all the way downā€?

I agree with @tomjaguarpaw here. :slight_smile: At a low enough level, all Haskell code is mutable and procedural, since thatā€™s the physical model used by all supported hardware systems GHC can run on!

The only thing stopping an idiomatic Haskell implementation from being as fast as the fastest C implementation is the appropriate library support. The C FFI exists just for this reason.

To put it another way: right now, itā€™s clear that no Haskell implementation can match the fastest implementations in other (languages|computation engines). The pragmatic challenge here is to identify where Haskellā€™s best-practice libraries have not yet been optimized for this use case, and optimize them appropriately.

ā€¦but lol, I guess thatā€™s exactly what you were saying!

That sounds very close to what Tom and I are saying, so we must be on the same page. :slight_smile:

I guess I just focused on the deficiencies in parallel execution because that has always been a topic Iā€™ve found very interesting.

1 Like

There is no such thing as ā€œdeclarative all the way downā€: at some point something has to tell the computer what to do step by step. However, if a language claims to be ā€œdeclarativeā€ then I expect to solve problems in it (mostly) declaratively. That is the case in relational paradigms. E.g. MiniZinc, SQL, Datalog and Prolog. Every language has a ā€œand then we break the paradigm and do X insteadā€ cut-off, I suppose Iā€™m saying it didnā€™t take long in this case for a very simple problem.

Yes, exactly :slight_smile:. Additionally though, note that almost all solutions except for mine and perhaps one other, are by experts. Meanwhile, the naive solutions literally donā€™t work because of the way thunks accumulate in lazy evaluation, wherein ā€œonly evaluate what you need toā€ ends up causing no evaluation at all due to memory overflow :slight_smile: When eventually they do work they are exceptionally slow. Iā€™ve coded Haskell for 4-5 weeks and Iā€™ve run into this same problem in 3 different data processing use cases (all on this forum).

For example, check out the ā€œOptimiser performance problemsā€ thread, where I both had a memory leak and terrible performance. To resolve the memory leak I was advised by the experts to either use force from deepseq or adopt a specialised strict container structure (that many were unaware of), and told that the ad package is basically doomed to be slow because of the way it deals with some aspect of thunk management (I had noticed that it was 2x slower than an analytically calculated gradient for a 2 parameter problem!). Meanwhile, we are talking about <15LoC of actual Haskell, and a noob like me obviously wonders what that means at scale or with more complex problems :slight_smile:

2 Likes

Hmm:

ā€¦are there any extant proposals to add laziness to the SQL standard? If so, then we could use SQL as a target language (at least it would be portable :-).

1 Like

Iā€™m inclined to agree, but I still donā€™t understand what youā€™re asking for. Is an Opaleye-like solution satisfactory? Or a bytestring-like solution?

I suspect thatā€™s because itā€™s a simple problem. Once the problem becomes sufficiently complicated the challenge becomes managing complexity in the implementation rather than the speed of bit twiddling. I suspect Haskell to eclipse C in complex scenarios.

the naive solutions literally donā€™t work because of the way thunks accumulate in lazy evaluation

I donā€™t think weā€™ve seen any problem in this thread that was related to thunks accumulating, have we? There are indeed some strictness annotations, but they are about specifying whether individual thunks are allocated or not, not about whether long chains of thunks are allocated.

I remember one case (resolved at Optimiser performance problems - #30 by tomjaguarpaw). Could you share the two others?

The suggestion to use force from deepseq is absolutely wrong, and adopting the strict container structure (I wouldnā€™t call it ā€œspecialisedā€, Iā€™d just call it ā€œcorrectā€) is absolutely the right solution (i.e. make invalid laziness unrepresentable). I am willing to die on this hill! In brief: if you donā€™t want thunks to appear in your data type then choose a data type that doesnā€™t allow thunks to appear.

No, thatā€™s not right. I explained that the approach to AD that the ad package takes is doomed to be slow on vector problems. That has nothing to do with thunks.

Itā€™s a curious thing. Thereā€™s one large group of people (mostly outside the Haskell world) that thinks that ā€œHaskell is hard to use because laziness and thunks make it hard to reason about performance and space usageā€ and another group of people (mostly Haskell insiders) who think ā€œHaskell is great, laziness has all sorts of benefits, and no problems in practiceā€. Theyā€™re both wrong! Laziness has severe implications when used incorrectly, but the solution is simple, not difficult: make invalid laziness unrepresentable. Unfortunately this message has not really percolated very far in the Haskell world. I will keep on trying.

To reiterate, Haskell will be better for more complex problems, not worse! After all, the most simple problems will always be fastest with hand-crafted assembly.

5 Likes

I think what we want is a language L that makes it easy to encapsulate performant code in reusable libraries, such that users can use performant library functions instead of rolling their own implementation. This way, only the library authors need to know how to write performant L code, not the users of that library.

Of course, in order to write these libraries, we

  1. have to start from a concrete motivation such as the 1brc problem
  2. come up with a fast solution (e.g., using hacky W8# in parseLine, perhaps unboxed byte strings)
  3. generalise this solution into a library.

We are currently doing (2) for L=Haskell. If we canā€™t do (2), a.k.a., the solution is not fast enough, then we should think about how our language/its compilation can be improved in order to make it fast enough. (I like thinking about this.)
Only then can we think about (3), which is another hard problem. Evidence is the abundance of INLINE pragmas in high-perf libraries that result in code size explosion.
Iā€™m increasingly convinced that a reliable solution to (3) can only be achieved by libraries using staged compilation/Typed Template Haskell, but in a way that users of that library can stay mostly oblivious to staging (so no syntactic noise through splicing/quoting). @AndrasKovacs knows much about this.

C (C++, Fortran) is a language that makes (2) easy, but (3) hard. Hence, performant C code needs an expert C programmer. L should do better, enabling performant L code written by novice L programmers using the right libraries.

When using SQL to solve the problem, SQL is like the interface of a C library (DuckDB); its implementation is certainly not pretty nor easily understood by its users. However, DuckDB is but one library, and it is not entirely trivial to interface with custom data types/your problem domain in L. Iā€™d rather have the means in L to define a competitive DBMS as a library, first class (factor 2 at most compared to C). But that needs us solving (2) and (3).


So whatā€™s causing problems with (2)? Why isnā€™t it ā€œfastā€, single-threaded? Thatā€™s what we need to figure out. (Parallelism is another problem.)

Edit: Andras claims below that the fastest single-threaded solution performs like straightforward C solutions already, so we might be ā€œfast enoughā€. Not sure what the fastest C solutions do better; perhaps vectorisation etc. which is kind of a pain point for Haskell (but it might not be for a staged solution).

3 Likes

If weā€™re really unable to resolve (2), thatā€™s very surprising to me. My expectation is that itā€™s possible to write low level Haskell code that is as fast as the ā€œsame approach implemented in Cā€, and that it will typically be able to wrap that code in a functional/declarative API that provides a reasonable level of reuse.

1 Like

Our single-threaded version earlier in this thread had the same speed as straightforward C solutions.

3 Likes

Ah good. At more than 170 comments itā€™s hard to keep track of everything! (And Discourse doesnā€™t even load the whole page at once ā€“ it uses ā€œprogressiveā€ (or ā€œregressiveā€?) loading.)

Yes, the ā€œfastestā€ C solution above is mostly about 256 bit SIMD.

But I would prefer not to have to rely on (T)TH for performance, GHC is unbearable slow already.

1 Like

I think (T)TH being slow is not inherent to the approach; itā€™s just that its specification/implementation needs love. I would even say that good support for staging should yield a much faster compiler, because we can dial down on the number of Simplifier runs and its complexity.

4 Likes