One Billion Row challenge in Hs

I’d add that – and forgive me if Im wrong – the performant solutions are mostly imperative + mutable, which – as a pleb – strikes me as indicating that it is currently not possible to write a performant declarative (functional, immutable) solution.

If this problem was to be a motivational case for improvement – speaking as a consumer, as opposed to a language researcher – I’d personally prioritise changing whatever makes possible a fast immutable declarative single threaded solution, over consideration of the parallel aspects.

Here is 1B rows in 25s in 7 LoC of obvious SQL: to the pragmatist, it speaks for itself.

4 Likes

I’m not sure what you mean here. Certainly, the SQL interface to DuckDB is declarative; I’m confident that the implementation is not! It is commonplace in Haskell to take an imperative implementation (written in Haskell or otherwise) and wrap it in a declarative API, so would you be satisfied with such a solution? Would a Haskell implementation in terms of an Opaleye- or Rel8-like typed, declarative API, that uses DuckDB on the backend be satisfactory?

If not, would a declarative Haskell implementation in terms of imperative, mutable Haskell (the text, bytestring etc. approach) be satisfactory? Or are you really asking for an implementation that is “declarative all the way down”?

I agree with @tomjaguarpaw here. :slight_smile: At a low enough level, all Haskell code is mutable and procedural, since that’s the physical model used by all supported hardware systems GHC can run on!

The only thing stopping an idiomatic Haskell implementation from being as fast as the fastest C implementation is the appropriate library support. The C FFI exists just for this reason.

To put it another way: right now, it’s clear that no Haskell implementation can match the fastest implementations in other (languages|computation engines). The pragmatic challenge here is to identify where Haskell’s best-practice libraries have not yet been optimized for this use case, and optimize them appropriately.

…but lol, I guess that’s exactly what you were saying!

That sounds very close to what Tom and I are saying, so we must be on the same page. :slight_smile:

I guess I just focused on the deficiencies in parallel execution because that has always been a topic I’ve found very interesting.

1 Like

There is no such thing as “declarative all the way down”: at some point something has to tell the computer what to do step by step. However, if a language claims to be “declarative” then I expect to solve problems in it (mostly) declaratively. That is the case in relational paradigms. E.g. MiniZinc, SQL, Datalog and Prolog. Every language has a “and then we break the paradigm and do X instead” cut-off, I suppose I’m saying it didn’t take long in this case for a very simple problem.

Yes, exactly :slight_smile:. Additionally though, note that almost all solutions except for mine and perhaps one other, are by experts. Meanwhile, the naive solutions literally don’t work because of the way thunks accumulate in lazy evaluation, wherein “only evaluate what you need to” ends up causing no evaluation at all due to memory overflow :slight_smile: When eventually they do work they are exceptionally slow. I’ve coded Haskell for 4-5 weeks and I’ve run into this same problem in 3 different data processing use cases (all on this forum).

For example, check out the “Optimiser performance problems” thread, where I both had a memory leak and terrible performance. To resolve the memory leak I was advised by the experts to either use force from deepseq or adopt a specialised strict container structure (that many were unaware of), and told that the ad package is basically doomed to be slow because of the way it deals with some aspect of thunk management (I had noticed that it was 2x slower than an analytically calculated gradient for a 2 parameter problem!). Meanwhile, we are talking about <15LoC of actual Haskell, and a noob like me obviously wonders what that means at scale or with more complex problems :slight_smile:

2 Likes

Hmm:

…are there any extant proposals to add laziness to the SQL standard? If so, then we could use SQL as a target language (at least it would be portable :-).

1 Like

I’m inclined to agree, but I still don’t understand what you’re asking for. Is an Opaleye-like solution satisfactory? Or a bytestring-like solution?

I suspect that’s because it’s a simple problem. Once the problem becomes sufficiently complicated the challenge becomes managing complexity in the implementation rather than the speed of bit twiddling. I suspect Haskell to eclipse C in complex scenarios.

the naive solutions literally don’t work because of the way thunks accumulate in lazy evaluation

I don’t think we’ve seen any problem in this thread that was related to thunks accumulating, have we? There are indeed some strictness annotations, but they are about specifying whether individual thunks are allocated or not, not about whether long chains of thunks are allocated.

I remember one case (resolved at Optimiser performance problems - #30 by tomjaguarpaw). Could you share the two others?

The suggestion to use force from deepseq is absolutely wrong, and adopting the strict container structure (I wouldn’t call it “specialised”, I’d just call it “correct”) is absolutely the right solution (i.e. make invalid laziness unrepresentable). I am willing to die on this hill! In brief: if you don’t want thunks to appear in your data type then choose a data type that doesn’t allow thunks to appear.

No, that’s not right. I explained that the approach to AD that the ad package takes is doomed to be slow on vector problems. That has nothing to do with thunks.

It’s a curious thing. There’s one large group of people (mostly outside the Haskell world) that thinks that “Haskell is hard to use because laziness and thunks make it hard to reason about performance and space usage” and another group of people (mostly Haskell insiders) who think “Haskell is great, laziness has all sorts of benefits, and no problems in practice”. They’re both wrong! Laziness has severe implications when used incorrectly, but the solution is simple, not difficult: make invalid laziness unrepresentable. Unfortunately this message has not really percolated very far in the Haskell world. I will keep on trying.

To reiterate, Haskell will be better for more complex problems, not worse! After all, the most simple problems will always be fastest with hand-crafted assembly.

5 Likes

I think what we want is a language L that makes it easy to encapsulate performant code in reusable libraries, such that users can use performant library functions instead of rolling their own implementation. This way, only the library authors need to know how to write performant L code, not the users of that library.

Of course, in order to write these libraries, we

  1. have to start from a concrete motivation such as the 1brc problem
  2. come up with a fast solution (e.g., using hacky W8# in parseLine, perhaps unboxed byte strings)
  3. generalise this solution into a library.

We are currently doing (2) for L=Haskell. If we can’t do (2), a.k.a., the solution is not fast enough, then we should think about how our language/its compilation can be improved in order to make it fast enough. (I like thinking about this.)
Only then can we think about (3), which is another hard problem. Evidence is the abundance of INLINE pragmas in high-perf libraries that result in code size explosion.
I’m increasingly convinced that a reliable solution to (3) can only be achieved by libraries using staged compilation/Typed Template Haskell, but in a way that users of that library can stay mostly oblivious to staging (so no syntactic noise through splicing/quoting). @AndrasKovacs knows much about this.

C (C++, Fortran) is a language that makes (2) easy, but (3) hard. Hence, performant C code needs an expert C programmer. L should do better, enabling performant L code written by novice L programmers using the right libraries.

When using SQL to solve the problem, SQL is like the interface of a C library (DuckDB); its implementation is certainly not pretty nor easily understood by its users. However, DuckDB is but one library, and it is not entirely trivial to interface with custom data types/your problem domain in L. I’d rather have the means in L to define a competitive DBMS as a library, first class (factor 2 at most compared to C). But that needs us solving (2) and (3).


So what’s causing problems with (2)? Why isn’t it “fast”, single-threaded? That’s what we need to figure out. (Parallelism is another problem.)

Edit: Andras claims below that the fastest single-threaded solution performs like straightforward C solutions already, so we might be “fast enough”. Not sure what the fastest C solutions do better; perhaps vectorisation etc. which is kind of a pain point for Haskell (but it might not be for a staged solution).

3 Likes

If we’re really unable to resolve (2), that’s very surprising to me. My expectation is that it’s possible to write low level Haskell code that is as fast as the “same approach implemented in C”, and that it will typically be able to wrap that code in a functional/declarative API that provides a reasonable level of reuse.

1 Like

Our single-threaded version earlier in this thread had the same speed as straightforward C solutions.

3 Likes

Ah good. At more than 170 comments it’s hard to keep track of everything! (And Discourse doesn’t even load the whole page at once – it uses “progressive” (or “regressive”?) loading.)

Yes, the “fastest” C solution above is mostly about 256 bit SIMD.

But I would prefer not to have to rely on (T)TH for performance, GHC is unbearable slow already.

1 Like

I think (T)TH being slow is not inherent to the approach; it’s just that its specification/implementation needs love. I would even say that good support for staging should yield a much faster compiler, because we can dial down on the number of Simplifier runs and its complexity.

4 Likes

Hrm…to choose between “library bloat” (having several similar libraries which only differ in their strictness properties), or ye ol’ code bloat (having several similar groups of definitions which only differ in certain other properties)? As noted here in this presentation:

https://www.youtube.com/watch?v=2PxsyWqZ5dI

…just imagine if we had to choose between several similar memory management libraries which only differed in their specific properties: thankfully, we don’t! Instead, it’s a part of the implementation.


Suppose multi-threading by default:

https://github.com/ghc-proposals/ghc-proposals/pull/240

had already been implemented - would there really still as much interest in “wringing every last bit” of performance out of the old single-threaded RTS? Or would we be:

  • advising people on how to migrate their old “unthreaded” code to the new default;

  • and discussing how to get the best multi-threaded performance out of the RTS?


It seems to me that you would merely be exchanging many short simplifier runs for a few much-longer simplifier runs, because of that extra complexity.

Oh I say, that’s a bit strong. deepseq is a tool with uses, and while I agree that in that specific problem, the strict vector was ultimately the right solution, in a more complex scenario in which either you want some laziness sometimes in your data structures or you don’t yet know which parts of your program are eating all of your memory with thunks and refactoring everything to be strict would be time-consuming, deepseq is an important tool to have at your disposal, even if just to try force-ing different things until you figure out which things to refactor. (But sometimes invalid laziness has to be representable; Haskell’s type system is very good at letting you express invariants through types but no type system is perfect at this.)

Certainly when you’re attempting to squeeze your thunks out by forcing everything through a show, as OP was, force is something you ought to be told about.

I updated my gist with @AndrasKovacs hash. Now it runs at about the same speed as yours, but uses less intrinsics and doesn’t have short string optimisation (only “short string” hash).

For such a bulk task (parsing CSV at 1GB/sec), not only every memory access counts, but also every instruction. Using a hash that works on 64-bit words instead of bytes helps to have fewer instructions.

2 Likes

deepseq is a time leak machine! and many other languages have it built into their spec :skull:

Correct.

Consider let x = head (return x) in 'a'

  • “outside in” evaluation: the result is 'a' as it ought to be, since x isn’t needed here to obtain that result.

  • “inside out” evaluation: the result will always be ⊥ no matter how much memory or how many threads are available.

To alleviate this fundamental problem with “inside out” evaluation, Eager Haskell would periodically and completely restart (reduction of) the program. This “re-scorched earth” style had two competing requirements:

  • avoiding too many restarts, so that the program can actually run at all;

  • avoiding too few restarts, so the program won’t do too much unnecessary work (like trying to evaluate x = head (return x) or other similarly-cyclic definitions).

…a “delicate balancing act”.

We’ve gone off topic, but I invite anyone who sneers at deepseq to take a look at Fix space leak between modules during compilation by MonoidMusician · Pull Request #4517 · purescript/purescript · GitHub and recommend a better way to address that problem. We weren’t happy with the deep infiltration of NFData into our types and would love to hear from someone who can clearly see a less invasive approach.

(ahem) …speaking for myself here: I wasn’t so much “sneering” at (the use of) entities like deepseq, but a prevailing impression that they (unfortunately) seem to accentuate:

  • that laziness/non-strict semantics are now more of an impediment than being of ongoing benefit to Haskell,

  • and little if anything would be lost if Haskell were strict by default.

(Again, this is merely how I’m perceiving the current situation).

So it is to that impression that I say:

  • did someone solve the halting problem?

  • or is Haskell going to eventually be total as well as dependent?

Because without either of those, being strict by default generally means more programs won’t produce a useful result at all. Therefore definitions like deepseq (or compel, as I mentioned here) along with other “strictness modifiers” should ideally be considered “features of last resort” because of their potential for non-termination.

So how can we get to that ideal? This looks promising:

…if the multi-threaded RTS does have to be overhauled to solve the “N > 4” problem observed by @chreekat and others, can any of Robert Ennal’s research be reused as well, so that comments like:

don’t end up being treated as basic advice when using Haskell?

1 Like

The result first: 3.9s!

11 threads (the number of cores is 10, although 2 of them are not “performance cores”):

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.934 s ±  0.007 s    [User: 36.062 s, System: 1.502 s]
  Range (min … max):    3.925 s …  3.943 s    5 runs

On my computer, -fllvm is still not faster - but for your version at least not slower - than not using it.

Without LLVM:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.344 s ±  0.009 s    [User: 32.554 s, System: 1.222 s]
  Range (min … max):    4.333 s …  4.355 s    5 runs

With LLVM (about the same).

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.346 s ±  0.005 s    [User: 32.559 s, System: 1.227 s]
  Range (min … max):    4.340 s …  4.353 s    5 runs

But now to the number of threads:

11 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.934 s ±  0.007 s    [User: 36.062 s, System: 1.502 s]
  Range (min … max):    3.925 s …  3.943 s    5 runs

12 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.951 s ±  0.015 s    [User: 36.132 s, System: 1.523 s]
  Range (min … max):    3.936 s …  3.970 s    5 runs

10 threads (the number of cores, although 2 of them are not “performance cores”):

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.968 s ±  0.013 s    [User: 35.878 s, System: 1.449 s]
  Range (min … max):    3.954 s …  3.982 s    5 runs

14 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      3.991 s ±  0.012 s    [User: 36.293 s, System: 1.555 s]
  Range (min … max):    3.979 s …  4.004 s    5 runs

16 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.029 s ±  0.012 s    [User: 36.438 s, System: 1.589 s]
  Range (min … max):    4.020 s …  4.051 s    5 runs

20 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.100 s ±  0.037 s    [User: 36.761 s, System: 1.631 s]
  Range (min … max):    4.054 s …  4.142 s    5 runs

8 threads:

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      4.344 s ±  0.009 s    [User: 32.554 s, System: 1.222 s]
  Range (min … max):    4.333 s …  4.355 s    5 runs

200 threads:

Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):      6.006 s ±  0.109 s    [User: 41.527 s, System: 3.630 s]
  Range (min … max):    5.921 s …  6.156 s    5 runs
1 Like