One Billion Row challenge in Hs

The 1BRC is a fun challenge in data processing that was originally restricted to JVM languages. Essentially, you have to compute averages and extreme values of a set of recordings.

The benchmark to beat is 1.53 seconds.

https://github.com/vaibhavsagar/experiments/tree/d3302a76c641d532db2bf1cb3f8793484dfdbfc0/1brc @vaibhavsagar 's current solution in Haskell

Links:

original 1BRC : GitHub - gunnarmorling/1brc: 1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

13 Likes

Could someone possibly rewrite test data generation in Haskell? I could not bother to install JDK for such purpose.

5 Likes

There’s also a python version if you have an interpreter for that laying around: https://github.com/gunnarmorling/1brc/blob/main/src/main/python/create_measurements.py

2 Likes

Here is the gist:

and

Results are determined by running the program on a Hetzner AX161 dedicated server (32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM).
Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant), using 8 cores of the machine.

(1) The naive approach would be to hash table the stats and rely on O(1) lookup/update, and (2) map/reduce to utilise all 8 cores.

I suspect mawk + gnu parallel would be about 10 lines of code and yet be very difficult to beat.

1 Like

As a point of reference cat measurements.txt > /dev/null takes about 12 seconds on my machine. I’d expect this to be significantly below 1.53s on machines where 1.53s is possible. I also do not have 8 physical cores :slight_smile: So, for all intents and purposes I couldn’t benchmark against 1.53s.

It isn’t obvious to me how a machine agnostic benchmark might be established which takes into account both relative processing speed and number of cores, and reports an adjusted elapsed time, but perhaps others know.

Benchmarks measure the performance of a benchmark, so in this case, both the program and the execution environment.

Maybe just run the reference solution on your machine and compare against that.

2 Likes

I think if the goal is that others in the thread can compare their solutions against each other that may not work very well because of the multi-core issue. That is, parallelism will be more operative for some solutions than others. It’d be easy if it was just 1 core because then we could use time cat measurements.txt > /dev/null as the numeraire and be done with it more or less.

I guess RAM would be an issue too. The measurements.txt file is about 15Gb on Linux, and I can’t fit it in RAM. So I couldn’t mount it to RAM disk for example to get those sub-second read speeds. If its not in RAM then for most of us SSD read speed will be a bottleneck.

1 Like

That file seems to present an extra difficulty for establishing a uniform benthmark - from briefly looking at the sources, measurements.txt is dynamically-generated content, derived from this CSV file of only 44693 entries. So each benchmarking result would be specific to the particular version of measurements.txt being used.

Perhaps this was intentional; perhaps the original authors thought that having a billion entries would ameliorate this issue - either way, it’s something which needs to be considered if the benchmarking results are going to be scrutinised seriously. One option would be to generate e.g. 1000 mesaurements_NNN.txt files of 1 million entries each, run the benchmark on each file, then use a prearranged calculation on the 1000 small-file results to obtain the “official” one. Otherwise…anyone for hosting one common 15Gbyte measurements.txt file to serve as a benchmarking reference?

I think the solution to benchmarking on randomly generated data like this is to fix the seed of the random number generator.

…maybe if all PRNGs were equal. But across different languages (and their implementations)? That seems quite optimistic.

But if the benchmark uses its own PRNG (again, only looked briefly) that would definitely be the simplest option (…for those with 16Gbyte of RAM to spare ;-).

1 Like

It’s easy to specify that a simple linear congruential generator should be used like the fasta benchmark from the benchmarks game does:

use this naïve linear congruential generator to calculate a random number each time a nucleotide needs to be selected (don’t cache the random number sequence)

IM = 139968
IA = 3877
IC = 29573
Seed = 42
 	
Random (Max)
   Seed = (Seed * IA + IC) modulo IM
   = Max * Seed / IM

Usually something like that is random enough.

Indeed. So, say we wrote the data generator in Haskell which outputs an infinite list with a known PRNG + seed as stated. Then take 1000000000, and bolt the parser + stats calculations directly onto it, thus avoiding any need to mess about with storage at all. Takes RAM out of the equation.

…which translates approximately to:

random max seed =
    let seed' = (seed * ia + ic) `modulo` im
    in  (max * seed' / im,
         seed')

…?

1 Like

Why should we (try to) take RAM (or rather, reading from disk) out of the equation? One of the aspects that makes the problem somewhat interesting IMO is the fact that it is sufficiently large that it would not fit in main memory/or would have some sort of trouble fitting into memory. Otherwise one might as well sufficiently often repeat a benchmark of 10K entries or so.

That one cannot directly compare the running time you get on your system against the running time of a solution on anyone elses system is a given. That applies for any benchmark. You’ll always need to compare against some sort of baseline on your machine anyway.

I’ve only used the Python generátor script and not the (official) Java one, but the values it generates are way too evenly distributed. The results of the billion row version are always -99.9, [-0.1, 0.1], and 99.9.

I would suggest using the CSV with the station names

and adding (randomly generated) minimum and maximum values to that, which should be used by the random temperatures, to get a bit more interesting values.

And being able to have the data file and the intermediate data in RAM (the OS caches it anyway, when using more than one benchmark run) is a prerequisite of the challenge, so you should adapt the size accordingly, 500,000,000 rows with 16 GB RAM,…

To be able to compare your result with the one generated by the Java reference implementation (rounding and maybe sorting of the station strings) you need to run the file

1 Like

I’m not sure about that. Memory isn’t really operative here computationally because all fast solutions will be streaming solutions. The big difference that memory make is that in the actual benchmark the file is mounted into RAM which makes it lightening fast to read. Yet to do that yourself you’d need at least >15 GB of RAM. So I guess my point is, if we wanted to play about with this in Haskell, why not just skip the RAM requirement all together and lazily evaluate the data directly into the solution.

I would expect that some of the fastest solutions would leverage parallelism and start iterating from different places in the file.

Well, you asked if it could be machine-agnostic. I don’t think it can be, but machines can be fixed, and that is good enough for a start. Same with the dataset, I don’t think you need an exact number of rows to compare solutions. 1/4 of a billion is still plenty.

Number of cores is probably the hardest part, I bet the best solution assumes that number. But still, if we don’t have a solution that is competitive in not-ideal circumstances, then I doubt we have a solution that is competitive on the required specification.

Because the original challenge does take RAM and reading from disk out of equation by using RAM disk and a machine with 128 GB RAM.

2 Likes

I wrote a naive (= no mutable data structures) implementation, taking 106 seconds on my machine: 1brc.hs · GitHub

$ ghc-9.8 -threaded -rtsopts 1brc.hs && time ./1brc +RTS -s -N8 -A128M
 943,546,402,696 bytes allocated in the heap
   8,316,733,712 bytes copied during GC
      28,490,928 bytes maximum residency (6 sample(s))
      15,721,296 bytes maximum slop
           16199 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       864 colls,   864 par   13.831s   2.104s     0.0024s    0.0222s
  Gen  1         6 colls,     5 par    0.064s   0.610s     0.1017s    0.5977s

  Parallel GC work balance: 75.73% (serial 0%, perfect 100%)

  TASKS: 18 (1 bound, 17 peak workers (17 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.081s  (  0.081s elapsed)
  MUT     time  664.972s  (103.921s elapsed)
  GC      time   13.895s  (  2.714s elapsed)
  EXIT    time    0.003s  (  0.002s elapsed)
  Total   time  678.950s  (106.719s elapsed)

  Alloc rate    1,418,927,204 bytes per MUT second

  Productivity  97.9% of total user, 97.4% of total elapsed

./1brc +RTS -s -N8 -A128M  662.51s user 16.47s system 634% cpu 1:47.01 total

Profile suggests that Map becomes a bottleneck here.

5 Likes