The 1BRC is a fun challenge in data processing that was originally restricted to JVM languages. Essentially, you have to compute averages and extreme values of a set of recordings.
Results are determined by running the program on a Hetzner AX161 dedicated server (32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM).
Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant), using 8 cores of the machine.
(1) The naive approach would be to hash table the stats and rely on O(1) lookup/update, and (2) map/reduce to utilise all 8 cores.
I suspect mawk + gnu parallel would be about 10 lines of code and yet be very difficult to beat.
As a point of reference cat measurements.txt > /dev/null takes about 12 seconds on my machine. I’d expect this to be significantly below 1.53s on machines where 1.53s is possible. I also do not have 8 physical cores So, for all intents and purposes I couldn’t benchmark against 1.53s.
It isn’t obvious to me how a machine agnostic benchmark might be established which takes into account both relative processing speed and number of cores, and reports an adjusted elapsed time, but perhaps others know.
I think if the goal is that others in the thread can compare their solutions against each other that may not work very well because of the multi-core issue. That is, parallelism will be more operative for some solutions than others. It’d be easy if it was just 1 core because then we could use time cat measurements.txt > /dev/null as the numeraire and be done with it more or less.
I guess RAM would be an issue too. The measurements.txt file is about 15Gb on Linux, and I can’t fit it in RAM. So I couldn’t mount it to RAM disk for example to get those sub-second read speeds. If its not in RAM then for most of us SSD read speed will be a bottleneck.
That file seems to present an extra difficulty for establishing a uniform benthmark - from briefly looking at the sources, measurements.txt is dynamically-generated content, derived from this CSV file of only 44693 entries. So each benchmarking result would be specific to the particular version of measurements.txt being used.
Perhaps this was intentional; perhaps the original authors thought that having a billion entries would ameliorate this issue - either way, it’s something which needs to be considered if the benchmarking results are going to be scrutinised seriously. One option would be to generate e.g. 1000 mesaurements_NNN.txt files of 1 million entries each, run the benchmark on each file, then use a prearranged calculation on the 1000 small-file results to obtain the “official” one. Otherwise…anyone for hosting one common 15Gbyte measurements.txt file to serve as a benchmarking reference?
…maybe if all PRNGs were equal. But across different languages (and their implementations)? That seems quite optimistic.
But if the benchmark uses its own PRNG (again, only looked briefly) that would definitely be the simplest option (…for those with 16Gbyte of RAM to spare ;-).
use this naïve linear congruential generator to calculate a random number each time a nucleotide needs to be selected (don’t cache the random number sequence)
IM = 139968
IA = 3877
IC = 29573
Seed = 42
Random (Max)
Seed = (Seed * IA + IC) modulo IM
= Max * Seed / IM
Indeed. So, say we wrote the data generator in Haskell which outputs an infinite list with a known PRNG + seed as stated. Then take 1000000000, and bolt the parser + stats calculations directly onto it, thus avoiding any need to mess about with storage at all. Takes RAM out of the equation.
Why should we (try to) take RAM (or rather, reading from disk) out of the equation? One of the aspects that makes the problem somewhat interesting IMO is the fact that it is sufficiently large that it would not fit in main memory/or would have some sort of trouble fitting into memory. Otherwise one might as well sufficiently often repeat a benchmark of 10K entries or so.
That one cannot directly compare the running time you get on your system against the running time of a solution on anyone elses system is a given. That applies for any benchmark. You’ll always need to compare against some sort of baseline on your machine anyway.
I’ve only used the Python generátor script and not the (official) Java one, but the values it generates are way too evenly distributed. The results of the billion row version are always -99.9, [-0.1, 0.1], and 99.9.
I would suggest using the CSV with the station names
and adding (randomly generated) minimum and maximum values to that, which should be used by the random temperatures, to get a bit more interesting values.
And being able to have the data file and the intermediate data in RAM (the OS caches it anyway, when using more than one benchmark run) is a prerequisite of the challenge, so you should adapt the size accordingly, 500,000,000 rows with 16 GB RAM,…
To be able to compare your result with the one generated by the Java reference implementation (rounding and maybe sorting of the station strings) you need to run the file
I’m not sure about that. Memory isn’t really operative here computationally because all fast solutions will be streaming solutions. The big difference that memory make is that in the actual benchmark the file is mounted into RAM which makes it lightening fast to read. Yet to do that yourself you’d need at least >15 GB of RAM. So I guess my point is, if we wanted to play about with this in Haskell, why not just skip the RAM requirement all together and lazily evaluate the data directly into the solution.
Well, you asked if it could be machine-agnostic. I don’t think it can be, but machines can be fixed, and that is good enough for a start. Same with the dataset, I don’t think you need an exact number of rows to compare solutions. 1/4 of a billion is still plenty.
Number of cores is probably the hardest part, I bet the best solution assumes that number. But still, if we don’t have a solution that is competitive in not-ideal circumstances, then I doubt we have a solution that is competitive on the required specification.
I wrote a naive (= no mutable data structures) implementation, taking 106 seconds on my machine: 1brc.hs · GitHub
$ ghc-9.8 -threaded -rtsopts 1brc.hs && time ./1brc +RTS -s -N8 -A128M
943,546,402,696 bytes allocated in the heap
8,316,733,712 bytes copied during GC
28,490,928 bytes maximum residency (6 sample(s))
15,721,296 bytes maximum slop
16199 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 864 colls, 864 par 13.831s 2.104s 0.0024s 0.0222s
Gen 1 6 colls, 5 par 0.064s 0.610s 0.1017s 0.5977s
Parallel GC work balance: 75.73% (serial 0%, perfect 100%)
TASKS: 18 (1 bound, 17 peak workers (17 total), using -N8)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.081s ( 0.081s elapsed)
MUT time 664.972s (103.921s elapsed)
GC time 13.895s ( 2.714s elapsed)
EXIT time 0.003s ( 0.002s elapsed)
Total time 678.950s (106.719s elapsed)
Alloc rate 1,418,927,204 bytes per MUT second
Productivity 97.9% of total user, 97.4% of total elapsed
./1brc +RTS -s -N8 -A128M 662.51s user 16.47s system 634% cpu 1:47.01 total
Profile suggests that Map becomes a bottleneck here.