One Billion Row challenge in Hs

ocramz · March 3, 2024, 8:24am

The 1BRC is a fun challenge in data processing that was originally restricted to JVM languages. Essentially, you have to compute averages and extreme values of a set of recordings.

The benchmark to beat is 1.53 seconds.

https://github.com/vaibhavsagar/experiments/tree/d3302a76c641d532db2bf1cb3f8793484dfdbfc0/1brc @vaibhavsagar 's current solution in Haskell

Links:

original 1BRC : GitHub - gunnarmorling/1brc: 1️⃣🐝🏎️ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java

Bodigrim · March 3, 2024, 8:45am

Could someone possibly rewrite test data generation in Haskell? I could not bother to install JDK for such purpose.

jaror · March 3, 2024, 11:56am

There’s also a python version if you have an interpreter for that laying around: https://github.com/gunnarmorling/1brc/blob/main/src/main/python/create_measurements.py

emiruz · March 4, 2024, 8:07am

Here is the gist:

and

Results are determined by running the program on a Hetzner AX161 dedicated server (32 core AMD EPYC™ 7502P (Zen2), 128 GB RAM).
Programs are run from a RAM disk (i.o. the IO overhead for loading the file from disk is not relevant), using 8 cores of the machine.

(1) The naive approach would be to hash table the stats and rely on O(1) lookup/update, and (2) map/reduce to utilise all 8 cores.

I suspect mawk + gnu parallel would be about 10 lines of code and yet be very difficult to beat.

emiruz · March 4, 2024, 8:28am

As a point of reference cat measurements.txt > /dev/null takes about 12 seconds on my machine. I’d expect this to be significantly below 1.53s on machines where 1.53s is possible. I also do not have 8 physical cores So, for all intents and purposes I couldn’t benchmark against 1.53s.

It isn’t obvious to me how a machine agnostic benchmark might be established which takes into account both relative processing speed and number of cores, and reports an adjusted elapsed time, but perhaps others know.

jeukshi · March 4, 2024, 12:08pm

Benchmarks measure the performance of a benchmark, so in this case, both the program and the execution environment.

Maybe just run the reference solution on your machine and compare against that.

emiruz · March 4, 2024, 1:09pm

I think if the goal is that others in the thread can compare their solutions against each other that may not work very well because of the multi-core issue. That is, parallelism will be more operative for some solutions than others. It’d be easy if it was just 1 core because then we could use time cat measurements.txt > /dev/null as the numeraire and be done with it more or less.

I guess RAM would be an issue too. The measurements.txt file is about 15Gb on Linux, and I can’t fit it in RAM. So I couldn’t mount it to RAM disk for example to get those sub-second read speeds. If its not in RAM then for most of us SSD read speed will be a bottleneck.

atravers · March 4, 2024, 1:57pm

That file seems to present an extra difficulty for establishing a uniform benthmark - from briefly looking at the sources, measurements.txt is dynamically-generated content, derived from this CSV file of only 44693 entries. So each benchmarking result would be specific to the particular version of measurements.txt being used.

Perhaps this was intentional; perhaps the original authors thought that having a billion entries would ameliorate this issue - either way, it’s something which needs to be considered if the benchmarking results are going to be scrutinised seriously. One option would be to generate e.g. 1000 mesaurements_NNN.txt files of 1 million entries each, run the benchmark on each file, then use a prearranged calculation on the 1000 small-file results to obtain the “official” one. Otherwise…anyone for hosting one common 15Gbyte measurements.txt file to serve as a benchmarking reference?

jaror · March 4, 2024, 1:59pm

I think the solution to benchmarking on randomly generated data like this is to fix the seed of the random number generator.

atravers · March 4, 2024, 2:05pm

…maybe if all PRNGs were equal. But across different languages (and their implementations)? That seems quite optimistic.

But if the benchmark uses its own PRNG (again, only looked briefly) that would definitely be the simplest option (…for those with 16Gbyte of RAM to spare ;-).

jaror · March 4, 2024, 2:11pm

It’s easy to specify that a simple linear congruential generator should be used like the fasta benchmark from the benchmarks game does:

use this naïve linear congruential generator to calculate a random number each time a nucleotide needs to be selected (don’t cache the random number sequence)
IM = 139968
IA = 3877
IC = 29573
Seed = 42
 	
Random (Max)
   Seed = (Seed * IA + IC) modulo IM
   = Max * Seed / IM

Usually something like that is random enough.

emiruz · March 4, 2024, 4:00pm

Indeed. So, say we wrote the data generator in Haskell which outputs an infinite list with a known PRNG + seed as stated. Then take 1000000000, and bolt the parser + stats calculations directly onto it, thus avoiding any need to mess about with storage at all. Takes RAM out of the equation.

atravers · March 4, 2024, 4:53pm

…which translates approximately to:

random max seed =
    let seed' = (seed * ia + ic) `modulo` im
    in  (max * seed' / im,
         seed')

…?

noinia · March 4, 2024, 5:36pm

Why should we (try to) take RAM (or rather, reading from disk) out of the equation? One of the aspects that makes the problem somewhat interesting IMO is the fact that it is sufficiently large that it would not fit in main memory/or would have some sort of trouble fitting into memory. Otherwise one might as well sufficiently often repeat a benchmark of 10K entries or so.

That one cannot directly compare the running time you get on your system against the running time of a solution on anyone elses system is a given. That applies for any benchmark. You’ll always need to compare against some sort of baseline on your machine anyway.

ReleaseCandidate · March 4, 2024, 6:10pm

I’ve only used the Python generátor script and not the (official) Java one, but the values it generates are way too evenly distributed. The results of the billion row version are always -99.9, [-0.1, 0.1], and 99.9.

I would suggest using the CSV with the station names

github.com

gunnarmorling/1brc/blob/main/data/weather_stations.csv

# Adapted from https://simplemaps.com/data/world-cities
# Licensed under Creative Commons Attribution 4.0 (https://creativecommons.org/licenses/by/4.0/)
Tokyo;35.6897
Jakarta;-6.1750
Delhi;28.6100
Guangzhou;23.1300
Mumbai;19.0761
Manila;14.5958
Shanghai;31.1667
São Paulo;-23.5500
Seoul;37.5600
Mexico City;19.4333
Cairo;30.0444
New York;40.6943
Dhaka;23.7639
Beijing;39.9040
Kolkāta;22.5675
Bangkok;13.7525
Shenzhen;22.5350
Moscow;55.7558

This file has been truncated. show original

and adding (randomly generated) minimum and maximum values to that, which should be used by the random temperatures, to get a bit more interesting values.

And being able to have the data file and the intermediate data in RAM (the OS caches it anyway, when using more than one benchmark run) is a prerequisite of the challenge, so you should adapt the size accordingly, 500,000,000 rows with 16 GB RAM,…

To be able to compare your result with the one generated by the Java reference implementation (rounding and maybe sorting of the station strings) you need to run the file

github.com

gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_baseline.java

/*
 *  Copyright 2023 The original authors
 *
 *  Licensed under the Apache License, Version 2.0 (the "License");
 *  you may not use this file except in compliance with the License.
 *  You may obtain a copy of the License at
 *
 *      http://www.apache.org/licenses/LICENSE-2.0
 *
 *  Unless required by applicable law or agreed to in writing, software
 *  distributed under the License is distributed on an "AS IS" BASIS,
 *  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 *  See the License for the specific language governing permissions and
 *  limitations under the License.
 */
package dev.morling.onebrc;

import static java.util.stream.Collectors.*;

import java.io.IOException;

This file has been truncated. show original

emiruz · March 4, 2024, 7:25pm

I’m not sure about that. Memory isn’t really operative here computationally because all fast solutions will be streaming solutions. The big difference that memory make is that in the actual benchmark the file is mounted into RAM which makes it lightening fast to read. Yet to do that yourself you’d need at least >15 GB of RAM. So I guess my point is, if we wanted to play about with this in Haskell, why not just skip the RAM requirement all together and lazily evaluate the data directly into the solution.

rhendric · March 4, 2024, 7:52pm

I would expect that some of the fastest solutions would leverage parallelism and start iterating from different places in the file.

jeukshi · March 4, 2024, 10:39pm

Well, you asked if it could be machine-agnostic. I don’t think it can be, but machines can be fixed, and that is good enough for a start. Same with the dataset, I don’t think you need an exact number of rows to compare solutions. 1/4 of a billion is still plenty.

Number of cores is probably the hardest part, I bet the best solution assumes that number. But still, if we don’t have a solution that is competitive in not-ideal circumstances, then I doubt we have a solution that is competitive on the required specification.

Bodigrim · March 4, 2024, 11:53pm

Because the original challenge does take RAM and reading from disk out of equation by using RAM disk and a machine with 128 GB RAM.

Bodigrim · March 4, 2024, 11:59pm

I wrote a naive (= no mutable data structures) implementation, taking 106 seconds on my machine: 1brc.hs · GitHub

$ ghc-9.8 -threaded -rtsopts 1brc.hs && time ./1brc +RTS -s -N8 -A128M
 943,546,402,696 bytes allocated in the heap
   8,316,733,712 bytes copied during GC
      28,490,928 bytes maximum residency (6 sample(s))
      15,721,296 bytes maximum slop
           16199 MiB total memory in use (0 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       864 colls,   864 par   13.831s   2.104s     0.0024s    0.0222s
  Gen  1         6 colls,     5 par    0.064s   0.610s     0.1017s    0.5977s

  Parallel GC work balance: 75.73% (serial 0%, perfect 100%)

  TASKS: 18 (1 bound, 17 peak workers (17 total), using -N8)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.081s  (  0.081s elapsed)
  MUT     time  664.972s  (103.921s elapsed)
  GC      time   13.895s  (  2.714s elapsed)
  EXIT    time    0.003s  (  0.002s elapsed)
  Total   time  678.950s  (106.719s elapsed)

  Alloc rate    1,418,927,204 bytes per MUT second

  Productivity  97.9% of total user, 97.4% of total elapsed

./1brc +RTS -s -N8 -A128M  662.51s user 16.47s system 634% cpu 1:47.01 total

Profile suggests that Map becomes a bottleneck here.

Topic		Replies	Views
[ANN] vector-hashtables-0.1.0.1 Announcements	4	535	September 10, 2021
Subquadratic behaviour of Data.List.nub Learn	9	621	April 10, 2023
Strict, StrictData, UNPACK	2	594	March 25, 2024
Haskell performance has come a long way since I started	3	489	April 11, 2025
Haskell-perf/sequences shows interesting results on addition of DList, Acc & snoc into comparison	2	332	November 22, 2021

One Billion Row challenge in Hs

Related topics