One Billion Row challenge in Hs

ReleaseCandidate · March 14, 2024, 11:46am

Of course. Way better now! 36s and your version is now the fastest one (3s faster than the one by András)!

As I’ve said, the “official” timing happens sometime in the evening (CET) and should yield better values.

ReleaseCandidate · March 14, 2024, 2:06pm

@jaror single threaded version, the fastest one so far: 36s

% hyperfine -w 1 -r 5  './exe-exe > ../solution.txt'
Benchmark 1: ./exe-exe > ../solution.txt
  Time (mean ± σ):     35.787 s ±  0.046 s    [User: 32.847 s, System: 2.822 s]
  Range (min … max):   35.736 s … 35.836 s    5 runs

@AndrasKovacs with the second fastest single threaded version by a small margin: 37s

% hyperfine -w 1 -r 5  './exe-exe > ../solution.txt'
Benchmark 1: ./exe-exe > ../solution.txt
  Time (mean ± σ):     36.849 s ±  0.102 s    [User: 32.488 s, System: 1.886 s]
  Range (min … max):   36.704 s … 36.967 s    5 runs

AndrasKovacs · March 14, 2024, 5:20pm

Nice, I don’t run out of memory with the mmap-ed version. I added “short-string optimization” on the top of it, I get a very nice speedup from that. Now it’s 23 seconds on my machine.

gist.github.com

https://gist.github.com/AndrasKovacs/e156ae66b8c28b1b84abe6b483ea20ec

OneBRC.hs

-- based on: https://gist.github.com/noughtmare/ab64aa24e5a106142164c91ea1ac5af7

#!/usr/bin/env cabal
{- cabal:
build-depends: base >= 4.19, bytestring, primitive >= 0.9, mmap
default-language: GHC2021
ghc-options: -Wall -O2 -fllvm
-}
{-# options_ghc -Wno-name-shadowing #-}
{-# LANGUAGE MagicHash, UnboxedTuples, UnliftedDatatypes, Strict,

This file has been truncated. show original

emiruz · March 14, 2024, 5:23pm

Here is a naive implementation of my left-child, right-sibling binary tree idea from above, in Fortran. On 2.8Ghz laptop, it completes 1B rows single threaded in 2m8s: < 4x slower than the best version reported by @ReleaseCandidate. Other than a custom float parser, it is currently naive. When I replace the current linear scan for siblings with an indexed array, I expect a 5-10x improvement. Once the concept is proven, I’ll give it a shot in Haskell, but I think its a good baseline for a structure I haven’t seen exploited in any other solution.

jaror · March 14, 2024, 6:09pm

Running this twice results in the second run only taking 15.407s on my machine. The first run took 18.160s.

ReleaseCandidate · March 14, 2024, 8:12pm

This version now runs in 29s on my computer, 8s less than the previous version. And RAM usage went down to 60MB (-15GB).

% hyperfine -w 1 -r 5  './exe-exe > ../solution.txt'
Benchmark 1: ./exe-exe > ../solution.txt
  Time (mean ± σ):     28.535 s ±  0.023 s    [User: 25.609 s, System: 2.818 s]
  Range (min … max):   28.505 s … 28.556 s    5 runs

AndrasKovacs · March 15, 2024, 12:48am

I made some modest changes, namely added a more “vectorized” newline search, and doubled the table size. Now it’s 22.5 secs for me, so I gained just 1,5 secs. I also experimented with other random changes, but apparently we’re getting into the territory where it’s hard to predict performance impact even by looking at the core.

I find it very suspicious that we apparently get no benefit from fitting the hashtable into L2, and instead the best choice is to make it large, while still fitting in L3. Normally this would indicate that scanning in table insertion is expensive, but I see nothing especially expensive in the core.

emiruz · March 15, 2024, 7:38am

It’d be wonderful if some kind soul would write this up in a blog ex-post delivery, for posterity. The code so far has had a lot of thought go into it, and its gems may be worth distilling and preserving.

ReleaseCandidate · March 15, 2024, 7:52am

Yes, even theoretically knowing that the used hash map is the most important thing isn’t the same as experiencing a 2.6x speedup (from 98s down to 38s) just by replacing the used hash map

Oh, I’d really like to do something again in Fortran, I still have a (physical!) copy of Fortran 95/2003 explained by Metcalf and Reid somewhere. Any language should have such a detailed and helpful explanation of its current standard (even if most languages don’t have an “official” standard).

sgraf · March 15, 2024, 11:31am

I would probably need to profile whether parsing is actually the bottleneck, but have you tried replacing the 5 unsafeIndexes by one 4-byte aligned read of a Word64 and then shifting stuff around? Perhaps the LLVM backend will take care this for you; on the other hand it’s a bit unlikely through all those withForeignPtr abstractions. Otherwise, 5 separate reads seems a bit expensive.

Also perhaps there is a large number of clashes in your hash table, resulting in a non-negligible number of go iterations? Unlikely, but perhaps worth checking.

jaror · March 15, 2024, 12:21pm

If parsing is a bottleneck now, then we might want to explore using rank-select bit strings: Haskell Works Blog - Introduction to the rank-select bit-string

There’s a whole series of blog posts including CSV parsing examples which are similar to what we’re parsing: Haskell Works Blog - Archives

jaror · March 15, 2024, 4:44pm

That slowed it down from 15.375s to 17.676s for me. And doesn’t lines already use a fast SIMD memchr?

Also, if I remove the hash table insertion from the program (but still make sure parseLine runs) the running time is 9.830s for me, so it seems more than 60% of the time is spent on parsing. If I also remove parseLine then it takes only about 5 seconds.

AndrasKovacs · March 15, 2024, 5:02pm

It’s good to know about the parsing time!
For me the custom newline search was a small but reliable improvement. Memchr is fast indeed, but our searches are very short and I thought that the out-of-line C call could eat up the bonus.

emiruz · March 16, 2024, 10:57am

A couple of benchmarking notes for those comparing to the Danny van Kooten version:

His processor – AMD Ryzen 4800U and 16 GB of RAM – can run single threaded at 4.2Ghz (mine does 2.8Ghz). Further, its an 8 core processor: 16 with hyper threading.
Assuming the effect of mmap is a constant (its just read speed, so it should be), his best single threaded version is about 20 - (6-1.6) = 15.6s.
His input file is 12Gb. Mine is 15.5Gb, about 30% bigger.

@ReleaseCandidate – what’s the file size, processor speed and RAM on the computer you’re using? Also is it running off of an encrypted disk or vanilla? We might be beating the above already given like for like conditions.

ReleaseCandidate · March 16, 2024, 4:14pm

I’ve already looked at Danny van Kooten’s version and I would say it isn’t optimised much at all (except for the parallel reading and a trick with forking and un-mapping the data file). I could not get it to run successfully even without this forking and piping, I always get a segfault in one of the worker threads.

I quickly made a single threaded version out of it to get some baseline comparison (and fixed the segfault).

C code

#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

#define MAX_DISTINCT_GROUPS    10000
#define MAX_GROUPBY_KEY_LENGTH 100

// Capacity of our hashmap
// Needs to be a power of 2
// so we can bit-and instead of modulo
#define HASHMAP_CAPACITY 16384
#define HASHMAP_INDEX(h) (h & (HASHMAP_CAPACITY - 1))

struct Group
{
    unsigned int count;
    int min;
    int max;
    long sum;
    char key[MAX_GROUPBY_KEY_LENGTH];
};

struct Result
{
    unsigned int n;
    unsigned int map[HASHMAP_CAPACITY];
    struct Group groups[MAX_DISTINCT_GROUPS];
};

// parses a floating point number as an integer
// this is only possible because we know our data file has only a single decimal
static inline const char *parse_number(int *dest, const char *s)
{
    // parse sign
    int mod = 1;
    if (*s == '-')
    {
        mod = -1;
        s++;
    }

    if (s[1] == '.')
    {
        *dest = ((s[0] * 10) + s[2] - ('0' * 11)) * mod;
        return s + 4;
    }

    *dest = (s[0] * 100 + s[1] * 10 + s[3] - ('0' * 111)) * mod;
    return s + 5;
}

static inline int min(int a, int b)
{
    return a < b ? a : b;
}
static inline int max(int a, int b)
{
    return a > b ? a : b;
}

// qsort callback
static inline int cmp(const void *ptr_a, const void *ptr_b)
{
    return strcmp(((struct Group *)ptr_a)->key, ((struct Group *)ptr_b)->key);
}

static struct Result *process_chunk(char *data, size_t data_size)
{
    // initialize result
    struct Result *result = malloc(sizeof(*result));
    if (!result)
    {
        perror("malloc error");
        exit(EXIT_FAILURE);
    }
    result->n = 0;

    // we could do this in a single call to memset
    // since the two are contiguous in memory
    // but this code is only called NTHREADS times
    // so not really worth it
    memset(result->map, 0, HASHMAP_CAPACITY * sizeof(*result->map));
    memset(result->groups, 0, MAX_DISTINCT_GROUPS * sizeof(*result->groups));

    const char *s            = data;
    const char *end          = &data[data_size - 1];
    const char *last_but_one = end - 1;

    // flaming hot loop
    while (s < last_but_one)
    {
        const char *linestart = s;

        // find position of ;
        // while simulatenuously hashing everything up to that point
        unsigned int len = 0;
        unsigned int h   = 0;
        while (s[len] != ';')
        {
            h = (h * 31) + (unsigned char)s[len];
            len += 1;
        }

        // parse decimal number as int
        int temperature;
        s = parse_number(&temperature, linestart + len + 1);

        // probe map until free spot or match
        unsigned int *c = &result->map[HASHMAP_INDEX(h)];
        while (*c > 0 && memcmp(result->groups[*c].key, linestart, len) != 0)
        {
            h += 1;
            c = &result->map[HASHMAP_INDEX(h)];
        }

        // new hashmap entry
        if (*c == 0)
        {
            *c = result->n;
            memcpy(result->groups[*c].key, linestart, len);
            result->n++;
        }

        // existing entry
        result->groups[*c].count += 1;
        result->groups[*c].min = min(result->groups[*c].min, temperature);
        result->groups[*c].max = max(result->groups[*c].max, temperature);
        result->groups[*c].sum += temperature;
    }

    return result;
}

static void result_to_str(char *dest, const struct Result *result)
{
    char buf[128];
    *dest++ = '{';
    for (unsigned int i = 0; i < result->n; i++)
    {
        size_t n = (size_t)snprintf(
            buf, 128, "%s=%.1f/%.1f/%.1f%s", result->groups[i].key,
            (float)result->groups[i].min / 10.0,
            ((float)result->groups[i].sum / (float)result->groups[i].count)
                / 10.0,
            (float)result->groups[i].max / 10.0,
            i < (result->n - 1) ? ", " : "");

        memcpy(dest, buf, n);
        dest += n;
    }
    *dest++ = '}';
    *dest   = 0x0;
}

int main(int argc, char **argv)
{
    char *file = "../measurements_big.txt";
    if (argc > 1)
    {
        file = argv[1];
    }

    int fd = open(file, O_RDONLY);
    if (!fd)
    {
        perror("error opening file");
        exit(EXIT_FAILURE);
    }

    struct stat sb;
    if (fstat(fd, &sb) == -1)
    {
        perror("error getting file size");
        exit(EXIT_FAILURE);
    }

    // mmap entire file into memory
    size_t sz  = (size_t)sb.st_size;
    char *data = mmap(NULL, sz, PROT_READ, MAP_SHARED, fd, 0);
    if (data == MAP_FAILED)
    {
        perror("error mmapping file");
        exit(EXIT_FAILURE);
    }

    struct Result *result = process_chunk(data, sz);

    // sort results alphabetically
    qsort(result->groups, (size_t)result->n, sizeof(*result->groups), cmp);

    char *buf = malloc(240551);
    result_to_str(buf, result);
    puts(buf);

    // clean-up
    munmap((void *)data, sz);
    close(fd);

    return EXIT_SUCCESS;
}

I compiled it using his flags:

clang -std=c11 -O3 -march=native -mtune=native -flto -Wall -Wextra -Wpedantic -Wformat=2 -Wconversion -Wundef -Winline -Wimplicit-fallthrough test.c -o testc

The result contains negative zeroes -0.0 and a , at the end, but is correct.
The benchmark: runtime is 27.5s, needs 8MB of RAM.

% hyperfine -r 5 -w 1 './testc > solution.txt'
Benchmark 1: ./testc > solution.txt
  Time (mean ± σ):     27.489 s ±  0.027 s    [User: 25.152 s, System: 2.271 s]
  Range (min … max):   27.452 s … 27.521 s    5 runs

My data

Apple M1 Max Pro
10Cores - 8 performance 2 efficiency
Max. Clock 3206 MHz (it did run at that frequency, I checked that)
L3 Cache 48MB
SSD has a measured speed of 4629.7MB/s write 5180.3 MB/s read

My data file is 14.9GB.

Btw. looking for a Fortran version I’ve found GitHub - sshestov/1brc-fortran, which (because of too many hash collisions) only returns 8350 instead of 8939 stations as a result and runs in 3.5s using 16 threads. That’s why I always check the results

If somebody has a AVX2 CPU on Linux: this should be the fastest C solution (if the title is true)

jaror · March 16, 2024, 5:50pm

That takes 0.7 seconds on my machine with 6 workers, very impressive.

emiruz · March 16, 2024, 5:59pm

Yeah me and sshestov seem to be the only people with Fortran solutions at the moment My latest hash + mmap version is clocking in at 38s (single threaded). I’ve tried to hand code some SIMD stuff but its difficult to beat the intrinsic functions. I’ll make it parallel soon and put it on a decent AWS machine for a fair go.

vshabanov · March 18, 2024, 12:26am

I’ve played around with it a bit. Looks like there’s a GHC RTS limitation. When I run -N8 with 8 threads (no allocations except at the end) it somehow blocks I/O in such a way that the program cannot use more than 400-600% CPU. If I add yield it can help with one or two read threads, but not more. If I use -N9 (ideally nProcessThreads + nReadThreads) then it works good, but indeed uses more than 800% CPU making it an incorrect benchmark.

I don’t know yet what is the source of the RTS behavior, so I just replaced parallel read threads with a simple reading in the same thread.

I also found that mmap works worse in a parallel scenario than parallel hGetBuf (perhaps too many conflicting page table mappings), while it’s slightly better in the sequential case.

The challenge seems to be to make as few memory accesses as possible, so I replaced the array-based hash table with a flat C-like blob (the same way others did, but using plain Foreign.Ptr.peek/poke without any intrinsics).

I have updated my gist. It runs faster than @jaror’s version, but slower than yours (and probably @AndrasKovacs’) because it doesn’t have any short-string optimisations (BTW, your version speeds up well with a -fllvm flag).

In multi-threaded variant it’s slightly slower than 7.c version (but faster with 24 threads). So:

a safe no-brainer version (unboxed array for counters) runs 2-4 times slower than C.
unsafe C-like version built on Foreign.Ptr runs about 1-1.25 times slower than C.
Data.Primitive looks more safe, but requires digging into intrinsics and might not always give better performance than Ptr.peek/poke
-fllvm can help.
minimizing memory access and any computing overhead matters more than any smart algorithms in this case.
mmap might not always be the best.
there are some issues in GHC RTS with tight non-allocating loops + I/O.

AVX is cool, I don’t think it’s doable in Haskell (that why we have FFI).

ReleaseCandidate · March 18, 2024, 7:14am

That’s interesting, because on my computer I had to remove it to gain about 300ms. By adding some bangs I shaved off another 1.5s, but nothing significant or interesting.

Talking about SIMD and auto-vectorisation of unrolled loops: does GHC even have the concept of loop unrolling?

Your version needs 103MB RAM, and uses 7 CPUs: 23s

hyperfine -w 1 -r 5  './exe-exe > solution.txt'
Benchmark 1: ./exe-exe > solution.txt
  Time (mean ± σ):     22.919 s ±  0.027 s    [User: 153.921 s, System: 7.697 s]
  Range (min … max):   22.897 s … 22.962 s    5 runs

Parallel version
Using 10 threads (the actual number of physical cores) did yield the best result, 6s using 25MB of RAM.

gfortran -fopenmp -march=native -ffast-math -O2 -o testf test.f90
hyperfine -w 1 -r 5  './testf > solution.txt'
Benchmark 1: ./testf > solution.txt
  Time (mean ± σ):      5.911 s ±  0.023 s    [User: 54.546 s, System: 1.320 s]
  Range (min … max):    5.881 s …  5.938 s    5 runs

vshabanov · March 18, 2024, 8:59am

22s total / 154s user times doesn’t look right. Have you compiled with -O2? (I’ve added it to the gist not to have such issues in future)

On my machine it was about 15s->13s total time difference. A very nice boost.