Counting Words, but can we go faster?

I agree that the current implementation of the discrimination library is not fast enough. But I believe that the fundemental idea is that you can “serialize” any data structure as an array of integers (or bitstring), and this array can be used for both ordering and grouping of items relatively quickly.

My Idea for an algorithm:

  1. For each element convert to an unboxed array of integers.
  2. Keep a hashmap (patricia tree for sorting) for each size of list encountered.
    • These arrays can be stored unboxed and checked for equality fast in the hashmap.
  3. Optionally convert back or have original values stored in hashmap.
  4. profit

Okay, there might be something to it. Here are the current benchmarks:

Name Mean [ms] Stddev [ms]
numbers/lengthBaseline 2.67 0.07
kjvbible/lengthBaseline 3.96 0.08
numbers/viaIntCounter 5.05 0.27
numbers/viaUnboxedVectorHash 22.03 1.19
numbers/viaVectorHashMap 28.81 1.16
numbers/viaFinite 45.27 2.03
kjvbible/viaFinite 61.59 2.84
kjvbible/viaVectorHashMap 68.81 2.8
numbers/viaDiscrimination 85.74 4.55
kjvbible/viaStrictHashMap 95.94 5.23
numbers/viaStrictHashMap 97.46 6.79
kjvbible/viaStrictMap 302.66 8.05
numbers/viaIntMap 401.75 30.22
kjvbible/viaDiscrimination 502.48 24.61
numbers/viaStrictMap 545.15 15.31

Right now is my wild algorithm viaFinite of encoding the data into integers write those directly into the hash-array the fastest of the ones I have been able to benchmark. (Sadly I cannot get the general Counter from @jaror implementation up and running). I did get the IntCounter example up and running and it is fast! My current version borrows a lot from it, but I could use some help optimizing it.

1 Like

You can’t depend on clutter?

I see your repo uses ghc922, is it possible to get Finite.hs to compile on 8.10? I added some pragmas, but then I get

   • Couldn't match type ‘PrimState m’ with ‘PrimState m0’
      Expected type: MutableByteArray (PrimState m0)
        Actual type: MutableByteArray (PrimState m)
      NB: ‘PrimState’ is a non-injective type family
      The type variable ‘m0’ is ambiguous

(Very bleeding edge. Had to upgrade nix and add experimental flags =P)

You probably need ScopedTypeVariables, but I’ll downgrade a version to 8.10.7 so that it is easier to compare.

Okay, the newest version is online and runs on 8.10.7, and the viaFinite solution is 40% faster than the vector hashmap on my benchmarks :).

1 Like

Cool! I tried adding Finite.hs to Comparing hs...hs-finite · unhammer/countwords · GitHub but it seems to go into some loop (maybe in countOn group) when I test like $ stack build && stack exec BufwiseFiniteBS <../kjvbible_x10.txt – am I holding it wrong?

Yeah, like the clutter implementations I have not implemented resizeing. Instead, if I run out of space I loop forever. Consider choosing 120k as the new count instead of 64k, it worked for me :slight_smile:

hehe ok, merged into https://github.com/unhammer/countwords/tree/hs , results updated at countwords/haskell at hs · unhammer/countwords · GitHub

2 Likes

Ah, I see that my approach is still 100 ms slower, maybe that can be won by also using the IntCounter on small words? Impressive that Clutter still runs on text. Anyway, great work everybody!

What helped for me was to index 64 bits at a time. So don’t fold over the bytestring, but use lower level indexing like this:

> x <- (\(B.BS ptr i) -> withForeignPtr ptr (\p -> peekElemOff (castPtr p :: Ptr Int) 0)) "hello world!"
> showHex x ""
"6f77206f6c6c6568"

But make sure that you apply a mask to the last element so that you don’t read more than the length of the word.

1 Like

Awesome trick, I’m sure I’ll add something like that eventually, right now I’m exploiting that ascii only uses the last 7 bits to gain more compression:

Btw, I have refactored the Finite approach into it’s own library at https://github.com/kalhauge/finite, still very much a work in progress, but it help with integrating with @unhammer s benchmarks.

1 Like