I agree that the current implementation of the discrimination library is not fast enough. But I believe that the fundemental idea is that you can “serialize” any data structure as an array of integers (or bitstring), and this array can be used for both ordering and grouping of items relatively quickly.
My Idea for an algorithm:
For each element convert to an unboxed array of integers.
Keep a hashmap (patricia tree for sorting) for each size of list encountered.
These arrays can be stored unboxed and checked for equality fast in the hashmap.
Optionally convert back or have original values stored in hashmap.
Okay, there might be something to it. Here are the current benchmarks:
Name
Mean [ms]
Stddev [ms]
numbers/lengthBaseline
2.67
0.07
kjvbible/lengthBaseline
3.96
0.08
numbers/viaIntCounter
5.05
0.27
numbers/viaUnboxedVectorHash
22.03
1.19
numbers/viaVectorHashMap
28.81
1.16
numbers/viaFinite
45.27
2.03
kjvbible/viaFinite
61.59
2.84
kjvbible/viaVectorHashMap
68.81
2.8
numbers/viaDiscrimination
85.74
4.55
kjvbible/viaStrictHashMap
95.94
5.23
numbers/viaStrictHashMap
97.46
6.79
kjvbible/viaStrictMap
302.66
8.05
numbers/viaIntMap
401.75
30.22
kjvbible/viaDiscrimination
502.48
24.61
numbers/viaStrictMap
545.15
15.31
Right now is my wild algorithm viaFinite of encoding the data into integers write those directly into the hash-array the fastest of the ones I have been able to benchmark. (Sadly I cannot get the general Counter from @jaror implementation up and running). I did get the IntCounter example up and running and it is fast! My current version borrows a lot from it, but I could use some help optimizing it.
I see your repo uses ghc922, is it possible to get Finite.hs to compile on 8.10? I added some pragmas, but then I get
• Couldn't match type ‘PrimState m’ with ‘PrimState m0’
Expected type: MutableByteArray (PrimState m0)
Actual type: MutableByteArray (PrimState m)
NB: ‘PrimState’ is a non-injective type family
The type variable ‘m0’ is ambiguous
(Very bleeding edge. Had to upgrade nix and add experimental flags =P)
Cool! I tried adding Finite.hs to Comparing hs...hs-finite · unhammer/countwords · GitHub but it seems to go into some loop (maybe in countOn group) when I test like $ stack build && stack exec BufwiseFiniteBS <../kjvbible_x10.txt – am I holding it wrong?
Yeah, like the clutter implementations I have not implemented resizeing. Instead, if I run out of space I loop forever. Consider choosing 120k as the new count instead of 64k, it worked for me
Ah, I see that my approach is still 100 ms slower, maybe that can be won by also using the IntCounter on small words? Impressive that Clutter still runs on text. Anyway, great work everybody!
Awesome trick, I’m sure I’ll add something like that eventually, right now I’m exploiting that ascii only uses the last 7 bits to gain more compression: