I have created a mini benchmark for counting, so far is the vector-hashtables, still the winner. But maybe some of @jaror ideas with mixed arrays or some better algorithm can improve this
I got some suggestions on the mailing list to use compact regions, so I did and it turns out quite a bit faster together with my idea of storing small words in a separate table. I’ve pushed my hash table implementation to GitHub here: https://github.com/noughtmare/clutter (it is still a bit crude).
Using that and the following code (and the faster implementation of toLower for text):
{-# LANGUAGE OverloadedStrings #-}
module CountWords where
import Control.Monad (unless)
import Data.Foldable
import Data.List (sortOn)
import Data.Ord (Down (..))
import qualified Data.Text as T
import qualified Data.Text.IO as T
import System.IO (isEOF)
import qualified TextCounter as C
unlessM :: Monad m => m Bool -> m () -> m ()
unlessM m1 m2 = m1 >>= \b -> unless b m2
runCountWords :: IO ()
runCountWords = do
t <- C.new 60000
let go = unlessM isEOF $ do
wrds <- T.words . T.toLower <$> T.getLine
traverse_ (C.count t) wrds
go
go
xs <- sortOn (Down . snd) <$> C.toList t
traverse_ (\(w, i) -> T.putStrLn (w `T.append` " " `T.append` T.pack (show i))) xs
I get these results:
Language | Simple | Optimized | Notes |
---|---|---|---|
Haskell | 0.89 | by Jaro Reinders | |
Python | 1.80 | 1.08 |
At this point only about 30% of the time is spent on inserting things in the hash table, so I guess I should try to optimize the other stuff next.
By the way @kalhauge, did you see that @unhammer already made a benchmark repo with a bunch of extra implementations?
Nice @jaror! How much of the performance do you think is due to the new regions approach, and how much is due to the small words table?
Yes I did see @unhammer s repository and I think that it should be the primary benchmarking repo, hope that my repo is a sublement of microbenchmarks so that we can make progress on each subpart. Which counting method is fastest vs which data-reading approach is fastest.
The mixed boxed/unboxed table reduces the total time by about 25% compared to the vector-hashtables, but my hash table has a different design so that’s not really apples-to-apples. Then the small words table reduces the total time by another 18%. But those include a baseline time for reading the input and doing the text operations, so the actual impact on insertion speed is probably more even, if I have my mental maths right.
(I should also say that I’m running the benchmarks on a laptop that is not charging now, but I’m comparing the speeds relative to the Python implementation, so I hope that won’t matter so much.)
Ah, I didn’t see that you had microbenchmarks for just the counting.
We have a new winner in the combination of @kalhauge 's hGet+span loop and @jaror 's clutter library: countwords/haskell at hs · unhammer/countwords · GitHub
I’m testing on ghc 8.10.4 8.10.7 (stackage lts 18.18), I’m guessing you used a newer ghc/base @jaror ?
Pretty cool that my TextCounter is outperforming a solution using ByteString as that should have much lower I/O, toLower
, and words
overhead. I guess I’ll have to make a ByteStringCounter too.
As for the GHC version, I’m using mainly 8.10.7.
Thanks @kalhauge this is awesome, I am Adrien Glauser, the guy who offered the initial implementation with the inefficiencies you mention, and I’d be happy to land a PR improving with your code and to credit all contributors (even if the repository owner does not accept it, at least we’ll have tried!)
Also on a personal note, given that the repository’s confessed aim is to display idiomatic code, I much prefer the first version mentioned above. The unboxed version will not make much sense to someone comparing between languages.
Great!! Now let’s go even faster!
Thanks @Nycticorax as well, you started this journey and essentially helped @jaror find an inefficiency in the Text library! And, maybe we can come up with an Idiomatic way of counting words in Haskell!
I have looked at your code, and what you are essentially doing is Grouping (Discriminating) the text
into a list of Word64’s:
(see an example for bytestrings here: https://github.com/kalhauge/counton/blob/main/benchmark/Main.hs#L32), and then doing something different if there is only one integer in the list of integers.
Maybe that can serve as an interface to the counting? Currently, in my benchmarking is the original library not super fast on strings, but we might improve that?
I don’t really agree. The reason I’m converting to an int is that it can then be unboxed. That means that we can just store the int in memory instead of a pointer to the int. Lists are even worse, lists of ints (or Word64s) are implemented as pointers to pointers to the number (adding a layer of indirection).
For this particular problem it is very important to reduce these layers of indirection. I don’t think there is really any way to do that with the discrimination library.
That’s also why the immutable IORef
or MutVar
approaches are performing worse: these mutable structures introduce another layer of indirection.
I agree that the current implementation of the discrimination library is not fast enough. But I believe that the fundemental idea is that you can “serialize” any data structure as an array of integers (or bitstring), and this array can be used for both ordering and grouping of items relatively quickly.
My Idea for an algorithm:
- For each element convert to an unboxed array of integers.
- Keep a hashmap (patricia tree for sorting) for each size of list encountered.
- These arrays can be stored unboxed and checked for equality fast in the hashmap.
- Optionally convert back or have original values stored in hashmap.
- profit
Okay, there might be something to it. Here are the current benchmarks:
Name | Mean [ms] | Stddev [ms] |
---|---|---|
numbers/lengthBaseline | 2.67 | 0.07 |
kjvbible/lengthBaseline | 3.96 | 0.08 |
numbers/viaIntCounter | 5.05 | 0.27 |
numbers/viaUnboxedVectorHash | 22.03 | 1.19 |
numbers/viaVectorHashMap | 28.81 | 1.16 |
numbers/viaFinite | 45.27 | 2.03 |
kjvbible/viaFinite | 61.59 | 2.84 |
kjvbible/viaVectorHashMap | 68.81 | 2.8 |
numbers/viaDiscrimination | 85.74 | 4.55 |
kjvbible/viaStrictHashMap | 95.94 | 5.23 |
numbers/viaStrictHashMap | 97.46 | 6.79 |
kjvbible/viaStrictMap | 302.66 | 8.05 |
numbers/viaIntMap | 401.75 | 30.22 |
kjvbible/viaDiscrimination | 502.48 | 24.61 |
numbers/viaStrictMap | 545.15 | 15.31 |
Right now is my wild algorithm viaFinite
of encoding the data into integers write those directly into the hash-array the fastest of the ones I have been able to benchmark. (Sadly I cannot get the general Counter from @jaror implementation up and running). I did get the IntCounter example up and running and it is fast! My current version borrows a lot from it, but I could use some help optimizing it.
You can’t depend on clutter?
I see your repo uses ghc922, is it possible to get Finite.hs to compile on 8.10? I added some pragmas, but then I get
• Couldn't match type ‘PrimState m’ with ‘PrimState m0’
Expected type: MutableByteArray (PrimState m0)
Actual type: MutableByteArray (PrimState m)
NB: ‘PrimState’ is a non-injective type family
The type variable ‘m0’ is ambiguous
(Very bleeding edge. Had to upgrade nix and add experimental flags =P)
You probably need ScopedTypeVariables, but I’ll downgrade a version to 8.10.7 so that it is easier to compare.
Okay, the newest version is online and runs on 8.10.7, and the viaFinite solution is 40% faster than the vector hashmap on my benchmarks :).
Cool! I tried adding Finite.hs to Comparing hs...hs-finite · unhammer/countwords · GitHub but it seems to go into some loop (maybe in countOn group
) when I test like $ stack build && stack exec BufwiseFiniteBS <../kjvbible_x10.txt
– am I holding it wrong?
Yeah, like the clutter implementations I have not implemented resizeing. Instead, if I run out of space I loop forever. Consider choosing 120k as the new count instead of 64k, it worked for me
hehe ok, merged into https://github.com/unhammer/countwords/tree/hs , results updated at countwords/haskell at hs · unhammer/countwords · GitHub
Ah, I see that my approach is still 100 ms slower, maybe that can be won by also using the IntCounter on small words? Impressive that Clutter still runs on text. Anyway, great work everybody!
What helped for me was to index 64 bits at a time. So don’t fold over the bytestring, but use lower level indexing like this:
> x <- (\(B.BS ptr i) -> withForeignPtr ptr (\p -> peekElemOff (castPtr p :: Ptr Int) 0)) "hello world!"
> showHex x ""
"6f77206f6c6c6568"
But make sure that you apply a mask to the last element so that you don’t read more than the length of the word.
Awesome trick, I’m sure I’ll add something like that eventually, right now I’m exploiting that ascii only uses the last 7 bits to gain more compression:
Btw, I have refactored the Finite approach into it’s own library at https://github.com/kalhauge/finite, still very much a work in progress, but it help with integrating with @unhammer s benchmarks.