Counting Words, but can we go faster?

kalhauge · August 1, 2022, 8:40am

Hi all,

As others, I have been obsessed with the Count Words article Performance comparison: counting words in Python, Go, C++, C, AWK, Forth, and Rust. In this article, the author tries to count the occurrence of words in a text file in different languages. It turns out that Haskell implementation is 6 times slower than the Python implementation!!

The original implementation (https://github.com/benhoyt/countwords/blob/master/haskell/src/CountWords.hs), by Adrien Glauser, is great, but there turned out to be some inefficiencies in the Text.toLower function (Why is Data.Text.toLower so slow in Haskell?), and it uses the containers Map instead of a hashmap.
The author of the count-words article no longer accepts new versions of programs, but I think it could be fun to try to optimize it.

First, my machine benchmarks the simple python version to:

$ hyperfine --warmup 2 'python3 simple.py < kjvbible_x10.txt'
Benchmark 1: python3 simple.py < kjvbible_x10.txt
  Time (mean ± σ):      3.349 s ±  0.053 s    [User: 3.295 s, System: 0.040 s]
  Range (min … max):    3.297 s …  3.480 s    10 runs

And the optimized C version is:

  Time (mean ± σ):     215.4 ms ±   2.4 ms    [User: 202.7 ms, System: 9.7 ms]
  Range (min … max):   211.3 ms … 218.9 ms    13 runs

A simple optimization would be to use bytestrings, as the article explicitly assumes the input is in ascii, and use the hashmap from unordered-containers:

module CountWords where

-- base
import Data.List (sortOn)
import Data.Ord (Down (..))
import Data.Foldable (forM_)

-- bytestring
import Data.ByteString as B
import Data.ByteString.Char8 as C

-- unordered-containers
import Data.HashMap.Strict as M

countWords :: B.ByteString -> [(B.ByteString, Int)]
countWords =
  sortOn (Down . snd)
    . count
    . C.words
    . B.map toLower
 where
  count =
    M.toList . M.fromListWith (+) . map (\x -> (x, 1))
  toLower a
    | a >= 65 && a <= 90 = a + 32
    | otherwise = a

runCountWords :: IO ()
runCountWords = do
  contents <- B.getContents
  forM_ (countWords contents) $ \(w, i) -> do
    B.putStr w >> putStr " " >> print i

This instantly gets us in python territory, with a still very readable solution:

  Time (mean ± σ):      2.214 s ±  0.018 s    [User: 2.120 s, System: 0.195 s]
  Range (min … max):    2.182 s …  2.255 s    10 runs

But can we go faster? Building on Mutable Data structures? Why so hard to find - #54 by jaror, I have created the fastest version I can come up with:

module CountWords where

-- base
import Control.Monad (forM_, unless)
import Data.Char (isSpace)
import Data.Function (fix)
import Data.List (sortOn)
import Data.Ord (Down (..))
import System.IO (stdin)

-- bytestring
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as C

-- vector-hashtable
import qualified Data.Vector.Hashtables as H

-- vector
import qualified Data.Vector.Mutable as VM
import qualified Data.Vector.Unboxed.Mutable as UM

type HashTable k v =
  H.Dictionary (H.PrimState IO) VM.MVector k UM.MVector v

runCountWords :: IO ()
runCountWords = do
  t <- H.initialize 10000 :: IO (HashTable B.ByteString Int)
  let incr = H.alter t (Just . maybe 1 (+ 1))
  flip fix B.empty $ \rec b -> do
    bs <- B.hGet stdin (64 * 1024)
    if B.null bs
      then do
        mapM_ incr (C.words b)
      else do
        let (initial, last) =
              B.spanEnd (not . isSpace)
                . mappend b
                . B.map toLower
                $ bs
        mapM_ incr (C.words initial)
        rec last
  xs <- H.toList t
  forM_ (sortOn (Down . snd) xs) $ \(w, i) -> do
    B.putStr w >> putStr " " >> print i
 where
  toLower a
    | a >= 65 && a <= 90 = a + 32
    | otherwise = a
  isSpace = (== 32)

Which is fast, but still 6 times slower than C.

  Time (mean ± σ):      1.350 s ±  0.010 s    [User: 1.278 s, System: 0.194 s]
  Range (min … max):    1.337 s …  1.367 s    10 runs

So here is my question, can we go faster? I hope this challenge can let some of the experienced haskellers out there show off their amazing skills :).

wyager · August 1, 2022, 8:41pm

I reject the “Stdlib: only use the language’s standard library functions.” constraint, as this unfairly punishes well-designed languages with small stdlibs (like Haskell).

I have not optimized the following program beyond writing fastLower, which saves about 0.2 seconds. It runs in 0.55 seconds on the KJV sample on my machine (compiled with -O2, M1 mac). How does it fare on your machine?

No mutability, no weird stuff, just concise idiomatic hs.

module Lib
    ( someFunc
    ) where
import qualified Streaming.ByteString.Char8 as BS8
import qualified Streaming.Prelude as P
import qualified Streaming as S
import Streaming (Stream, Of)
import Data.HashMap.Strict as Map (HashMap)
import qualified Data.HashMap.Strict as Map (alter, empty, toList)
import Data.ByteString (ByteString)
import Data.ByteString.Char8 (unpack)
import Data.List (sortOn)
import Data.Ord (Down(..))

fastLower :: Char -> Char
fastLower x | x >= 'A' && x <= 'Z' = toEnum (fromEnum x - (fromEnum 'A' - fromEnum 'a'))
            | otherwise = x

someFunc :: IO ()
someFunc = do
    let words :: Stream (Of ByteString) IO ()
        words = S.mapped BS8.toStrict $ BS8.words $ BS8.map fastLower  BS8.stdin
        update :: HashMap ByteString Word -> ByteString -> HashMap ByteString Word
        update m w = Map.alter (Just . maybe 1 (+1)) w m
    counts <- P.fold_ update Map.empty id words
    let sorted = sortOn (Down . snd) $ Map.toList counts
    let display (word,count) = putStrLn $ unpack word ++ " " ++ show count
    mapM_ display sorted

BardurArantsson · August 1, 2022, 10:03pm

This is absolutely right. Benchmarks are already meaningless enough without adding constraints driven by (mainly) the C and C++ mentality of dependency-avoidance at all cost (aka: NIH syndrome).

EDIT: To be fair: They did/do sometimes have good reasons for this, but those reasons are not relevant to most other languages… and are increasingly less relevant in C++ land at least these days.

hasufell · August 2, 2022, 4:45am

Haskell has a small stdlib?

And… Haskell has a well-designed stdlib?

unhammer · August 2, 2022, 12:38pm

Added those to countwords/haskell at hs · unhammer/countwords · GitHub (all changed to use a faster toLower, which seems fair since it’ll appear in the next version of Text).

@kalhauge your low-level bytestring loop is the current winner, though not that far ahead of the simple lazy IO (Text) version, which should also lower non-ASCII correctly. Perhaps Profiling using 'late' cost centres (after optimization) could shed some light?

wyager · August 2, 2022, 1:36pm

Correct, haskell has a small stdlib (if we’re talking about base) compared to e.g. python or C++. This is why you have to explicitly put containers or unordered-containers in your dependencies. The referent of “well-designed” was “language”, although base is also not bad, partially by virtue of being so small.

kalhauge · August 2, 2022, 1:39pm

@unhammer, Amazing work, you totally beat me to the punch! Also great to see @wyager solution in action. Profiling is hard and sharing profile results is often meaningless, so it’s good to have everything in one repository .

Note: I don’t know if it makes a difference, but maybe all the examples with VH, should use the same initial size?

jaror · August 2, 2022, 1:51pm

One improvement I can still think of is to use a single array to store the whole hash table which could potentially reduce the number of cache misses. However I don’t think it is currently possible to implement that in Haskell even with low level primitives in GHC.Exts. I’ve sent a mail to the ghc-devs mailing list about that: Mixed boxed/unboxed arrays?

Another idea for speeding it up is to use two different hash tables: one for the most frequent words and one for less frequent words. The idea is that the table with frequent words might be small enough to stay completely in cache for several words in a row.

Edit: an alternative idea is to store the top 16-64 words (benchmark!) in a linear lookup array and query that first before querying the hash table. These most common words are also probably short, so long words can skip the linear lookup. If you only do this for words up to 8 characters then you can store the whole word in one Word64 which should be quite fast.

hasufell · August 2, 2022, 2:00pm

I guess that’s why I spent 9 months fixing type FilePath = String, which will probably take another 5 years to reach the ecosystem.

Then you have lazy IO, which is an abomination in its own and I’m sure we’ll never get proper streaming into base.

Reading file contents into Strings is also a joke.

The Read class with its ReadS, ReadP, ReadsPrec and whatever gives me cancer every time I get even minimal exposure to it.

Show is broken, but we can’t fix it easily without causing unnecessary churn.

I’m not sure anything is particularly designed in base.

And this is a polite rant.

kalhauge · August 2, 2022, 2:19pm

@hasufell and @wyager I feel like the discussion about base might be a little of topic. I was hoping this could be a cool discussion about getting extreme performance out of Haskell . Maybe create a new topic to discuss it in?

wyager · August 2, 2022, 2:23pm

This is not germane to the original thread, so this is the last I’ll say on this topic, but the fact that you can feel so passionate about such arcane details of base, which would fall so far beneath the discursive noise floor in most any other language, is proof that Haskell (and even base) are relatively quite well designed.

Imagine going to a C(++) or python developer and objecting that fopen (or regional equivalent) takes a string argument. They would wonder how such a thought even occurred to you, as there are a million more pressing problems.

This is not to discount any efforts to improve base further; e.g. replacing stringly typed values is an admirable goal. It is not, however, evidence that base or other staples of the haskell ecosystem, as they exist, are poorly designed.

kalhauge · August 3, 2022, 7:44pm

I have created a mini benchmark for counting, so far is the vector-hashtables, still the winner. But maybe some of @jaror ideas with mixed arrays or some better algorithm can improve this

jaror · August 3, 2022, 9:52pm

I got some suggestions on the mailing list to use compact regions, so I did and it turns out quite a bit faster together with my idea of storing small words in a separate table. I’ve pushed my hash table implementation to GitHub here: https://github.com/noughtmare/clutter (it is still a bit crude).

Using that and the following code (and the faster implementation of toLower for text):

{-# LANGUAGE OverloadedStrings #-}

module CountWords where

import Control.Monad (unless)
import Data.Foldable
import Data.List (sortOn)
import Data.Ord (Down (..))
import qualified Data.Text as T
import qualified Data.Text.IO as T
import System.IO (isEOF)
import qualified TextCounter as C

unlessM :: Monad m => m Bool -> m () -> m ()
unlessM m1 m2 = m1 >>= \b -> unless b m2

runCountWords :: IO ()
runCountWords = do
  t <- C.new 60000
  let go = unlessM isEOF $ do
        wrds <- T.words . T.toLower <$> T.getLine
        traverse_ (C.count t) wrds
        go
  go
  xs <- sortOn (Down . snd) <$> C.toList t
  traverse_ (\(w, i) -> T.putStrLn (w `T.append` " " `T.append` T.pack (show i))) xs

I get these results:

Language	Simple	Optimized	Notes
Haskell	0.89		by Jaro Reinders
Python	1.80	1.08

At this point only about 30% of the time is spent on inserting things in the hash table, so I guess I should try to optimize the other stuff next.

By the way @kalhauge, did you see that @unhammer already made a benchmark repo with a bunch of extra implementations?

kalhauge · August 4, 2022, 5:50am

Nice @jaror! How much of the performance do you think is due to the new regions approach, and how much is due to the small words table?

Yes I did see @unhammer s repository and I think that it should be the primary benchmarking repo, hope that my repo is a sublement of microbenchmarks so that we can make progress on each subpart. Which counting method is fastest vs which data-reading approach is fastest.

jaror · August 4, 2022, 7:40am

The mixed boxed/unboxed table reduces the total time by about 25% compared to the vector-hashtables, but my hash table has a different design so that’s not really apples-to-apples. Then the small words table reduces the total time by another 18%. But those include a baseline time for reading the input and doing the text operations, so the actual impact on insertion speed is probably more even, if I have my mental maths right.

(I should also say that I’m running the benchmarks on a laptop that is not charging now, but I’m comparing the speeds relative to the Python implementation, so I hope that won’t matter so much.)

Ah, I didn’t see that you had microbenchmarks for just the counting.

unhammer · August 4, 2022, 9:20am

We have a new winner in the combination of @kalhauge 's hGet+span loop and @jaror 's clutter library: countwords/haskell at hs · unhammer/countwords · GitHub
I’m testing on ghc ~~8.10.4~~ 8.10.7 (stackage lts 18.18), I’m guessing you used a newer ghc/base @jaror ?

jaror · August 4, 2022, 9:23am

Pretty cool that my TextCounter is outperforming a solution using ByteString as that should have much lower I/O, toLower, and words overhead. I guess I’ll have to make a ByteStringCounter too.

As for the GHC version, I’m using mainly 8.10.7.

Nycticorax · August 4, 2022, 12:55pm

Thanks @kalhauge this is awesome, I am Adrien Glauser, the guy who offered the initial implementation with the inefficiencies you mention, and I’d be happy to land a PR improving with your code and to credit all contributors (even if the repository owner does not accept it, at least we’ll have tried!)

Also on a personal note, given that the repository’s confessed aim is to display idiomatic code, I much prefer the first version mentioned above. The unboxed version will not make much sense to someone comparing between languages.

kalhauge · August 4, 2022, 1:22pm

Great!! Now let’s go even faster!

Thanks @Nycticorax as well, you started this journey and essentially helped @jaror find an inefficiency in the Text library! And, maybe we can come up with an Idiomatic way of counting words in Haskell!

I have looked at your code, and what you are essentially doing is Grouping (Discriminating) the text
into a list of Word64’s:
(see an example for bytestrings here: https://github.com/kalhauge/counton/blob/main/benchmark/Main.hs#L32), and then doing something different if there is only one integer in the list of integers.
Maybe that can serve as an interface to the counting? Currently, in my benchmarking is the original library not super fast on strings, but we might improve that?

jaror · August 4, 2022, 2:30pm

I don’t really agree. The reason I’m converting to an int is that it can then be unboxed. That means that we can just store the int in memory instead of a pointer to the int. Lists are even worse, lists of ints (or Word64s) are implemented as pointers to pointers to the number (adding a layer of indirection).

For this particular problem it is very important to reduce these layers of indirection. I don’t think there is really any way to do that with the discrimination library.

That’s also why the immutable IORef or MutVar approaches are performing worse: these mutable structures introduce another layer of indirection.