Counting Words, but can we go faster?

kalhauge · August 4, 2022, 7:04pm

I agree that the current implementation of the discrimination library is not fast enough. But I believe that the fundemental idea is that you can “serialize” any data structure as an array of integers (or bitstring), and this array can be used for both ordering and grouping of items relatively quickly.

My Idea for an algorithm:

For each element convert to an unboxed array of integers.
Keep a hashmap (patricia tree for sorting) for each size of list encountered.
- These arrays can be stored unboxed and checked for equality fast in the hashmap.
Optionally convert back or have original values stored in hashmap.
profit

kalhauge · August 8, 2022, 10:30pm

Okay, there might be something to it. Here are the current benchmarks:

Name	Mean [ms]	Stddev [ms]
numbers/lengthBaseline	2.67	0.07
kjvbible/lengthBaseline	3.96	0.08
numbers/viaIntCounter	5.05	0.27
numbers/viaUnboxedVectorHash	22.03	1.19
numbers/viaVectorHashMap	28.81	1.16
numbers/viaFinite	45.27	2.03
kjvbible/viaFinite	61.59	2.84
kjvbible/viaVectorHashMap	68.81	2.8
numbers/viaDiscrimination	85.74	4.55
kjvbible/viaStrictHashMap	95.94	5.23
numbers/viaStrictHashMap	97.46	6.79
kjvbible/viaStrictMap	302.66	8.05
numbers/viaIntMap	401.75	30.22
kjvbible/viaDiscrimination	502.48	24.61
numbers/viaStrictMap	545.15	15.31

Right now is my wild algorithm viaFinite of encoding the data into integers write those directly into the hash-array the fastest of the ones I have been able to benchmark. (Sadly I cannot get the general Counter from @jaror implementation up and running). I did get the IntCounter example up and running and it is fast! My current version borrows a lot from it, but I could use some help optimizing it.

github.com

kalhauge/counton/blob/main/src/Finite.hs

{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE MultiWayIf #-}
{-# LANGUAGE OverloadedStrings #-}

-- | A module of finite data, and their compressed.
module Finite where

import Control.Exception (assert)
import Control.Monad (unless, when)
import Control.Monad.Primitive (PrimMonad (PrimState))
import Data.Bits
import qualified Data.ByteString as B
import Data.Foldable
import Data.Hashable
import qualified Data.List as List
import Data.Primitive
import qualified Data.Primitive.MutVar as MutVar
import Debug.Trace

This file has been truncated. show original

unhammer · August 9, 2022, 10:02am

You can’t depend on clutter?

I see your repo uses ghc922, is it possible to get Finite.hs to compile on 8.10? I added some pragmas, but then I get

   • Couldn't match type ‘PrimState m’ with ‘PrimState m0’
      Expected type: MutableByteArray (PrimState m0)
        Actual type: MutableByteArray (PrimState m)
      NB: ‘PrimState’ is a non-injective type family
      The type variable ‘m0’ is ambiguous

(Very bleeding edge. Had to upgrade nix and add experimental flags =P)

kalhauge · August 9, 2022, 1:37pm

You probably need ScopedTypeVariables, but I’ll downgrade a version to 8.10.7 so that it is easier to compare.

kalhauge · August 9, 2022, 3:48pm

Okay, the newest version is online and runs on 8.10.7, and the viaFinite solution is 40% faster than the vector hashmap on my benchmarks :).

unhammer · August 9, 2022, 7:47pm

Cool! I tried adding Finite.hs to Comparing hs...hs-finite · unhammer/countwords · GitHub but it seems to go into some loop (maybe in countOn group) when I test like $ stack build && stack exec BufwiseFiniteBS <../kjvbible_x10.txt – am I holding it wrong?

kalhauge · August 9, 2022, 7:53pm

Yeah, like the clutter implementations I have not implemented resizeing. Instead, if I run out of space I loop forever. Consider choosing 120k as the new count instead of 64k, it worked for me

unhammer · August 10, 2022, 6:13am

hehe ok, merged into https://github.com/unhammer/countwords/tree/hs , results updated at countwords/haskell at hs · unhammer/countwords · GitHub

kalhauge · August 10, 2022, 7:36am

Ah, I see that my approach is still 100 ms slower, maybe that can be won by also using the IntCounter on small words? Impressive that Clutter still runs on text. Anyway, great work everybody!

jaror · August 10, 2022, 8:48am

What helped for me was to index 64 bits at a time. So don’t fold over the bytestring, but use lower level indexing like this:

> x <- (\(B.BS ptr i) -> withForeignPtr ptr (\p -> peekElemOff (castPtr p :: Ptr Int) 0)) "hello world!"
> showHex x ""
"6f77206f6c6c6568"

But make sure that you apply a mask to the last element so that you don’t read more than the length of the word.

kalhauge · August 10, 2022, 10:44am

Awesome trick, I’m sure I’ll add something like that eventually, right now I’m exploiting that ascii only uses the last 7 bits to gain more compression:

github.com/kalhauge/finite

src/Finite/Class.hs

388b83fbf


      
            group a
              | a <= 0 = [0, negate a]
              | otherwise = [a]
            {-# INLINE group #-}
          
          instance Finite a => Finite [a] where
            group = concatMap group
            {-# INLINE group #-}
          
          -- | If your bytestring only contain ascii charactors we can compress them a little more.
          ascii :: B.ByteString -> [Int]
          ascii !bs
            | B.null bs = []
            | otherwise =
              ( List.foldl'
                  (\a r' -> a `unsafeShiftL` 7 .|. idx r')
                  0
                  [0 .. x - 1]
              ) :
              ascii (B.drop x bs)
           where

Btw, I have refactored the Finite approach into it’s own library at GitHub - kalhauge/finite: A Library for counting finite data, still very much a work in progress, but it help with integrating with @unhammer s benchmarks.

Topic		Replies	Views
[Solved] Why is Data.Text.toLower so slow in Haskell?	7	1218	July 29, 2022
New blog post: Is Haskell fast? Links	10	3105	December 2, 2020
[ANN] vector-hashtables-0.1.0.1 Announcements	4	535	September 10, 2021
String literals and compilation speed Learn	7	578	March 10, 2022
Word Count Script Feedback	6	1232	May 2, 2019

Counting Words, but can we go faster?

Related topics