Mutable Data structures? Why so hard to find

MasseR · May 30, 2022, 11:21am

Just want to add one datapoint to this discussion. Using the fiwiki dump, it takes 8 seconds for awk to finish, 20 seconds for ordered-containers and 19 seconds for unordered containers. This with absolutely no effort spent on optimizing.

{-# LANGUAGE OverloadedStrings #-}
module Main where

import qualified Data.Map.Strict as M
import qualified Data.Text.Lazy as Lazy
import qualified Data.Text.Lazy.IO as LazyIO
import qualified Data.Text as Strict
import Data.Map (Map)
import Data.Foldable (foldl', for_)
import qualified Data.Text.IO as StrictIO
import Data.List (sortOn)
import Data.Ord (Down(Down))
import Data.HashMap.Strict (HashMap)
import qualified Data.HashMap.Strict as HM


orderedHisto :: Lazy.Text -> [(Strict.Text, Int)]
orderedHisto = M.toList . histo
  where
    histo :: Lazy.Text -> Map Strict.Text Int
    histo = foldl' (\acc x -> M.insertWith (+) (Lazy.toStrict x) 1 acc) M.empty . Lazy.lines

unorderedHisto :: Lazy.Text -> [(Strict.Text, Int)]
unorderedHisto = HM.toList . histo
  where
    histo :: Lazy.Text -> HashMap Strict.Text Int
    histo = foldl' (\acc x -> HM.insertWith (+) (Lazy.toStrict x) 1 acc) HM.empty . Lazy.lines

main :: IO ()
main = do
  contents <- LazyIO.getContents
  for_ (take 10 $ sortOn (Down . snd) $ unorderedHisto contents) $ \(key, val) ->
    StrictIO.putStrLn $ key <> " " <> Strict.pack (show val)

$ bzcat ~/temp/realytemp/fiwiki-20220520-pages-articles-multistream.xml.bz2 | head -1000000| tr ' ' '\n' | time ./test > /dev/null
./test > /dev/null  17.39s user 1.18s system 93% cpu 19.908 total
$ bzcat ~/temp/realytemp/fiwiki-20220520-pages-articles-multistream.xml.bz2 | head -1000000| tr ' ' '\n' | time awk '{s[$0]++} END{ PROCINFO["sorted_in"]="@val_num_desc"; i=1;for(x in s){if (i++>10)break;print x,s[x]}}' > /dev/null
awk  > /dev/null  7.34s user 0.25s system 87% cpu 8.629 total

atravers · May 30, 2022, 12:43pm

Section 1.2.1 of @simonmar’s proposal:

module Data.Mutable.IO where
  type Mutable a = Ref# RealWorld a
  readRef :: Mutable a -> IO a
  writeRef :: Mutable a -> a -> IO ()

module Data.Mutable.ST where
  type Mutable s a = Ref# s a
  readRef :: Mutable s a -> ST s a
  writeRef :: Mutable s a -> a -> ST s ()

…has me wondering about Ref#'s state parameter s - maybe it should be for the monad providing the context, which presumably would have a MonadPrim instance:

data Ref# m a  -- unless it needs to be a primitive type
newRef# :: MonadPrim m => a -> m (Ref# m a)
readRef# :: MonadPrim m => Ref# m a -> m a
writeRef# :: MonadPrim m => Ref# m a -> a -> m ()

which then allows e.g:

type STRef# s a = Ref# (ST s) a
type IORef# a   = Ref# IO a
type TVar# a    = Ref# STM a

As for the original example…

data MutPair ... where
  MutPair :: Ref m1 a -> Ref m2 b -> MutPair (m1 a) (m2 b)

(…perhaps - completing that declaration is left as an exercise :-)

jaror · May 30, 2022, 12:56pm

I think the spirit of PrimMonad would be to use the PrimState type family, like so:

data Ref# s a  -- unless it needs to be a primitive type
newRef# :: PrimMonad m => a -> m (Ref# (PrimState m) a)
readRef# :: PrimMonad m => Ref# (PrimState m) a -> m a
writeRef# :: PrimMonad m => Ref# (PrimState m) a -> a -> m ()

type STRef# s a = Ref# s a
type IORef# a   = Ref# RealWorld a

I don’t think STM can be made to fit this Ref# type. STM is not an instance of PrimMonad either.

As for:

data MutPair ... where
  MutPair :: Ref m1 a -> Ref m2 b -> MutPair (m1 a) (m2 b)

Constructing the MutPair must be an effectful action inside some monad, so it must be of the form:

data MutPair ... where
  MutPair :: Ref m1 a -> Ref m2 b -> m' (MutPair (m1 a) (m2 b))

But what would m' be there?

atravers · May 30, 2022, 1:33pm

…I intended the use of the type class to restrict the use of Ref# variables to “appropriate” monadic types, so an alternative would be:

class Monad m => MonadCaptive m where ...

instance MonadCaptive (ST s) where ...
instance MonadCaptive IO where ...
instance MonadCaptive STM where ...

data Ref# m a  -- unless it needs to be a primitive type
newRef# :: MonadCaptive m => a -> m (Ref# m a)
readRef# :: MonadCaptive m => Ref# m a -> m a
writeRef# :: MonadCaptive m => Ref# m a -> a -> m ()

…how to prevent instances for e.g. Identity or Maybe from being defined is the obvious question there!

jaror:

Constructing the MutPair must be an effectful action inside some monad, so it must be of the form:
data MutPair ... where
  MutPair :: Ref m1 a -> Ref m2 b -> m' (MutPair (m1 a) (m2 b))
But what would m' be there?

(…hence the embedding of m1 and m2 within the two arguments of MutPair.)

As for m', if m1 == m2 then the declaration can be simpler:

data MutPair ... where
  MutPair :: Ref m' a -> Ref m' b -> m' (MutPair a b)

For m1 /= m2 , some imaginative thinking could be required…

hasufell · May 30, 2022, 1:46pm

Did you know about Announcing the "stm-containers" library – Functional programming debugs you

unhammer · May 30, 2022, 2:01pm

How are you running your script? Because afaict it expects one word per line. If I tr ' ' '\n' before inputting into your script, my computer runs out of memory

atravers · June 3, 2022, 12:49am

This also looks relevant for @simonmar’s proposal:

Interior mutability is a design pattern in Rust that allows you to mutate data even when there are immutable references to that data; normally, this action is disallowed by the borrowing rules. To mutate data, the pattern uses unsafe code inside a data structure to bend Rust’s usual rules that govern mutation and borrowing.

… and near the end of that article:

[…] By using RefCell<T>, we have an outwardly immutable List value. But we can use the methods on RefCell<T> that provide access to its interior mutability so we can modify our data when we need to. The runtime checks of the borrowing rules protect us from data races, and it’s sometimes worth trading a bit of speed for this flexibility in our data structures.

Hmm: “runtime checks”? “data races” ? Here’s an proposal that has already been accepted:

github.com/ghc-proposals/ghc-proposals

Compile with threaded RTS by default

ghc-proposals:master ← ulysses4ever:master

opened 08:57AM - 10 Jun 19 UTC

ulysses4ever

+90 -0

The [proposal](https://github.com/ghc-proposals/ghc-proposals/blob/master/propos…als/0052-threaded-by-default.rst) has been accepted; the following discussion is mostly of historic interest. --- Compile with `-threaded` by default. For those in need of the non-threaded RTS, provide the new `-single-threaded` flag. [Rendered](https://github.com/ulysses4ever/ghc-proposals/blob/master/proposals/0000-threaded-by-default.rst). [Reddit thread](https://www.reddit.com/r/haskell/comments/byvq5w/ghc_proposal_compile_with_threaded_rts_by_default/).

So the question is:

will runtime checks also be needed in GHC for mutable fields in data constructors when used by multi-threaded programs?

If they are, it would seem to erase the main advantage of having mutable fields, namely improved performance (unless atomicity checks are now much cheaper).

unhammer · July 25, 2022, 12:03pm

I came across a similar performance comparison at Performance comparison: counting words in Python, Go, C++, C, AWK, Forth, and Rust . On my machine, their unoptimised Python one is 1.9s, their unoptimised (but reasonable-looking) Haskell is 5.7s (while if I use a HashMap IORef I get 2.5s). Why is it that pretty much all the languages in that list run so much faster than Haskell in their unoptimised versions? I mean, I’m sure there’s a way to optimise the Haskell to at least beat unoptimised Python (SIMD, buffering, hand-rolled hash functions, etc.), but why is the simple, straightforward way to do it so slow? (This regards woth the original Rosetta version and the one contributed to that blog post, presumably written by different people.)

jaror · July 25, 2022, 12:39pm

Some opportunities for performance optimization:

Use a streaming algorithm (e.g. using streamly) instead of reading the whole file and doing multiple passes.
Use a hashmap (as you already tried), but we really need faster mutable hash tables. The type IORef Int is just asking for bad performance. The Int should really be unboxed inside the mutable hash table for the best performance.
Store intermediate results in a vector and sort that instead of sorting linked lists (but I think the sorting is not the bottleneck here).

unhammer · July 25, 2022, 2:49pm

The streamly version isn’t much faster (the lazy io version also “streams” in the sense that it doesn’t read the whole file into memory but works on chunks, it just doesn’t guarantee that it won’t start leaking when you refactor). And streamly is quite a heavy extra dep (even wants a ghc plugin) for something that should be simple: read a file chunk by chunk. (Profiling the HashMap IORef version shows that at least 2/3 of the time is spent in HashMap.alterF.)

As I said, I’m sure it’s possible to optimise this in Haskell, but it’s disappointing that it’s so non-trivial to get even the performance of other unoptimised code in other languages, requiring investigating third party libraries and alternative data structures (or maybe the bottleneck is something completely different - before it’s found I’m not going to believe any suggestion is the right one). And it’s not a microbench situation, it’s a task that is both common and useful (I make freqlists many times a day in my jobs), typical of data processing situations where performance matters, and has been around since McIlroy and Knuth were young, while not being very difficult to quickly code up a suboptimal solution to.

hasufell · July 25, 2022, 3:08pm

Where is the streamly version? Because I don’t believe you. You need some knowledge about streamly API. Naive versions aren’t alwas fast.

jaror · July 25, 2022, 3:10pm

Oh, I couldn’t find a streamly or lazy text version in the repo linked from the blog post. I thought that would have a bigger effect on performance.

Is that with the new -fprof-late? Otherwise it probably doesn’t tell you much about the optimized performance.

The “unoptimized” Python version is quite optimized. It is just not the writer of the program that has done the optimization, it’s the writers of Python’s standard library. Haskell hasn’t seen much optimization of that kind yet for this particular problem domain (mutable hash tables).

I think it is not that hard now to write a library with linear types which implements an efficient counting hash table, i.e. one that maps some boxed element type to unboxed integers. But linear types are also quite recent, so that explains part of the lack of good mutable hash tables. I’ve still only seen @Bodigrim’s linear-builder library use linear types to improve performance.

tomjaguarpaw · July 25, 2022, 3:16pm

The first thing I would try to improve here is to not build an (potentially) enormous Text in memory before putString it. I would also try to distinguish the performance of the count from the performance of the console I/O.

github.com

benhoyt/countwords/blob/d8c5cf3af2f58f4ca8b4f6d59cfd3329f815c453/haskell/src/CountWords.hs#L21

    
      
          import qualified Data.Text.IO    as T
          import           GHC.Unicode     (isSpace)
          
          
countwords :: Ord a => [a] -> M.Map a Integer
          countwords = foldl' (\m k -> M.insertWith (+) k 1 m) M.empty
          {-# SPECIALIZE countwords :: [T.Text] -> M.Map T.Text Integer #-}
          
          
runCountWords :: IO ()
          runCountWords =
              T.getContents >>= T.putStr
                  . T.unlines
                  . map (\(w, i) -> w `T.append` " " `T.append` (T.pack . show $ i))
                  . sortBy (comparing (Down . snd))
                  . M.toList
                  . countwords
                  . filter (not . T.null)
                  . T.split isSpace
                  . T.map toLower

jaror · July 25, 2022, 5:34pm

Even if I replace M.toList . countwords by a magic function magic :: [T.Text] -> [(T.Text, Int)] that just evaluates its argument deeply and then returns a precomputed answer then the Haskell solution is still slower than the unoptimized Python solution on my machine. Python (unoptimized): 1.78 s, Haskell: 2.85 s.

If I change magic to return an empty list the whole program still takes 2.77 s, so the problem is definitely the mapping, splitting, and filtering.

If I subtract that overhead from the fastest I got using vector-hashtables: 3.71, then we get 0.86~0.94 which is respectable. Maybe that is achievable without too much work?

Update: for some reason T.toLower and T.words do not improve the speed of the preprocessing much.

unhammer · July 25, 2022, 8:36pm

The version I mentioned earlier in the thread, streamly wc, 2x faster than getcontents→list→hashmap→list, 2x slower than awk · GitHub , which is basically the one from the streaming examples repo (changed to compile streamly-0.8.1.1). On the test set from the benhoyt post, it takes 2.2s (vs 2.5s) on my machine, so better but still beating simple.py (1.9s), but with more complicated code and dependence on specific library/compiler versions not to mention library knowledge.

I assume the authors of streamly know their API. Yeah, this is a good example of how naive Haskell code isn’t always fast. And you don’t just need to know the streamly API, but

you need to know to use streamly at all (would conduit/io-streams/pipes/machines/streaming be fast enough? who has the time to check)
you need to read the streaming docs on compiling to know that you can’t go above ghc 8.10 at the moment (so no optimised-profiling support since that requires 9.4)
you need to read the streaming docs to know that you need 5 compiler flags and a compiler plugin or you may suffer a 5-10x slowdown
you need to know to use HashMap IORef from unordered-containers and not the base Map

unhammer · July 25, 2022, 8:42pm

compile with -O2, nearly as fast as streamly · GitHub is about 2x faster than that one (when piping to /dev/null so no console scrolling slowdown at least), and prints the top n without that whole unlines thing. I should probably put these in a repo by now so they’re easier to refer to

EDIT: See Comparing benhoyt:master...unhammer:hs · benhoyt/countwords · GitHub

The combination of streamly + vector-hashtables is the current winner, almost same speed as simple.py.

hasufell · July 26, 2022, 5:03am

Yes, naive haskell code is not fast.

Streamly is likely by far the best streaming implementation if you’re willing to put some effort into optImizing it and understanding the API.

If you want stable and ok performance, conduit is a better bet.

With streamly, you can get large regressions when you refactor or bump your GHC version, because it inlines so heavily to try to optimize everything into one hot loop, which then allows GHC to emit very efficient code.

santiweight · July 26, 2022, 7:37am

I feel like it’s worth restating that the vast majority of runtime problems in a program will not be linear factors, but in unoptimized algorithms.

Whilst it’s good to think about improving linear factors, it really is a minor point for general purpose programming languages, and when necessary it is certainly addressable in Haskell.

95% of your code will not care about a 2x performance hit in some types of streaming folds. The 4% that does care can handle it if needed in Haskell. The rest is probably in a problem domain that is not what Haskell is designed for (a typed lambda calculus), and you’re better off FFIing into C or starting a Rust project. The fact that Haskell is naively worse at such problems is a feature.

FWIW, one huge potential improvement to the problem described above would be a linearly-typed Map data structure, such as in linear-base.

jaror · July 27, 2022, 8:29am

I can beat Python with this:

{-# LANGUAGE OverloadedStrings #-}

module CountWords where

import           Data.Foldable
import           Data.List       (sortBy)
import           Data.Ord        (Down (..), comparing)
import qualified Data.Text       as T
import qualified Data.Text.IO    as T

import qualified Data.Vector.Mutable as VM
import qualified Data.Vector.Unboxed.Mutable as UM
import qualified Data.Vector.Hashtables as H

import Control.Monad
import Control.Monad.Primitive
import System.IO

type HashTable k v = H.Dictionary (PrimState IO) VM.MVector k UM.MVector v

unlessM :: Monad m => m Bool -> m () -> m ()
unlessM m1 m2 = m1 >>= \b -> unless b m2

runCountWords :: IO ()
runCountWords = do
  t <- H.initialize 10000 :: IO (HashTable T.Text Int)
  let 
    go = unlessM isEOF $ do
      words <- T.words . T.toLower <$> T.getLine
      traverse_ (H.alter t (Just . maybe 1 (+ 1))) words
      go
  go
  xs <- sortBy (comparing (Down . snd)) <$> H.toList t
  traverse_ (\(w, i) -> T.putStrLn (w `T.append` " " `T.append` T.pack (show i))) xs

And @Bodigrim’s quick work improving toLower: https://github.com/haskell/text/pull/460

That gives me the following results:

Language      | Simple | Optimized | Notes
------------- | ------ | --------- | -----
Haskell       |   1.40 |           | by Adrien Glauser
Python        |   1.78 |      1.07 |

unhammer · August 1, 2022, 7:56pm

Woah, that’s interesting, I get

Language      | Simple | Optimized | Notes
------------- | ------ | --------- | -----
Haskell       |   3.43 |           | jaror
Python        |   2.01 |      1.38 |

(while Streamly + VH gives me 2.15; I added your version as LinewiseVH in https://github.com/unhammer/countwords/tree/hs ). What ghc are you using?

EDIT: Now I see this is after including the new toLower patch. That’s amazing, you pinpointed an unnecessary slowdown that affects everybody and made a nice reproducible issue and less than a week later it’s fixed