Mutable Data structures? Why so hard to find

Haskell isn’t necessarily slower at tasks involving mutability, but for some tasks (such as word counts!) it’s worth adding a little mutability for the speedup; and mutable hash tables in Haskell could perhaps be better, in particular the hashtables package, though there is now vector-hashtables which seems like it could improve things.

Also, as the streamly example showed, it’s not hard to put mutable values in immutable HashMap. I tried it without streamly now, and it’s about as fast as with streamly as long as I use IORefs in the values instead of pure updates.

1 Like

Indeed. I said it in wrong way, but what I meant was that many mutable data structures in haskell are slow likely because of lack of love.
(Mutable vector is going to be an exception)

I was answering this question:

…I’ll let you provide Rosetta Code with a better-performing alternative.


…then start a new thread so it [any claims of “denialism” on this topic, along with supporting evidence] can be disputed discussed there.

Well that was the entire point of this thread - mutable structures not getting love.
Please why did you omit the slow (at mutability-involving tasks) in front of that line, I was specifically talking about the cases which is optimized for mutability. Once you omitted that part, it became completely different sentence.

Why we don’t have an ST-based API for e.g. Map is indeed a good question. I think we need good transient data structures (those which admit mutable and persistent APIs with seemless switching between them) and our ecosystem is sorely lacking in that regard. Plain and simple.

These mutable data structures would then profit immensely from the mutable constructor fields proposal (on phone. Sorry, Google it), but no one has the time to push it over the finish line.

4 Likes

I like your sentiment. I think it’s true that mutability is a second class citizen is most programming, especially programming done in Haskell, but that doesn’t necessarily mean that we should neglect mutability complete. We need to have some better story here for Haskell.

1 Like
4 Likes

This is a proposal to add mutable constructor fields to GHC.

Hmm…

[…]
This reminds me of a funny event at the Haskell workshop 2006. One participant stood up and sincerely proposed that Haskell’ standard would find a way to automatically derive a monadic version of a pure expression. Chung-chieh Shan recommended that person to take a look at OCaml…

– Oleg Kiselyov

Having quickly looked at it…according to the proposal, the following is “inefficient”:

data MutPair a b =
  MutPair {-# UNPACK #-} !(IORef a)
          {-# UNPACK #-} !(IORef b)

So why not just:

data MutPair a b =
  MutPair (IORef# a) (IORef# b)

where:

type IORef# a = MutVar# RealWorld a

The {-# UNPACK #-} pragma actually converts the data to something with the same representation as your suggestion. The problem is that the MutVar# itself is a pointer to a piece of mutable memory (it’s an UnliftedType not an unboxed type). The proposal wants to make it possible to have mutable memory inside the constructors of data types itself without that extra pointer indirection.

1 Like

…as described by State in Haskell:

  • (page 4 of 51)
    […] A reference can be thought of as the name of (or address of) a variable, an updatable location in the state capable of holding a value. […]

  • (page 5)
    […] Notice that the reference itself does not change; it is the state which is modified. […]

If these requirements can now be relaxed, then why not just (re)define MutVar# as an unboxed type?

If these requirements can now be relaxed

The proposal is about relaxing these requirements.

why not just (re)define MutVar# as an unboxed type?

I think we’d want to use a different name for backward compatibility and because pointers to mutable data are also still useful in some cases. Other than that, I think that is pretty much what the proposal sets out to do.

1 Like

I already did :wink: added the non-streamly IORefs version. I still wish I knew what’s making it slower than awk though.

1 Like

Isn’t that the role of lifted types like IORef and STRef (no trailing hash), in the same way that the lifted Int does for the unboxed Int#?


…through the selective use of reserved words in lieu of individual type annotations:

  • mutable - as a generic substitute for MutVar# s or MutVar# RealWorld.

I’m just wondering…could this also be used elsewhere in Haskell e.g:

  • floater - as a generic substitute for Float#, Double#, and presumably Quad# and more at some time in the future;
  • procedure - as a generic substitute for STRep# s, STRep# RealWorld (i.e. IO#) and presumably STM#.

Nice!

From Controlling Chaos: … by Stephan Herhut, Sven-Bodo and Scholz Clemens Grelck:

Single Assignment C, or SAC for short, is a strict, first-order functional language geared at high-performance numerical computation.

Furthermore:

Despite its syntactical proximity to the imperative language C, several carefully chosen restrictions of the core of SAC inhibit any form of side-effect and, thus, render it purely functional.

Based on that, I cautiously suspect that the interaction between two fundamental features in Haskell are primarily responsible:

  • non-strict semantics;
  • higher-order functions.

For example, Standard ML has higher-order functions and is strict.

One of the reasons for Haskell’s existence was to make research into, and application of these features easier - to remove them would make Haskell virtually irrelevant. There are only three widely-used non-strict languages with higher-order semantics:

  • Miranda(R),
  • Clean,
  • Haskell.

That means Haskell is very much “going first” in this field of research, so its implementation/s are going to be lacking in some ways, much in the same way early compilers for C or Pascal were slower than native assembly code - I for one am very glad that they kept on working on those old compilers! Such longevity for languages like Haskell (and their features) is what we’re interested in, because like assembly was all those decades ago, we want something better than what we have now.

It seems like you’re right that MutVar# will be unnecessary after this proposal. The proposal has a section about removing it. And there’s also some discussion about using a proper Mutable type instead of a mutable keyword.

Just want to add one datapoint to this discussion. Using the fiwiki dump, it takes 8 seconds for awk to finish, 20 seconds for ordered-containers and 19 seconds for unordered containers. This with absolutely no effort spent on optimizing.

{-# LANGUAGE OverloadedStrings #-}
module Main where

import qualified Data.Map.Strict as M
import qualified Data.Text.Lazy as Lazy
import qualified Data.Text.Lazy.IO as LazyIO
import qualified Data.Text as Strict
import Data.Map (Map)
import Data.Foldable (foldl', for_)
import qualified Data.Text.IO as StrictIO
import Data.List (sortOn)
import Data.Ord (Down(Down))
import Data.HashMap.Strict (HashMap)
import qualified Data.HashMap.Strict as HM


orderedHisto :: Lazy.Text -> [(Strict.Text, Int)]
orderedHisto = M.toList . histo
  where
    histo :: Lazy.Text -> Map Strict.Text Int
    histo = foldl' (\acc x -> M.insertWith (+) (Lazy.toStrict x) 1 acc) M.empty . Lazy.lines

unorderedHisto :: Lazy.Text -> [(Strict.Text, Int)]
unorderedHisto = HM.toList . histo
  where
    histo :: Lazy.Text -> HashMap Strict.Text Int
    histo = foldl' (\acc x -> HM.insertWith (+) (Lazy.toStrict x) 1 acc) HM.empty . Lazy.lines

main :: IO ()
main = do
  contents <- LazyIO.getContents
  for_ (take 10 $ sortOn (Down . snd) $ unorderedHisto contents) $ \(key, val) ->
    StrictIO.putStrLn $ key <> " " <> Strict.pack (show val)
$ bzcat ~/temp/realytemp/fiwiki-20220520-pages-articles-multistream.xml.bz2 | head -1000000| tr ' ' '\n' | time ./test > /dev/null
./test > /dev/null  17.39s user 1.18s system 93% cpu 19.908 total
$ bzcat ~/temp/realytemp/fiwiki-20220520-pages-articles-multistream.xml.bz2 | head -1000000| tr ' ' '\n' | time awk '{s[$0]++} END{ PROCINFO["sorted_in"]="@val_num_desc"; i=1;for(x in s){if (i++>10)break;print x,s[x]}}' > /dev/null
awk  > /dev/null  7.34s user 0.25s system 87% cpu 8.629 total

Section 1.2.1 of @simonmar’s proposal:

module Data.Mutable.IO where
  type Mutable a = Ref# RealWorld a
  readRef :: Mutable a -> IO a
  writeRef :: Mutable a -> a -> IO ()

module Data.Mutable.ST where
  type Mutable s a = Ref# s a
  readRef :: Mutable s a -> ST s a
  writeRef :: Mutable s a -> a -> ST s ()

…has me wondering about Ref#'s state parameter s - maybe it should be for the monad providing the context, which presumably would have a MonadPrim instance:

data Ref# m a  -- unless it needs to be a primitive type
newRef# :: MonadPrim m => a -> m (Ref# m a)
readRef# :: MonadPrim m => Ref# m a -> m a
writeRef# :: MonadPrim m => Ref# m a -> a -> m ()

which then allows e.g:

type STRef# s a = Ref# (ST s) a
type IORef# a   = Ref# IO a
type TVar# a    = Ref# STM a

As for the original example…

data MutPair ... where
  MutPair :: Ref m1 a -> Ref m2 b -> MutPair (m1 a) (m2 b)

(…perhaps - completing that declaration is left as an exercise :-)

I think the spirit of PrimMonad would be to use the PrimState type family, like so:

data Ref# s a  -- unless it needs to be a primitive type
newRef# :: PrimMonad m => a -> m (Ref# (PrimState m) a)
readRef# :: PrimMonad m => Ref# (PrimState m) a -> m a
writeRef# :: PrimMonad m => Ref# (PrimState m) a -> a -> m ()

type STRef# s a = Ref# s a
type IORef# a   = Ref# RealWorld a

I don’t think STM can be made to fit this Ref# type. STM is not an instance of PrimMonad either.

As for:

data MutPair ... where
  MutPair :: Ref m1 a -> Ref m2 b -> MutPair (m1 a) (m2 b)

Constructing the MutPair must be an effectful action inside some monad, so it must be of the form:

data MutPair ... where
  MutPair :: Ref m1 a -> Ref m2 b -> m' (MutPair (m1 a) (m2 b))

But what would m' be there?

…I intended the use of the type class to restrict the use of Ref# variables to “appropriate” monadic types, so an alternative would be:

class Monad m => MonadCaptive m where ...

instance MonadCaptive (ST s) where ...
instance MonadCaptive IO where ...
instance MonadCaptive STM where ...

data Ref# m a  -- unless it needs to be a primitive type
newRef# :: MonadCaptive m => a -> m (Ref# m a)
readRef# :: MonadCaptive m => Ref# m a -> m a
writeRef# :: MonadCaptive m => Ref# m a -> a -> m ()

…how to prevent instances for e.g. Identity or Maybe from being defined is the obvious question there!


(…hence the embedding of m1 and m2 within the two arguments of MutPair.)

As for m', if m1 == m2 then the declaration can be simpler:

data MutPair ... where
  MutPair :: Ref m' a -> Ref m' b -> m' (MutPair a b)

For m1 /= m2 , some imaginative thinking could be required…

Did you know about Announcing the "stm-containers" library – Functional programming debugs you

How are you running your script? Because afaict it expects one word per line. If I tr ' ' '\n' before inputting into your script, my computer runs out of memory :slight_smile: