Mutable Data structures? Why so hard to find

Thanks, now this explains it. I did expect that those parts would be in hotspots of production codebases, what I missed is that those portions would be proprietary! Duh.
Indeed, I also don’t see mutation helps with library designs.

Means that e.g. hackage container libraries are mostly for implementing other libraries, right? I guess hotspot codes in production use their own code. I wonder how they would look like, but as it is proprietary, what could I expect…

For reference, the problem I was thinking about was huge hashtables with destructive lookups (with delete). The constant factor can become big even when I am using it for my hobby/utility code usage.

1 Like

If you want to get bare metal C/Rust-like performance your code might look like this: fasta Haskell GHC #4 program (Benchmarks Game).

I think the most important point is to use Data.Vector.Unboxed.Mutable. Make sure to use the unsafeWrite and unsafeRead functions to avoid bound checks. Also learn how to write tail-recursive loops with strict arguments like the go function inside the genSeeds function. And get the level of concurrency/parallelism right: make the chunks of work large enough that overheads don’t have much impact, but don’t make it too large so that some workers have no work left at the end.

3 Likes

…duly noted; …I’ve edited my earlier post to clarify that the “functional subset” of Standard Haskell (v. 2010) has no useful way to specify an evaluation order. Having said that:


Prelude.seq isn’t sequential:

1 Like

Typical example where mutability helps could be counting word occurrences from a huge dictionaries.

EDIT: how would you do the word counting? I.e. reading from a text file and reporting counts of each word in the file.

…Rosetta Code: that site appears often in my search results - let’s have a look:

https://rosettacode.org/wiki/Word_frequency#Haskell

Thing is, this is quite slow by many times (since Map suffers by log(N) factor, which is usually big for typical big text files). I’ve seen this performance drawback brought up as an evidence of why haskell is slow.

(…as opposed to the OS most people are currently [2022 May] using?)

If something’s too slow, you have two basic options in Haskell:

  1. use a better algorithm.
  2. use encapsulated state e.g. ST, runST, etc to implement an imperative algorithm.

I am asking why is the lack of ST-based container libraries agreed upon in the community.

Btw, programmers who complain about this usually do not use Windows OS, as far as I’ve seen.

2 Likes

The news of logarithmic time being slow is greatly exaggerated.

Should you insist on hashing, unordered-containers has a hash map; although a tree and logarithmic time is still involved, the branching factor and base is so big it can be fast enough.

I can be talked into Map/HashMap String (STRef Int) though.

Below, I couldn’t resist but over-analyze the word count problem, regardless of whether it’s a toy problem, a strawman problem, a proxy problem, or actually a mission-critical computation that some cars must perform in an embedded system in real time lest the cars would crash.

“Big file” is ambiguous in this context. Big as in a million distinct words but each occurs only once or twice? Big as in half a dozen distinct words but each occurs a million times?

In the former case, growing a balanced binary search tree is cheaper than reallocating+rehashing a hash table. In the latter case, lg(6) is small.

1 Like

Hm yes, let’s assume a typical Shakespear piece or famous speeches or something - which is where such script is most likely applied.
There would be enough variations in the words, where most words appearing perhaps 10-20 times & some words (e.g. “the”) appear a lot, like thousands to millions times.

I tried that rosetta solution, it scales pretty badly. On the first million lines of nynorsk wikipedia, a simple GNU Awk script

head -1000000| tr ' ' '\n' |\time awk '{s[$0]++} END{ PROCINFO["sorted_in"]="@val_num_desc"; i=1;for(x in s){if (i++>10)break;print x,s[x]}}'
takes just under 11s:

10.48user 0.27system 0:10.89elapsed 98%CPU (0avgtext+0avgdata 548484maxresident)k

whereas (after ghc --make -o wc WC.hs),
head -1000000|\time ./wc 10
takes nearly a minute

55.72user 0.54system 0:58.27elapsed 96%CPU (0avgtext+0avgdata 1288892maxresident)k

Switching to unordered-containers HashMap improves it a little:

39.29user 0.82system 0:42.30elapsed 94%CPU (0avgtext+0avgdata 1290264maxresident)k

but as far as I know there’s no (mutable or immutable) hash map in Haskell that gives you the performance of awk.

1 Like

Did you try with (one of the few) mutable hashmaps? It seems like those you tried are immutable maps.

(after ghc --make -o wc WC.hs),

Perhaps a naïve comment, but does this compile with optimisations (i.e.,
-O2) enabled?

Tried it again without firefox running so it’s a slightly more accurate comparison, adding -O2 seems to make it slower :slight_smile:

1 Like

Well, a plain conversion to (mutable) BasicHashTable (also getting rid of those confusing <<<'s) is 1m30s without -O2, and 56s with -O2, so not much gained there ¯\_(ツ)_/¯

Though that code also creates lists before and after the table, might not stream that well? There’s an example at streamly-examples/WordFrequency.hs at master · composewell/streamly-examples · GitHub – I tried this some months ago but I had to change it a bit to fit with my streamly version, gives about 23s on the same test set, though it’s still >2x slower than gnu awk. (I haven’t tested if it’s the list-streaming vs streamly or the IORef vs pure insertWith that makes the difference in this case.)

1 Like

@unhammer Interesting, and thank you for sharing the results!
Yes, haskell is slow (at mutability-involving tasks). However some people try to deny and think otherwise, which I think is quite unhealthy.

I still wish mutable data got more love in haskell, such data structures could be faster with devotion in place (while still could be slower than mutable-by-default languages).
Eh, but what can I do.

Haskell isn’t necessarily slower at tasks involving mutability, but for some tasks (such as word counts!) it’s worth adding a little mutability for the speedup; and mutable hash tables in Haskell could perhaps be better, in particular the hashtables package, though there is now vector-hashtables which seems like it could improve things.

Also, as the streamly example showed, it’s not hard to put mutable values in immutable HashMap. I tried it without streamly now, and it’s about as fast as with streamly as long as I use IORefs in the values instead of pure updates.

1 Like

Indeed. I said it in wrong way, but what I meant was that many mutable data structures in haskell are slow likely because of lack of love.
(Mutable vector is going to be an exception)

I was answering this question:

…I’ll let you provide Rosetta Code with a better-performing alternative.


…then start a new thread so it [any claims of “denialism” on this topic, along with supporting evidence] can be disputed discussed there.