Mutable Data structures? Why so hard to find

Abab9579 · May 19, 2022, 5:42am

Why is it hard to find mutable data structures in use?

There are quite a bit of cased where mutable data structure is desirable, since even logN factor could be big enough for a huge N - a few times bigger factor could be problematic.

However, I find it hard to find mutable structures in common use in haskell. Can I ask why? It seems like problems suited for destructive updates should be common in production usecases.

santiweight · May 19, 2022, 6:49am

The reality is just that such mutable cases are extremely rare, contrary to what my Java algorithms course taught me. Also, mutation is overhyped for many algorithms: pureness allows for un-restricted sharing in algorithms, and if you take a look at Seq, for example, the computational complexities are pretty great…

To the bigger point: the most expensive thing today is dev time, by A LOT. Devs and managers have a choice between:

Use mutable data structures and cause dev problems for maybe 5-10% application speed up, but require a headache of managing mutable state throughout the codebase.
Don’t use mutable data structures and don’t get harassed at 11pm on a Friday about some mutation-related bug report and require more dev time as a result

Are there cases that require mutable data structures? Yes. But they are very rarely required or worth it. Good devs will isolate such mutation like a virus, such as DB boundaries or behind type safe APIs (e.g. LinearTypes).

For example, my current workbase, mutation happens all the time, but we hide it, as is the Haskell way. The codebase looks immutable because we’ve pushed the IO/STM stuff right down to the bottom of our application monad (in this case a event-based simulator).

But where are the critical examples? The most common cases for mutation are, as you say, production use cases, where heavier workloads necessitate mutation. The reason you don’t see such code is because it’s proprietary, and so these repos aren’t up on GitHub. Libraries rarely require mutation in order to function. I looked through my work’s dependencies, and found exactly zero libraries that expose any mutation (there’s barely even exposed IO). tasty has some MVars underneath to get the test runners going, but that’s the only case I can think of.

Your question is understandable, but it’s hard to think of where Haskellers are underusing mutation… If we get past our reflexive “surely Haskellers aren’t using mutation enough”, and ask more pointed questions, it’s hard to think where the community is erring:

What libraries are too slow to use in practice? Why don’t they use mutation? Should they use mutation?
What problems require mutation? Are those problems unaddressed in the Haskell community? If those problems are addressed, are they in fact using mutation?

Some open-source libraries I can think of that do require mutation include: FRP libraries (reflex, reactive-banana); reactive UIs (elm, miso et al.); mutable APIs in containers/vector; STM/ST monads and their reference types; Bodigrim’s recent text-builder library which leverages LinearTypes. But note that without exception their APIs are pure/mutation-free.

atravers · May 19, 2022, 8:05am

Primarily because Haskell is non-strict.

Standard Haskell (v. 2010) - the “functional subset” (in the mathematical sense) - provides no useful way to specify an evaluation order, which is needed to ensure mutation occurs in the correct sequence. This has two consequences:

the vast majority of Haskell code is, well, functional: expressions almost everywhere.
if the [ideally fewer and fewer] cases where mutation is needed, the sequencing it requires is specified via some generic interface: functor, applicative, monad, arrow, comonad, etc - most Haskell programmers prefer the “functional style” of Haskell instead.

But Haskell, like any other programming language you’re using, runs on a solid-state Turing machine, driven by mutation and sequencing - fortunately for us, that memory-manipulation is now mostly tucked away inside the Haskell implementation and under the control of carefully-designed algorithms (and mostly out of the clutches of those error-prone “wet brains” !)

Surely this must be better than the alternative: each of us effectively “reinventing the wheel” of memory management ourselves at every third or fifth opportunity…

Here’s another way to think about it: when was the last time you saw native assembly code being used directly in an imperative language like C, Pascal or Fortran? All going
well: very rarely. The majority of the “big” compilers for those languages are now so refined that the native assembly code they generate almost always beats what a “wet brain” can do!

Now if only those “soggy skulls” could do something similar about manually managing all that morose mutation - oh wait…

Abab9579 · May 19, 2022, 8:13am

Thanks, now this explains it. I did expect that those parts would be in hotspots of production codebases, what I missed is that those portions would be proprietary! Duh.
Indeed, I also don’t see mutation helps with library designs.

Means that e.g. hackage container libraries are mostly for implementing other libraries, right? I guess hotspot codes in production use their own code. I wonder how they would look like, but as it is proprietary, what could I expect…

For reference, the problem I was thinking about was huge hashtables with destructive lookups (with delete). The constant factor can become big even when I am using it for my hobby/utility code usage.

jaror · May 19, 2022, 8:15am

jaror · May 19, 2022, 8:29am

If you want to get bare metal C/Rust-like performance your code might look like this: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/fasta-ghc-4.html.

I think the most important point is to use Data.Vector.Unboxed.Mutable. Make sure to use the unsafeWrite and unsafeRead functions to avoid bound checks. Also learn how to write tail-recursive loops with strict arguments like the go function inside the genSeeds function. And get the level of concurrency/parallelism right: make the chunks of work large enough that overheads don’t have much impact, but don’t make it too large so that some workers have no work left at the end.

atravers · May 19, 2022, 8:34am

…duly noted; …I’ve edited my earlier post to clarify that the “functional subset” of Standard Haskell (v. 2010) has no useful way to specify an evaluation order. Having said that:

…Prelude.seq isn’t sequential:

Abab9579 · May 19, 2022, 9:14am

Typical example where mutability helps could be counting word occurrences from a huge dictionaries.

EDIT: how would you do the word counting? I.e. reading from a text file and reporting counts of each word in the file.

atravers · May 20, 2022, 3:00am

…Rosetta Code: that site appears often in my search results - let’s have a look:

Abab9579 · May 20, 2022, 3:04am

Thing is, this is quite slow by many times (since Map suffers by log(N) factor, which is usually big for typical big text files). I’ve seen this performance drawback brought up as an evidence of why haskell is slow.

atravers · May 20, 2022, 3:43am

(…as opposed to the OS most people are currently [2022 May] using?)

If something’s too slow, you have two basic options in Haskell:

use a better algorithm.
use encapsulated state e.g. ST, runST, etc to implement an imperative algorithm.

Abab9579 · May 20, 2022, 4:13am

I am asking why is the lack of ST-based container libraries agreed upon in the community.

Btw, programmers who complain about this usually do not use Windows OS, as far as I’ve seen.

treblacy · May 20, 2022, 5:55pm

The news of logarithmic time being slow is greatly exaggerated.

Should you insist on hashing, unordered-containers has a hash map; although a tree and logarithmic time is still involved, the branching factor and base is so big it can be fast enough.

I can be talked into Map/HashMap String (STRef Int) though.

Below, I couldn’t resist but over-analyze the word count problem, regardless of whether it’s a toy problem, a strawman problem, a proxy problem, or actually a mission-critical computation that some cars must perform in an embedded system in real time lest the cars would crash.

“Big file” is ambiguous in this context. Big as in a million distinct words but each occurs only once or twice? Big as in half a dozen distinct words but each occurs a million times?

In the former case, growing a balanced binary search tree is cheaper than reallocating+rehashing a hash table. In the latter case, lg(6) is small.

Abab9579 · May 20, 2022, 10:28pm

Hm yes, let’s assume a typical Shakespear piece or famous speeches or something - which is where such script is most likely applied.
There would be enough variations in the words, where most words appearing perhaps 10-20 times & some words (e.g. “the”) appear a lot, like thousands to millions times.

unhammer · May 27, 2022, 7:05am

I tried that rosetta solution, it scales pretty badly. On the first million lines of nynorsk wikipedia, a simple GNU Awk script

head -1000000| tr ' ' '\n' |\time awk '{s[$0]++} END{ PROCINFO["sorted_in"]="@val_num_desc"; i=1;for(x in s){if (i++>10)break;print x,s[x]}}'
takes just under 11s:

10.48user 0.27system 0:10.89elapsed 98%CPU (0avgtext+0avgdata 548484maxresident)k

whereas (after ghc --make -o wc WC.hs),
head -1000000|\time ./wc 10
takes nearly a minute

55.72user 0.54system 0:58.27elapsed 96%CPU (0avgtext+0avgdata 1288892maxresident)k

Switching to unordered-containers HashMap improves it a little:

39.29user 0.82system 0:42.30elapsed 94%CPU (0avgtext+0avgdata 1290264maxresident)k

but as far as I know there’s no (mutable or immutable) hash map in Haskell that gives you the performance of awk.

Abab9579 · May 27, 2022, 8:17am

Did you try with (one of the few) mutable hashmaps? It seems like those you tried are immutable maps.

Solid · May 27, 2022, 8:55am

(after ghc --make -o wc WC.hs),

Perhaps a naïve comment, but does this compile with optimisations (i.e.,
-O2) enabled?

unhammer · May 27, 2022, 9:59am

Tried it again without firefox running so it’s a slightly more accurate comparison, adding -O2 seems to make it slower

unhammer · May 27, 2022, 10:48am

Well, a plain conversion to (mutable) BasicHashTable (also getting rid of those confusing <<<'s) is 1m30s without -O2, and 56s with -O2, so not much gained there ¯\_(ツ)_/¯

Though that code also creates lists before and after the table, might not stream that well? There’s an example at streamly-examples/WordFrequency.hs at master · composewell/streamly-examples · GitHub – I tried this some months ago but I had to change it a bit to fit with my streamly version, gives about 23s on the same test set, though it’s still >2x slower than gnu awk. (I haven’t tested if it’s the list-streaming vs streamly or the IORef vs pure insertWith that makes the difference in this case.)

Abab9579 · May 27, 2022, 10:57am

@unhammer Interesting, and thank you for sharing the results!
Yes, haskell is slow (at mutability-involving tasks). However some people try to deny and think otherwise, which I think is quite unhealthy.

I still wish mutable data got more love in haskell, such data structures could be faster with devotion in place (while still could be slower than mutable-by-default languages).
Eh, but what can I do.

Topic		Replies	Views
Counting Words, but can we go faster? Show and Tell	30	3606	August 10, 2022
How to optimize the following algorithms Learn	14	707	June 18, 2021
Memory performance when reading large files Learn	11	2784	December 18, 2021
Quick Sort variants performance Learn	2	1607	July 7, 2020
The Strengths Of Haskell (42 strengths listed so far)	23	3855	October 17, 2021

Mutable Data structures? Why so hard to find

Related topics