Mutable Data structures? Why so hard to find

Why is it hard to find mutable data structures in use?

There are quite a bit of cased where mutable data structure is desirable, since even logN factor could be big enough for a huge N - a few times bigger factor could be problematic.

However, I find it hard to find mutable structures in common use in haskell. Can I ask why? It seems like problems suited for destructive updates should be common in production usecases.

3 Likes

The reality is just that such mutable cases are extremely rare, contrary to what my Java algorithms course taught me. Also, mutation is overhyped for many algorithms: pureness allows for un-restricted sharing in algorithms, and if you take a look at Seq, for example, the computational complexities are pretty greatā€¦

To the bigger point: the most expensive thing today is dev time, by A LOT. Devs and managers have a choice between:

  1. Use mutable data structures and cause dev problems for maybe 5-10% application speed up, but require a headache of managing mutable state throughout the codebase.
  2. Donā€™t use mutable data structures and donā€™t get harassed at 11pm on a Friday about some mutation-related bug report and require more dev time as a result

Are there cases that require mutable data structures? Yes. But they are very rarely required or worth it. Good devs will isolate such mutation like a virus, such as DB boundaries or behind type safe APIs (e.g. LinearTypes).

For example, my current workbase, mutation happens all the time, but we hide it, as is the Haskell way. The codebase looks immutable because weā€™ve pushed the IO/STM stuff right down to the bottom of our application monad (in this case a event-based simulator).

But where are the critical examples? The most common cases for mutation are, as you say, production use cases, where heavier workloads necessitate mutation. The reason you donā€™t see such code is because itā€™s proprietary, and so these repos arenā€™t up on GitHub. Libraries rarely require mutation in order to function. I looked through my workā€™s dependencies, and found exactly zero libraries that expose any mutation (thereā€™s barely even exposed IO). tasty has some MVars underneath to get the test runners going, but thatā€™s the only case I can think of.

Your question is understandable, but itā€™s hard to think of where Haskellers are underusing mutationā€¦ If we get past our reflexive ā€œsurely Haskellers arenā€™t using mutation enoughā€, and ask more pointed questions, itā€™s hard to think where the community is erring:

  • What libraries are too slow to use in practice? Why donā€™t they use mutation? Should they use mutation?
  • What problems require mutation? Are those problems unaddressed in the Haskell community? If those problems are addressed, are they in fact using mutation?

Some open-source libraries I can think of that do require mutation include: FRP libraries (reflex, reactive-banana); reactive UIs (elm, miso et al.); mutable APIs in containers/vector; STM/ST monads and their reference types; Bodigrimā€™s recent text-builder library which leverages LinearTypes. But note that without exception their APIs are pure/mutation-free.

5 Likes

Primarily because Haskell is non-strict.

Standard Haskell (v. 2010) - the ā€œfunctional subsetā€ (in the mathematical sense) - provides no useful way to specify an evaluation order, which is needed to ensure mutation occurs in the correct sequence. This has two consequences:

  • the vast majority of Haskell code is, well, functional: expressions almost everywhere.
  • if the [ideally fewer and fewer] cases where mutation is needed, the sequencing it requires is specified via some generic interface: functor, applicative, monad, arrow, comonad, etc - most Haskell programmers prefer the ā€œfunctional styleā€ of Haskell instead.

But Haskell, like any other programming language youā€™re using, runs on a solid-state Turing machine, driven by mutation and sequencing - fortunately for us, that memory-manipulation is now mostly tucked away inside the Haskell implementation and under the control of carefully-designed algorithms (and mostly out of the clutches of those error-prone ā€œwet brainsā€ !)

Surely this must be better than the alternative: each of us effectively ā€œreinventing the wheelā€ of memory management ourselves at every third or fifth opportunityā€¦

Hereā€™s another way to think about it: when was the last time you saw native assembly code being used directly in an imperative language like C, Pascal or Fortran? All going
well: very rarely. The majority of the ā€œbigā€ compilers for those languages are now so refined that the native assembly code they generate almost always beats what a ā€œwet brainā€ can do!

Now if only those ā€œsoggy skullsā€ could do something similar about manually managing all that morose mutation - oh waitā€¦

Thanks, now this explains it. I did expect that those parts would be in hotspots of production codebases, what I missed is that those portions would be proprietary! Duh.
Indeed, I also donā€™t see mutation helps with library designs.

Means that e.g. hackage container libraries are mostly for implementing other libraries, right? I guess hotspot codes in production use their own code. I wonder how they would look like, but as it is proprietary, what could I expectā€¦

For reference, the problem I was thinking about was huge hashtables with destructive lookups (with delete). The constant factor can become big even when I am using it for my hobby/utility code usage.

1 Like

If you want to get bare metal C/Rust-like performance your code might look like this: https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/fasta-ghc-4.html.

I think the most important point is to use Data.Vector.Unboxed.Mutable. Make sure to use the unsafeWrite and unsafeRead functions to avoid bound checks. Also learn how to write tail-recursive loops with strict arguments like the go function inside the genSeeds function. And get the level of concurrency/parallelism right: make the chunks of work large enough that overheads donā€™t have much impact, but donā€™t make it too large so that some workers have no work left at the end.

3 Likes

ā€¦duly noted; ā€¦Iā€™ve edited my earlier post to clarify that the ā€œfunctional subsetā€ of Standard Haskell (v. 2010) has no useful way to specify an evaluation order. Having said that:


ā€¦Prelude.seq isnā€™t sequential:

1 Like

Typical example where mutability helps could be counting word occurrences from a huge dictionaries.

EDIT: how would you do the word counting? I.e. reading from a text file and reporting counts of each word in the file.

ā€¦Rosetta Code: that site appears often in my search results - letā€™s have a look:

Thing is, this is quite slow by many times (since Map suffers by log(N) factor, which is usually big for typical big text files). Iā€™ve seen this performance drawback brought up as an evidence of why haskell is slow.

(ā€¦as opposed to the OS most people are currently [2022 May] using?)

If somethingā€™s too slow, you have two basic options in Haskell:

  1. use a better algorithm.
  2. use encapsulated state e.g. ST, runST, etc to implement an imperative algorithm.

I am asking why is the lack of ST-based container libraries agreed upon in the community.

Btw, programmers who complain about this usually do not use Windows OS, as far as Iā€™ve seen.

2 Likes

The news of logarithmic time being slow is greatly exaggerated.

Should you insist on hashing, unordered-containers has a hash map; although a tree and logarithmic time is still involved, the branching factor and base is so big it can be fast enough.

I can be talked into Map/HashMap String (STRef Int) though.

Below, I couldnā€™t resist but over-analyze the word count problem, regardless of whether itā€™s a toy problem, a strawman problem, a proxy problem, or actually a mission-critical computation that some cars must perform in an embedded system in real time lest the cars would crash.

ā€œBig fileā€ is ambiguous in this context. Big as in a million distinct words but each occurs only once or twice? Big as in half a dozen distinct words but each occurs a million times?

In the former case, growing a balanced binary search tree is cheaper than reallocating+rehashing a hash table. In the latter case, lg(6) is small.

1 Like

Hm yes, letā€™s assume a typical Shakespear piece or famous speeches or something - which is where such script is most likely applied.
There would be enough variations in the words, where most words appearing perhaps 10-20 times & some words (e.g. ā€œtheā€) appear a lot, like thousands to millions times.

I tried that rosetta solution, it scales pretty badly. On the first million lines of nynorsk wikipedia, a simple GNU Awk script

head -1000000| tr ' ' '\n' |\time awk '{s[$0]++} END{ PROCINFO["sorted_in"]="@val_num_desc"; i=1;for(x in s){if (i++>10)break;print x,s[x]}}'
takes just under 11s:

10.48user 0.27system 0:10.89elapsed 98%CPU (0avgtext+0avgdata 548484maxresident)k

whereas (after ghc --make -o wc WC.hs),
head -1000000|\time ./wc 10
takes nearly a minute

55.72user 0.54system 0:58.27elapsed 96%CPU (0avgtext+0avgdata 1288892maxresident)k

Switching to unordered-containers HashMap improves it a little:

39.29user 0.82system 0:42.30elapsed 94%CPU (0avgtext+0avgdata 1290264maxresident)k

but as far as I know thereā€™s no (mutable or immutable) hash map in Haskell that gives you the performance of awk.

1 Like

Did you try with (one of the few) mutable hashmaps? It seems like those you tried are immutable maps.

(after ghc --make -o wc WC.hs),

Perhaps a naĆÆve comment, but does this compile with optimisations (i.e.,
-O2) enabled?

Tried it again without firefox running so itā€™s a slightly more accurate comparison, adding -O2 seems to make it slower :slight_smile:

1 Like

Well, a plain conversion to (mutable) BasicHashTable (also getting rid of those confusing <<<'s) is 1m30s without -O2, and 56s with -O2, so not much gained there ĀÆ\_(惄)_/ĀÆ

Though that code also creates lists before and after the table, might not stream that well? Thereā€™s an example at streamly-examples/WordFrequency.hs at master Ā· composewell/streamly-examples Ā· GitHub ā€“ I tried this some months ago but I had to change it a bit to fit with my streamly version, gives about 23s on the same test set, though itā€™s still >2x slower than gnu awk. (I havenā€™t tested if itā€™s the list-streaming vs streamly or the IORef vs pure insertWith that makes the difference in this case.)

1 Like

@unhammer Interesting, and thank you for sharing the results!
Yes, haskell is slow (at mutability-involving tasks). However some people try to deny and think otherwise, which I think is quite unhealthy.

I still wish mutable data got more love in haskell, such data structures could be faster with devotion in place (while still could be slower than mutable-by-default languages).
Eh, but what can I do.