Base proposal around vector-like types

snoyberg · March 16, 2021, 7:09pm

Following up on the previous discussions around alternative standard libraries and preludes, I wrote up a proposal for what I’d like to see to address IMO a major pain point today: the proliferation of different packed memory representations.

https://www.snoyman.com/blog/2021/04/haskell-base-proposal/

I’m going to add a link to this Discourse thread in the blog post as the official discussion point.

iustin · March 16, 2021, 7:51pm

Good proposal. Seeing String finally going away (as the default type) would be awesome!

tonyday567 · March 16, 2021, 7:52pm

This is the base I dream about using!

There seems to be broad support that, yes, we would like to have a crack at fixing base, and we’d like to refactor across several fronts at once.

Can I suggest that a sane way of doing this, managing projects that are mixing core libraries and base, would be for the community to cohere around a few core projects, that have names like:

base-packed (ie this proposal)
base-num
base-nopartials
base-promoteseq

and then feed into something like base-foundation that is a draft new base, available via an import Prelude.Foundation.

Loosely coupled, I would say that they would best be hosted in the one spot.

oswald2 · March 16, 2021, 9:34pm

I think this is a great idea. It addresses some points that have always bothered me.

winterland1989 · March 17, 2021, 3:34am

Haskell Foundation sponsor here:

We’ve been working on this unified idea for a long time, and all the proposed stuff is already implemented in Z-Data, part of Z.Haskell. Plus we have:

Generic interface to manipulate array and array slices.
UTF8 based text with Unicode 13.0 processing.
Very fast Builder and Parser.
RE2 regex.

The original plan is to announce Z.Haskell when our crypto lib is ready. But it seems the topic of changing the base comes up a lot these days, and we do hope people get aware of our effort.

That been said, we do hope our effort can be merged into the base, and we’re ready to contribute if that happened. But that will be very difficult for other people because it will cause massive breaking. We would suggest another way:

Haskell Foundation could start a project incubator and supports new eco-systems like Z.Haskell by giving incubation support(mainly publicity), like what the Apache Foundation does with their Java big data eco-systems. Haskell Foundation guides the overall competition and iteration.

From our use case(network engineering mainly), It seems our community spent too much time on topics such as extensible effect systems, etc, which have their own good. But to make a great engineering eco-system, programmers should focus on the basic part, such as IO and crypto. People seem to have an impression that Haskell has solved those, but we’re really not satisfied by the status quo.

angerman · March 17, 2021, 3:49am

Yes! I do believe this is actually part of what the agenda said!

Ericson2314 · March 17, 2021, 4:28am

Hehe i feel like a few thing I said before are being responded too.

Unifying ByteString, Text, and Vector

I am absolutely for deduplicating Vector, Bytestring, and Text as described.

@snoyman says this is orthogonal to breaking up base, and I agree. The first step is can be done by just making some dependencies on vector from the other packages. I think we agree agree there.

Unify unboxed and storable vectors

I don’t know enough about the stuff in this section but generally sounds nice.

Automatic Boxed vs Unpacked selection

I think the best tool we have for this is runtime reps, but those are egregiously underused. To make them ergonomic we need huge changes to libraries (see GitHub - ekmett/unboxed: experimenting with unlifted classes via backpack), but I also think we need a “monomorphizing”/“templating” quantifier just like Rust or (dare I say) the upcoming Go generics. (Or C++ templates, but no SFINAE and other jankiness that comes from the C++ designers arriving at templating from macro systems rather than type theory quantifiers.)

Put it in base!

I just don’t get why. I am still definitely for breaking up base and against putting this stuff in a non-broken up base.

Right now, Vector is often not used. I think a large part of that is the so-many-vector-types problem I’m trying to solve here.

I agree, and you do solve that! People don’t have an issue using Map, so I don’t think pulling stuff out of base is a issue. If you are worried about List being just too accessible, maybe let’s make List more external rather than making the unpacked ones more internal .

Another is that it takes so darn long for vector to compile. People don’t want to depend on it.

This is really don’t buy. If we can distribute prebuilt base, we can distribute pre-built other libraries. Full stop.

As it stands today, most of the I/O functions included in base work on Strings. We almost all agree that we shouldn’t be using String, but its usage persists because it is the only base-approved type. Let’s fix that.

A split base also fixes that by making that stuff less forced to be at the bottom of the dependency graph, and thus free to use better types.

With a solid Vector type in base, we can then begin to develop sets of functionality around that type in external libraries. People can iterate on those designs, and then we can consider standardizing, either in base, in a CLC-approved library, or in a new split-base world.

But we as much more more experiment keeping things among vector, bytestring, and text. If we need to make more cross-cutting changes atomically, let’s put those 3 packages in a monorepo , but leave GHC off the critical path.

Adoption

I hadn’t realized that this post initially came off as a massive breaking change in the language and library ecosystem. So let me clarify. I believe that the changes above can be made immediately with zero breakage (though plenty of effort).

Agree, the internal representations of vector, bytestring, and text are not stable (as far as I know) so we should be able to de-dup without issue.

emilypi · March 17, 2021, 5:22am

I think the best tool we have for this is runtime reps, but those are egregiously underused. To make them ergonomic we need huge changes to libraries (see GitHub - ekmett/unboxed: experimenting with unlifted classes via backpack)…

This was my first thought when I saw this - a RuntimeRep-polymorphic vector class (type?). It would be a vast improvement over the existing Storable|Primitive|Unboxed situation. Carter and I proposed a backpacky version of this for Vector some time in 2019, but I am very willing to sacrifice that backpackiness of the implementation to just get it done using techniques like what we see in Ed’s unboxed.

I just don’t get why. I am still definitely for breaking up base and against putting this stuff in a non-broken up base.

I think this gets to a point that I need a little convincing on as well. Personally, I would like to see two things happen with base, and neither necessarily contradict what @snoyman is proposing, just refactors it slightly: I would like to see base be the compat layer between GHC and the user along with its Prelude, and propose the addition of a std library package which had the user-facing std library features we want exposed (e.g. re-exports from containers, a built-in container types and so on that hook into the compat layer in base.

To me, the notion of base is cluttered: it mixes GHC’s stable interfaces for its primops with its Prelude, and then also mixes those two with convenience modules. It’s not a very good example of a single-responsibility package if there ever were one!

This is really don’t buy. If we can distribute prebuilt base, we can distribute pre-built other libraries. Full stop.

Everything else I happen to agree with Michael on. I can’t wait to hash this out in our meetings

snoyberg · March 17, 2021, 6:05am

@winterland1989 I won’t say I’d never heard of Z-Data before, but I definitely don’t remember hearing of it before. Looking at it now, it looks really nice. It’s not quite in line with what I lay out in my blog post, but a lot of that can be laid at the feet of lacking support within GHC. Specifically the temporary-pinning and semi-open type families would be relevant. From my understanding, it looks like PrimVector is using unpinned memory, and then making temporary copies for cases where pinned memory is needing; is that correct?

To the multiple responses along the lines of “it still shouldn’t be in base”: I disagree for two reasons:

We’re talking about a nebulous split-base alternative, which doesn’t exist today, and hasn’t come into existence. I urge everyone to stop making the perfect the enemy of the good. If (and it’s a big if) we can all agree that some vector representation is a good thing, and we all agree on what it should look like, and then we get blocked at the finish line on fighting about how to split up base, we all lose.
Even in a split-base world, I believe this kind of datatype is central enough that it will have specific tie-ins with the compiler, such as automatic typeclass deriving special-cased ala Show.

I hear the argument that having a library shipped with GHC is the same as having it in base. I understand the logic there. I’m observing that, empirically, it’s not true. People do treat it differently. Even the Map example applies here: I’ve seen plenty of newcomers accidentally use the almost-useless Data.HashTable instead of Data.Map simply because it was there. I’m completely convinced that will apply with String as well.

If you are worried about List being just too accessible, maybe let’s make List more external rather than making the unpacked ones more internal .

I’m not opposed, but good luck with that. Again, I’m trying to propose something that has a chance of succeeding. I think we have a better chance of including a packed representation in base than removing lists from their hallowed position.

I think the best tool we have for this is runtime reps, but those are egregiously underused.

This is a technique I’m not familiar with. I’ll have a look, sounds interesting.

To me, the notion of base is cluttered:

I don’t disagree. I also see this mentality leading towards no progress until the Gordian Knot of split-base is cut. I think we’ve done enough of that already.

winterland1989 · March 17, 2021, 6:42am

From my understanding, it looks like PrimVector is using unpinned memory, and then making temporary copies for cases where pinned memory is needing; is that correct?

Yes, but we only need explicit pinned memory when the user has to deal with safe FFI, and the copy is not necessary when the bytes is large enough(implicitly pinned).

In Z.Haskell, we provide enough tools to deal with directly passing unpinned bytes to unsafe FFI, so overall it’s rare a problem for us, the function above is just provided for some exotic usage.

Unbox vs boxed differences really don’t matter in Z.Haskell too, vector combinators work on all kinds of vectors and arrays, even on unlifted ones such as UnliftedArray. We have a general array class for dealing with different array type.

angerman · March 17, 2021, 7:15am

I am in general in favour of cleaning this up. However, this one has me
concerned:

Another is that it takes so darn long for vector to compile. People don’t want to depend on it.

This is really don’t buy. If we can distribute prebuilt base, we can distribute pre-built other libraries. Full stop.

If we put vector into ghc, and I’m not fundamentally opposed to that, as it might provide better data structures for ghc as well, but if this means the validation times (compile + test) of ghc itself will suffer significantly, I’ll be very much against this; shipping vector as a prebuilt library then is a different thing, we could just ship a larger set of libraries by default, and the costs would be only once per release. If however we end up adding vector to ghc, and that adds another 10min to ghc’s compilation time, and more time to the test suite, it will make ghc’s terrible CI turnaround times just worse.

So yes, I don’t want to prohibit people from using vector because it’s too slow to compile, at the same time I don’t want to be the person that ends up compiling vector all the time while working on ghc.

blamario · March 17, 2021, 2:35pm

Your concern is completely orthogonal to the question of whether vector should be merged into base or shipped with ghc but separately from base. The CI will need to build and test the vector code regardless of whether it’s inside or outside of base.

Ericson2314 · March 17, 2021, 2:54pm

@snoyberg I completely agree don’t block this stuff on splitting base. Both with a first version of just gutting bytestring and text and making them use vector, I see no need to!

I would say start with that first version, and if that wraps up with no progress on splitting base do we then get serious about adding things in base. That’s a gamble I’m willing to take, if only to light a fire under splitting base :). I’m pretty confident in my “splitting base requires backpack and good orphans” thesis for both why it hasn’t happened yet, and yet also why there’s no strong reason it couldn’t still.

angerman · March 17, 2021, 2:59pm

@blamario if it’s not shipped with ghc and not merged into base, it won’t have any effect on ghc’s CI. That was what I was trying to bring across. If it’s merged into base, or shipped with GHC, that will affect ghc’s CI, and that is where I want to highlight that this has a non-negligible cost. It effectively means everyone working on ghc, will have to build vector as well during validation runs.

Then again if we just ship vector as an extra package during release, that is a different story. In that case CI won’t have to deal with vector until releases are cut; it is essentially just an additional battery that’s being included.

I am aware that my viewpoint is not necessarily similar to those who use ghc.

blamario · March 17, 2021, 6:55pm

Does the splitting of base really need to happen first? I may be missing some subtleties, but I don’t see what prevents us from renaming the existing base to old-base-to-be-split, and then shipping with GHC a new base package that consists solely of thin modules like

{-# LANGUAGE PackageImports #-}
module Data.List (module Data.List) where
import "old-base-to-be-split" Data.List

I expect there’s be no backward compatibility break here, right?

After this step, splitting any particular module out from old-base-to-be-split would consist of, well, splitting it out into another package, and at the same time changing the import declaration in the new thin base. Again no compatibility break.

carter · March 18, 2021, 2:13am

several tricky things I’d like to highlight:

bytestring and text both have adhoc copies of a particular flavor of stream fusion in the style of vector.
(and via something like backpack, their internals could largely be supplanted )
backpack!
a) type Bytestring = Vector Word8; type Text = Vector Word16 etc are viable “embeddings” (lets ignore unicode code points vs unicode text graphemes for the moment)
b) this is ignoring that for all of these we want the whole 4 way cartesian product of pinned vs unpinned X onheap vs offheap
for the purposes of robust performance and predictably good performance, (and good compile times!) stream fusion isn’t actually the current state of the art (of fusion engineering!). And that any good approach should be coupled with making sure we improve how robusty and performantly ghc can support these!

a) the current “best” starting points for good end user fusion that isnt fragile are tools like http://amosr.amospheric.com/papers/robinson2017merges.pdf (gh link at https://github.com/amosr/folderol), along with this very very different gem by our favorite https://arxiv.org/abs/1612.06668 (an oleg staged programming collaboration).

b) you’d then wanna pair these ideas with some of the improvments in how we can support these optimizations, https://github.com/egraphs-good/egg and associated reading https://egraphs-good.github.io/ have some really good ideas.

on a totally orthogonal note, i do think robust support for SOA (what we call unboxed) and AOS (what we call storable when its C compatible) is really important! and in dire need of some improvment.

Also how we design these have deep implications for how we can better do memmove/memset/memcopy for haskell native types. (I did a deep look at how those are implemented by default in many platforms, and outside of glibc, it looks like all bets are off unless cpus do magic optimizations when it sees assembly loops that correspond to those operations). Some of these nuances only come up when dealing with multi core reads/writes, but theres a lot left on the floor performance wise with what even new ghcs do around this stuff. (i started at patch for this last winter, but then the crazy year happened)

edit: i’m touching on a lot of ideas here that are pretty tricky, i’m happy to explain in longer form if anyone’s interested

edit2: point being, i support seriously looking at how we can improve the current landscape rather than get stuck with historical design choice inertia.

carter · March 18, 2021, 2:17am

important footnote i forgot:

folding vector et al into base brings up something very important: perhaps primitive/primmonad should be in base?

bgamari · March 18, 2021, 3:06am

bytestring and text both have adhoc copies of a particular flavor of stream fusion in the style of vector.
(and via something like backpack, their internals could largely be supplanted )

I have discussed this particular point with @snoyberg. My understanding is that @snoyberg’s proposal is to drop (implicit) stream fusion from the vector-like types that would live in the standard libraries in favor of an orthogonal, explicit stream type. Frankly, I think that dropping fusion would be a Good Thing: the status quo delivers questionable benefits and does so unreliably. Furthermore, when fusion fails to fire it creates a great deal of work for the compiler (and can even make the program slower than had we not attempted to fuse).

@snoyberg’s proposal is essentially to fold a set of primitive, unfused types into the standard library I believe this would render primitive (at least its types) largely redundant. Indeed you likely would want something like PrimMonad, however.

In any event, I think the idea here would be to reimplement text, bytestring, and primitive on top of the new types provided by the standard library, ensuring a reasonably low-cost migration path.

winterland1989 · March 18, 2021, 5:14am

If such a rewritten will be done, could you please consider about a better byte array type, I have proposed one here today, which is floating in my head for a long time: https://github.com/ZHaskell/ghc-proposals/blob/gaddr/proposals/0000-add-GC-traceable-pointers.md

We have already made a lot of work based on the current byte array type in Z-Data, and I think a rewrite will be a huge duplication, welcome to use our code under BSD license.

snoyberg · March 18, 2021, 6:22am

Ben’s comment is a good representation of my stance. To highlight some points:

I’m only talking about the core types moving into base, leaving open experimentation in the library ecosystem for different interfaces
Stream fusion (or any other kind of rewrite-rule-enable fusion) isn’t part of the core type, but part of the interface on top

Topic		Replies	Views
Unified Vector :: Haskell Weekly podcast Links	1	632	March 22, 2021
Tech Agenda Track: Meeting Minutes 3/17 Haskell Foundation	0	778	March 19, 2021
Vector type proposal discussion, blog Haskell Foundation	11	1465	April 5, 2021
Tech Agenda Track: Meeting Minutes 3/30 Haskell Foundation	0	582	April 2, 2021
Informal discussion about the progression of `base` Core Libraries Committee	158	8339	May 6, 2024