Base proposal around vector-like types

i’m not sure how robust/sound fusion on top would be, but its certainly a worthy experiment. I assume we are only discussing dense/fixed size/boxed sequences? (eg not sparse vectors nor grapheme based indexing ontop of unicode are goals here?)

1 Like

i’m not sure how robust/sound fusion on top would be

I don’t see any reason to worry about soundness. As far as robustness is concerned, I think it would actually be an improvement over the status quo (at the expense of some verbosity). Fusion in text and vector is already a layer on top of a plain array. The difference is that these packages try to make fusion transparent to the user, despite the fact that it will very often fail unless the user makes a conscious effort to ensure that the consumer and producer see one another. This leads to performance that is difficult to reason about since an important property of the program’s runtime behavior is not tracked in its types.

I assume we are only discussing dense/fixed size/boxed sequences? (eg not sparse vectors nor grapheme based indexing ontop of unicode are goals here?)

Correct, the suggestion is merely to introduce some simple vector types for common needs. Sparse vectors or any more sophisticated text structures would be out of scope in my opinion.

5 Likes

So a question I have is what, if any, new cool features do we need to deduplicate vect, bytestring, and text? Because I would rather deduplicate them first and then worry about cool temp pinning, levity, etc. etc.

2 Likes

what do you mean by temp pinning?

i agree on the dedupe first.

1 Like

What do you mean by temp pinning?

The GC pinning ideas discussed earlier. I just use that as an example of a new feature not found in any of the libraries today that might need RTS support, and thus would be a reason to coordinate develop with GHC.

I agree on the dedupe first.

Great! I know you’ve thought about the vector bytestring duplication before, so this is reassuring that we can in fact do it out of band.

1 Like

If GHC CI built amazonka, I’m sure we would improve GHC performance.

Even though I hate big compile times as much as you can imagine (I build amazonka a lot…) I’m still :+1: for taking this slight risk with vector as (if it is a problem) it would inevitably lead to improvements in GHC performance (especially if we improve GHC internals with it, as you mention).

2 Likes

I highly respect the this path you are working through, and I thank you for openly contributing to this effort. If anything, Z-haskell can give us a playground and space for validation. Unfortunately, the situation with GHC is such that, if the Z-haskell effort is going to merge with GHC’s path, you would need to reduce the amount of breakage or otherwise help make that transition more graceful - it won’t be accepted otherwise. As I see it, the challenge would then be about making the best out of Z Haskell in GHC, upstreaming what can be with minimal fuss. That is probably not easily workable right now, but maybe it could be in the future.

1 Like

I agree. Temporary pinning is not a trivial change. While I have recently been pondering the possibility of using the non-moving collector’s allocator for pinned allocations (which would significantly reduce the fragmentation potential of such allocations), I think we have bigger fish to fry. Moreover, I don’t think it is strictly necessary to improve the status quo. Large allocations are already pinned and copying small values is generally quite cheap.

For this reason my sense is that, if we want them, could have uniform vector types with our current treatment of pinning.

2 Likes

If GHC CI built amazonka, I’m sure we would improve GHC performance.

Perhaps this is true; I certainly don’t doubt that it would help. We even have the infrastructure for making this happen: every night GHC’s CI builds the entirety of head.hackage. Various compile-time statistics are collected from these runs.

However, at the moment we do not have enough people to maintain head.hackage to allow us to keep a package like amazonka (or rather, its sizeable transitive dependency set) building. This is an area where we could really use more hands.

1 Like

And I’m choosing this because I think it’s an underappreciated problem, and one that we can actually solve, now that we have real collaboration between GHC, core library maintainers, and alternative prelude authors.

Here, here!!

So on our first step in this journey, my recommendation is simple: let’s unify these types! For ease of transition, we can think of ByteString and Text as newtype wrappers for now. But as you’ll see in the continuing discussion, I really have different plans.

And with that change in place, we no longer have a distinction between the representations of Text and ByteString. A Text in fact turns into a simple newtype around ByteString with some rules of character encoding (ideally UTF-8). This would avoid a lot of performance overhead around functions like encodeUtf8 and decodeUtf8, and hopefully make the entire I/O system just a bit easier to work with.

this type of unification helps us address so many other problems!

And finally, I think these types need to go into base. Yes, I already hear the arguments about split-base popping up. Let me clarify. I’m not saying the first version ever of these types goes into base. Let’s get the GHC changes necessary in place, write an external library providing these types, and then move them into base.

I like that this proposal can move forward without getting tripped up on tangential issues.

I’m :100: for finding a way forward via small changes we can all commit to, and let other topics like “splitting up base” for separate proposals.

I’m not recommending we put all of the vector package into base. Far from it. I’m recommending we construct a well designed, hopefully-universal packed data representation, standardize it, and put it in base where it can be used by everything.

As it stands today, most of the I/O functions included in base work on Strings. We almost all agree that we shouldn’t be using String, but its usage persists because it is the only base-approved type. Let’s fix that.

In any event, that’s my goal here: put these types deep into base, make them ubiquitous, and move most/all existing code over to use them.

Yes, all of that! Sounds great! :100:

3 Likes

What does that maintenance look like? What tasks need more hands with?

Can you write up a list somewhere and point us towards it? That sounds like work I might be able to help with, and I would guess others might be interested as well.

I firmly believe it’ll be easier to improve GHC’s performance if we are paying attention to how libraries like amazonka, vector, etc perform.

3 Likes

If I have a Vector a with no constraints on a, it must be a boxed Vector. This is unfortunate, because for many types this is a very poor representation. We would like some method to automatically select the best representation for a Vector.

I think it’s something that’s simply not possible in current state of haskell. If there’s no constraint on element type we could put anything into vector. Boxing ensures that we can generate code which would work with any value. But switching to nonuniform representation means that polymorphic functions have to be able to work with all possible representations and even be able to make that choice at runtime. I think that with enough indirection it’s always possible to push that choice to runtime. Such dispatch wouldn’t be free too.

It looks like a huge change to GHC. A multiyear undertaking at best. Current solution is to carry around constraint which describes how to represent array of things.

2 Likes

Beyond this, I’m also far from convinced that this is actually something that we want. Making choice of representation automatic sounds good on paper, but in practice it introduces yet another source of uncertainty regarding performance. In general, data representation choices are some of the most critical decisions that an author much make when designing their program. Leaving these to the compiler seems to me to be a mistake.

In my opinion our libraries already make it harder to reason about performance than necessary (e.g. due to implicit fusion and the heavy reliance on poorly-tested rewrite rules). Making data representation implicit would only worsen this problem.

1 Like

Has this actually been tried in practice already?

I think all of the unpredictability would be caused by the unpredictability of laziness and strictness. If a data structure is used in a strict way and it is possible to unbox it then it should always be unboxed. Arguably Vectors should always be strict or at least strict by default (I did find a compelling use-case for lazy vectors) and it should be immediately obvious if it is possible to unbox by just checking if the element type is “unbox-able” (e.g. an instance of Unbox). If a function is polymorphic in the element type then two versions of that function could be created for both possibilities.

To draw an analogy with laziness, I think it would be really bothersome if I was forced to explicitly specify the strictness properties of all variables. So maybe Haskell would be a nicer language if the compiler could automatically figure out details like boxed vs unboxed representations? Of course, it should always be possible to add annotations, but I don’t think that they should be required.

1 Like

We can have our cake and eat it too if we just had the templating/monomorphizing quantifier for runtime reps. I’d argue even {-# UNPACK #-} is too magical compared to that, as it makes pattern matching potentially more costly as inside unboxed things are implicitly reboxed.

All this is more work than just deduplicating the libraries we’ve got, and so I’d love for it proceed in parallel, with us only worrying about the combination after we both.

2 Likes

Eric, is this the staged binder/quantifier we’ve talked about? (cause it sounds like thats what you’re alluding to :slight_smile: )

1 Like

John :stuck_out_tongue: (no worries, pandemic is long), but yes.

1 Like

I think I was just being slow and truncating internet handles yesterday :sweat_smile::sweat_smile:

2 Likes

Has this actually been tried in practice already?

I don’t think so it looks very-very difficult to me.

Main problem with unboxing vectors is parametric polymorphism. As an example let consider function with type foo :: (∀a. Vector a → Vector (Bar a)) → X function which could be passed to foo has to work with each and every element type whether it’s unboxed or not. It also must be able to construct values of Bar a which again could be unboxed or not.

it should be immediately obvious if it is possible to unbox by just checking if the element type is “unbox-able” (e.g. an instance of Unbox)

It’s easy in monomorphic case (relatively, orphans make it more complicated). But once you start writing polymorphic functions it becomes impossibility. There’s no way to recover dictionaries at run time.

Similarly any monomorphisation scheme requires answer what to do with polymorphic recursion which is allowed in haskell.

1 Like

@Shimuuar agreed, and dictacting a “one true” unboxing has its own problems too :slight_smile: (i hope to revisit some ideas i’ve had for making such a toolkit in the future, but other things demand more work first :slight_smile: )

hence the allusion by @Ericson2314 and myself that we need to think about “staged/ c++ contstexpr templating” style lambdas and foralls if we wanted a “functor friendlty” auto unboxing UX that has predictable rules and properties

1 Like