Base proposal around vector-like types

adamse · April 1, 2021, 11:16am

With current ByteString it is possible to have it backed by a mmaped file, this seems like a good feature. Would we lose this if ByteString was moved to use something that is not ForeignPtr?

snoyberg · April 1, 2021, 1:35pm

mmaping came up in the discussions and @bgamari seemed to think it would be supported. But I don’t know enough of the details of how it will work.

bgamari · April 1, 2021, 6:20pm

Sigh, I actually realized that I may have missed something in my discussion with @snoyberg. In particular, I was thinking that we would take advantage of @dcoutts’s off-heap ByteArray# support to facilitate mmap'd ByteArray#s. However, this could be a bit tricky to accomplish since we would need to ensure that the ByteArray# header would need to precede the mapping. It’s not immediately clear to me how we can guarantee this in general. It’s possible that @dcoutts had something in mind here.

adamse · April 3, 2021, 6:38am

The recent updates to the ticket seem to address this. It looks like it would work to me :)!

fede · April 3, 2021, 9:58am

Improving the situation with vectors is great. I would like to mention that in scientific applications peoples like to work with multidimensional arrays. It would be great if the new vector would be suitable to perform fast operations on multidimensional arrays.

daylily · April 2, 2022, 3:22am

It’s been a year since this got posted (and Text is already UTF-8), I’m curious if there is any proposal as of now? Really interested to see this get implemented.

From what I read from Michael’s posts, what we hope to do is basically add a vector type to primitive, and move ByteString to be implemented on that (then maybe move Text to be implemented on ByteString). That looks like a very clear goal. Maybe I can try to help with that.

One question I have is: how will vector fit in? I do not use it frequently (it takes a decade to compile and most of the times either primitive or containers work better for me), but I have a vague idea that it tries to do data storage and streaming at the same time. The vector type being added to primitive could completely replace data storage function of vector, and there are other (probably better) streaming libraries too. Is it possible that vector will be eventually phased out?

daylily · April 2, 2022, 6:16am

I just found out there is a library contiguous that abstracts the array types from primitive quite nicely: contiguous: Unified interface for primitive arrays

It has a class Contiguous That unifies most array operations
It has a type Slice that provides basic vector functionalities for all Contiguous Types.

Bodigrim · April 2, 2022, 5:12pm

Not gonna happen. ByteString is to remain backed by ForeignPtr, and Text does not benefit from unification with vectors, because there are few common operations.

jaror · April 3, 2022, 1:52pm

Most of the API seems to overlap, but only if you can have vectors with variable size elements. Is that really impossible to overcome? I think the Contiguous class is general enough for that.

Also, text (and String/Char) uses unicode code points, but I would like to see a library that uses (extended) grapheme clusters. I think code points are a useless middle ground.

Ericson2314 · April 3, 2022, 4:55pm

Has that been elaborated upon in a blog post? It would be great to have a fully laid out "challenges unifying vector and bytestring" because I have no idea what they are.

Conversely, I do miss all the container interfaces of Java, and I wonder if we should lean on something like mono-traversable (or just optics?!?) more in the Prelude so that people feel less compelled to to make choices like Map vs HashMap or Bytestring vs Vector Word8 when it doesn’t in fact matter to what they are doing.

Bodigrim · April 3, 2022, 7:54pm

“Vector with variable size elements” is an oxymoron. If you read closer, the only thing Michael proposes to unify is Data.Text.Array and Data.Primitive.ByteArray, which from my perspective is of limited utility and not what people usually assume by unification of Text and Vector.

The challenges are the same as for unifying lists and trees, apples and oranges: ByteString and Vector are vastly different data types for different purposes. If ByteString does not serve well your use case, take Vector instead, problem solved.

Ericson2314 · April 3, 2022, 8:12pm

The challenges are the same as for unifying lists and trees, apples and oranges: ByteString and Vector are vastly different data types for different purposes. If ByteString does not serve well your use case, take Vector instead, problem solved.

Well in that case I think the “apple and orange”-ness is far less well known then!

Bodigrim · April 3, 2022, 8:27pm

You do not ask to explain why lists and trees are different, right? The default is that since ByteString and Vector do not share a single line of code, they are different, and GHC is happy to check this automatically. If you wish to argue otherwise, the burden of proof is on you, not on me.

jaror · April 3, 2022, 8:50pm

As mentioned in the blog post that started this thread, the main difference between the representation of bytestring and the representation of unboxed vectors of Word8s is that bytestring is pinned to make it easy to pass them to foreign C functions (such as printing functions) while unboxed vectors are unpinned which means they play better with the garbage collector.

I remember reading/hearing recently that large unboxed vectors are already pinned, so that isn’t a problem and small vector can easily be copied without too much overhead. So, I think with that strategy bytestring could switch over to a ByteArray# backed (vector-like) data type.

Ericson2314 · April 3, 2022, 9:11pm

@Bodigrim thats an odd argument to me. People write code which overlaps in purpose all the time. @jaror mentioned the original blog post, which comes to the opposite conclusion that the differences are superficial, and by being able to just mutate whether memory is pinned, we could erase the final non-trivial distinction.

All this just makes me want a blog post retort more!

Bodigrim · April 3, 2022, 9:20pm

The main difference is that ByteString is a foreign pointer (and pinnedness is just a consequence), while Vector Word8 is a ByteArray# heap object (which could be pinned or unpinned). Yet again, I must emphasize that these are just different data types serving different use cases, same as lists and trees. Why people feel an insistent urge to force their unification at cost of both maintainers and users is beyond my comprehension.

jaror · April 3, 2022, 9:48pm

As I see it, maintainers’ jobs would get easier if these types are unified because they would have to maintain less code. And the main reason for unification is to reduce the amount of choice for users and present one solution that fits all their needs.

What is the benefit of using a foreign pointer if it is not pinnedness? Which use-cases require foreign pointers that cannot also use “normal” ByteArray#s? I don’t see why ByteString has to be inherently “foreign”.

Ericson2314 · April 3, 2022, 9:58pm

Well let’s really drill down here.

Lists vs trees is about an intrinsic property of the data structure. Here, however, the intrinsic properties are roughly the same. What is varying instead how how the memory is managed. In a language like C++ or rust this is closer to the allocator parameter on std::vector/Vec.

I did some spelunking because I do not know the details:

bytestring/Data/ByteString/Internal.hs at 707e50bac1f1ef7c1729a9246e05f95a4909fd27 · haskell/bytestring · GitHub allocates bytes on master today
GHC.ForeignPtr.mallocPlainForeignPtrBytes is what it’s a wrapper around.

This function is similar to mallocForeignPtrBytes, except that the internally an optimised ForeignPtr representation with no finalizer is used. Attempts to add a finalizer will cause an exception to be thrown.

raises eyebrows. Foreign, but no finalizer??
newPinnedByteArray# is the underlying primop its the docs say

Like newByteArray# but GC guarantees not to move it.

Meanwhile…

vector/vector/src/Data/Vector/Primitive/Mutable.hs at dae9d1739f760027100458928ab96515ecb57cdf · haskell/vector · GitHub Internal module Data.Vector.Primitive.Mutable implements basicUnsafeNew with Data.Primitive.ByteArray.newByteArray.
Data.Primitive.ByteArray.newByteArray uses newByteArray#

So yes, while using ForeignPtr could be intersteresting allowing one to transfer ownership to a foreign library that would deallocate the bytestring, in practice we are actually just getting two marginally different GHC-managed heap objects after all!

@Bodigrim I am sorry but it feels like you boldly asserted something, which sent me on this wild goose chase chasing variable bindings, just to figure out the thing the original blog post from @snoyman casually asserted without much detail is in fact exactly true.

Bodigrim · April 3, 2022, 10:05pm

@Ericson2314 it seems you assume that all ByteStrings are created via mallocByteString. That’s not so, you can use Data.ByteString.Internal.fromForeignPtr with any ForeignPtr instead.

Ericson2314 · April 3, 2022, 10:21pm

OK! Then the last thing I wrote is wrong, and there are yet more subtleties. But then I switch back to arguing merely that all this is quite subtle, far from obvious, and deserving a nice blog post!

It is also interesting to me that ForeignPtr is such a highly flexible data type in many respects, but only supports pinned memory. That makes sense for something called ForeignPtr — and mere pinning is enough to safely make an Addr#. But the use-cases of extendable flat memory include the “pin it sometimes not others” usecase @snoyberg mentions, which still feels natural and obvious, and is a dynamism ForeignPtr doesn’t support.

I still feel what we have here is a tower of somewhat awkward abstractions, totally justified for some things, but used in other ways that are far less justified. The accumulation of stuff here, and end result of Vector Word8 vs ByteString does feel like an accident of history, and not something I feel like justifying to new users!

I really do want to clean all this stuff up, from the lowest level RTS assumption of pinness being something that doesn’t vary over the lifetime of a heap object, to the highest level bits of making a Vector/ByteString replacement that is parametric over these details.

For me, this sort of “pull things apart, shake them until they make sense” is why I like breaking up base, though I understand why other people agree with these problems, see the coordination failures, and draw the 180° opposite conclusion that we should throw it all in one monorepo to sort it out.

Topic		Replies	Views
Unified Vector :: Haskell Weekly podcast Links	1	632	March 22, 2021
Tech Agenda Track: Meeting Minutes 3/17 Haskell Foundation	0	778	March 19, 2021
Vector type proposal discussion, blog Haskell Foundation	11	1465	April 5, 2021
Tech Agenda Track: Meeting Minutes 3/30 Haskell Foundation	0	582	April 2, 2021
Informal discussion about the progression of `base` Core Libraries Committee	158	8339	May 6, 2024

Base proposal around vector-like types

Related topics