With current ByteString it is possible to have it backed by a mmap
ed file, this seems like a good feature. Would we lose this if ByteString
was moved to use something that is not ForeignPtr
?
mmap
ing came up in the discussions and @bgamari seemed to think it would be supported. But I don’t know enough of the details of how it will work.
Sigh, I actually realized that I may have missed something in my discussion with @snoyberg. In particular, I was thinking that we would take advantage of @dcoutts’s off-heap ByteArray#
support to facilitate mmap
'd ByteArray#
s. However, this could be a bit tricky to accomplish since we would need to ensure that the ByteArray#
header would need to precede the mapping. It’s not immediately clear to me how we can guarantee this in general. It’s possible that @dcoutts had something in mind here.
Improving the situation with vectors is great. I would like to mention that in scientific applications peoples like to work with multidimensional arrays. It would be great if the new vector would be suitable to perform fast operations on multidimensional arrays.
It’s been a year since this got posted (and Text is already UTF-8), I’m curious if there is any proposal as of now? Really interested to see this get implemented.
From what I read from Michael’s posts, what we hope to do is basically add a vector type to primitive
, and move ByteString to be implemented on that (then maybe move Text to be implemented on ByteString). That looks like a very clear goal. Maybe I can try to help with that.
One question I have is: how will vector
fit in? I do not use it frequently (it takes a decade to compile and most of the times either primitive
or containers
work better for me), but I have a vague idea that it tries to do data storage and streaming at the same time. The vector type being added to primitive
could completely replace data storage function of vector
, and there are other (probably better) streaming libraries too. Is it possible that vector
will be eventually phased out?
I just found out there is a library contiguous
that abstracts the array types from primitive
quite nicely: contiguous: Unified interface for primitive arrays
- It has a class
Contiguous
That unifies most array operations - It has a type
Slice
that provides basic vector functionalities for allContiguous
Types.
Not gonna happen. ByteString is to remain backed by ForeignPtr, and Text does not benefit from unification with vectors, because there are few common operations.
Most of the API seems to overlap, but only if you can have vectors with variable size elements. Is that really impossible to overcome? I think the Contiguous
class is general enough for that.
Also, text (and String
/Char
) uses unicode code points, but I would like to see a library that uses (extended) grapheme clusters. I think code points are a useless middle ground.
Has that been elaborated upon in a blog post? It would be great to have a fully laid out "challenges unifying vector
and bytestring
" because I have no idea what they are.
Conversely, I do miss all the container interfaces of Java, and I wonder if we should lean on something like mono-traversable (or just optics?!?) more in the Prelude so that people feel less compelled to to make choices like Map
vs HashMap
or Bytestring
vs Vector Word8
when it doesn’t in fact matter to what they are doing.
“Vector with variable size elements” is an oxymoron. If you read closer, the only thing Michael proposes to unify is Data.Text.Array
and Data.Primitive.ByteArray
, which from my perspective is of limited utility and not what people usually assume by unification of Text
and Vector
.
The challenges are the same as for unifying lists and trees, apples and oranges: ByteString
and Vector
are vastly different data types for different purposes. If ByteString
does not serve well your use case, take Vector
instead, problem solved.
The challenges are the same as for unifying lists and trees, apples and oranges:
ByteString
andVector
are vastly different data types for different purposes. IfByteString
does not serve well your use case, takeVector
instead, problem solved.
Well in that case I think the “apple and orange”-ness is far less well known then!
You do not ask to explain why lists and trees are different, right? The default is that since ByteString
and Vector
do not share a single line of code, they are different, and GHC is happy to check this automatically. If you wish to argue otherwise, the burden of proof is on you, not on me.
As mentioned in the blog post that started this thread, the main difference between the representation of bytestring
and the representation of unboxed vectors of Word8
s is that bytestring is pinned to make it easy to pass them to foreign C functions (such as printing functions) while unboxed vectors are unpinned which means they play better with the garbage collector.
I remember reading/hearing recently that large unboxed vectors are already pinned, so that isn’t a problem and small vector can easily be copied without too much overhead. So, I think with that strategy bytestring could switch over to a ByteArray#
backed (vector-like) data type.
@Bodigrim thats an odd argument to me. People write code which overlaps in purpose all the time. @jaror mentioned the original blog post, which comes to the opposite conclusion that the differences are superficial, and by being able to just mutate whether memory is pinned, we could erase the final non-trivial distinction.
All this just makes me want a blog post retort more!
The main difference is that ByteString
is a foreign pointer (and pinnedness is just a consequence), while Vector Word8
is a ByteArray#
heap object (which could be pinned or unpinned). Yet again, I must emphasize that these are just different data types serving different use cases, same as lists and trees. Why people feel an insistent urge to force their unification at cost of both maintainers and users is beyond my comprehension.
As I see it, maintainers’ jobs would get easier if these types are unified because they would have to maintain less code. And the main reason for unification is to reduce the amount of choice for users and present one solution that fits all their needs.
What is the benefit of using a foreign pointer if it is not pinnedness? Which use-cases require foreign pointers that cannot also use “normal” ByteArray#
s? I don’t see why ByteString
has to be inherently “foreign”.
Well let’s really drill down here.
Lists vs trees is about an intrinsic property of the data structure. Here, however, the intrinsic properties are roughly the same. What is varying instead how how the memory is managed. In a language like C++ or rust this is closer to the allocator parameter on std::vector
/Vec
.
I did some spelunking because I do not know the details:
-
bytestring/Internal.hs at 707e50bac1f1ef7c1729a9246e05f95a4909fd27 · haskell/bytestring · GitHub allocates bytes on master today
-
GHC.ForeignPtr.mallocPlainForeignPtrBytes
is what it’s a wrapper around.This function is similar to
mallocForeignPtrBytes
, except that the internally an optimised ForeignPtr representation with no finalizer is used. Attempts to add a finalizer will cause an exception to be thrown.raises eyebrows. Foreign, but no finalizer??
-
newPinnedByteArray#
is the underlying primop its the docs sayLike
newByteArray#
but GC guarantees not to move it.
Meanwhile…
-
vector/Mutable.hs at dae9d1739f760027100458928ab96515ecb57cdf · haskell/vector · GitHub Internal module
Data.Vector.Primitive.Mutable
implementsbasicUnsafeNew
withData.Primitive.ByteArray.newByteArray
. -
Data.Primitive.ByteArray.newByteArray
usesnewByteArray#
So yes, while using ForeignPtr
could be intersteresting allowing one to transfer ownership to a foreign library that would deallocate the bytestring, in practice we are actually just getting two marginally different GHC-managed heap objects after all!
@Bodigrim I am sorry but it feels like you boldly asserted something, which sent me on this wild goose chase chasing variable bindings, just to figure out the thing the original blog post from @snoyman casually asserted without much detail is in fact exactly true.
@Ericson2314 it seems you assume that all ByteString
s are created via mallocByteString
. That’s not so, you can use Data.ByteString.Internal.fromForeignPtr
with any ForeignPtr
instead.
OK! Then the last thing I wrote is wrong, and there are yet more subtleties. But then I switch back to arguing merely that all this is quite subtle, far from obvious, and deserving a nice blog post!
It is also interesting to me that ForeignPtr
is such a highly flexible data type in many respects, but only supports pinned memory. That makes sense for something called ForeignPtr
— and mere pinning is enough to safely make an Addr#
. But the use-cases of extendable flat memory include the “pin it sometimes not others” usecase @snoyberg mentions, which still feels natural and obvious, and is a dynamism ForeignPtr
doesn’t support.
I still feel what we have here is a tower of somewhat awkward abstractions, totally justified for some things, but used in other ways that are far less justified. The accumulation of stuff here, and end result of Vector Word8
vs ByteString
does feel like an accident of history, and not something I feel like justifying to new users!
I really do want to clean all this stuff up, from the lowest level RTS assumption of pinness being something that doesn’t vary over the lifetime of a heap object, to the highest level bits of making a Vector/ByteString replacement that is parametric over these details.
For me, this sort of “pull things apart, shake them until they make sense” is why I like breaking up base, though I understand why other people agree with these problems, see the coordination failures, and draw the 180° opposite conclusion that we should throw it all in one monorepo to sort it out.