Base proposal around vector-like types

Bodigrim · April 3, 2022, 10:41pm

Sorry, I do not have bandwidth for a blog post. My sole purpose was to answer @daylily that as a comaintainer of bytestring, text and vector I do not see this unification coming and no one made any practical effort towards it. I’m skeptical about it for various reasons, which again I do not have bandwidth to describe in full, but if someone wants to prove me wrong - you are most welcome to design bytestring or text over vector (or contiguous, or whatever) and achieve the same API and performance.

Ericson2314 · April 3, 2022, 10:47pm

Sounds good. I am very sympathetic to the idea that there is more writing to be done than time to do it.

Kleidukos · April 4, 2022, 10:13am

The trick is to specialise ByteString, for it is currently doing three things at once:

FFI blob
String literal and container
Binary blob (especially from the network)

If you can focus on one of those things you will definitely find optimisations. If you need to keep these three roles at once, compromises will have to be made.

jaror · April 4, 2022, 11:59am

I kind of like your idea, because from this thread I am beginning to think that ByteStrings are not actually what I thought they were. But I wouldn’t like to see even more fragmentation in string types.

I think I have only really used ByteString as an ascii string literal or byte container (which is also readable from stdin and printable to stdout*). From that perspective I think it would be fine to use a ByteArray# based solution. I also think the binary blob use-case could also use such a ByteArray# based solution. No foreign language is involved in these use-cases, so I think ForeignPtr is not appropriate, see the docs:

The type ForeignPtr represents references to objects that are maintained in a foreign language, i.e., that are not part of the data structures usually managed by the Haskell storage manager.

Instead, it seems like the current implementation using ForeignPtr’s is actually optimized for use with external C code over the FFI. I don’t think that is a very common use-case and I’d actually use something like CString for that. What is the difference between ByteString and CString? Could this situation be resolved by expanding the API of CString and changing ByteString to be backed by a ByteArray# (and unified with unboxed vectors, which have the same representation)?

*Actually, this might be the catch if Haskell’s IO is implemented in terms of FFI calls to C functions. But I believe it is now also possible to use UnliftedFFITypes to make foreign calls with the ByteArray# type as input or output.

Kleidukos · April 4, 2022, 12:17pm

A CString is a Ptr CChar, so a NUL-terminated list of C characters

hasufell · April 4, 2022, 3:53pm

There is ShortByteString

Bodigrim · April 4, 2022, 6:13pm

I’m sorry, but I don’t appreciate the optics that there is a “situation to be resolved”. If you do not like ByteString backed by ForeignPtr - no problem, use ByteArray or Vector or ShortByteString, godspeed, case closed. For the sake of what exactly would you like to break hundreds of existing consumers of a type, which works for them perfectly well? There must be a compelling reason for unification beyond “it is better to have less types” (which is a weird argument to make in a statically-typed ecosystem).

jaror · April 4, 2022, 6:22pm

I guess what I’m trying to figure out is what the ideal interface would look like. I’m not suggesting that we should switch to another system tomorrow and break all code. I just want to see what goal we could work towards. ~~The best strategy is probably to introduce a new library that presents a unified type for the most common use cases and which can be recommended to newcomers.~~ Edit: let me just quote this blog post:

This section was added at the suggestion of Ben Gamari. I hadn’t realized that this post initially came off as a massive breaking change in the language and library ecosystem. So let me clarify. I believe that the changes above can be made immediately with zero breakage (though plenty of effort). The problem would be further fracturing the ecosystem ala the classic XKCD. However, I think we have a great move here: compatibility shims.

In this world I’m picturing, most of the existing data types, and especially the two most common ones (strict ByteString and Text) can be expressed as newtype wrappers over the new types. Vector can probably have the same kind of shim too, though the strictness requirements I’m envisioning may change the semantics of boxed Vectors; I’m not sure yet.

In any event, that’s my goal here: put these types deep into base, make them ubiquitous, and move most/all existing code over to use them.

I think @snoyberg makes that argument in the blog post that we are discussing here: all the different string/sequence types are a significant barrier to adoption and onboarding. A secondary benefit is that it reduces maintenance burden because there will be less code duplication (or at least that is what I hope).

Bodigrim · April 4, 2022, 6:49pm

As I said, this is a weird argument to make in a statically typed language, which prides itself in making even minor distinctions type-level. Besides that I do not find having distinct types for binary blobs, unicode text and general-purpose arrays excessive. The situation is similar to many other languages.

As a maintainer of discussed libraries, I believe that such change is likely to come at a great cost to maintainers.

iustin · April 4, 2022, 7:13pm

It really seems to me, mostly as a bystander, that there are two people discussing two different topics here.

Nobody argues, I think, for a type that will ambiguously represent either binary blobs or text, depending on how you look at it (this was Python 2’s str type). What the original post argues for was this:

Text and ByteString have completely different representations, and therefore converting between them is an expensive operation. Note that, contrary to popular belief, switching to UTF-8 in text won’t solve this (though it’s still a good idea for reducing memory usage). The problem is, again, pinned versus unpinned memory.

The “unification” would keep separate types (of course), but with a closer representation such that converting of a large ASCII text between Text and ByteString would be very low cost or even a no-op, and ideally eliminated by the compiler.

Furthermore, the overall problem is this (again from the original post):

If you want to represent a sequence of data in Haskell today, you have too many options […] It’s a lot of cognitive overhead for new users, the libraries somewhat fight over which thing to use, and there are some hidden traps and performance costs.

So as I read the proposal, it’s about making the choice of which data type to use for a “sequence of data” much simpler and with less potential for performance degradation. Or something like that.

But nobody is arguing for a type PerfectString that can do all at once.

jaror · April 4, 2022, 8:17pm

Maybe we should list all the options for these text/sequence types:

boxed / unboxed
pinned / unpinned
codepoints (non-uniform) / graphemes (non-uniform) / bytes (uniform)
strict / lazy (or perhaps better: lifted / unlifted)
mutable / immutable
foreign / native (should this really be separate from the pinnedness axis?)
short / normal (save a little bit of space but lose ability to slice)
automatic fusion / manual fusion

I think the challenge is to try to fit as many of these in the same types without sacrificing too much ergonomics and performance. And if multiple types are required then I’d like to try to use type classes or backpack to provide a uniform interface to as many of these as possible, but even then I think there will need to be at least two interfaces for immutable and mutable sequences. And I’d like to see an easy way to convert between the different types.

Which of these can we combine into one type?

The blog post suggest to merge boxed and unboxed vectors into a single type that is able to use unboxed storage if possible. I think that sounds very difficult, but I believe C++ can do it with its vectors.
The blog also suggests to introduce temporary pinning. I think I heard recently that temporary pinning is not a good option, but unpinned memory could simply be copied. That means we could just always use unpinned memory and have a low level pinned sequence type for people who know what they are doing.
I think a text type (codepoints or graphemes) should be separate, but I think it can have mostly the same interface as other sequences and it should be possible to cheaply convert text into a sequence of bytes. But I wouldn’t complain if somebody figured out a good way to have a non-uniform contiguous sequence type. Then this could simply be a type synonym over that.
I think it is fine to pick strict/unlifted elements for these sequences because it is meant for storage. Maybe a lower level lazy/lifted type can be added separately for people who know what they’re doing.
Mutable and immutable versions probably require completely different APIs, but it should be possible to cheaply convert between them (unsafely or perhaps even safely using linear types).
I’d say we should pick native types by default and add a separate foreign type if necessary for people who know what they are doing.
I think we can make a best of both worlds type for short vs normal variants which is one byte longer than a short bytestring, but also supports slicing if the sequence is larger.
I think the trend is to have manual fusion as automatic fusion is just too difficult to get right.

To conclude, that would mean we have 4 main types: text, mutable text (why does that not exist yet?), vector, mutable vector. Maybe we will need explicit boxed and unboxed vector variants (text is always unboxed). But even that would be a huge improvement in my opinion.

So, in particular the changes with the status quo would be:

a more primitive vector type would be introduced which ideally would automatically choose an unboxed representation if possible.
bytestring would get discouraged and vector would get some bytestring/Vector Word8 specific functions for e.g. input and output.
text would use Unboxed.Vector Word8 internally instead of that custom array type.
text would get a mutable interface (maybe this is unnecessary?)
text would be cheaply convertible to the vector type.
all immutable sequence type would have the same interface (bar those bytestring specific functions) and a type class or backpack signature should enforce that

And maybe more, but I’m too tired now. Please fire away with your criticisms!

PS I see now that I have overlooked the important difference between lazy/strict sequences and unlifted/lifted elements. I think most sequences should be strict and explicit streaming should be used instead of lazy sequences. But lazy sequences are pretty nice to use for small programs…

rae · April 4, 2022, 8:56pm

I have not been following closely. But I’d like to amplify @iustin’s point, because I think I agree. It seems there may be an advantage (note that I say “may”: I have no idea whether there is or isn’t) if these fundamental types share a representation, not that they are the same type. So I don’t think the argument that we are reducing types in a statically typed languages is much in play. (I apologize @Bodigrim if I misunderstand you here.) Why might this be an advantage? Because it makes conversions among the types cheaper. Is that the entire advantage? Quite possibly. If so, is there evidence that real programs are hampered by slow conversions?

My bottom line: it’s clear that some people think this is a good idea and others think it’s a bad idea. @Bodigrim, as maintainer, has more knowledge of the internals, and likely a better handle on estimating the implementation / breakage cost of this change. So it would seem that others, if they think this change is a good idea, should back up their arguments with data, just to prevent this conversation from going around and around.

Abab9579 · April 5, 2022, 3:03am

Why should sequences be strict? Even haskell list is lazy, and ppl could achieve streaming behavior easily with laziness…

snoyberg · April 5, 2022, 12:48pm

I’m not pushing for this proposal anymore, since it seems unlikely to be worth the cost based on discussions with maintainers. However, since something just came up today to provide a reasonable justification, I thought I’d share it. A colleague asked about the best way to parse a Scientific value from a Text value. I had a look at how aeson does this, and discovered that it implements its own scientific Parser. This would be perfect, except that it works on ByteStrings, not Text. We’re left with a few different options:

Use encodeUtf8 to get a ByteString from the original Text. This is wasteful, as it causes an unnecessary allocation, buffer copy, and deallocation.
Reimplement the scientific parser for Text, leading to code duplication.

(There’s also the route of using String-based parsers within scientific, but that’s somewhat separate from this discussion.)

In my ideal world, where Text is a newtype wrapper around ByteString (or, better yet, Vector Word8), (1) would be the correct solution and would have zero cost. To address @Bodigrim’s point, this is less about library maintainer burden within the bytestring/text/vector ecosystem, but downstream libraries and applications.

A lot of the motivation for this proposal came from Rust, where String is a newtype wrapper around Vec<u8>, slices generalize over multiple data storage mechanisms, and a unified iteration interface generalizes over many different data structures. IMO it’s a game changer to writing great code.

hasufell · April 5, 2022, 5:38pm

There’s a third option without code duplication for your specific example, which I have done for the Abstract filepath proposal: CPP. It isn’t pretty, but works well if both text types roughly export the same API (and they do). E.g. I was able to abstract over both String and ShortByteString by simply removing the explicit String pattern matches and using uncons etc. instead, e.g.:

isExtensionOf :: STRING -> FILEPATH -> Bool
isExtensionOf ext = \fp -> case uncons ext of
  Just (x, _)
    | x == _period -> isSuffixOf ext . takeExtensions $ fp
  _ -> isSuffixOf (singleton _period <> ext) . takeExtensions $ fp

Then you just #include with the right imports and types set. Maybe not particularly elegant, but a practical workaround.

One could wonder if this shared API could be unified under a typeclass even.

santiweight · April 5, 2022, 6:28pm

This solution is the same solution as the backpack suggestion which was mentioned in @snoyberg’s original proposal or the follow-ups. I don’t Ocaml nor Haskell-backpack, but this might share the spirit of the corresponding solution:

signature ListLike L {
  uncons :: L a -> Maybe a
}

-- Defining a parser
signature Parsers {
  ListLike L
} where

isExtensionOf :: L -> L -> Bool
isExtensionOf ext = \fp -> case uncons ext of
  Just (x, _)
    | x == _period -> isSuffixOf ext . takeExtensions $ fp
  _ -> isSuffixOf (singleton _period <> ext) . takeExtensions $ fp

-- Using the signature
module Foo where
import Parsers @[]

...

The type class solution you mention is quite simple to implement, however, the pushback I imagine you will hear is that either:

The inlining pragmas will make binaries too large for non-trivial codebases using more than one list type
The type class will incur a runtime cost, which is a nono for something as core as vector types

daylily · April 7, 2022, 1:45am

Oh god. I didn’t expect my reply would generate this long a discussion. I’ll try to summarize some of the thoughts I have:

benefits

My idea of why it would be a good thing to merge the representations of Text, ByteString and Vector is that

It makes the conversions basically zero-cost
It makes future code easier to write with less duplication

I can’t say how this will affect existing packages’ maintenance, because I’m not a maintainer of any of the packages that could be involved.

problems

I skimmed the thread and it seems that the only “substantial” (i.e. alters performance) change is that ByteString becomes unpinned, which can be bad for large blobs passed through FFI. So if that is something we cannot do, then we have a choice:

Make Text implemented with primitive's ByteArrays, which is isomorphic to the current representation
- This is a “pointless” change in the sense that it has no substantial runtime effect
- This can, however, duduplicate some code in text (How much?), which may reduce or increase the burden of maintainers (?)
Make Text implemented with ByteStrings, which makes it zero-cost to transform between the two
- This one makes Texts pinned (which can change performance characteristics)

Again, I don’t know which one will be better, I’m just trying to summarize some information here.

execution

Sure, I believe this will not happen anytime soon:

If we are going to use Vectors as the underlying representation of ByteString & Text, then vector should first get into ghc boot packages
If we are going to use primitive, or contiguous, then they need to get into ghc boot packages;
and if contiguous becomes “blessed” in this way, then user will be confused as to whether use it or vector.

Bodigrim · October 8, 2022, 2:39pm

Leaving ByteString aside, there is an ongoing work to unify ShortByteString, Data.Text.Array and Data.Primitive.ByteArray. The underlying type, ByteArray is now available from base (Data.Array.Byte), and bytestring, text and primitive are to share it:

github.com/haskell/primitive

Use data-array-byte package to provide ByteArray for GHC < 9.4

haskell:master ← Bodigrim:master

opened 07:54PM - 07 Oct 22 UTC

Bodigrim

+6 -243

Follow up on #354. I copied [`Data.Array.Byte`](https://hackage.haskell.org/p…ackage/base-4.17.0.0/docs/Data-Array-Byte.html) from `base-4.17` into a standalone compatibility package [`data-array-byte`](https://hackage.haskell.org/package/data-array-byte-0.1) so that `bytestring` and `text` can switch to `ByteArray`, but maintain providing a stable interface across GHC versions. Now `primitive` can also use the same compatibility package, reducing amount of CPP and code to maintain and just importing `Data.Array.Byte` unconditionally. Cf. https://github.com/haskell/text/pull/474 and https://github.com/haskell/bytestring/pull/410

Kleidukos · October 11, 2022, 6:25am

That is fantastic, thank you very much @Bodigrim !!

danidiaz · October 11, 2022, 12:52pm

The underlying type, ByteArray is now available from base

That’s awesome! I have a small quibble with these sentences in the Haddocks for ByteArray:

Boxed wrapper for ByteArray#.
Since ByteArray# is an unlifted type and not a member of kind Type , things like [ByteArray#] or IO ByteArray# are ill-typed. To work around this inconvenience this module provides a standard boxed wrapper, inhabiting Type

They are correct, but to me the wording sort of suggests that the difference between ByteArray and ByteArray# is the boxity. But both ByteArray and ByteArray# are boxed: the difference lies instead in the “liftedness”. So perhaps it should be “A lifted wrapper for ByteArray#”.