Sorry, I do not have bandwidth for a blog post. My sole purpose was to answer @daylily that as a comaintainer of bytestring
, text
and vector
I do not see this unification coming and no one made any practical effort towards it. I’m skeptical about it for various reasons, which again I do not have bandwidth to describe in full, but if someone wants to prove me wrong - you are most welcome to design bytestring
or text
over vector
(or contiguous
, or whatever) and achieve the same API and performance.
Sounds good. I am very sympathetic to the idea that there is more writing to be done than time to do it.
The trick is to specialise ByteString, for it is currently doing three things at once:
- FFI blob
- String literal and container
- Binary blob (especially from the network)
If you can focus on one of those things you will definitely find optimisations. If you need to keep these three roles at once, compromises will have to be made.
I kind of like your idea, because from this thread I am beginning to think that ByteStrings are not actually what I thought they were. But I wouldn’t like to see even more fragmentation in string types.
I think I have only really used ByteString
as an ascii string literal or byte container (which is also readable from stdin and printable to stdout*). From that perspective I think it would be fine to use a ByteArray#
based solution. I also think the binary blob use-case could also use such a ByteArray#
based solution. No foreign language is involved in these use-cases, so I think ForeignPtr
is not appropriate, see the docs:
The type
ForeignPtr
represents references to objects that are maintained in a foreign language, i.e., that are not part of the data structures usually managed by the Haskell storage manager.
Instead, it seems like the current implementation using ForeignPtr’s is actually optimized for use with external C code over the FFI. I don’t think that is a very common use-case and I’d actually use something like CString
for that. What is the difference between ByteString
and CString
? Could this situation be resolved by expanding the API of CString
and changing ByteString
to be backed by a ByteArray#
(and unified with unboxed vectors, which have the same representation)?
*Actually, this might be the catch if Haskell’s IO is implemented in terms of FFI calls to C functions. But I believe it is now also possible to use UnliftedFFITypes
to make foreign calls with the ByteArray#
type as input or output.
I’m sorry, but I don’t appreciate the optics that there is a “situation to be resolved”. If you do not like ByteString
backed by ForeignPtr
- no problem, use ByteArray
or Vector
or ShortByteString
, godspeed, case closed. For the sake of what exactly would you like to break hundreds of existing consumers of a type, which works for them perfectly well? There must be a compelling reason for unification beyond “it is better to have less types” (which is a weird argument to make in a statically-typed ecosystem).
I guess what I’m trying to figure out is what the ideal interface would look like. I’m not suggesting that we should switch to another system tomorrow and break all code. I just want to see what goal we could work towards. The best strategy is probably to introduce a new library that presents a unified type for the most common use cases and which can be recommended to newcomers. Edit: let me just quote this blog post:
This section was added at the suggestion of Ben Gamari. I hadn’t realized that this post initially came off as a massive breaking change in the language and library ecosystem. So let me clarify. I believe that the changes above can be made immediately with zero breakage (though plenty of effort). The problem would be further fracturing the ecosystem ala the classic XKCD. However, I think we have a great move here: compatibility shims.
In this world I’m picturing, most of the existing data types, and especially the two most common ones (strict
ByteString
andText
) can be expressed as newtype wrappers over the new types.Vector
can probably have the same kind of shim too, though the strictness requirements I’m envisioning may change the semantics of boxedVector
s; I’m not sure yet.In any event, that’s my goal here: put these types deep into
base
, make them ubiquitous, and move most/all existing code over to use them.
I think @snoyberg makes that argument in the blog post that we are discussing here: all the different string/sequence types are a significant barrier to adoption and onboarding. A secondary benefit is that it reduces maintenance burden because there will be less code duplication (or at least that is what I hope).
As I said, this is a weird argument to make in a statically typed language, which prides itself in making even minor distinctions type-level. Besides that I do not find having distinct types for binary blobs, unicode text and general-purpose arrays excessive. The situation is similar to many other languages.
As a maintainer of discussed libraries, I believe that such change is likely to come at a great cost to maintainers.
It really seems to me, mostly as a bystander, that there are two people discussing two different topics here.
Nobody argues, I think, for a type that will ambiguously represent either binary blobs or text, depending on how you look at it (this was Python 2’s str type). What the original post argues for was this:
Text
andByteString
have completely different representations, and therefore converting between them is an expensive operation. Note that, contrary to popular belief, switching to UTF-8 intext
won’t solve this (though it’s still a good idea for reducing memory usage). The problem is, again, pinned versus unpinned memory.
The “unification” would keep separate types (of course), but with a closer representation such that converting of a large ASCII text between Text and ByteString would be very low cost or even a no-op, and ideally eliminated by the compiler.
Furthermore, the overall problem is this (again from the original post):
If you want to represent a sequence of data in Haskell today, you have too many options […] It’s a lot of cognitive overhead for new users, the libraries somewhat fight over which thing to use, and there are some hidden traps and performance costs.
So as I read the proposal, it’s about making the choice of which data type to use for a “sequence of data” much simpler and with less potential for performance degradation. Or something like that.
But nobody is arguing for a type PerfectString that can do all at once.
Maybe we should list all the options for these text/sequence types:
- boxed / unboxed
- pinned / unpinned
- codepoints (non-uniform) / graphemes (non-uniform) / bytes (uniform)
- strict / lazy (or perhaps better: lifted / unlifted)
- mutable / immutable
- foreign / native (should this really be separate from the pinnedness axis?)
- short / normal (save a little bit of space but lose ability to slice)
- automatic fusion / manual fusion
I think the challenge is to try to fit as many of these in the same types without sacrificing too much ergonomics and performance. And if multiple types are required then I’d like to try to use type classes or backpack to provide a uniform interface to as many of these as possible, but even then I think there will need to be at least two interfaces for immutable and mutable sequences. And I’d like to see an easy way to convert between the different types.
Which of these can we combine into one type?
- The blog post suggest to merge boxed and unboxed vectors into a single type that is able to use unboxed storage if possible. I think that sounds very difficult, but I believe C++ can do it with its vectors.
- The blog also suggests to introduce temporary pinning. I think I heard recently that temporary pinning is not a good option, but unpinned memory could simply be copied. That means we could just always use unpinned memory and have a low level pinned sequence type for people who know what they are doing.
- I think a text type (codepoints or graphemes) should be separate, but I think it can have mostly the same interface as other sequences and it should be possible to cheaply convert text into a sequence of bytes. But I wouldn’t complain if somebody figured out a good way to have a non-uniform contiguous sequence type. Then this could simply be a type synonym over that.
- I think it is fine to pick strict/unlifted elements for these sequences because it is meant for storage. Maybe a lower level lazy/lifted type can be added separately for people who know what they’re doing.
- Mutable and immutable versions probably require completely different APIs, but it should be possible to cheaply convert between them (unsafely or perhaps even safely using linear types).
- I’d say we should pick native types by default and add a separate foreign type if necessary for people who know what they are doing.
- I think we can make a best of both worlds type for short vs normal variants which is one byte longer than a short bytestring, but also supports slicing if the sequence is larger.
- I think the trend is to have manual fusion as automatic fusion is just too difficult to get right.
To conclude, that would mean we have 4 main types: text, mutable text (why does that not exist yet?), vector, mutable vector. Maybe we will need explicit boxed and unboxed vector variants (text is always unboxed). But even that would be a huge improvement in my opinion.
So, in particular the changes with the status quo would be:
- a more primitive
vector
type would be introduced which ideally would automatically choose an unboxed representation if possible. -
bytestring
would get discouraged andvector
would get somebytestring
/Vector Word8
specific functions for e.g. input and output. -
text
would useUnboxed.Vector Word8
internally instead of that custom array type. -
text
would get a mutable interface (maybe this is unnecessary?) -
text
would be cheaply convertible to the vector type. - all immutable sequence type would have the same interface (bar those
bytestring
specific functions) and a type class or backpack signature should enforce that
And maybe more, but I’m too tired now. Please fire away with your criticisms!
PS I see now that I have overlooked the important difference between lazy/strict sequences and unlifted/lifted elements. I think most sequences should be strict and explicit streaming should be used instead of lazy sequences. But lazy sequences are pretty nice to use for small programs…
I have not been following closely. But I’d like to amplify @iustin’s point, because I think I agree. It seems there may be an advantage (note that I say “may”: I have no idea whether there is or isn’t) if these fundamental types share a representation, not that they are the same type. So I don’t think the argument that we are reducing types in a statically typed languages is much in play. (I apologize @Bodigrim if I misunderstand you here.) Why might this be an advantage? Because it makes conversions among the types cheaper. Is that the entire advantage? Quite possibly. If so, is there evidence that real programs are hampered by slow conversions?
My bottom line: it’s clear that some people think this is a good idea and others think it’s a bad idea. @Bodigrim, as maintainer, has more knowledge of the internals, and likely a better handle on estimating the implementation / breakage cost of this change. So it would seem that others, if they think this change is a good idea, should back up their arguments with data, just to prevent this conversation from going around and around.
Why should sequences be strict? Even haskell list is lazy, and ppl could achieve streaming behavior easily with laziness…
I’m not pushing for this proposal anymore, since it seems unlikely to be worth the cost based on discussions with maintainers. However, since something just came up today to provide a reasonable justification, I thought I’d share it. A colleague asked about the best way to parse a Scientific
value from a Text
value. I had a look at how aeson
does this, and discovered that it implements its own scientific
Parser
. This would be perfect, except that it works on ByteString
s, not Text
. We’re left with a few different options:
- Use
encodeUtf8
to get aByteString
from the originalText
. This is wasteful, as it causes an unnecessary allocation, buffer copy, and deallocation. - Reimplement the
scientific
parser forText
, leading to code duplication.
(There’s also the route of using String
-based parsers within scientific
, but that’s somewhat separate from this discussion.)
In my ideal world, where Text
is a newtype wrapper around ByteString
(or, better yet, Vector Word8
), (1) would be the correct solution and would have zero cost. To address @Bodigrim’s point, this is less about library maintainer burden within the bytestring
/text
/vector
ecosystem, but downstream libraries and applications.
A lot of the motivation for this proposal came from Rust, where String
is a newtype wrapper around Vec<u8>
, slices generalize over multiple data storage mechanisms, and a unified iteration interface generalizes over many different data structures. IMO it’s a game changer to writing great code.
There’s a third option without code duplication for your specific example, which I have done for the Abstract filepath proposal: CPP. It isn’t pretty, but works well if both text types roughly export the same API (and they do). E.g. I was able to abstract over both String
and ShortByteString
by simply removing the explicit String pattern matches and using uncons
etc. instead, e.g.:
isExtensionOf :: STRING -> FILEPATH -> Bool
isExtensionOf ext = \fp -> case uncons ext of
Just (x, _)
| x == _period -> isSuffixOf ext . takeExtensions $ fp
_ -> isSuffixOf (singleton _period <> ext) . takeExtensions $ fp
Then you just #include
with the right imports and types set. Maybe not particularly elegant, but a practical workaround.
One could wonder if this shared API could be unified under a typeclass even.
This solution is the same solution as the backpack suggestion which was mentioned in @snoyberg’s original proposal or the follow-ups. I don’t Ocaml nor Haskell-backpack, but this might share the spirit of the corresponding solution:
signature ListLike L {
uncons :: L a -> Maybe a
}
-- Defining a parser
signature Parsers {
ListLike L
} where
isExtensionOf :: L -> L -> Bool
isExtensionOf ext = \fp -> case uncons ext of
Just (x, _)
| x == _period -> isSuffixOf ext . takeExtensions $ fp
_ -> isSuffixOf (singleton _period <> ext) . takeExtensions $ fp
-- Using the signature
module Foo where
import Parsers @[]
...
The type class solution you mention is quite simple to implement, however, the pushback I imagine you will hear is that either:
- The inlining pragmas will make binaries too large for non-trivial codebases using more than one list type
- The type class will incur a runtime cost, which is a nono for something as core as vector types
Oh god. I didn’t expect my reply would generate this long a discussion. I’ll try to summarize some of the thoughts I have:
benefits
My idea of why it would be a good thing to merge the representations of Text, ByteString and Vector is that
- It makes the conversions basically zero-cost
- It makes future code easier to write with less duplication
I can’t say how this will affect existing packages’ maintenance, because I’m not a maintainer of any of the packages that could be involved.
problems
I skimmed the thread and it seems that the only “substantial” (i.e. alters performance) change is that ByteString becomes unpinned, which can be bad for large blobs passed through FFI. So if that is something we cannot do, then we have a choice:
- Make
Text
implemented withprimitive
'sByteArray
s, which is isomorphic to the current representation- This is a “pointless” change in the sense that it has no substantial runtime effect
- This can, however, duduplicate some code in
text
(How much?), which may reduce or increase the burden of maintainers (?)
- Make
Text
implemented withByteString
s, which makes it zero-cost to transform between the two- This one makes
Text
s pinned (which can change performance characteristics)
- This one makes
Again, I don’t know which one will be better, I’m just trying to summarize some information here.
execution
Sure, I believe this will not happen anytime soon:
- If we are going to use
Vector
s as the underlying representation ofByteString
&Text
, thenvector
should first get into ghc boot packages - If we are going to use
primitive
, orcontiguous
, then they need to get into ghc boot packages; - and if
contiguous
becomes “blessed” in this way, then user will be confused as to whether use it orvector
.
Leaving ByteString
aside, there is an ongoing work to unify ShortByteString
, Data.Text.Array
and Data.Primitive.ByteArray
. The underlying type, ByteArray
is now available from base
(Data.Array.Byte), and bytestring
, text
and primitive
are to share it:
The underlying type,
ByteArray
is now available frombase
That’s awesome! I have a small quibble with these sentences in the Haddocks for ByteArray
:
Boxed wrapper for
ByteArray#
.
SinceByteArray#
is an unlifted type and not a member of kindType
, things like[ByteArray#]
orIO ByteArray#
are ill-typed. To work around this inconvenience this module provides a standard boxed wrapper, inhabitingType
They are correct, but to me the wording sort of suggests that the difference between ByteArray
and ByteArray#
is the boxity. But both ByteArray
and ByteArray#
are boxed: the difference lies instead in the “liftedness”. So perhaps it should be “A lifted wrapper for ByteArray#”.