In this 2022 comment I highlighted the various responsibilities of ByteString today:
It is currently doing three things at once:
Container for data that crosses the FFI boundaries (needs to be pinned and kept alive so that the GC doesn’t bugger off with your data in the middle of a safe FFI call.)
String literal (where fromString truncates multi-byte characters to octets. That job is best left to Text or OsString)
Binary blob read from the file system or the network, which can be handled freely by the GC.
Since ByteString is backed by a ForeingPtr with finalisers, I imagine this makes it more suited for holding FFI data.
So, can an unpinned ByteArray be used a replacement “destination” type when reading data from the outside and letting the GC handle things? After all, it does back Text, so it does seem suitable for Haskell-side data manipulation. Not being pinned avoid memory fragmentation as well, which is pretty nice.
If you’re writing code so low-level that pinnedness matters, you should just use a ByteArray.primitive is a must-have for this, I will accept no substitutes (although I do feel like it’s a bit too typeclass-heavy).
As I see it that’s what a strict ByteString originally represents, a single chunk of data with an undefined encoding. (and this is why IsString ByteString truncates characters to bytes)
I don’t like the text package. Examples:
I’m writing bindings for a C API and I know that some endpoint returns a UTF-8 (or ASCII) string. I should be able to pack that string into a plain UTF-8 byte array in a single copy operation. Currently I have to either decode a ByteString to Text (linear waste) or repackage using Text internals (which have no obligation to uphold the PVP). Also in this particular case it’d be nice if there was no slice information stored, à la ShortByteString;
I have a library that returns arbitrary errors, always in UTF-8. The fastest way to build a string is with Data.ByteString.Builder, but it doesn’t have an encoding (except IsString Builder does, it’s UTF-8, funny how that works). I’m forced to either use Data.Text.Lazy.Builder (not interchangeable, far worse API), or String (has its own pitfalls);
I want to be extremely pedantic with my file name handling. User, on Linux, gives me a raw file name (OsString), I encode it using the current locale with errors encoded into UTF-8 surrogates (decodeFS), I… manually check if there are surrogates present in the resulting String and then if I wish to store it in Text it will implicitly wipe said surrogates. UTF-8 with surrogates is a special encoding that would make this both type-safe and fast, but there’s no way to represent it without writing a whole separate library.
I don’t think encodings should be superglued to types. The ecosystem should have generic string types (Short/Slice/Stream Short/Builder/Rope), and then have separate libraries for each encoding. Ideally you’d also have some way to hide encoding types so that you can talk about Short {UTF8} without having to mention UTF8 constantly (same with byte array pinnedness).
The ShortByteString objects from the bytestring package are precisely the (not necessarily or by default pinned) blobs of binary data that support a very similar API.
The main limitations with ShortByteString are:
There is no chunked variant, but various uses of Lazy bytestrings are in any case better served by monadic byte streams.
There is no support for Builders, precisely because builders leverage pinned memory to support efficient incremental construction (to chunked lazy bytestrings).
Otherwise ShortByteString is internally a ByteArray, but have a familiar richer API. It is easy to convert between ByteArray and ShortByteString via the SBS pattern synonym.
As for the IsString instance, though criticised by some, it does support bona-fide octet string constants, such as SMTP verbts (“EHLO”, “MAIL FROM:”, “RCPT TO:”, “DATA”) or even sample binary data. For truly stripped-down compile-time read-only binary data one can also get away with just Addr# literals, and be duly warned about the 8-bit byte restriction:
λ> :set -XMagicHash
λ> import qualified Data.ByteString as BS
λ> import qualified Data.ByteString.Internal as BS
λ> mylit = BS.unsafePackLiteral "foobar"#
λ> BS.length mylit
6
λ> show mylit
"\"foobar\""
λ> mylit2 = BS.unsafePackLiteral "Виктор"#
<interactive>:7:31: error: [GHC-43080]
primitive string literal must contain only characters <= '\xFF'
But see also unsafePackLenLiteral if the octet string needs to hold NUL bytes (0-valued octets):
There is no support for Builders, precisely because builders leverage pinned memory to support efficient incremental construction (to chunked lazy bytestrings).
I should perhaps have said that, more precisely, what a Builder outputs (directly) is a lazy ByteString, rather than ShortByteString. One can of course take that lazy ByteString (with its pinned chunks) and copy it to a ShortByteString, releasing the intermediate storage if the lazy ByteString is then no longer needed.
On the input side, a Builder can consume data from all sorts of serialisable objects, including ShortByteStrings for which a combinator already exists.
Returning to your point, sure if @bodigrim (a maintainer of both bytestring and the package you highlighted) is suitably motivated, the new linear builders based on Buffer could then be used to construct a ShortByteString “directly”, if there advantages in doing so over use of lazy ByteStrings as pinned chunked memory during construction.
Meanwhile, a more disciplined (than directly working with Addr# literals) construct for compile-time ByteString literals can be something like the below (for demo purposes stashed away in .../Data/ByteString/Literal.hs):
module Data.ByteString.Literal (byteStringLiteral) where
import qualified Data.ByteString.Char8 as B
import qualified Language.Haskell.TH.Syntax as TH
byteStringLiteral :: (MonadFail m, TH.Quote m) => String -> TH.Code m B.ByteString
byteStringLiteral s = TH.liftCode $ fmap TH.TExp $
if all (<= '\xff') s
then TH.lift (B.pack s)
else fail $ "Input code points exceed 0xFF in: " ++ show s
which then makes possible:
$ cat test.hs
module Foo where
import Data.ByteString.Literal
$ ghci -v0 -XTemplateHaskell -ddump-splices -i... test.hs
λ> import qualified Data.ByteString as B
λ>
λ> bs = $$(byteStringLiteral "foo\0bar\xA0\&baz")
<interactive>:3:8-46: Splicing expression
byteStringLiteral "foo\0bar\xA0\&baz"
======>
bytestring-0.12.2.0-inplace:Data.ByteString.Internal.Type.unsafePackLenLiteral
11 "foo\NULbar\\160baz"#
λ>
λ> B.unpack bs
[102,111,111,0,98,97,114,160,98,97,122]
λ>
λ> bs = $$(byteStringLiteral "foo\0βαρ")
<interactive>:5:8: error: [GHC-39584]
• Input code points exceed 0xFF in: "foo\NUL\946\945\961"
• In the Template Haskell splice $$(byteStringLiteral "foo\0βαρ")
In the expression: $$(byteStringLiteral "foo\0βαρ")
In an equation for ‘bs’: bs = $$(byteStringLiteral "foo\0βαρ")
Large objects are pinned too (as constantly copying them on every major collection is not cheap). So if you create a large enough (about 4k or more) ByteArray from a ByteString, you will get another pinned object and an extra memory copy.
The fragmentation can easily be solved by using chunks of the same size. Lazy ByteString uses ~32kB chunks. Builders do the same (they accumulate data into chunks of fixed size). If the goal is to reduce fragmentation then it’s already achieved by lazy ByteStrings and builders.
Since OS I/O functions require valid memory buffers there’s no way not to use pinned memory. If there’s a need to store a subset of the data in a long-running program then I’m afraid there’s no other way as to create a Text or ShortByteString from a ByteString and make sure it’s forced, so it only contains a copy and doesn’t keep a pointer to the whole original ByteString.
Indeed this can help with fragmentation, but it’s not always possible or desirable to do this. Often it’s also very difficult to enforce. The 32KB chunk size is just a default in bytestring and it’s very easy to, for instance, append two chunks without thinking.
These large chunks are also often too big, so you still need to copy it into something smaller, and if that’s pinned then you have the issue again.
Yeah, that’s because LazyByteString is just [StrictByteString], it’s a terrible format for anything except streaming from a socket or flattening a Builder to (stream to a socket). Best you can do to dissuade people from doing complex text manipulation with it is to make it clear that it’s a list and to offer a better suited type in a nearby module (e.g. a rope).
But yes, it’s very easy to create small bytestrings and get fragmentation issues.
Unfortunately, I/O is a complex task that cannot be solved simply and efficiently in a general way. One needs to know several things to design an efficient system for each use case:
I/O buffers are pinned.
Large Text or ShortByteString are also pinned.
Long-lived pinned data allocated at random points in time lead to fragmentation, but long-lived movable data slows down every major GC.
Chunks of the same size help a lot with fragmentation.
Builders try to keep chunks of the same size inside, but big ByteStrings could be added as is.
If Text/ShortByteString is not forced it will still contain a closure with a reference to the original large pinned buffer, but closures (or other slice data structures) can use less space than the copied string and be more performant.
A memory copy is not cheap comparing to modern I/O speeds (easily 1-2GB/s per device). It’s ~20GB/s if not in the cache; much slower if not sequential; much faster if in the cache.
Different tasks have different trade-offs and the solution depends on the data you’re processing.
The simple solution of always copying pinned inputs to unpinned memory can lead to:
extra memory copying
extra load on every major GC to move the data
potentially the same pinned array
potentially greater memory usage if only a small slice of the input is used
It might be fine in many use cases but I wouldn’t add it as the default approach to basic libraries.
A StrictByteString can be constructed with externally allocated memory, so it can be used over any pointer you want as long as you manage memory lifetimes properly. I personally haven’t found any use for it, but there may be some unique cases where this is desirable.
A ByteArray on the other hand is always safe, and if you need to copy from it you can check if it’s pinned upfront.
Not really, C libraries I’ve used always juggle the following three approaches:
C string size is known upfront, so allocation is user’s responsibility (fread, ov_read, spng_decoded_image_size). This seems to always be when reading or parsing something, so the resulting arrays are large and uniform in size, but you generally feed these somewhere else after instead of keeping them, so I don’t think malloc would do anything here;
C string lifetime is tied to some object (glfwGetMonitorName, strings in FT_FaceRec, VkPhysicalDeviceProperties.deviceName). You may be safe to consume the string on the spot if your application is singlethreaded, but in general using ForeignPtrs here is shooting yourself in the foot;
C string is baked in at compilation time (pretty much just version strings and error messages). These are safe to refer to by the pointer, but they’re also tiny, so it doesn’t matter either way.
As I see it ForeignPtr representation is really just there to allow me to use malloc instead of a pinned ByteArray, and if the difference between the two ever starts to matter I probably have far bigger issues than long garbage collection times.