The Quest to Completely Eradicate `String` Awkwardness

Bodigrim · August 10, 2024, 10:47pm

Yeah, good point.

I’d suggest Data.Text.Type.

rhendric · August 10, 2024, 11:09pm

Text depends on ByteArray, which depends on various things that use Prelude, which gets in the way of exporting Text in Prelude.

Should I:

Move data ByteArray to a new Data.Array.Byte.Type module and move its instances to their class modules
Make those various things depend on specific GHC.Internal.* modules instead of Prelude
Something else I haven’t thought of?

Bodigrim · August 10, 2024, 11:16pm

Data.Byte.Array already depends on a bunch of GHC.Internal.* modules, so option (2) seems the easiest.

hasufell · August 11, 2024, 3:09am

Just as a side note: base uses this property (surrogate range) for FilePath roundtripping (PEP-383).

So we wouldn’t be able to change the definition of Char, unless we also move OsPath to base.

Well, yes. But I’d also say that time has proven that base is the last place to receive a redesign of historical mistakes.

So where does that leave us?

Sure, but:

String is a bad default
and yet it is deeply ingrained in our standard library
it is a constant educational challenge

So what we’re facing here isn’t really lack of better alternatives, but sh*tty defaults.

And defaults do matter.

rhendric · August 11, 2024, 3:26am

I’ve spent a fair amount of time this evening trying to implement what @bodigrim wants, and while the additions to base didn’t end up being too egregious, moving instance Binary Text into binary is turning out quite ghastly. Here is an incomplete diff of just Data.Binary.Class.hs; it doesn’t compile yet because there are still more things to copy over.

It's long.

diff --git a/src/Data/Binary/Class.hs b/src/Data/Binary/Class.hs
index 0b79743..e91e533 100644
--- a/src/Data/Binary/Class.hs
+++ b/src/Data/Binary/Class.hs
@@ -4,6 +4,11 @@
 {-# LANGUAGE ScopedTypeVariables #-}
 {-# LANGUAGE PatternGuards #-}
 {-# LANGUAGE Trustworthy #-}
+{-# LANGUAGE UnboxedTuples #-}
+{-# LANGUAGE BangPatterns #-}
+{-# LANGUAGE MagicHash #-}
+{-# LANGUAGE RankNTypes #-}
+{-# LANGUAGE UnboxedTuples #-}
 
 #if __GLASGOW_HASKELL__ >= 706
 {-# LANGUAGE PolyKinds #-}
@@ -127,6 +132,18 @@ import GHC.Fingerprint
 
 import Data.Version (Version(..))
 
+import Control.Monad.ST.Unsafe (unsafeSTToIO)
+import qualified Data.ByteString.Internal as B
+import qualified Data.ByteString.Short.Internal as SBS
+import Data.Array.Byte (ByteArray(..), MutableByteArray(..))
+import Data.Char (chr)
+import Data.Text.Type (Text(..))
+import GHC.Exts (Addr#, Int(..), byteArrayContents#, copyByteArray#, indexWord8OffAddr#, newByteArray#, newPinnedByteArray#, unsafeCoerce#, unsafeFreezeByteArray#)
+import GHC.ForeignPtr (ForeignPtr(..), ForeignPtrContents(PlainPtr))
+import GHC.IO (unsafeDupablePerformIO)
+import GHC.ST (ST(..), runST)
+import GHC.Word (Word8(..))
+
 ------------------------------------------------------------------------
 
 -- Factored into two classes because this makes GHC optimize the
@@ -855,6 +872,288 @@ instance Binary a => Binary (NE.NonEmpty a) where
   put = put . NE.toList
 #endif
 
+------------------------------------------------------------------------
+-- Text
+
+instance Binary Text where
+    put t = put (encodeUtf8 t)
+    get   = do
+      bs <- get
+      case decodeUtf8' bs of
+        Left exn -> fail (show exn)
+        Right a -> return a
+
+-- | Decode a 'ByteString' containing UTF-8 encoded text.
+--
+-- If the input contains any invalid UTF-8 data, the relevant
+-- exception will be returned, otherwise the decoded text.
+decodeUtf8' :: B.ByteString -> Either UnicodeException Text
+decodeUtf8' = unsafeDupablePerformIO . try . evaluate . decodeUtf8With strictDecode
+{-# INLINE decodeUtf8' #-}
+
+-- | State of decoding a 'ByteString' in UTF-8.
+-- Enables incremental decoding ('validateUtf8Chunk', 'validateUtf8More',
+-- 'decodeUtf8Chunk', 'decodeUtf8More').
+
+-- Internal invariant:
+-- the first component is the initial state if and only if
+-- the second component is empty.
+--
+-- @
+-- 'utf9CodePointState' s = 'utf8StartState'
+-- <=>
+-- 'partialUtf8CodePoint' s = 'PartialUtf8CodePoint' 0
+-- @
+data Utf8State = Utf8State
+  { -- | State of the UTF-8 state machine.
+    utf8CodePointState :: {-# UNPACK #-} !DecoderState
+    -- | Bytes of the currently incomplete code point (if any).
+  , partialUtf8CodePoint :: {-# UNPACK #-} !PartialUtf8CodePoint
+  }
+  deriving (Eq, Show)
+
+-- | An exception type for representing Unicode encoding errors.
+data UnicodeException =
+    DecodeError String (Maybe Word8)
+    -- ^ Could not decode a byte sequence because it was invalid under
+    -- the given encoding, or ran out of input in mid-decode.
+  | EncodeError String (Maybe Char)
+    -- ^ Tried to encode a character that could not be represented
+    -- under the given encoding, or ran out of input in mid-encode.
+
+type OnDecodeError = String -> Maybe Word8 -> Maybe Char
+
+-- | A delayed representation of strict 'Text'.
+data StrictBuilder = StrictBuilder
+  { sbLength :: {-# UNPACK #-} !Int
+  , sbWrite :: forall s. MutableByteArray s -> Int -> ST s ()
+  }
+
+-- | Use 'StrictBuilder' to build 'Text'.
+toText :: StrictBuilder -> Text
+toText (StrictBuilder 0 _) = mempty
+toText (StrictBuilder n write) = runST (do
+  dst <- new n
+  write dst 0
+  arr <- unsafeFreeze dst
+  pure (Text arr 0 n))
+
+-- | Copy a 'ByteString'.
+--
+-- Unsafe: This may not be valid UTF-8 text.
+unsafeFromByteString :: ByteString -> StrictBuilder
+unsafeFromByteString bs =
+  StrictBuilder (B.length bs) (\dst ofs -> copyFromByteString dst ofs bs)
+
+decodeUtf8With :: OnDecodeError -> B.ByteString -> Text
+decodeUtf8With onErr = decodeUtf8With1 onErr invalidUtf8Msg
+
+-- | Helper for 'Data.Text.Encoding.decodeUtf8With'.
+--
+
+-- This could be shorter by calling 'decodeUtf8With2' directly, but we make the
+-- first call validateUtf8Chunk directly to return even faster in successful
+-- cases.
+decodeUtf8With1 :: OnDecodeError -> String -> B.ByteString -> Text
+decodeUtf8With1 onErr msg bs = validateUtf8ChunkFrom 0 bs $ \len ms -> case ms of
+  Just s
+    | len == B.length bs ->
+      let !(SBS.SBS arr) = SBS.toShort bs in
+      Text (ByteArray arr) 0 len
+    | otherwise -> toText $
+        unsafeFromByteString (B.take len bs) <> skipIncomplete onErr msg s
+  Nothing ->
+   let (builder, _, s) = decodeUtf8With2 onErr msg startUtf8State (B.drop (len + 1) bs) in
+   toText $
+     unsafeFromByteString (B.take len bs) <>
+     handleUtf8Error onErr msg (B.index bs len) <>
+     builder <>
+     skipIncomplete onErr msg s
+
+-- | Helper for 'Data.Text.Encoding.decodeUtf8With',
+-- 'Data.Text.Encoding.streamDecodeUtf8With', and lazy
+-- 'Data.Text.Lazy.Encoding.decodeUtf8With',
+-- which use an 'OnDecodeError' to process bad bytes.
+--
+-- See 'decodeUtf8Chunk' for a more flexible alternative.
+--
+-- @since 2.0.2
+decodeUtf8With2 :: OnDecodeError -> String -> Utf8State -> B.ByteString -> (StrictBuilder, B.ByteString, Utf8State)
+decodeUtf8With2 onErr msg s0 bs = loop s0 0 mempty
+  where
+    loop s i !builder =
+      let nonEmptyPrefix len = builder
+            <> utf8StateToStrictBuilder s
+            <> unsafeFromByteString (B.take len (B.drop i bs))
+      in validateUtf8MoreCont s (B.drop i bs) $ \len ms -> case ms of
+        Nothing ->
+          if len < 0
+          then
+            -- If the first byte cannot complete the partial code point in s,
+            -- retry from startUtf8State.
+            let builder' = builder <> skipIncomplete onErr msg s
+            -- Note: loop is strict on builder, so if onErr raises an error it will
+            -- be forced here, short-circuiting the loop as desired.
+            in loop startUtf8State i builder'
+          else
+            let builder' = nonEmptyPrefix len
+                  <> handleUtf8Error onErr msg (B.index bs (i + len))
+            in loop startUtf8State (i + len + 1) builder'
+        Just s' ->
+          let builder' = if len <= 0 then builder else nonEmptyPrefix len
+              undecoded = if B.length bs >= partUtf8Len (partialUtf8CodePoint s')
+                then B.drop (i + len) bs  -- Reuse bs if possible
+                else partUtf8ToByteString (partialUtf8CodePoint s')
+          in (builder', undecoded, s')
+
+invalidUtf8Msg :: String
+invalidUtf8Msg = "Data.Binary.Class: Invalid UTF-8 stream"
+
+-- | Encode text using UTF-8 encoding.
+encodeUtf8 :: Text -> B.ByteString
+encodeUtf8 (Text arr off len)
+  | len == 0  = B.empty
+  -- It would be easier to use Data.ByteString.Short.fromShort and slice later,
+  -- but this is undesirable when len is significantly smaller than length arr.
+  | otherwise = unsafeDupablePerformIO $ do
+    marr@(MutableByteArray mba) <- unsafeSTToIO $ newPinned len
+    unsafeSTToIO $ copyI len marr 0 arr off
+    let fp = ForeignPtr (byteArrayContents# (unsafeCoerce# mba))
+                        (PlainPtr mba)
+    pure $ B.fromForeignPtr fp 0 len
+
+-- | Copy some elements of an immutable array.
+-- This code is duplicated from Data.Text.Type for (mumble mumble) reasons.
+copyI :: Int                    -- ^ Count
+      -> MutableByteArray s     -- ^ Destination
+      -> Int                    -- ^ Destination offset
+      -> ByteArray              -- ^ Source
+      -> Int                    -- ^ Source offset
+      -> ST s ()
+copyI (I# count#) (MutableByteArray dst#) (I# dstOff#) (ByteArray src#) (I# srcOff#) =
+  ST $ \s1# ->
+    case copyByteArray# src# srcOff# dst# dstOff# count# s1# of
+      s2# -> (# s2#, () #)
+{-# INLINE copyI #-}
+
+-- | Create an uninitialized mutable array.
+-- This code is duplicated from Data.Text.Type for (mumble mumble) reasons.
+new :: Int -> ST s (MutableByteArray s)
+new (I# len#) = ST $ \s1# ->
+  case newByteArray# len# s1# of
+    (# s2#, marr# #) -> (# s2#, MutableByteArray marr# #)
+{-# INLINE new #-}
+
+-- | Create an uninitialized mutable pinned array.
+newPinned :: forall s. Int -> ST s (MutableByteArray s)
+newPinned (I# len#) =
+  ST $ \s1# ->
+    case newPinnedByteArray# len# s1# of
+      (# s2#, marr# #) -> (# s2#, MutableByteArray marr# #)
+{-# INLINE newPinned #-}
+
+-- | Freeze a mutable array. Do not mutate the 'MutableByteArray' afterwards!
+-- This code is duplicated from Data.Text.Type for (mumble mumble) reasons.
+unsafeFreeze :: MutableByteArray s -> ST s ByteArray
+unsafeFreeze (MutableByteArray marr) = ST $ \s1# ->
+  case unsafeFreezeByteArray# marr s1# of
+    (# s2#, ba# #) -> (# s2#, ByteArray ba# #)
+{-# INLINE unsafeFreeze #-}
+
+
+-- | Prefix of a UTF-8 code point encoded in 4 bytes,
+-- possibly empty.
+--
+-- - The most significant byte contains the number of bytes,
+--   between 0 and 3.
+-- - The remaining bytes hold the incomplete code point.
+-- - Unused bytes must be 0.
+--
+-- All of operations available on it are the functions below.
+-- The constructor should never be used outside of those.
+--
+-- @since 2.0.2
+newtype PartialUtf8CodePoint = PartialUtf8CodePoint Word32
+  deriving (Eq, Show)
+
+-- | Empty prefix.
+partUtf8Empty :: PartialUtf8CodePoint
+partUtf8Empty = PartialUtf8CodePoint 0
+
+-- | Length of the partial code point, stored in the most significant byte.
+partUtf8Len :: PartialUtf8CodePoint -> Int
+partUtf8Len (PartialUtf8CodePoint w) = fromIntegral $ w `shiftR` 24
+
+word8ToInt :: Word8 -> Int
+word8ToInt = fromIntegral
+
+-------------------------------------------------------------------------------
+-- Naive UTF8 decoder.
+-- See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for the explanation of the state machine.
+
+newtype ByteClass = ByteClass Word8
+
+byteToClass :: Word8 -> ByteClass
+byteToClass n = ByteClass (W8# el#)
+  where
+    !(I# n#) = word8ToInt n
+    el# = indexWord8OffAddr# table# n#
+
+    table# :: Addr#
+    table# = "\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\NUL\SOH\SOH\SOH\SOH\SOH\SOH\SOH\SOH\SOH\SOH\SOH\SOH\SOH\SOH\SOH\SOH\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\a\b\b\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\STX\n\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\ETX\EOT\ETX\ETX\v\ACK\ACK\ACK\ENQ\b\b\b\b\b\b\b\b\b\b\b"#
+
+newtype DecoderState = DecoderState Word8
+  deriving (Eq, Show)
+
+utf8AcceptState :: DecoderState
+utf8AcceptState = DecoderState 0
+
+utf8RejectState :: DecoderState
+utf8RejectState = DecoderState 12
+
+updateState :: ByteClass -> DecoderState -> DecoderState
+updateState (ByteClass c) (DecoderState s) = DecoderState (W8# el#)
+  where
+    !(I# n#) = word8ToInt (c + s)
+    el# = indexWord8OffAddr# table# n#
+
+    table# :: Addr#
+    table# = "\NUL\f\CAN$<`T\f\f\f0H\f\f\f\f\f\f\f\f\f\f\f\f\f\NUL\f\f\f\f\f\NUL\f\NUL\f\f\f\CAN\f\f\f\f\f\CAN\f\CAN\f\f\f\f\f\f\f\f\f\CAN\f\f\f\f\f\CAN\f\f\f\f\f\f\f\CAN\f\f\f\f\f\f\f\f\f$\f$\f\f\f$\f\f\f\f\f$\f$\f\f\f$\f\f\f\f\f\f\f\f\f\f"#
+
+updateDecoderState :: Word8 -> DecoderState -> DecoderState
+updateDecoderState b s = updateState (byteToClass b) s
+
+newtype CodePoint = CodePoint Int
+
+-- | @since 2.0
+data DecoderResult
+  = Accept !Char
+  | Incomplete !DecoderState !CodePoint
+  | Reject
+
+-- | @since 2.0
+utf8DecodeStart :: Word8 -> DecoderResult
+utf8DecodeStart !w
+  | st == utf8AcceptState = Accept (chr (word8ToInt w))
+  | st == utf8RejectState = Reject
+  | otherwise             = Incomplete st (CodePoint cp)
+  where
+    cl@(ByteClass cl') = byteToClass w
+    st = updateState cl utf8AcceptState
+    cp = word8ToInt $ (0xff `unsafeShiftR` word8ToInt cl') .&. w
+
+-- | @since 2.0
+utf8DecodeContinue :: Word8 -> DecoderState -> CodePoint -> DecoderResult
+utf8DecodeContinue !w !st (CodePoint !cp)
+  | st' == utf8AcceptState = Accept (chr cp')
+  | st' == utf8RejectState = Reject
+  | otherwise              = Incomplete st' (CodePoint cp')
+  where
+    cl  = byteToClass w
+    st' = updateState cl st
+    cp' = (cp `shiftL` 6) .|. word8ToInt (w .&. 0x3f)
+
+
 ------------------------------------------------------------------------
 -- Typeable/Reflection

Is this really a good idea?

Bodigrim · August 11, 2024, 8:10am

We can reverse order of dependencies and make binary depend on text, not vice versa as it currently is. Then all encodeUtf8 machinery stays in text and binary just imports Data.Text.Encoding and defines an instance.

Bodigrim · August 11, 2024, 9:16am

I mean, it’s valid to say that you want Text to be available out of the box without a single import / language extension. Is it the problem statement we are trying to solve here? Or is there something else?

ribosomerocker · August 11, 2024, 11:34pm

I think it’s inconvenient for a few reasons, though one is not related to the extension, but the ecosystem:

OverloadedStrings is kind of well known to make type inference more annoying and type errors a little bit obscure. Sometimes, it can’t really decide whether something should be a String or a Text and it also errors…
Even with OverloadedString, it’s not a panacea as while a string literal may be interpreted as any type, the resulting type of the function could be any other type of string, and you’d still need to do conversion there manually via pack/unpack, and the errors there are also a little annoying to spot… it doesn’t erase the Text/String split when it’s the output
Like for the above, you could end up doing really weird things with OverloadedStrings like using functions that deal with strings and midway switching to functions that deal with text in the same codebase if you’re not paying attention… though I think this might be a little uncommon, as I haven’t seen anyone complain about it. This is kind of related to the point about OverloadedStrings not solving the output type being different, so while you could run some algorithms on String because you’re using base and your functions just so happen to default to string, you could end up using a function in base or any other library like pretty-simple etc. that turns it into a string that you also need to process, so you actually switch to using Text there… and I know that this isn’t really sane to complain about, but I feel like just the fact that someone could do that is just really… I don’t really know what to say except it’s really jank, I guess.

simonmic · August 14, 2024, 5:53am

Data point: when hledger switched from (mostly) String to (mostly) strict Text there was only a small speedup.

simonmic · August 14, 2024, 7:04am

When testing in GHCI, you may get unclear errors because it’s not set. After setting it with :set, you may get unclear errors when you try to reload, so you have to unset it again. :seti sounds like a solution, but this is another thing to discover/explain.

simonmic · August 14, 2024, 7:08am

You need to add 1-3 lines of imports, in every file where you use it
You need to understand and choose Lazy vs Strict
You need to know how to import qualified to avoid clashes with built in names
You need to write verbose prefixes where using qualified names
You need to learn a new API (text’s) in addition to the one you already learned (Data.List’s)
Before ghc 8.4, you needed to ensure the text package was installed and in scope
You need to convert to and from all the libraries and builtins that use String
You need to explain all this to each new haskeller.

tomjaguarpaw · August 14, 2024, 7:16am

If we assume that String is never actually going to go away (because it would break too much code), numbers 2, 3, 4, 5, 7 and 8 can’t be resolved in any way, can they? And 6 by definition is no longer a problem, which leaves “you need to add 1-2 lines of imports” as the only item on your list which could actually be improved (by exporting everything you use in Text from the Prelude). Have I got that right?

Bodigrim · August 14, 2024, 9:32pm

This is to be resolved by a new {-# LANGUAGE NamedDefaults #-} pragma.

@ribosomerocker as for (2) and (3) even if there was no String at all, you’d still have to do conversions between lazy and strict Text, builders, bytestrings, etc. I’m not sure what can be meaningfully improved in this area short of using a dynamically typed language.

sgraf · August 18, 2024, 2:30pm

Re: template-haskell should not depend on arbitrary Text functions: Note that I presume that Text will live in ghc-internal, where all the implementation of template-haskell lives as well (at least on GHC master). It’s just a matter of selecting what to export where. Could even put the impl for encoding/decoding there; it’s unlikely that its API changes much, so the churn caused by supporting a big version range in text/binary etc. should be acceptable.

hasufell · August 19, 2024, 4:40am

I’m worried ghc-internal will become a god-package, making efforts of reinstallable base even harder.

The idea is not that all implementation lives there. The idea is that it contains ghc internals, so that packages can write portable code across GHC versions, covering for precisely the API differences in ghc-internal.

sgraf · August 19, 2024, 6:25am

That’s fair, I was just saying that moving shared code into ghc-internal is a relatively low-cost option if a dependency from binary to text or vice versa should be avoided.

It is even possible to move it into ghc-internal for now and decide later on what to move into which package. An incremental path to make more logic reinstallable.

sgraf · August 19, 2024, 6:25am

Great to see progress on moving Text into base, btw.!

I wonder if Text (or an adapter thereof) can be a transient data structure.
That is, a persistent data structure that can be converted into an ephemeral mutable one with a fast lazy copy-on-write conversion function (works best if the data structure is a tree to begin with).
One would use the persistent data structure when purity/convenience/persistence is more important than performance, and switch to the ephemeral ST-based data structure when performance is critical.
I think that would give users the best of both convenient String construction/pattern-matching as well as efficient representation.

Apart from Edward Kmett’s transients sketch (which is unfortunately buggy), I have not seen many people experiment with transients in Haskell, which I find a bit sad. I mentored a bachelor’s thesis on the topic, but the work was not merged into GHC.

But let’s not lose focus on step 1: Provide convenient access to Text by putting it into base.

atravers · September 21, 2024, 5:07am

There it is:

Compact List Representation: Definition, Garbage Collection, and System Implementation, Wilfred J. Hansen (1969).

Compact lists are stored sequentially in memory, rather than chained with pointers. Since this is not always convenient, the Swym system permits a list to be chained, compact, or any combination of the two. A description is given of that list representation and the operators implemented (most are similar to those of LISP 1.5).

This could be the basis for the serialised-tree format used in the aforementioned Gibbon compiler project…

BurningWitness · October 8, 2024, 5:25pm

I find that text happily misses most of my use cases:

For bulk processing the only thing I’m looking for is the proof that the data inside is UTF-8. A newtype over a LazyByteString should be more than good enough for this, and it would be nice to have packages for all commonly used encodings in a similar fashion;
For small dynamic strings (in my experience overwhelmingly logging and error reporting) I don’t want to use StrictText because I don’t want to pay O(n) for appends. Builder does not allow edits, String is too slow by definition, LazyText is biased in one direction. A rope would fill a niche here.
Small static strings could well be a newtype over ShortByteString;

I don’t know if the performance gains from using text's definitely-ByteArray-backed types as opposed to bytestring's generally-ByteArray-backed ones justify the migraine of having to convert between the two.

blamario · October 8, 2024, 8:07pm

Would ByteStringUTF work for you if it wrapped a Lazy bytestring instead of the strict one?