The Quest to Completely Eradicate `String` Awkwardness

My suggestion is… kind of to improve this in library land. base is a library, right? All we have to do is remove every single usage of String and replace any functions that use it with the better alternatives, whether it’s Text or something else.

1 Like

base is not capable of such drastic changes in the current Haskell ecosystem, or at least does not have the resources to do so. The CLC oversees proposals, which are submitted by members of the Haskell community, who wish to make a change. There has been no coordinated plan to address this problem in a serious way, engaging the community to accept and migrate, or otherwise.

Ah, yeah, I see that. I suppose this would best be a CLC proposal at some point. I think it is more beneficial for everyone to get all popular libraries moving off of String.

1 Like

I was making a broader point about all other string types, String is probably the least practical out of the bunch (albeit still neat).

os-string was released last year, alongside the changes to filepath, directory, unix and Win32 packages. You can already avoid pretty much all String-related IO, except perhaps reading locale-dependent C strings.

The topic of Strings being inefficient is not contentious, it’s about the single largest point of agreement in the entire community.

Type classes are bound by laws, and the laws forShow/Read is that they’re classes of things that roughly can be copied from stdout to your code file and back to stdin. You’re free to choose a lawless typeclass to simplify printing, but that set of personal preferences will necessarily exclude half the community who will happily disagree on every single instance.

Builder is a general way to serialize anything to raw bytes in the exact format you want it to be in, both the smallest necessary tool for any such job and a showcase demonstrating you don’t need type classes for this.

In my understanding Dependent Haskell is about being able to compile type-level data into the runtime to extend type-checking, so I don’t think it’s anywhere near.

While we’re in a tangent, I was working on solving this problem (GitHub - chrisdone-archive/present: Make presentations for data types) see the extension section and how the output contains many choices for a given type. The presentation generated lazily, so an editor could walk through a nested structure or even infinite ones. I had an Emacs mode which talks to GHCi.

I gave up after getting stuck on the implementation, but I think that’s the direction I’d take Haskell if I had a magic wand. Seeing what’s going in my Haskell programs is often not possible due to type erasure and lack of good debugging facilities. It’s not even that I want to walk through my code with a stepper, I just want to see the data, with a rich representation that I don’t have to prepare in advance.

8 Likes

Text is the better solution when you just want to shove bytes you can read and understand, somewhere (the web, a file). Which is 99% most of the time. So yeah, you want String to be Text and you want it to work like Java (it replaces String with StringBuilder when it can, do you really want to iterate all over each previous String to concat another when you just want a lone big String at the end?). This is very good because you help people fall on the pit of success.

Now, the current value of String is educational. It’s a list. List functions, functor functions, monad functions, all apply. You teach students to come up with reverse on a list, and then teach them how to use it. You make them compose the acquired list tools to solve String related problems.

This is the one in base, students can come up with either.

reverse                 :: [a] -> [a]
#if defined(USE_REPORT_PRELUDE)
reverse                 =  foldl (flip (:)) []
#else
reverse l =  rev l []
  where
    rev []     a = a
    rev (x:xs) a = rev xs (x:a)
#endif

A classic problem example and the solution:

isPalindrome :: String -> Bool
isPalindrome ss = ss == reverse ss

Now, if it was:

isPalindrome :: Text -> Bool
isPalindrome ss = ss == T.reverse ss

You would be showing them this:

reverse ::
#if defined(ASSERTS)
  HasCallStack =>
#endif
  Text -> Text
reverse (Text _ _ 0) = empty
reverse t            = reverseNonEmpty t

reverseNonEmpty ::
  Text -> Text
#if defined(PURE_HASKELL)
reverseNonEmpty (Text src off len) = runST $ do
    dest <- A.new len
    _ <- reversePoints src off dest len
    result <- A.unsafeFreeze dest
    pure $ Text result 0 len

reversePoints
    :: A.Array -- ^ Input array
    -> Int -- ^ Input index
    -> A.MArray s -- ^ Output array
    -> Int -- ^ Output index
    -> ST s ()
reversePoints src xx dest yy = go xx yy where
    go !_ y | y <= 0 = pure ()
    go x y =
        let pLen = utf8LengthByLeader (A.unsafeIndex src x)
            -- The next y is also the start of the current point in the output
            yNext = y - pLen
        in do
            A.copyI pLen dest yNext src x
            go (x + pLen) yNext
#else
reverseNonEmpty (Text (A.ByteArray ba) off len) = runST $ do
    marr@(A.MutableByteArray mba) <- A.new len
    unsafeIOToST $ c_reverse mba ba (fromIntegral off) (fromIntegral len)
    brr <- A.unsafeFreeze marr
    return $ Text brr 0 len
#endif

Monads, the ST monad, unsafe operations, Arrays and ByteArrays.
Hardly beginner material and not an inviting first look at Haskell.

Of course you can… not show them and “just trust me bro”. Or make them take the Text API for granted and work with that. Like how IO is represented and done and how the monad of the IO monad is glossed over at first.

2 Likes

In additions to the other problems mentioned here:

So while “just” replacing String with Text could reduce the existing awkwardness, new awkwardness would be introduced regarding where to use a lazy or strict version of some type.

Perhaps the lazy Text type could be replaced with a combination of the strict Text type and (lazy) lists i.e. [Text]:

(Lazy lists: in SWI-Prolog since 2019, in Haskell since 1990 ;-)

Then a merger of the two Text modules could at least be contemplated: if that really is possible, it would help to simplify the replacement of String.

2 Likes

I’m pretty sure LazyText chunks can start and end in the middle of a code point , so at best you’d have newtype LazyText = LazyText [TextChunk] (public) and newtype TextChunk = TextChunk StrictByteString1 (internal). You could force every chunk to end on code point boundaries, but then any LazyByteString that is known to be UTF-8 at the time of parsing would need to be explicitly rechunked to fit that expectation.

For ByteStrings however a public newtype LazyByteString = LazyByteString [StrictByteString1] definitely makes a lot of sense, that string being chunked doesn’t feel like an implementation detail.

compare

putStr . unlines  $ map show [1..]

with

import Data.Tex qualified as T
import Data.Text.IO qualified as T

T.putStr . T.unlines $ map (T.pack . show) [1..]

The first one start printing straight away, the second one hangs up until you run out of memory. Not exactly what I call “just better”.

However, the same code using Lazy.Text just work as the String version.

5 Likes

I confess I have not — yet! — read all of this discussion, but on Discourse there is no shame in opening a new topic if the discussion splinters in multiple sub-threads. It helps keep everything tidy.

This sort of gotcha is why “moving Text into base” is a huge task, not only for those adding to base but also for library maintainers who want their code to continue running on this new base version. It would be better to avoid touching Read and Show entirely, and provide a new Pretty class, which can be done in many ways, which is why it’s a problem better solved in library-land. Just depend on the library that has the Pretty you want in your own code! There’s no guarantee every base user will want the same one!

1 Like

Is there a library to get “natural looking in code” conversions between string types?

For example, I’d love if base would expose toString / toByteString / toText functions, instead of having to always import the pack / unpack functions from multiple separate modules. That’s what would be most helpful for me.

Also a toByteString should handle both strict and lazy return type based on what is expected in context, same with toText. Defaulting to UTF-8 for conversions (where applicable) would also feel like what I expect nowadays. And if you want a UTF-16 or whatever other encoding, reusing the default keyword for strings as well at the module level would make sense.

Let me hoogle that for you!

encode-string plus a couple of one line base functions works for me

import Text.Show.Pretty (ppShow)
import Data.String.Encode

-- alias for universal string converter (convertString) provided by encode-string
toS :: (ConvertString a b) => a -> b
toS = convertString

-- equivalent of show for text but always pretty print
txt :: (Show a) => a -> Text
txt = toS . ppShow

-- unpretty show hardly ever use
txtShow :: (Show a) => a -> Text
txtShow = toS . show

Now I default to text and throw in a toS whenever the compiler complains: e.g. String → Text above or
Text → String as below

resolveFile :: (MonadIO m) => Path Abs Dir -> Text -> m (Path Abs File)
resolveFile b = D.resolveFile b . toS
2 Likes

That would be a matter for implementors, not new Haskell arrivals:


Unfortunately, renaming Data.Text to e.g. Data.Text.Strict (so that Data.Text.Lazy can be the default) would now probably break too much existing code. It’s another reason to investigate the possibility of merging the two Text modules, using lazy lists of strict Text.


Nice. If there’s a similar way to replace the use of Read, a less-ambitious option (than eradicating String immediately) would be to simply mark String as deprecated, with a view to possibly moving all “stringy” types into libraries:

(yes, even one for type String = [Char]), especially considering just how many of them now exist:

But (again) it would also be nice to have only one Text type to nominate as the successor to String

2 Likes

As far as I remember, they cannot.

That is possible (in theory). String is a low-level type. I don’t think most haskellers really have a good intuition about what unicode code points are (which are not unicode scalar values which are not grapheme clusters).

So String is inherently a low-level type.

But I have not yet seen someone propose a migration strategy that won’t make it into history books as a giant failure.

4 Likes

I’m OK with this - having a world where strict Text is the default and saying "Text is the default string type, it is finite and strict." is a true statement. I think the only thing we actually do lose there is pure infinite strings like this, which would be perfectly fine as just [Text] or lazy text.

The change I’d make to base here to not have this example blow up, is making [1..] be an explicit infinite range type somehow (however large a bike-shed that is to paint); T.unlines should simply not be possible on it.

For the other kind of strings laziness where is used, lazy IO, currently supported by [Char] or lazy text I think is a tangible benefit to lose that. Not-so-hot take: It’s always better to use an actual stream type. Of course we could also keep [Char] and just drop the unsafeInterleaveIOs from the relevant functions, but while we are talking about pros and cons of switching to a strict text type as default, I think this would be one of them.

Alright, supposing that the intuitions of most Haskellers are lacking in that way…perhaps a newer programming language can help to provide the much-needed insight:

Hrm - does that mean the intuitions of most Rust users are also lacking in that exact same way? If so, then can anyone nominate a programming language that does support Unicode “properly” ? A migration strategy for Haskell that won’t make it into history books as a giant failure could then be based on that programming language…


Cold take: Browse and search packages | Hackage [for stream] - which one should a new Haskeller use?

As I see it, the only way a streaming type supersedes lazy lists (of characters or other values) in education is for it to be simple and standard Haskell. Educators can then rewrite all their course note with confidence, knowing that it will be supported indefinitely.

(I leave it as an exercise to determine the chances of a simple standard streaming type actually appearing…ever.)

1 Like

Perhaps no streaming library at all – many instructional programs using interact etc. (and needing lazy IO to produce immediate results, not blow up memory, etc.) look like forever (getLine >>= _ >>= mapM_ putStrLn). But streaming is a pretty good default.

1 Like