The Quest to Completely Eradicate `String` Awkwardness

mixphix · August 7, 2024, 1:38pm

This sort of gotcha is why “moving Text into base” is a huge task, not only for those adding to base but also for library maintainers who want their code to continue running on this new base version. It would be better to avoid touching Read and Show entirely, and provide a new Pretty class, which can be done in many ways, which is why it’s a problem better solved in library-land. Just depend on the library that has the Pretty you want in your own code! There’s no guarantee every base user will want the same one!

mhitza · August 7, 2024, 3:56pm

Is there a library to get “natural looking in code” conversions between string types?

For example, I’d love if base would expose toString / toByteString / toText functions, instead of having to always import the pack / unpack functions from multiple separate modules. That’s what would be most helpful for me.

Also a toByteString should handle both strict and lazy return type based on what is expected in context, same with toText. Defaulting to UTF-8 for conversions (where applicable) would also feel like what I expect nowadays. And if you want a UTF-16 or whatever other encoding, reusing the default keyword for strings as well at the module level would make sense.

mixphix · August 7, 2024, 3:59pm

Let me hoogle that for you!

theGhostJW · August 7, 2024, 8:20pm

encode-string plus a couple of one line base functions works for me

import Text.Show.Pretty (ppShow)
import Data.String.Encode

-- alias for universal string converter (convertString) provided by encode-string
toS :: (ConvertString a b) => a -> b
toS = convertString

-- equivalent of show for text but always pretty print
txt :: (Show a) => a -> Text
txt = toS . ppShow

-- unpretty show hardly ever use
txtShow :: (Show a) => a -> Text
txtShow = toS . show

Now I default to text and throw in a toS whenever the compiler complains: e.g. String → Text above or
Text → String as below

resolveFile :: (MonadIO m) => Path Abs Dir -> Text -> m (Path Abs File)
resolveFile b = D.resolveFile b . toS

atravers · August 7, 2024, 10:33pm

That would be a matter for implementors, not new Haskell arrivals:

Unfortunately, renaming Data.Text to e.g. Data.Text.Strict (so that Data.Text.Lazy can be the default) would now probably break too much existing code. It’s another reason to investigate the possibility of merging the two Text modules, using lazy lists of strict Text.

Nice. If there’s a similar way to replace the use of Read, a less-ambitious option (than eradicating String immediately) would be to simply mark String as deprecated, with a view to possibly moving all “stringy” types into libraries:

(yes, even one for type String = [Char]), especially considering just how many of them now exist:

But (again) it would also be nice to have only one Text type to nominate as the successor to String…

Bodigrim · August 7, 2024, 11:33pm

As far as I remember, they cannot.

hasufell · August 8, 2024, 2:04am

That is possible (in theory). String is a low-level type. I don’t think most haskellers really have a good intuition about what unicode code points are (which are not unicode scalar values which are not grapheme clusters).

So String is inherently a low-level type.

But I have not yet seen someone propose a migration strategy that won’t make it into history books as a giant failure.

mikeplus64 · August 8, 2024, 2:51am

I’m OK with this - having a world where strict Text is the default and saying "Text is the default string type, it is finite and strict." is a true statement. I think the only thing we actually do lose there is pure infinite strings like this, which would be perfectly fine as just [Text] or lazy text.

The change I’d make to base here to not have this example blow up, is making [1..] be an explicit infinite range type somehow (however large a bike-shed that is to paint); T.unlines should simply not be possible on it.

For the other kind of strings laziness where is used, lazy IO, currently supported by [Char] or lazy text I think is a tangible benefit to lose that. Not-so-hot take: It’s always better to use an actual stream type. Of course we could also keep [Char] and just drop the unsafeInterleaveIOs from the relevant functions, but while we are talking about pros and cons of switching to a strict text type as default, I think this would be one of them.

atravers · August 8, 2024, 4:45am

Alright, supposing that the intuitions of most Haskellers are lacking in that way…perhaps a newer programming language can help to provide the much-needed insight:

What Is a String?

The String type, which is provided by Rust’s standard library rather than coded into the core language, is a growable, mutable, owned, UTF-8 encoded string type.

Storing UTF-8 Encoded Text with Strings - The Rust Programming Language
The Character Type

Rust’s char type is four bytes in size and represents a Unicode Scalar Value, which means it can represent a lot more than just ASCII. […] Unicode Scalar Values range from U+0000 to U+D7FF and U+E000 to U+10FFFF inclusive.

Data Types - The Rust Programming Language

…that seems simple enough:

However, a “character” isn’t really a concept in Unicode, so your human intuition for what a “character” is may not match up with what a char is in Rust. We’ll discuss this topic in detail in “Storing UTF-8 Encoded Text with Strings” in Chapter 8.

Hrm - does that mean the intuitions of most Rust users are also lacking in that exact same way? If so, then can anyone nominate a programming language that does support Unicode “properly” ? A migration strategy for Haskell that won’t make it into history books as a giant failure could then be based on that programming language…

Cold take: Browse and search packages | Hackage [for stream] - which one should a new Haskeller use?

As I see it, the only way a streaming type supersedes lazy lists (of characters or other values) in education is for it to be simple and standard Haskell. Educators can then rewrite all their course note with confidence, knowing that it will be supported indefinitely.

(I leave it as an exercise to determine the chances of a simple standard streaming type actually appearing…ever.)

mikeplus64 · August 8, 2024, 5:20am

Perhaps no streaming library at all – many instructional programs using interact etc. (and needing lazy IO to produce immediate results, not blow up memory, etc.) look like forever (getLine >>= _ >>= mapM_ putStrLn). But streaming is a pretty good default.

hasufell · August 8, 2024, 7:27am

Well, if you open the book, it has a pretty good section on it: Storing UTF-8 Encoded Text with Strings - The Rust Programming Language

If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is stored as a vector of u8 values that looks like this:
[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]
That’s 18 bytes and is how computers ultimately store this data. If we look at them as Unicode scalar values, which are what Rust’s char type is, those bytes look like this:
['न', 'म', 'स', '्', 'त', 'े']
There are six char values here, but the fourth and sixth are not letters: they’re diacritics that don’t make sense on their own. Finally, if we look at them as grapheme clusters, we’d get what a person would call the four letters that make up the Hindi word:
["न", "म", "स्", "ते"]

So you want a way to pattern match on all 3 variants, imo. And write literals for all 3 variants that are checked at compile time (not runtime or other trash like OverloadedStrings). Sorry for the language.

And then you still have to educate users about it.

So it actually does not get easier. It just gets more correct.

I had a conversation a while ago about AFPP/OsString and got asked what people should do who don’t care about the details. I actually wasn’t sure what to answer. The entire point of it is to force you to care.

atravers · August 8, 2024, 7:38am

I did open that book (to get those quotes);
Your quote still doesn’t justify the claim that “most Haskellers don’t really have a good intuition about what Unicode code points are”, as if it’s only a Haskell problem - it seems to be a problem for users of Rust too (otherwise they surely would have implemented Unicode “properly” ).

So again:

BurningWitness · August 8, 2024, 7:53am

I got confused by the fact that simdutf8 needs extra code to work across chunk boundaries.

And now that I know that, this is an extremely annoying restriction for parsing, as merely knowing that a certain stretch of input is UTF-8 is not enough to transform that input into Text. I’ll now need a whole extra rechunking function just to conform to that expectation.

hasufell · August 8, 2024, 8:38am

You’re nitpicking and reading into my post.

My points are:

Unicode codepoints are very non-intuitive (rust doesn’t have a type for them: good)
Improving the situation will likely make things more unergonomic (at least in part), because Strings are hard (compare with rust not allowing to index strings or OsString throwing stones at you if you want to convert to String/Text)
people want better ergonomics (zomg, too many string types), not write more correct code

That’s why most of these string rants kinda miss the point, imo.

atravers · August 8, 2024, 9:19am

Some time ago this was written, presumably in passing joviality:

…but if things really are (already?) so bad:

data String a ...

unitString :: a -> String a
unitString x = ...

bindString :: String a -> (a -> String b) -> String b
bindString str k = ...

      ⋮

data Byte ...
type ByteString = String Byte
      ⋮

data OS ...
type OSString = String OS
      ⋮

data Char ...
type Text = String Char
      ⋮

data Bin ...
type BinString = String Bin
      ⋮

Or can some simplicity be salvaged from the morass of “stringy” types?

Bodigrim · August 8, 2024, 8:47pm

Before dreaming of eradication of String, someone would have to accomplish a much more mundane task of putting Text datatype into base. Yet I don’t see a queue of volunteers.

rhendric · August 8, 2024, 9:24pm

Mechanically, that doesn’t sound difficult. Out of idle curiosity, what additional work would one be volunteering to do beyond submitting an MR that moves the things from one library to the other and puts in aliases/deprecation warnings?

Bodigrim · August 8, 2024, 11:59pm

It would be easier to comment on additional work if someone describes first how exactly they would approach the task. Putting ByteArray from primitive into base is the closest example, which can serve as a model.

arybczak · August 9, 2024, 12:03am

If this would happen, the amount of breakage would be monumental, hence it’s not going to.

atravers · August 9, 2024, 12:49am

Either that or it would require Rust-style editions, as opposed to whatever this was supposed to be:

Seeking feedback on language editions proposal

…so both are probably just as unlikely to happen.

Topic		Replies	Views
HF Tech Proposal #1: UTF-8 Encoded Text Haskell Foundation	40	6374	June 2, 2021
Informal discussion about the progression of `base` Core Libraries Committee	158	8386	May 6, 2024
Bringing Data.Text into `base`: What is the next step?	52	3948	October 20, 2022
Text-2.0 with UTF8 is finally released! Announcements	20	2824	January 27, 2022
Base proposal around vector-like types Haskell Foundation	85	7509	August 26, 2023

The Quest to Completely Eradicate `String` Awkwardness

Related topics