The Quest to Completely Eradicate `String` Awkwardness

BurningWitness · October 8, 2024, 8:56pm

It would only really make sense in the context of packages that provide all the necessary functions around said newtypes, so 90% of the text package would still exist in that sense (though renaming it utf8 would make more sense).

Also I don’t like long names and wouldn’t want to specify the encoding every single time I talk about the type, so while I would want this to be roughly Bytes (encoding :: Type), I’d still want to only type Bytes and have GHC route the encoding behind my back.

atravers · October 8, 2024, 9:45pm

hasufell:

If we look at them as Unicode scalar values, which are what Rust’s char type is, those bytes look like this:
['न', 'म', 'स', '्', 'त', 'े']
There are six char values here, but the fourth and sixth are not letters: they’re diacritics that don’t make sense on their own. Finally, if we look at them as grapheme clusters, we’d get what a person would call the four letters that make up the Hindi word:
["न", "म", "स्", "ते"]

Perhaps the “original mistake” in Haskell’s design (for this ongoing saga, anyway) wasn’t the definition type String = [Char], but having a Char type at all in Haskell, particularly as Unicode breaks the ASCII-era mapping of single letters (as seen on ASCII keyboards of that era) to machine-precision integers.

If String was abstract, then the encoding of the various individual Unicode glyphs (or “letters”, grapheme clusters, whatever) would be an internal implementation matter. This would have allowed the shortest possible String to be just one Unicode glyph:

"ते" :: String

which would be far more intuitive to a native writer of Devanagari than typing out:

['त', 'े'] :: [Char]

…especially considering that the computer was invented to serve our purposes.

From 1990 back to now - since someone here had an adverse recollection upon seeing those SML functions:

stretch :: Text -> [Text] 
scrunch :: [Text] -> Text

hence:

stretch “नमस्ते” = ["न", "म", "स्", "ते"]

and:

scrunch ["न", "म", "स्", "ते"] =
scrunch ["न", "मस्", "ते"]     =
scrunch ["न", "मस्ते"]         =
scrunch ["नम", "स्ते"]         =
scrunch ["नमस्", "ते"]         =
scrunch ["नमस्ते"]             = "नमस्ते"

So forget String - at least it could eventually be changed to type String = Text - it’s Char that’s got to go (unless the Unicode Consortium suddenly decide that characters are necessary; a rather unlikely occurrence). Moreover, the decision to introduce a separate type for strings in SWI-Prolog means that a Haskell implementation of abstract Text could reuse recent research from another declarative language…

blamario · October 8, 2024, 11:03pm

100%. Historically speaking, programming languages designed for text processing (Icon, SNOBOL, OmniMark, …) always had string as the basic type. The Char type was a crutch used by system-programming languages to provide at least some text processing ability.

hasufell · October 10, 2024, 3:20am

Redefining String goes against the Haskell report and so does dropping Char.

If you design a new language, that might be feasible.

Ambrose · October 10, 2024, 3:29am

Per Wikipedia at least, UTF-8 didn’t rise to dominance until the mid/late '00s. Yet another fun example of Haskell being old.

If only people would treat this sort of thing as fun instead of complaining so much. We can deal with it - it’s just programming

atravers · October 10, 2024, 6:37am

Actually I was thinking about a future Haskell Report, but that would be years away - I would be starting small with something like Hugs, and slowly switch its sources over to an abstract String/Text while removing Char, to get a better idea of where the problems are lurking…

AndreasPK · October 11, 2024, 5:34pm

Redefining String goes against the Haskell report and so does dropping Char .

Language editions when

atravers · October 11, 2024, 6:06pm

If you mean language editions like what’s possible in Rust…I would suggest waiting for the second Rust compiler to be fully operational: it too will need to support language editions, which would then provide a point of comparison with the existing compiler. A Haskell language-edition extension could then use those two implementations of editions as points of reference.

hasufell · October 12, 2024, 6:18am

You mean when a certain language edition is enabled, it would also swap base and other libraries (because it can’t be cabal flags)?

I’m having a bit of a hard time imagining how that would work overall.

AndreasPK · October 16, 2024, 5:04pm

I’m doubt the idea I describe below is worth the complexity/effort. There will imo always be different types covering different needs between builders, human text, string-like byte sequences and pinned vs unpinned. So I don’t believe one can eradicate the awkwardness completely. Merely reduce it!

But with that disclaimer out of the way:

I was more thinking of it swapping out what some names map to rather than whole packages. Something along the lines of -XFancyStrings that has fundamental compiler support inspired by backpack which means:

String literals are Text/String dependent on -XFancyStrings
There is a mechanism in place for libraries to provide different implementations for a given name dependent on the language edition.
This can be used to map names fromData.String and other relevant modules to implementations/types using either [Char] or Text.
For example definitions in Data.String would mostly map to one of Data.String.Legacy or Data.String.Text depending on the edition used.

That should mean:

A code base would compile without change if:
- Strings are treated as opaque
- The code base uses the same edition for all modules.
- Dependencies provide a interface for the given edition.
Code access string internals can be compatible with both editions by providing different implementations for each edition.
Code compiled using the new edition can use packages that only have a old edition interface by explicitly converting arguments. (Or maybe GHC/HLS could even do this for us if desired)

In theory that would allow everyone to use a Opaque String type in a forward compatible way, and once all their dependencies are up to date they can just switch to the new edition themselves.

But as I said. I’m not convinced it’s worth it.

maxigit · October 17, 2024, 10:31am

Unless you don’t do anything with them Strings are rarely treated as opaque. String being lists, pretty much everything done to them (split, map, combine, etc …) will be done with list functions from the Prelude. In fact, that’s even the point of using Strings, to use them as list (the opposite of opaque).

AndreasPK · October 17, 2024, 1:06pm

Of course if your code base does that you would simply have to either change it or remain on the old language edition (potentially introducing explicit conversions if your dependencies drop the old-version interface).

But I still think the vast majority of string (or FilePath) use is opaque if you look at it on a per-function level. And on all those functions where you simply pass a String along, or store it in some data type, nothing would need to change. Which is why I think Language editions could reduce the pain of a eco-system wide transition from String to Text a lot if it ever happened, but they still can’t make all code magically work with Text that worked with String.