The Quest to Completely Eradicate `String` Awkwardness

Alright. I’m sorry for taking such a long time to write another message, but I feel like I had to absorb everything that everyone is saying. I’m extremely thankful for all the interactions and I feel like I need to point out a couple of notes after what people are saying:

First off, perhaps in the long run it is more user friendly to eradicate Text rather than String. Throughout everything mentioned thus far it feels like what people are imagining if in a hypothetical world we have all carried this herculean task, all code would be littered with Text instead of String. I think that this wouldn’t be a good experience for beginners, as they are called String right about everywhere else. Of course, I don’t mean that we should just remove Text and it’s semantics, but more like keep String's name and replace its internals with Text…? This sounds kind of tough, but maybe it’s easier than I think?

The issues that come with doing something like that, as people mentioned, comes in a lot of forms:
Firstly… one of the things at the heart of the problem is that, as someone previously mentioned, a String is ultimately just a [Char], and this is perhaps easier to abstract over than a monomorphic Text?

I’m not quite sure about how one would solve this… I think it would be best if, outwardly, String still acted similarly to [Char], but it would actually be invalid to say type String = [Char], as the internals would be closer to Text. I am really quite unsure about this, as I am not very sure about the internals and all I can comment about is that it would be more convenient if we picked Text over the current String and somehow replaced String with it while attempting to make it still act like String but with the benefit of Text? To me, this sounds like the premise of a really bad piece of fiction, but maybe people can think of things up to make it a reality… In case people complain about the special casing part and not the outrageous piece of fiction that I’ve written, I think it’s totally fair to do some fairly special things when it comes to String. I think, if nothing else, it would make things considerably more convenient that a string type can get some use out of the typeclasses like Foldable and the others. Perhaps we can create some sort of StringContainer type that could be used if nothing else, if we don’t want to rely on Char for anything… though, maybe it could be made to work with Text?

I think this will help with what @tomjaguarpaw previously mentioned, about being unable to stop people putting Chars inside []. I think that all we have to do is lie and make it so that [Char] is really an alias to the new String, or something similar?

I think this will also help with pattern matching, as it would now be simpler?
I’m really sorry if this all sounds really idiotic, but it sounds like the best case scenario, so I’m interested in what we would need to sacrifice to reach this…

Secondly… as previously mentioned, there’s the issue of whether we should pick lazy or strict variants of Text. There also has been recommendations to have Text be strict and use [Text] for laziness, I believe.

I’m not really too picky about this part, as I think it would be frankly equivalent in either situation… though, I suppose it would perhaps be better to go with the lazy text variant; I don’t see much benefit in having strict text be the default, and I feel like a “default” Textnot breaking expectation from the demand ofString` would be without much fault. As I said though, I don’t really have any strong feelings here.

Once again, I’m really sorry if this is sounding outrageous…

SWI-Prolog recently made a similar change:

(In ISO Prolog, a double-quoted string literal is converted to a list of character-codes: What type is a single quoted string? - Help! - SWI-Prolog .)

It’s not idiotic nor outrageous. Two things that come to mind:

  • String (or rather [Char]) has the unfortunate property of being pattern-matched quite heavily, so a lot of people’s code – like (x:xs) – will become invalid suddenly

  • There also has been recommendations to have Text be strict and use [Text] for laziness, I believe.

No idea who recommended it but it’s not a good advice, a dedicated streaming library should be used (conduit, streamly, etc).

I’d prefer the other way round: Let GHC store all data in a serialized way, as demonstrated by the Gibbon project. Then String is stored as efficient as Text, and we can drop Text and all of its flavors.

I see lots of enthusiasm and interest in this thread. May I urge it to be channeled into practical developments and not just dreaming how strings could have looked in an imaginary ecosystem? I’m sorry for reality check, but String and functions to work with it are to remain in base, and String cannot be substituted by Text or [Text] under the hood automagically.

I see lots of proposed solutions in this thread, but I don’t know what problem they solve. If you want to use Text, go and use it. What exactly makes it inconvenient?

9 Likes

In my experience… mainly, the fact that Text and String overlap almost exactly 1:1 in their usecases, except as mentioned, Text is just a better String. Secondly, OverloadedStrings is annoying to say the least, and while it and a combination of another extension seem to help when it comes to defaulting string literals to Text, it’s not something you see often, and it’s still a bit more inconvenient to use Text than String because base is so string-oriented, so you’re forced to implement some of the basic features yourself, or use a library like text-show or text-display for something that should have been included in base or core (or is already included, just for the worse version). Of course, if I had to pick between using Text or String right now, I would exclusively use String because it’s just way more convenient and using Text as it is right now just adds some annoying overhead. And yet, I would still like to get the benefits that Text would give if I had used it. I think the point of this discussion is to get a better idea so that we can direct this fairlyand story into knowing how or what would be a good way to simplify the ecosystem…

This statement is almost true, because every String (unless it contains surrogate characters!) can be replaced by a lazy Text, where each chunk contains exactly one Char. But I assume this is not what you mean?

If the statement was to mean that String can be replaced by strict Text then it’s false, you can easily face major performance penalties if you do so.

For instance, tasty uses String. Never been an issue.

How exactly [Data.Text.Text] is better than Data.Text.Lazy.Text? They are almost isomorphic (chunks of lazy text are guaranteed to be nonempty) and the existing implementation of lazy Text is indeed built upon the implementation of strict Text, so I don’t see who is to win and what is to be saved. Mind you that one would still need to wrap [Text] into a newtype to provide instances.

I remember alternative preludes being all rage around 2016. Pretty much all of them are abandoned or on life support in 2024, causing great pain to projects who embraced them. Alternative preludes do not stand the test of time.

As for text I would strongly advise against moving the entire package into base. But we can move data Text and its instances.

You can define (:+) as below and use it instead of (:):

pattern (:+) :: Char -> Text -> Text
pattern x :+ xs <- (uncons -> Just (x, xs)) where
  x :+ xs = cons x xs
8 Likes

If Text works better for your usecases, just use it.

Why is it annoying? You can put it into a Cabal file to apply for all modules (or even into the global Cabal config, if you fancy so) and you can put it into .ghci to apply for all GHCi sessions.

But you don’t have to pick one. Use whatever is appropriate and approachable in each specific instance.

3 Likes

it would be more convenient if we picked Text over the current String and somehow replaced String with it while attempting to make it still act like String but with the benefit of Text ?

One proposed motivation for the Backpack module system back in the day was to overcome the Text vs String impasse by letting packages depend on an abstract signature which could be filled with either one.

From Edward Z. Yang’s thesis:

But Backpack didn’t catch on for a variety of reasons.

Can you elaborate on where you would draw the line and why?

Just strict data Text and the API used by its instances, entirely ignoring Data.Text.Lazy, encompasses code currently living in 15 different modules in text. Here, courtesy of calligraphy, is a graph of those dependencies:

A complex dependency graph

(Link to SVG, if the above doesn’t render.)

Is extracting this specific bundle of code more appealing than absorbing text wholesale?

calligraphy sounds cool! Unfortunately I don’t see any image, I just see literally “A complex dependency graph” (which I guess is the alt text). Is something wrong my end, or is that what everyone sees?

Sorry about that; does the link I just added work better for you?

(Direct SVG uploading is disabled on this Discourse; I just tried a PNG fallback but the server force-shrinks it down to an unusable size.)

Yeah, the link works, thanks!

1 Like

Could you please remove instance Binary Text (which in case if data Text is moved into base would go into binary itlsef) and other instances for classes outside of base and rerun the analysis?

2 Likes

If the supporting code for instance Binary Text goes into binary, and the supporting code for instance Lift Text goes into template-haskell, we are down to 9 modules and this graph:

A less complex dependency graph

(Link to SVG)

But the supporting code for instance Lift Text, for example, includes the function Data.Text.Array.copyFromPointer. Is that function really going to live in template-haskell? Seems like a very weird place to import it from. Similarly, is Data.Text.Array.newPinned going to live in binary simply because encodeUtf8, and thus instance Binary Text, uses it?

I would have assumed that even if those instances go to their classes’ packages, the Text-related functions they use would go to base, which would make the first graph more reflective of how text is to be chopped up.

It can be whittled down further (fusion can go, Data.Text.Array can go), but yes, selected functions from 9 modules are infinitely easier to put into base than 55 modules in their full glory.

The instance is to be rewritten without them, using raw GHC.Exts, and the functions with the rest of Data.Text.Array will continue to live in the text package.

Same: reimplement the instance using GHC.Exts, keep Data.Text.Array in text intact. All of them are just thin wrappers.

So just to be clear, your vision is to expose data Text, its instances, (maybe?) pack and unpack, and nothing else? And have redundant implementations of any machinery beneath those exports in base and text?

Correct, only data Text and instances, and export it from Prelude.

Hmm: similar to how the standard Prelude re-exports a select group of entities from the Data.List module - it worked back then, so it would be very annoying for something like it to fail now…

What about the constructor? It has to be exported from somewhere in order for text to make use of it, but it doesn’t seem like a safe thing to export from Prelude. It should be exported from the implementation module, right?

What’s the correct naming convention for that module? GHC.Text? Something with Internal in it somewhere?