Text-2.0 with UTF8 is finally released!

I’m happy to announce that text-2.0 with UTF-8 underlying representation has been finally released: https://hackage.haskell.org/package/text-2.0. The release is identical to rc2, circulated earlier.

Changelog: https://hackage.haskell.org/package/text-2.0/changelog

Please give it a try. Here is a cabal.project template: https://gist.github.com/Bodigrim/9834568f075be36a1c65e7aaba6a15db

This work would not be complete without a blazingly-fast UTF-8 validator, submitted by Koz Ross into bytestring-0.11.2.0, whose contributions were sourced via HF as an in-kind donation from MLabs. I would like to thank Emily Pillmore for encouraging me to take on this project, helping with the proposal and permissions. I’m grateful to my fellow text maintainers, who’ve been carefully reviewing my work in course of the last six months, as well as helpful and responsive maintainers of downstream packages and GHC developers. Thanks all, it was a pleasant journey!

41 Likes

Thank you very much for all of your hard work, Bodigrim and everyone involved.
Congratulations on the release!

Wow!! Good stuff. Just tested it locally on one of my projects and it works :slight_smile:

1 Like

@Bodigrim,

Thank you for your efforts to see this thru!

I don’t have the full picture, so forgive my ignorance on some of this… I am curious, how far down the road does this push us WRT simplifying our String/Text madness in Haskell?

I would not say that String is a weakness, it is actually a strength. There is no way not to have Char in base and there is no way not to have lists, so [Char] will be always available. It totally makes sense for base not to introduce more entities than needed and just use [Char] to represent strings. It’s also a wonderful educational tool: while there are certain pitfalls with readFile :: String -> IO String, a novice can start writing something useful knowing only about lists and basic polymorphic operations on them. This is as simple as possible.

9 Likes

It is a weakness in a sense of making “bad” things easier than “good”.

You start with [Char], then you add some printfs, read a file or two, show some things… and then you find yourself deeply committed to your training wheels.

Yes, I know about some use cases where [Char] is okay. But you need to be on a lookout to prevent it from leaking into all the other places.

I’m even fine with the [Char] type by itself. The madness lies in the String.

4 Likes

Yea, I would agree with @wiz on this.

In practice, String and [Char] are warts that make implementations more complex, more verbose, and less straightforward for practical use of Haskell in production. They also complicate teaching/learning Haskell to/for beginners. To say otherwise is not wrong, but obviously not the conclusion you would arrive at as a new Haskeller uninhibited by experiential baggage from Haskell of yesteryear.

The madness lies in the String.

I’d argue the madness lies in the prolific use of String in the ecosystem, starting with base and extending to the rest of our “standard library”.

My point is that we cannot remove String from base, because Char and [] would always remain there.

The ecosystem should use String less indeed. The biggest hurdle here is file API, which is now targeted by AFP proposal, led by @hasufell.

1 Like

I didn’t immediately know what you meant, so, for others, it is the abstract file path proposal.

3 Likes

Hey so what’s the upgrade path for this for library authors? Should I go around and make text-2.0 the lower bound for all my libs?

Why make it lower bound? text is a boot library, so GHC < 9.4 will stick to text-1.2.

The majority of packages do not need any changes to upgrade, just relax an upper bound to text < 2.1. A template cabal.project can be found here: Template cabal.project with text-2.0 support · GitHub

1 Like

And I see aeson doing this:

    , text              >=1.2.3.0  && <1.3 || >=2.0 && <2.1

So I’ll follow that pattern. Thank you!

I don’t understand why we can’t (conceptually) remove type String = [Char], change all the relevant functions such as putStrLn to use Text instead, and make string literals have the type Text?

This would pretty much solve the string situation afaict (but of course the cost of breakage would be very high).

Am I missing something here?

2 Likes

I think the problem is deeper than just choosing a type. Given that that correct way to interact with a terminal depends on the mode that terminal is in I don’t think there is a unique choice that works properly. For example

% cat /tmp/test.hs && ghc /tmp/test.hs && LC_ALL=C /tmp/test
main = putStrLn "💔"
Loaded package environment from /home/tom/.ghc/x86_64-linux-8.10.7/environments/default
test: <stdout>: commitBuffer: invalid argument (invalid character)

I think the system needs rethinking from the ground up.

2 Likes

I think one easy solution is to just always default to utf8, or does that not solve this problem?

2 Likes

That would require that anyone using a Haskell system must have configured their terminal for UTF8 which seems rather a strong requirement.

3 Likes

It seems, the hardest part of this fix is figuring out the best course of action.

Maybe we should advocate for the HF Technical WG take on this next step?

Given how GHC breaks our world from time to time, and given how “graceful migrations” is a mostly solved problem, I don’t see why we couldn’t figure this out if we try and push hard on it.

Think of how nice it will be to not need to go through all the extra hoops when using Text!

1 Like

A number of languages enforce UTF8 these days, so I don’t think it’s cruel to expect it.

1 Like

Do they even force UTF8 for interacting with the terminal? I’m personally in favour of requiring UTF8 everywhere but I’m not sure if there are other encodings in popular use on interactive terminals.

1 Like

Yeah, lots of environments have terminals and locales misconfigured. And then there’s a C/C.UTF8 fustercluck.

Personally I’d rather see mojibake once in a while than have some script/build crashed for printing pretty unicode lines.