Text-2.0 with UTF8 is finally released!

Bodigrim · December 24, 2021, 7:15pm

I’m happy to announce that text-2.0 with UTF-8 underlying representation has been finally released: https://hackage.haskell.org/package/text-2.0. The release is identical to rc2, circulated earlier.

Changelog: https://hackage.haskell.org/package/text-2.0/changelog

Please give it a try. Here is a cabal.project template: https://gist.github.com/Bodigrim/9834568f075be36a1c65e7aaba6a15db

This work would not be complete without a blazingly-fast UTF-8 validator, submitted by Koz Ross into bytestring-0.11.2.0, whose contributions were sourced via HF as an in-kind donation from MLabs. I would like to thank Emily Pillmore for encouraging me to take on this project, helping with the proposal and permissions. I’m grateful to my fellow text maintainers, who’ve been carefully reviewing my work in course of the last six months, as well as helpful and responsive maintainers of downstream packages and GHC developers. Thanks all, it was a pleasant journey!

gilmi · December 24, 2021, 8:36pm

Thank you very much for all of your hard work, Bodigrim and everyone involved.
Congratulations on the release!

vmchale · December 25, 2021, 5:57pm

Wow!! Good stuff. Just tested it locally on one of my projects and it works

ketzacoatl · December 29, 2021, 1:13pm

@Bodigrim,

Thank you for your efforts to see this thru!

I don’t have the full picture, so forgive my ignorance on some of this… I am curious, how far down the road does this push us WRT simplifying our String/Text madness in Haskell?

Bodigrim · December 29, 2021, 7:17pm

I would not say that String is a weakness, it is actually a strength. There is no way not to have Char in base and there is no way not to have lists, so [Char] will be always available. It totally makes sense for base not to introduce more entities than needed and just use [Char] to represent strings. It’s also a wonderful educational tool: while there are certain pitfalls with readFile :: String -> IO String, a novice can start writing something useful knowing only about lists and basic polymorphic operations on them. This is as simple as possible.

wiz · January 20, 2022, 8:50am

It is a weakness in a sense of making “bad” things easier than “good”.

You start with [Char], then you add some printfs, read a file or two, show some things… and then you find yourself deeply committed to your training wheels.

Yes, I know about some use cases where [Char] is okay. But you need to be on a lookout to prevent it from leaking into all the other places.

I’m even fine with the [Char] type by itself. The madness lies in the String.

ketzacoatl · January 20, 2022, 5:53pm

Yea, I would agree with @wiz on this.

In practice, String and [Char] are warts that make implementations more complex, more verbose, and less straightforward for practical use of Haskell in production. They also complicate teaching/learning Haskell to/for beginners. To say otherwise is not wrong, but obviously not the conclusion you would arrive at as a new Haskeller uninhibited by experiential baggage from Haskell of yesteryear.

The madness lies in the String.

I’d argue the madness lies in the prolific use of String in the ecosystem, starting with base and extending to the rest of our “standard library”.

Bodigrim · January 20, 2022, 8:49pm

My point is that we cannot remove String from base, because Char and [] would always remain there.

The ecosystem should use String less indeed. The biggest hurdle here is file API, which is now targeted by AFP proposal, led by @hasufell.

jaror · January 20, 2022, 9:04pm

I didn’t immediately know what you meant, so, for others, it is the abstract file path proposal.

fosskers · January 20, 2022, 10:58pm

Hey so what’s the upgrade path for this for library authors? Should I go around and make text-2.0 the lower bound for all my libs?

Bodigrim · January 20, 2022, 11:51pm

Why make it lower bound? text is a boot library, so GHC < 9.4 will stick to text-1.2.

The majority of packages do not need any changes to upgrade, just relax an upper bound to text < 2.1. A template cabal.project can be found here: Template cabal.project with text-2.0 support · GitHub

fosskers · January 21, 2022, 12:55am

And I see aeson doing this:

    , text              >=1.2.3.0  && <1.3 || >=2.0 && <2.1

So I’ll follow that pattern. Thank you!

gilmi · January 21, 2022, 7:58am

I don’t understand why we can’t (conceptually) remove type String = [Char], change all the relevant functions such as putStrLn to use Text instead, and make string literals have the type Text?

This would pretty much solve the string situation afaict (but of course the cost of breakage would be very high).

Am I missing something here?

tomjaguarpaw · January 21, 2022, 10:13am

I think the problem is deeper than just choosing a type. Given that that correct way to interact with a terminal depends on the mode that terminal is in I don’t think there is a unique choice that works properly. For example

% cat /tmp/test.hs && ghc /tmp/test.hs && LC_ALL=C /tmp/test
main = putStrLn "💔"
Loaded package environment from /home/tom/.ghc/x86_64-linux-8.10.7/environments/default
test: <stdout>: commitBuffer: invalid argument (invalid character)

I think the system needs rethinking from the ground up.

jaror · January 21, 2022, 10:24am

I think one easy solution is to just always default to utf8, or does that not solve this problem?

tomjaguarpaw · January 21, 2022, 10:34am

That would require that anyone using a Haskell system must have configured their terminal for UTF8 which seems rather a strong requirement.

ketzacoatl · January 21, 2022, 4:38pm

It seems, the hardest part of this fix is figuring out the best course of action.

Maybe we should advocate for the HF Technical WG take on this next step?

Given how GHC breaks our world from time to time, and given how “graceful migrations” is a mostly solved problem, I don’t see why we couldn’t figure this out if we try and push hard on it.

Think of how nice it will be to not need to go through all the extra hoops when using Text!

fosskers · January 21, 2022, 6:03pm

A number of languages enforce UTF8 these days, so I don’t think it’s cruel to expect it.

tomjaguarpaw · January 21, 2022, 8:38pm

Do they even force UTF8 for interacting with the terminal? I’m personally in favour of requiring UTF8 everywhere but I’m not sure if there are other encodings in popular use on interactive terminals.

wiz · January 22, 2022, 9:39am

Yeah, lots of environments have terminals and locales misconfigured. And then there’s a C/C.UTF8 fustercluck.

Personally I’d rather see mojibake once in a while than have some script/build crashed for printing pretty unicode lines.

Topic		Replies	Views
The Quest to Completely Eradicate `String` Awkwardness	106	6051	October 17, 2024
HF Tech Proposal #1: UTF-8 Encoded Text Haskell Foundation	40	6360	June 2, 2021
Informal discussion about the progression of `base` Core Libraries Committee	158	8353	May 6, 2024
Bringing Data.Text into `base`: What is the next step?	52	3940	October 20, 2022
Haskell Foundation May Update Haskell Foundation	14	3203	June 5, 2021

Text-2.0 with UTF8 is finally released!

Related topics