This work would not be complete without a blazingly-fast UTF-8 validator, submitted by Koz Ross into bytestring-0.11.2.0, whose contributions were sourced via HF as an in-kind donation from MLabs. I would like to thank Emily Pillmore for encouraging me to take on this project, helping with the proposal and permissions. I’m grateful to my fellow text maintainers, who’ve been carefully reviewing my work in course of the last six months, as well as helpful and responsive maintainers of downstream packages and GHC developers. Thanks all, it was a pleasant journey!
I don’t have the full picture, so forgive my ignorance on some of this… I am curious, how far down the road does this push us WRT simplifying our String/Text madness in Haskell?
I would not say that String is a weakness, it is actually a strength. There is no way not to have Char in base and there is no way not to have lists, so [Char] will be always available. It totally makes sense for base not to introduce more entities than needed and just use [Char] to represent strings. It’s also a wonderful educational tool: while there are certain pitfalls with readFile :: String -> IO String, a novice can start writing something useful knowing only about lists and basic polymorphic operations on them. This is as simple as possible.
It is a weakness in a sense of making “bad” things easier than “good”.
You start with [Char], then you add some printfs, read a file or two, show some things… and then you find yourself deeply committed to your training wheels.
Yes, I know about some use cases where [Char] is okay. But you need to be on a lookout to prevent it from leaking into all the other places.
I’m even fine with the [Char] type by itself. The madness lies in the String.
In practice, String and [Char] are warts that make implementations more complex, more verbose, and less straightforward for practical use of Haskell in production. They also complicate teaching/learning Haskell to/for beginners. To say otherwise is not wrong, but obviously not the conclusion you would arrive at as a new Haskeller uninhibited by experiential baggage from Haskell of yesteryear.
The madness lies in the String.
I’d argue the madness lies in the prolific use of String in the ecosystem, starting with base and extending to the rest of our “standard library”.
I don’t understand why we can’t (conceptually) remove type String = [Char], change all the relevant functions such as putStrLn to use Text instead, and make string literals have the type Text?
This would pretty much solve the string situation afaict (but of course the cost of breakage would be very high).
I think the problem is deeper than just choosing a type. Given that that correct way to interact with a terminal depends on the mode that terminal is in I don’t think there is a unique choice that works properly. For example
It seems, the hardest part of this fix is figuring out the best course of action.
Maybe we should advocate for the HF Technical WG take on this next step?
Given how GHC breaks our world from time to time, and given how “graceful migrations” is a mostly solved problem, I don’t see why we couldn’t figure this out if we try and push hard on it.
Think of how nice it will be to not need to go through all the extra hoops when using Text!
Do they even force UTF8 for interacting with the terminal? I’m personally in favour of requiring UTF8 everywhere but I’m not sure if there are other encodings in popular use on interactive terminals.