HF Tech Proposal #1: UTF-8 Encoded Text

This proposal outlines a project plan for the migration of the text package from its current default encoding (UTF-16) to a new default of UTF-8.

The lack of UTF-8 as a default in the text package is a pain point raised by the Haskell Community and many of our industry partners for many years. We have done our homework in soliciting feedback from the broader community and industry, and have received positive affirmation of the following with regards to the proposal:

  • This is desirable: Overwhelmingly, the community was positive on such a change, and industry was the same.
  • There is an appetite for breakage: neither the community nor the people we spoke to in industry were concerned about the cost of migration or breakage, though, as a deliverable, we will make sure to minimize the cost of migration by reusing the text package name, and providing alternatives in the case that users require UTF-16 text.
  • The text maintainers are on board: All text maintainers currently support this effort, and will be reviewers and support for the project leader throughout the course of the project.

You can view the proposal here: https://github.com/haskellfoundation/tech-proposals/pull/1

32 Likes

Apologies if I’m just confused and asking a silly question, but I am curious: the net gain for the community at large is not just performance, right? It is also about “cleaning up” the Text/String story? So does this proposal open the door towards improving the situation with String/Text/ByteString and the prolific use of String in our “core libraries” (despite the prolific recommendation to use Text/ByteString instead of String). Or what does this “unblock”?

@ketzacoatl https://github.com/haskellfoundation/tech-proposals/pull/1/files#diff-499cc0f92214d3033964ae2965f771a95c284603bdd3ca1ce6336543c20549e9R18-R71

I think the rendered document might be easier to read: https://github.com/haskellfoundation/tech-proposals/blob/7b79d755f2d5b08c647ee841d608823bff6d9b48/proposals/002-text-utf-default.md#what-is-unicode

But this seems to only mention performance benefits: utf8 takes less space and it is more popular (so you probably need less conversions).

(I personally think the performance gains are convincing, but your answer does not seem to address @ketzacoatl’s question)

Thanks for the reference, that is a nice and concise description of the technical change. I do understand that general change (the switch to utf8 and making other string types available for the various use cases). As @jaror notes, my question is a little different, though I did not present it very clearly. I’m asking more about the opportunities that are more readily accessible, once the change has been made and has had time to percolate thru the package ecosystem. For example, would GHC find these types useful? Or would we be closer to deprecating String across the ecosystem?

1 Like

I’m not an expert in such matters but I can’t see how it gets us any closer to deprecating String (other than by generating community momentum towards improving text types in general). UTF8 Text will be API compatible with UTF16 text so anything that can be done API-wise with UTF8 Text could already have been done.

2 Likes

I’m not an expert in such matters but I can’t see how it gets us any closer to deprecating String (other than by generating community momentum towards improving text types in general).

This. I don’t think deprecating String is going to happen anytime in the next 5 years. There is simply too much code to contend with and upgrade at this point, without decent alternatives. The best we can do is make reasonable alternatives available and urge people to choose them over String until a critical mass is reached. Only then can we even think about it. However, I’m unsold on the notion as it is: String is not good for some programs, but it’s fine for others!. String is not performant, per se, and we should not offer it as a default string representation along with Text variants (which only leads to a proliferation of mostly-the-same-but-subtly-different datatypes), but it’s not wrong to write String in many cases.

I think the attitude that we should deprecate string is a function of frustration with it being the default, rather than one made on technical grounds. In that sense, I can commiserate, and I hope we get there, but let’s not go too far.

5 Likes

It would certainly take time, and be mostly boring work, but (at least I believe) it would be worth the effort. We would then be able to skip all the (un)packing and translation we do. At least, it seems pretty common to use Text exclusively on internal datatypes, but then use translation functions to keep base happy. That then spills over into the other libraries in the ecosystem, b/c base did it and it’s easier to be in sync with base. I’ve gotten used to the translation, but it seems a bit unnecessary, and it adds overhead (physical and cognitive). Just think of how much code you’d remove if your app’s internal textual data structures matched the base and generic library functions you use.

Thank you all for helping me better understand the impact (and not) of the proposal.

1 Like

I believe we actually do aim to improve String story in Haskell. Part of the current impasse is that there is no strictly preferential candidate to replace String in core API. One could argue that it should be Text, but in many contexts, when data is ASCII, people stick to ByteString because it takes less memory. By backing Text with UTF8 encoding we remove such incentive and promote greater convergence.

2 Likes

Ok, thank you for explaining @bodigrim. That is how I understood it - with the changes to Text, we have a better foundation for agreement on deprecating String… but even if there were enough interested contributors to make that happen, are there other technical hurdles? Or is the primary challenge going to be building consensus amongst ghc and base/core library developers for this change?

1 Like

The primary challenge is to provide a migration strategy. If you just swap String for Text in base, everything is broken overnight. The original motivation behind Backpack, I believe, was to facilitate such migration by making base polymorphic over string interface, so that legacy users could instantiate it with String, and new ones with Text or whatever

1 Like

Yea, the details of how to migrate are important.

Is it easier to make the interface polymorphic, or just provide the hammer and nails ala some duplicate functions with differing names and such, then working through the conversion across the ecosystem piecemeal?

I would guess a polymorphic interface would be easier, but is that actually true? Was backpack not successful or could it not be useful in the migration?

Thank you @Bodigrim, for your efforts on the proposal and explaining this clearly.

Backpack did not take off (yet?). This is often attributed to a lack of support in Stack, but I’m actually doubtful how significant it is. Maybe there is simply no Backpack champion to promote it.

I have not seen any specific proposals how to switch to better string types in base, with or without Backpack.

ezyang did produce backpack-str way back when, so at least some of that work is likely usable. This is, in a very real way, the purpose of Backpack, and if you want to find a champion then “fix the most complaint-generating wart in the Haskell ecosystem” seems like a heck of a bait opportunity. I’d certainly be interested in working on it.

3 Likes

Sorry, I am probably ignorant of a bunch of backstory and context here, but what is wrong with backpack/backpack-str, or why haven’t we figured out how to integrate that into core and start smoothing out the bumps? Am I being naive, or isn’t this exactly what we’d want to improve the experience with our text types in Haskell?

I would really like to see backpack get more traction too. One big argument against it is that it is yet another feature of Haskell which is not so mature which everyone would have to learn. Moreover, as far as I am aware, no existing alternative prelude/base has taken the backpack-str approach, so making such a library is still a task that needs to be done.

1 Like

Let’s imagine a world in which the string problem was solved by Backpack. What would it look like?

All the important libraries (base, aeson…) would be indefinite: they would depend on a module signature for strings, instead of using a concrete type.

The decision about which kind of string to use would take place at executables and test suites, by making them depend on the corresponding implementation package.

Is this realistic?

I don’t see large swathes of the ecosystem choosing to become indefinite. A more conservative and backwards-compatible approach could involve the “multiple public libraries” feature of Cabal (do those work on Hackage yet?) For a certain package, the main library would be the traditional concrete form; the indefinite form could be exposed as a sublibrary for those interested. To avoid duplication, the concrete form should be implemented in terms of the indefinite one (this would require the use of reexported-modules:, which don’t play well with Haddocks IIRC.)

Defining an executable or test suite would often require an annoying extra dependency: even if your Main.hs doesn’t handle strings directly, some indefinite dependency might, so you’d need to throw in some string implementation.

There’s the issue of how to handle versioning of module signatures. They’re a bit weird in that respect, because merely adding a function to them is a breaking change (because some implementation packages might suddenly become incomplete!)

So far, few actual libraries in Hackage have chosen to use Backpack to provide flexibility in string representation. I expected wider adoption. But in retrospect it’s understandable why it didn’t caught on: fiddling with compilation units involves a lot of ceremony, and module signatures are very different from the typeclass-based polymorphism Haskellers are familiar with.

5 Likes

This is the same with type classes, so I don’t think it is that weird.

I also wonder if something like type class defaulting rules could be added to backpack to keep the same end-user experience as we have now, while allowing people to explicitly declare that they want a different string implementation. Library authors can already create fully instantiated versions of their packages. That is what base should do imo. Then you can view backpack simply as a tool that library authors can use to write multiple versions of their package in parallel without having to duplicate code.

1 Like

OK, so backpack is one of the “alternative preludes”, and backpack-str is its internal reprensentation of the text/string types. That makes more sense, and I didn’t realize that distinction (thank you for clarifying).

I appreciate the guarantees Haskell provides, and I’m not advocating for adoption of backpack as much as asking: what is the (correct) short path (for GHC/base/core-libraries) to take towards deprecating String and cleaning up the types we have to represent the different text use cases (now that we know more about those use cases and the differing requirements, which led to all the different types)? backpack-str seems to be one option, either directly or by generalizing its approach and adapting to base/core-libraries.