Meeting Minutes - Text Maintainers Meeting
2021-04-15
Attendees:
Li-Yao
Emily Pillmore
Andreas Abel
Andrew Lelechenko
Background
This meeting is for the text package maintainers as we begin project planning for the text-utf8 conversion work.
Action Items:
Engage with users of text-icu to determine an optimal path going forward for Unicode as we begin the steps towards a utf-8 text migration.
Engage with maintainers who use text in serialization-heavy libraries to coordinate utf8 changes
Investigate the performance regressions between from 8.10 to 9.0, and further regressions from 9.0 to 9.2
Text-utf8 package migration:
Performance:
Emily:
It’s important to get messaging right for this. Many expect a huge performance increase.
Andrew:
Changes to branch prediction will probably mean not much of a benefit performance-wise aside from serialization.
Expectations: aeson, serialization etc etc will be the benefit, not necessarily the text itself. We’re excited by z-haskell’s Z-Data approach. Text-icu is perhaps slow on the haskell side.
Andrew:
There are a few options we have for implementing the conversion from text to unicode:
Maintainers of text-icu would be able to find native utf8-bindings and provide an interface to use these bindings. (best from perspective of text maintainers).
We can provide a converter between utf8 text and utf16
Abandon text-icu. Implement all of its functionality as native haskell libs
Harendra Kumar was working on this with haskell-data. Text-icu was already requiring the icu library internally.
Andreas:
The last point was a nuisance historically in Agda
Agda uses text-icu to compute the width of characters. We are not married to the idea.
Andrew:
Unsure of the users of text-icu aside from folks like Ed. Many have switched to unicode-transform.
Emily:
We’ve gone to a handful of industry partners and have had no negative feedback on the issue - in fact, the opposite: it was lukewarm at worst.
Community buy-in, especially from folks like Michael, Ed, and others have given the thumb’s up on the idea, so I imagine there will be strong backing behind this idea.
Andrew:
Step 1 is to understand why text is 7x regressed between 8.10 and 9.0, and further between 9.0 and 9.2.
Then we can decide on next steps. It’s hard to say what to do until we figure out what’s wrong.
We also want to talk with stakeholders of text and text-icu to figure out what the timeline should be for migration, and how they will be affected.
Emily:
Fair. We’ll do our investigation with the maintainers of megaparsec, attoparsec, http-client, hexml, aeson etc. to gauge tolerance for migration and what we can do to make it easier.
We’ll schedule it in the coming weeks. Hopefully we can get some GHC folks like AndreasK or Bgamari to help out with diagnosing the regressions.
I think you can ask @jaspervdj to give you more permissions.
Edit: I remember now, in the comment here he said he granted a higher trust level to working group members, maybe the same can be done for you. But otherwise you can just move up in the trust hierarchy by interacting with discourse. Here is an overview of the trust levels: Understanding Discourse Trust Levels. So I think you can already post many links when you’ve reached level 1, which just requires 10 minutes of reading at least 5 different topics and 30 posts in total.
We have a write up on the state of Unicode in Haskell and our plans about it: https://github.com/composewell/streamly/blob/master/design/unicode.md . We have factored out unicode-data package from unicode-transforms to make it more modular. unicode-transforms supports unicode normalization, other features from text-icu can also be supported with some effort.
However, ideally I would like the Haskell ecosystem to not use text and bytestring like libraries in the long run. Instead, use streams and arrays as the fundamental constructs. We can encode and serialize streams to arrays of any type of data, be it UTF8 or UTF16. Similarly, we can de-serialize UTF8 or UTF16 data from an array to a stream and process it in whatever way. Streamly supports this already. The text type could then just be a sugar on top of the array type if its really wanted for convenience.
That’s interesting. Streamly also regressed with GHC 9.0, we have not investigated it yet. Though the fundamental benchmarks seem to be fine, high level benchmarks regressed. Which makes me suspect if GHC 9.0 messed up with rewrite rules?
I definitely ran into a rewrite rule regression between text on 8.6 and 8.10, though I couldn’t distill it down into a small case. Inspecting core revealed that a fusion took place that hadn’t before, which resulted in a whole lot of effectively useless passing through stream and unstream conversions, causing a huge blowup in time and allocations. The culprit was fused append calls in the context of some rather tricky cps-style code.
Unifying bytestring and text under an umbrella vector type is another HF topic (see https://www.snoyman.com/blog/2021/03/haskell-base-proposal-2/), but this is probably more of a long-term nature and very much in flux (e. g., it’s not quite clear how much of existing API can be salvaged). On the other hand, switching to utf8 - at least in theory - can be accomplished without any API breakage (modulo internal modules, of course), and thus is a feasible short-term goal.