Text Maintainers: Meeting Minutes 2021-04-15

emilypi · April 15, 2021, 9:08pm

Meeting Minutes - Text Maintainers Meeting
2021-04-15

Attendees:

Li-Yao
Emily Pillmore
Andreas Abel
Andrew Lelechenko

Background

This meeting is for the text package maintainers as we begin project planning for the text-utf8 conversion work.

Action Items:

Engage with users of text-icu to determine an optimal path going forward for Unicode as we begin the steps towards a utf-8 text migration.
Engage with maintainers who use text in serialization-heavy libraries to coordinate utf8 changes
Investigate the performance regressions between from 8.10 to 9.0, and further regressions from 9.0 to 9.2

Text-utf8 package migration:

Performance:

Emily:

It’s important to get messaging right for this. Many expect a huge performance increase.

Andrew:

Changes to branch prediction will probably mean not much of a benefit performance-wise aside from serialization.

Expectations: aeson, serialization etc etc will be the benefit, not necessarily the text itself. We’re excited by z-haskell’s Z-Data approach. Text-icu is perhaps slow on the haskell side.

Andrew:

There are a few options we have for implementing the conversion from text to unicode:
- Maintainers of text-icu would be able to find native utf8-bindings and provide an interface to use these bindings. (best from perspective of text maintainers).
- We can provide a converter between utf8 text and utf16
- Abandon text-icu. Implement all of its functionality as native haskell libs

Harendra Kumar was working on this with haskell-data. Text-icu was already requiring the icu library internally.

Andreas:

The last point was a nuisance historically in Agda
Agda uses text-icu to compute the width of characters. We are not married to the idea.

Andrew:

Unsure of the users of text-icu aside from folks like Ed. Many have switched to unicode-transform.

Emily:

We’ve gone to a handful of industry partners and have had no negative feedback on the issue - in fact, the opposite: it was lukewarm at worst.
Community buy-in, especially from folks like Michael, Ed, and others have given the thumb’s up on the idea, so I imagine there will be strong backing behind this idea.

Andrew:

Step 1 is to understand why text is 7x regressed between 8.10 and 9.0, and further between 9.0 and 9.2.
Then we can decide on next steps. It’s hard to say what to do until we figure out what’s wrong.
We also want to talk with stakeholders of text and text-icu to figure out what the timeline should be for migration, and how they will be affected.

Emily:

Fair. We’ll do our investigation with the maintainers of megaparsec, attoparsec, http-client, hexml, aeson etc. to gauge tolerance for migration and what we can do to make it easier.
We’ll schedule it in the coming weeks. Hopefully we can get some GHC folks like AndreasK or Bgamari to help out with diagnosing the regressions.

Andrew:

We need someone to lead the benchmarking work.

Bodigrim · April 15, 2021, 10:47pm

With respect to native Haskell libraries, covering some features of text-icu:
https://hackage.haskell.org/package/unicode-transforms for normalization
https://github.com/jgm/unicode-collation for collation

Bodigrim · April 15, 2021, 10:48pm

https://github.com/composewell/unicode-data for character database
https://hackage.haskell.org/package/unicode-general-category for character database

I would like to hear from text-icu users, which features remain missing.
(Discourse prohibits me from posting more than two links per message %)

jaror · April 15, 2021, 11:36pm

I think you can ask @jaspervdj to give you more permissions.

Edit: I remember now, in the comment here he said he granted a higher trust level to working group members, maybe the same can be done for you. But otherwise you can just move up in the trust hierarchy by interacting with discourse. Here is an overview of the trust levels: Understanding Discourse Trust Levels. So I think you can already post many links when you’ve reached level 1, which just requires 10 minutes of reading at least 5 different topics and 30 posts in total.

Bodigrim · April 16, 2021, 12:08am

So in this dystopian world I need to gain trust of a robot by doing meaningless chores?

My comments in full can be found at https://www.reddit.com/r/haskell/comments/mromg9/text_maintainers_textutf8_migration_discussion/gunucar

f-a · April 16, 2021, 12:23am

Or that of a human — you should now be able to post as many links per message as you wish.

ketzacoatl · April 16, 2021, 1:30am

Step 1 is to understand why text is 7x regressed between 8.10 and 9.0, and further between 9.0 and 9.2.

Yikes!

Then we can decide on next steps. It’s hard to say what to do until we figure out what’s wrong.

Thank you for making that a priority!

jaspervdj · April 16, 2021, 9:54am

I bumped the level manually. Sorry for the hassle.

harendra · April 17, 2021, 7:23pm

We have a write up on the state of Unicode in Haskell and our plans about it: https://github.com/composewell/streamly/blob/master/design/unicode.md . We have factored out unicode-data package from unicode-transforms to make it more modular. unicode-transforms supports unicode normalization, other features from text-icu can also be supported with some effort.

However, ideally I would like the Haskell ecosystem to not use text and bytestring like libraries in the long run. Instead, use streams and arrays as the fundamental constructs. We can encode and serialize streams to arrays of any type of data, be it UTF8 or UTF16. Similarly, we can de-serialize UTF8 or UTF16 data from an array to a stream and process it in whatever way. Streamly supports this already. The text type could then just be a sugar on top of the array type if its really wanted for convenience.

I would be happy to help in any case.

harendra · April 17, 2021, 7:27pm

That’s interesting. Streamly also regressed with GHC 9.0, we have not investigated it yet. Though the fundamental benchmarks seem to be fine, high level benchmarks regressed. Which makes me suspect if GHC 9.0 messed up with rewrite rules?

sclv · April 18, 2021, 4:28pm

I definitely ran into a rewrite rule regression between text on 8.6 and 8.10, though I couldn’t distill it down into a small case. Inspecting core revealed that a fusion took place that hadn’t before, which resulted in a whole lot of effectively useless passing through stream and unstream conversions, causing a huge blowup in time and allocations. The culprit was fused append calls in the context of some rather tricky cps-style code.

Bodigrim · April 19, 2021, 8:50pm

Unifying bytestring and text under an umbrella vector type is another HF topic (see https://www.snoyman.com/blog/2021/03/haskell-base-proposal-2/), but this is probably more of a long-term nature and very much in flux (e. g., it’s not quite clear how much of existing API can be salvaged). On the other hand, switching to utf8 - at least in theory - can be accomplished without any API breakage (modulo internal modules, of course), and thus is a feasible short-term goal.

Topic		Replies	Views
Call for reviews: switch Text to UTF8 Announcements	2	1554	August 27, 2021
Haskell Foundation May Update Haskell Foundation	14	3203	June 5, 2021
The Text UTF-8 Proposal Has Been Merged! Haskell Foundation	0	1179	May 20, 2021
HF Tech Proposal #1: UTF-8 Encoded Text Haskell Foundation	40	6362	June 2, 2021
The Quest to Completely Eradicate `String` Awkwardness	106	6053	October 17, 2024

Text Maintainers: Meeting Minutes 2021-04-15

Related topics