Text Maintainers: Meeting Minutes 2021-04-15

Meeting Minutes - Text Maintainers Meeting
2021-04-15

Attendees:

Li-Yao
Emily Pillmore
Andreas Abel
Andrew Lelechenko

Background

This meeting is for the text package maintainers as we begin project planning for the text-utf8 conversion work.

Action Items:

  • Engage with users of text-icu to determine an optimal path going forward for Unicode as we begin the steps towards a utf-8 text migration.
  • Engage with maintainers who use text in serialization-heavy libraries to coordinate utf8 changes
  • Investigate the performance regressions between from 8.10 to 9.0, and further regressions from 9.0 to 9.2

Text-utf8 package migration:

Performance:

Emily:

  • It’s important to get messaging right for this. Many expect a huge performance increase.

Andrew:

  • Changes to branch prediction will probably mean not much of a benefit performance-wise aside from serialization.

Expectations: aeson, serialization etc etc will be the benefit, not necessarily the text itself. We’re excited by z-haskell’s Z-Data approach. Text-icu is perhaps slow on the haskell side.

Andrew:

  • There are a few options we have for implementing the conversion from text to unicode:
    • Maintainers of text-icu would be able to find native utf8-bindings and provide an interface to use these bindings. (best from perspective of text maintainers).
    • We can provide a converter between utf8 text and utf16
    • Abandon text-icu. Implement all of its functionality as native haskell libs

Harendra Kumar was working on this with haskell-data. Text-icu was already requiring the icu library internally.

Andreas:

  • The last point was a nuisance historically in Agda
  • Agda uses text-icu to compute the width of characters. We are not married to the idea.

Andrew:

  • Unsure of the users of text-icu aside from folks like Ed. Many have switched to unicode-transform.

Emily:

  • We’ve gone to a handful of industry partners and have had no negative feedback on the issue - in fact, the opposite: it was lukewarm at worst.
  • Community buy-in, especially from folks like Michael, Ed, and others have given the thumb’s up on the idea, so I imagine there will be strong backing behind this idea.

Andrew:

  • Step 1 is to understand why text is 7x regressed between 8.10 and 9.0, and further between 9.0 and 9.2.
  • Then we can decide on next steps. It’s hard to say what to do until we figure out what’s wrong.
  • We also want to talk with stakeholders of text and text-icu to figure out what the timeline should be for migration, and how they will be affected.

Emily:

  • Fair. We’ll do our investigation with the maintainers of megaparsec, attoparsec, http-client, hexml, aeson etc. to gauge tolerance for migration and what we can do to make it easier.
  • We’ll schedule it in the coming weeks. Hopefully we can get some GHC folks like AndreasK or Bgamari to help out with diagnosing the regressions.

Andrew:

  • We need someone to lead the benchmarking work.
7 Likes

With respect to native Haskell libraries, covering some features of text-icu:
https://hackage.haskell.org/package/unicode-transforms for normalization
https://github.com/jgm/unicode-collation for collation

2 Likes

https://github.com/composewell/unicode-data for character database
https://hackage.haskell.org/package/unicode-general-category for character database

I would like to hear from text-icu users, which features remain missing.
(Discourse prohibits me from posting more than two links per message %)

2 Likes

I think you can ask @jaspervdj to give you more permissions.

Edit: I remember now, in the comment here he said he granted a higher trust level to working group members, maybe the same can be done for you. But otherwise you can just move up in the trust hierarchy by interacting with discourse. Here is an overview of the trust levels: https://blog.discourse.org/2018/06/understanding-discourse-trust-levels/. So I think you can already post many links when you’ve reached level 1, which just requires 10 minutes of reading at least 5 different topics and 30 posts in total.

1 Like

So in this dystopian world I need to gain trust of a robot by doing meaningless chores?

My comments in full can be found at https://www.reddit.com/r/haskell/comments/mromg9/text_maintainers_textutf8_migration_discussion/gunucar

2 Likes

Or that of a human — you should now be able to post as many links per message as you wish.

3 Likes

Step 1 is to understand why text is 7x regressed between 8.10 and 9.0, and further between 9.0 and 9.2.

Yikes!

Then we can decide on next steps. It’s hard to say what to do until we figure out what’s wrong.

Thank you for making that a priority!

1 Like

I bumped the level manually. Sorry for the hassle.

We have a write up on the state of Unicode in Haskell and our plans about it: https://github.com/composewell/streamly/blob/master/design/unicode.md . We have factored out unicode-data package from unicode-transforms to make it more modular. unicode-transforms supports unicode normalization, other features from text-icu can also be supported with some effort.

However, ideally I would like the Haskell ecosystem to not use text and bytestring like libraries in the long run. Instead, use streams and arrays as the fundamental constructs. We can encode and serialize streams to arrays of any type of data, be it UTF8 or UTF16. Similarly, we can de-serialize UTF8 or UTF16 data from an array to a stream and process it in whatever way. Streamly supports this already. The text type could then just be a sugar on top of the array type if its really wanted for convenience.

I would be happy to help in any case.

5 Likes

That’s interesting. Streamly also regressed with GHC 9.0, we have not investigated it yet. Though the fundamental benchmarks seem to be fine, high level benchmarks regressed. Which makes me suspect if GHC 9.0 messed up with rewrite rules?

2 Likes

I definitely ran into a rewrite rule regression between text on 8.6 and 8.10, though I couldn’t distill it down into a small case. Inspecting core revealed that a fusion took place that hadn’t before, which resulted in a whole lot of effectively useless passing through stream and unstream conversions, causing a huge blowup in time and allocations. The culprit was fused append calls in the context of some rather tricky cps-style code.

1 Like

Unifying bytestring and text under an umbrella vector type is another HF topic (see https://www.snoyman.com/blog/2021/03/haskell-base-proposal-2/), but this is probably more of a long-term nature and very much in flux (e. g., it’s not quite clear how much of existing API can be salvaged). On the other hand, switching to utf8 - at least in theory - can be accomplished without any API breakage (modulo internal modules, of course), and thus is a feasible short-term goal.

2 Likes