[ANN] unicode-data-0.3.0: APIs to efficiently access the Unicode character database

Wismill · January 2, 2022, 11:25am

On behalf of the maintainers team I’m happy to announce unicode-data-0.3.0.

unicode-data provides Haskell APIs to efficiently access the latest Unicode character database.
It is up to 5 times faster than base:Data.Char.

This release features:

Support for big-endian architectures.
Support for General_Category property.
All the predicates and case mapping functions found in base:Data.Char.

It follows the 0.2.0 release which added support for:

Unicode 14.0.0.
Unicode Identifier and Pattern Syntax.

See the complete ChangeLog for details.

nomeata · January 2, 2022, 6:19pm

Is there a reason why we can’t make Data.Char just as fast?

Wismill · January 2, 2022, 6:36pm

We use a different approach: Data.Char relies on FFI (see GHC.Unicode) while unicode-data uses only pure Haskell (bitmaps and simple functions). Maybe @Bodigrim or @harendra could explain this better?

chreekat · January 2, 2022, 6:45pm

Maybe I’m dense, but that sounds like a “how” rather than a “why”. Put it another way, is there a reason Data.Char can’t be reimplemented to use your approach? A pure-Haskell approach sounds better, anyway.

Anyway, good work, thanks for the library!

nomeata · January 2, 2022, 6:49pm

Indeed, pure Haskell and faster sounds like a clear win for everyone (especially people porting Haskell to odd environments). Sounds like folding this into base, or reversing the dependency direction, is worth discussing!

How does code size change?

Bodigrim · January 2, 2022, 6:59pm

It is technically possible to fold unicode-data into base, but I would advise against. The reason is that Unicode is an evolving standard, and it is desirable to have an ability to update to the latest version without upgrading your compiler. A standalone package provides such possibility, but base does not - you are bound to whichever Unicode version was wired in. I suspect not all developers fully realise that the behaviour of Unicode-aware application depends on GHC used to build it. In fact, it would be better to deprecate Unicode API from Data.Char and refer users to unicode-data or similar packages.

nomeata · January 2, 2022, 7:23pm

Well, Data.Char is currently there, and is widely used. As long as it is there and not deprecated, surely changing an FFI implementation to a faster and pure implementation seems desirable – the problems with outdated unicode standards (which probably not all developers care about) are orthogonal to that. And those developers who do care can of course still use such a dedicated library.

ketzacoatl · January 3, 2022, 5:54am

IMHO, we should not restrict the correctness or capabilities of a basic library such as unicode-data for the GHC/base versioning issues. We should instead separate the GHC-dependent and independent stuff in base, and otherwise de-couple the versioning there. I know this is more complicated, but I’m sure we can figure it out (for example, the core parts of unicode-data that need to be in base can be, with “faster moving stuff” in a separate package). But even so, let’s clean up GHC internals so we stop crippling ourselves.

Bodigrim · January 3, 2022, 5:25pm

@nomeata there is more motivation to use a proper solution, when it is both faster and correct than when it is just correct But anyway unicode-data is open-source, so nothing prevents a motivated individual to put an effort and merge it into base, I believe.

nomeata · January 3, 2022, 6:07pm

Is that an official approval by the CLC?
Or who’d be the one to tell such an motivated individual that such a change would be welcome?

@Wismill, do you know off hand if your approach can replace all of the following (from include/WCsubst.h)

HsInt u_iswupper(HsInt wc);
HsInt u_iswdigit(HsInt wc);
HsInt u_iswalpha(HsInt wc);
HsInt u_iswcntrl(HsInt wc);
HsInt u_iswspace(HsInt wc);
HsInt u_iswprint(HsInt wc);
HsInt u_iswlower(HsInt wc);

HsInt u_iswalnum(HsInt wc);

HsInt u_towlower(HsInt wc);
HsInt u_towupper(HsInt wc);
HsInt u_towtitle(HsInt wc);

HsInt u_gencat(HsInt wc);

Bodigrim · January 3, 2022, 9:04pm

Or who’d be the one to tell such an motivated individual that such a change would be welcome?

It depends on how and where it is impemented. E. g., you can just rewrite Files · master · Glasgow Haskell Compiler / GHC · GitLab using lookup tables (similar to unicode-data) instead of binary search. This will give you the ultimate performance, ~~and no need for CLC approval, because this is not even Haskell~~ (this C module is a part of base and technically still falls under CLC supervision, but I do not expect a lot of fight over an autogenerated file).

harendra · January 11, 2022, 5:36am

Yes, all of these are implemented. In fact, unicode-data is a drop-in replacement for Data.Char except for a couple of Show/Read related functions.

nomeata · February 22, 2022, 7:11pm

On javascript backend · Wiki · Glasgow Haskell Compiler / GHC · GitLab I spotted

What is the plan for handling the various C bits in base/text/bytestring/etc.? Specifically, has there been any motion on discussions with upstreams regarding upstreaming the Javascript implementations?

We plan to propose patches for these libraries. We can use CPP to condition JS specific code to HOST_OS=ghcjs and similarly in .cabal files.

I guess a pure Haskell solution for Data.Char would help with that kind of work.

nomeata · April 11, 2022, 1:00pm

I started a discussion about how base can benefit here: https://gitlab.haskell.org/ghc/ghc/-/issues/21375. @wismill, your thoughts will be appreciated there

Topic		Replies	Views
Text-2.0 with UTF8 is finally released! Announcements	20	2819	January 27, 2022
Text Maintainers: Meeting Minutes 2021-04-15 Haskell Foundation	11	2582	April 19, 2021
Bytestring-0.11.3.0 Announcements	0	682	March 5, 2022
[GHC API stability] Update #3 Announcements	3	511	March 13, 2025
Transparently implement data types differently and more compact	11	941	October 28, 2021

[ANN] unicode-data-0.3.0: APIs to efficiently access the Unicode character database

Related topics