[ANN] unicode-data-0.3.0: APIs to efficiently access the Unicode character database

On behalf of the maintainers team I’m happy to announce unicode-data-0.3.0.

unicode-data provides Haskell APIs to efficiently access the latest Unicode character database.
It is up to 5 times faster than base:Data.Char.

This release features:

  • Support for big-endian architectures.
  • Support for General_Category property.
  • All the predicates and case mapping functions found in base:Data.Char.

It follows the 0.2.0 release which added support for:

See the complete ChangeLog for details.

11 Likes

Is there a reason why we can’t make Data.Char just as fast?

4 Likes

We use a different approach: Data.Char relies on FFI (see GHC.Unicode) while unicode-data uses only pure Haskell (bitmaps and simple functions). Maybe @Bodigrim or @harendra could explain this better?

Maybe I’m dense, but that sounds like a “how” rather than a “why”. Put it another way, is there a reason Data.Char can’t be reimplemented to use your approach? A pure-Haskell approach sounds better, anyway. :slight_smile:

Anyway, good work, thanks for the library!

3 Likes

Indeed, pure Haskell and faster sounds like a clear win for everyone (especially people porting Haskell to odd environments). Sounds like folding this into base, or reversing the dependency direction, is worth discussing!

How does code size change?

1 Like

It is technically possible to fold unicode-data into base, but I would advise against. The reason is that Unicode is an evolving standard, and it is desirable to have an ability to update to the latest version without upgrading your compiler. A standalone package provides such possibility, but base does not - you are bound to whichever Unicode version was wired in. I suspect not all developers fully realise that the behaviour of Unicode-aware application depends on GHC used to build it. In fact, it would be better to deprecate Unicode API from Data.Char and refer users to unicode-data or similar packages.

5 Likes

Well, Data.Char is currently there, and is widely used. As long as it is there and not deprecated, surely changing an FFI implementation to a faster and pure implementation seems desirable – the problems with outdated unicode standards (which probably not all developers care about) are orthogonal to that. And those developers who do care can of course still use such a dedicated library.

6 Likes

IMHO, we should not restrict the correctness or capabilities of a basic library such as unicode-data for the GHC/base versioning issues. We should instead separate the GHC-dependent and independent stuff in base, and otherwise de-couple the versioning there. I know this is more complicated, but I’m sure we can figure it out (for example, the core parts of unicode-data that need to be in base can be, with “faster moving stuff” in a separate package). But even so, let’s clean up GHC internals so we stop crippling ourselves.

@nomeata there is more motivation to use a proper solution, when it is both faster and correct than when it is just correct :wink: But anyway unicode-data is open-source, so nothing prevents a motivated individual to put an effort and merge it into base, I believe.

Is that an official approval by the CLC? :wink:
Or who’d be the one to tell such an motivated individual that such a change would be welcome?

@Wismill, do you know off hand if your approach can replace all of the following (from include/WCsubst.h)

HsInt u_iswupper(HsInt wc);
HsInt u_iswdigit(HsInt wc);
HsInt u_iswalpha(HsInt wc);
HsInt u_iswcntrl(HsInt wc);
HsInt u_iswspace(HsInt wc);
HsInt u_iswprint(HsInt wc);
HsInt u_iswlower(HsInt wc);

HsInt u_iswalnum(HsInt wc);

HsInt u_towlower(HsInt wc);
HsInt u_towupper(HsInt wc);
HsInt u_towtitle(HsInt wc);

HsInt u_gencat(HsInt wc);
1 Like

Or who’d be the one to tell such an motivated individual that such a change would be welcome?

It depends on how and where it is impemented. E. g., you can just rewrite libraries/base/cbits/WCsubst.c · master · Glasgow Haskell Compiler / GHC · GitLab using lookup tables (similar to unicode-data) instead of binary search. This will give you the ultimate performance, and no need for CLC approval, because this is not even Haskell (this C module is a part of base and technically still falls under CLC supervision, but I do not expect a lot of fight over an autogenerated file).

2 Likes

Yes, all of these are implemented. In fact, unicode-data is a drop-in replacement for Data.Char except for a couple of Show/Read related functions.

On javascript backend · Wiki · Glasgow Haskell Compiler / GHC · GitLab I spotted

What is the plan for handling the various C bits in base/text/bytestring/etc.? Specifically, has there been any motion on discussions with upstreams regarding upstreaming the Javascript implementations?

We plan to propose patches for these libraries. We can use CPP to condition JS specific code to HOST_OS=ghcjs and similarly in .cabal files.

I guess a pure Haskell solution for Data.Char would help with that kind of work.

I started a discussion about how base can benefit here: https://gitlab.haskell.org/ghc/ghc/-/issues/21375. @wismill, your thoughts will be appreciated there

4 Likes