[ANN] unicode-data-0.3.0: APIs to efficiently access the Unicode character database

On behalf of the maintainers team I’m happy to announce unicode-data-0.3.0.

unicode-data provides Haskell APIs to efficiently access the latest Unicode character database.
It is up to 5 times faster than base:Data.Char.

This release features:

  • Support for big-endian architectures.
  • Support for General_Category property.
  • All the predicates and case mapping functions found in base:Data.Char.

It follows the 0.2.0 release which added support for:

See the complete ChangeLog for details.

9 Likes

Is there a reason why we can’t make Data.Char just as fast?

4 Likes

We use a different approach: Data.Char relies on FFI (see GHC.Unicode) while unicode-data uses only pure Haskell (bitmaps and simple functions). Maybe @Bodigrim or @harendra could explain this better?

Maybe I’m dense, but that sounds like a “how” rather than a “why”. Put it another way, is there a reason Data.Char can’t be reimplemented to use your approach? A pure-Haskell approach sounds better, anyway. :slight_smile:

Anyway, good work, thanks for the library!

3 Likes

Indeed, pure Haskell and faster sounds like a clear win for everyone (especially people porting Haskell to odd environments). Sounds like folding this into base, or reversing the dependency direction, is worth discussing!

How does code size change?

1 Like

It is technically possible to fold unicode-data into base, but I would advise against. The reason is that Unicode is an evolving standard, and it is desirable to have an ability to update to the latest version without upgrading your compiler. A standalone package provides such possibility, but base does not - you are bound to whichever Unicode version was wired in. I suspect not all developers fully realise that the behaviour of Unicode-aware application depends on GHC used to build it. In fact, it would be better to deprecate Unicode API from Data.Char and refer users to unicode-data or similar packages.

4 Likes

Well, Data.Char is currently there, and is widely used. As long as it is there and not deprecated, surely changing an FFI implementation to a faster and pure implementation seems desirable – the problems with outdated unicode standards (which probably not all developers care about) are orthogonal to that. And those developers who do care can of course still use such a dedicated library.

6 Likes

IMHO, we should not restrict the correctness or capabilities of a basic library such as unicode-data for the GHC/base versioning issues. We should instead separate the GHC-dependent and independent stuff in base, and otherwise de-couple the versioning there. I know this is more complicated, but I’m sure we can figure it out (for example, the core parts of unicode-data that need to be in base can be, with “faster moving stuff” in a separate package). But even so, let’s clean up GHC internals so we stop crippling ourselves.

@nomeata there is more motivation to use a proper solution, when it is both faster and correct than when it is just correct :wink: But anyway unicode-data is open-source, so nothing prevents a motivated individual to put an effort and merge it into base, I believe.

Is that an official approval by the CLC? :wink:
Or who’d be the one to tell such an motivated individual that such a change would be welcome?

@Wismill, do you know off hand if your approach can replace all of the following (from include/WCsubst.h)

HsInt u_iswupper(HsInt wc);
HsInt u_iswdigit(HsInt wc);
HsInt u_iswalpha(HsInt wc);
HsInt u_iswcntrl(HsInt wc);
HsInt u_iswspace(HsInt wc);
HsInt u_iswprint(HsInt wc);
HsInt u_iswlower(HsInt wc);

HsInt u_iswalnum(HsInt wc);

HsInt u_towlower(HsInt wc);
HsInt u_towupper(HsInt wc);
HsInt u_towtitle(HsInt wc);

HsInt u_gencat(HsInt wc);
1 Like

Or who’d be the one to tell such an motivated individual that such a change would be welcome?

It depends on how and where it is impemented. E. g., you can just rewrite libraries/base/cbits/WCsubst.c · master · Glasgow Haskell Compiler / GHC · GitLab using lookup tables (similar to unicode-data) instead of binary search. This will give you the ultimate performance, and no need for CLC approval, because this is not even Haskell (this C module is a part of base and technically still falls under CLC supervision, but I do not expect a lot of fight over an autogenerated file).

2 Likes

Yes, all of these are implemented. In fact, unicode-data is a drop-in replacement for Data.Char except for a couple of Show/Read related functions.