We use a different approach: Data.Char relies on FFI (see GHC.Unicode) while unicode-data uses only pure Haskell (bitmaps and simple functions). Maybe @Bodigrim or @harendra could explain this better?
Maybe I’m dense, but that sounds like a “how” rather than a “why”. Put it another way, is there a reason Data.Char can’t be reimplemented to use your approach? A pure-Haskell approach sounds better, anyway.
Indeed, pure Haskell and faster sounds like a clear win for everyone (especially people porting Haskell to odd environments). Sounds like folding this into base, or reversing the dependency direction, is worth discussing!
It is technically possible to fold unicode-data into base, but I would advise against. The reason is that Unicode is an evolving standard, and it is desirable to have an ability to update to the latest version without upgrading your compiler. A standalone package provides such possibility, but base does not - you are bound to whichever Unicode version was wired in. I suspect not all developers fully realise that the behaviour of Unicode-aware application depends on GHC used to build it. In fact, it would be better to deprecate Unicode API from Data.Char and refer users to unicode-data or similar packages.
Well, Data.Char is currently there, and is widely used. As long as it is there and not deprecated, surely changing an FFI implementation to a faster and pure implementation seems desirable – the problems with outdated unicode standards (which probably not all developers care about) are orthogonal to that. And those developers who do care can of course still use such a dedicated library.
IMHO, we should not restrict the correctness or capabilities of a basic library such as unicode-data for the GHC/base versioning issues. We should instead separate the GHC-dependent and independent stuff in base, and otherwise de-couple the versioning there. I know this is more complicated, but I’m sure we can figure it out (for example, the core parts of unicode-data that need to be in base can be, with “faster moving stuff” in a separate package). But even so, let’s clean up GHC internals so we stop crippling ourselves.
@nomeata there is more motivation to use a proper solution, when it is both faster and correct than when it is just correct But anyway unicode-data is open-source, so nothing prevents a motivated individual to put an effort and merge it into base, I believe.
Or who’d be the one to tell such an motivated individual that such a change would be welcome?
It depends on how and where it is impemented. E. g., you can just rewrite Files · master · Glasgow Haskell Compiler / GHC · GitLab using lookup tables (similar to unicode-data) instead of binary search. This will give you the ultimate performance, and no need for CLC approval, because this is not even Haskell (this C module is a part of base and technically still falls under CLC supervision, but I do not expect a lot of fight over an autogenerated file).
What is the plan for handling the various C bits in base/text/bytestring/etc.? Specifically, has there been any motion on discussions with upstreams regarding upstreaming the Javascript implementations?
We plan to propose patches for these libraries. We can use CPP to condition JS specific code to HOST_OS=ghcjs and similarly in .cabal files.
I guess a pure Haskell solution for Data.Char would help with that kind of work.