When does the "Not pedantically Unicode-correct case" occur when using String?

chansey97 · March 10, 2024, 8:48pm

Just reading the haskell wiki, it said:

Cons:

massive overhead, up to 4 words per character, which also has speed implications

not pedantically Unicode-correct in some cases (e.g. there are strings which change length when changing case, so map toLower is not accurate in that case)

A String is just a List of Char, why map toLower would cause the length to change? I can’t image the case.

Anyone can give me an example?

Thanks.

Bodigrim · March 10, 2024, 9:13pm

Capital I with dot İ (U+0130) is an example of a character, which becomes 2 characters in lower case:

tdammers · March 11, 2024, 10:50am

That is in fact exactly the problem. A String is a list of Char, but a string is not just a list of characters. The Char type in Haskell represents code points, not characters (and note that Unicode carefully avoids using the word “character” entirely, they instead talk about “code points”, “glyphs”, etc.), and different “characters” (whichever Unicode-defined meaning you pick) do not necessarily correspond to single code points.

So, with String being a list of code points, any element-wise transform can only ever transform code points on a 1:1 basis, but this is not necessarily correct, because there are Unicode “characters” that are composed of multiple code points. This includes things like flags and emoji, but also diacritics and many other use cases. And it is not at all impossible for a locale to define case conversions that turn a single-code-point entity into a multi-code-point one, or vv.

The example of Turkish ı and İ is actually not a problem with String, because all four of i, ı, İ and I fit into a single code point each, but still - manipulating individual parts of a multi-code-point entity separately is wrong.

An example, off the top of my head, would be languages that only write diacritics on lowercase letters, but not uppercase; in such a language, uppercasing the letter á when written as the letter “a” plus combining diacritic accent will turn it into plain “A”, going from two code points (letter + combining diacritic) to one (just the letter).

Another example would be the treatment of “ß” in German - by convention, there is no uppercase form of “ß” (although one has been proposed and adopted by some typographers and publishers); instead, the traditional way of dealing with “ß” in all-caps is to replace it with “SS” (e.g. “Maß” → “MASS”). And of couse this means that the single-codepoint “ß” is turned into a two-codepoint string, “SS”.

Neither of these can be modeled as toLower :: Char -> Char; they can, however, be modeled as strToLower :: [Char] -> [Char] (or Text -> Text).

Anyway, toLower in Haskell is pretty primitive anyway, and shouldn’t be expected to do correct case conversions regardless of culture / language / locale, and for typical use cases, we can squint and call it “good enough”. If you need something better, then you will have to go look for locale-aware string manipulation APIs anyway, since you cannot possibly get things like ß or İ right without telling the upper/lowercasing functions which language and culture you want it to use (and just assuming that the current locale as per the environment is the correct one to use isn’t necessarily right either).

Topic		Replies	Views
Text-2.0 with UTF8 is finally released! Announcements	20	2821	January 27, 2022
[ANN] unicode-data-0.3.0: APIs to efficiently access the Unicode character database	13	1258	April 11, 2022
Different Behaviors when showing an ASCII and Unicode character Learn	9	710	March 15, 2024
[Solved] Why is Data.Text.toLower so slow in Haskell?	7	1218	July 29, 2022
How to refer to data type "anything"? Learn	2	420	June 12, 2022

When does the "Not pedantically Unicode-correct case" occur when using String?

Related topics