That is in fact exactly the problem. A String
is a list of Char
, but a string is not just a list of characters. The Char
type in Haskell represents code points, not characters (and note that Unicode carefully avoids using the word “character” entirely, they instead talk about “code points”, “glyphs”, etc.), and different “characters” (whichever Unicode-defined meaning you pick) do not necessarily correspond to single code points.
So, with String
being a list of code points, any element-wise transform can only ever transform code points on a 1:1 basis, but this is not necessarily correct, because there are Unicode “characters” that are composed of multiple code points. This includes things like flags and emoji, but also diacritics and many other use cases. And it is not at all impossible for a locale to define case conversions that turn a single-code-point entity into a multi-code-point one, or vv.
The example of Turkish ı and İ is actually not a problem with String
, because all four of i, ı, İ and I fit into a single code point each, but still - manipulating individual parts of a multi-code-point entity separately is wrong.
An example, off the top of my head, would be languages that only write diacritics on lowercase letters, but not uppercase; in such a language, uppercasing the letter á when written as the letter “a” plus combining diacritic accent will turn it into plain “A”, going from two code points (letter + combining diacritic) to one (just the letter).
Another example would be the treatment of “ß” in German - by convention, there is no uppercase form of “ß” (although one has been proposed and adopted by some typographers and publishers); instead, the traditional way of dealing with “ß” in all-caps is to replace it with “SS” (e.g. “Maß” → “MASS”). And of couse this means that the single-codepoint “ß” is turned into a two-codepoint string, “SS”.
Neither of these can be modeled as toLower :: Char -> Char
; they can, however, be modeled as strToLower :: [Char] -> [Char]
(or Text -> Text
).
Anyway, toLower
in Haskell is pretty primitive anyway, and shouldn’t be expected to do correct case conversions regardless of culture / language / locale, and for typical use cases, we can squint and call it “good enough”. If you need something better, then you will have to go look for locale-aware string manipulation APIs anyway, since you cannot possibly get things like ß or İ right without telling the upper/lowercasing functions which language and culture you want it to use (and just assuming that the current locale as per the environment is the correct one to use isn’t necessarily right either).