I confess I have not — yet! — read all of this discussion, but on Discourse there is no shame in opening a new topic if the discussion splinters in multiple sub-threads. It helps keep everything tidy.
This sort of gotcha is why “moving Text into base” is a huge task, not only for those adding to base
but also for library maintainers who want their code to continue running on this new base
version. It would be better to avoid touching Read
and Show
entirely, and provide a new Pretty
class, which can be done in many ways, which is why it’s a problem better solved in library-land. Just depend on the library that has the Pretty
you want in your own code! There’s no guarantee every base
user will want the same one!
Is there a library to get “natural looking in code” conversions between string types?
For example, I’d love if base would expose toString
/ toByteString
/ toText
functions, instead of having to always import the pack
/ unpack
functions from multiple separate modules. That’s what would be most helpful for me.
Also a toByteString
should handle both strict and lazy return type based on what is expected in context, same with toText
. Defaulting to UTF-8 for conversions (where applicable) would also feel like what I expect nowadays. And if you want a UTF-16 or whatever other encoding, reusing the default
keyword for strings as well at the module level would make sense.
encode-string
plus a couple of one line base functions works for me
import Text.Show.Pretty (ppShow)
import Data.String.Encode
-- alias for universal string converter (convertString) provided by encode-string
toS :: (ConvertString a b) => a -> b
toS = convertString
-- equivalent of show for text but always pretty print
txt :: (Show a) => a -> Text
txt = toS . ppShow
-- unpretty show hardly ever use
txtShow :: (Show a) => a -> Text
txtShow = toS . show
Now I default to text and throw in a toS
whenever the compiler complains: e.g. String → Text above or
Text → String as below
resolveFile :: (MonadIO m) => Path Abs Dir -> Text -> m (Path Abs File)
resolveFile b = D.resolveFile b . toS
That would be a matter for implementors, not new Haskell arrivals:
Unfortunately, renaming Data.Text
to e.g. Data.Text.Strict
(so that Data.Text.Lazy
can be the default) would now probably break too much existing code. It’s another reason to investigate the possibility of merging the two Text
modules, using lazy lists of strict Text
.
Nice. If there’s a similar way to replace the use of Read
, a less-ambitious option (than eradicating String
immediately) would be to simply mark String
as deprecated, with a view to possibly moving all “stringy” types into libraries:
(yes, even one for type String = [Char]
), especially considering just how many of them now exist:
But (again) it would also be nice to have only one Text
type to nominate as the successor to String
…
As far as I remember, they cannot.
That is possible (in theory). String is a low-level type. I don’t think most haskellers really have a good intuition about what unicode code points are (which are not unicode scalar values which are not grapheme clusters).
So String is inherently a low-level type.
But I have not yet seen someone propose a migration strategy that won’t make it into history books as a giant failure.
I’m OK with this - having a world where strict Text
is the default and saying "Text
is the default string type, it is finite and strict." is a true statement. I think the only thing we actually do lose there is pure infinite strings like this, which would be perfectly fine as just [Text]
or lazy text.
The change I’d make to base
here to not have this example blow up, is making [1..]
be an explicit infinite range type somehow (however large a bike-shed that is to paint); T.unlines
should simply not be possible on it.
For the other kind of strings laziness where is used, lazy IO
, currently supported by [Char]
or lazy text I think is a tangible benefit to lose that. Not-so-hot take: It’s always better to use an actual stream type. Of course we could also keep [Char]
and just drop the unsafeInterleaveIO
s from the relevant functions, but while we are talking about pros and cons of switching to a strict text type as default, I think this would be one of them.
Alright, supposing that the intuitions of most Haskellers are lacking in that way…perhaps a newer programming language can help to provide the much-needed insight:
-
…that seems simple enough:
Hrm - does that mean the intuitions of most Rust users are also lacking in that exact same way? If so, then can anyone nominate a programming language that does support Unicode “properly” ? A migration strategy for Haskell that won’t make it into history books as a giant failure could then be based on that programming language…
Cold take: Browse and search packages | Hackage [for stream
] - which one should a new Haskeller use?
As I see it, the only way a streaming type supersedes lazy lists (of characters or other values) in education is for it to be simple and standard Haskell. Educators can then rewrite all their course note with confidence, knowing that it will be supported indefinitely.
(I leave it as an exercise to determine the chances of a simple standard streaming type actually appearing…ever.)
Perhaps no streaming library at all – many instructional programs using interact
etc. (and needing lazy IO to produce immediate results, not blow up memory, etc.) look like forever (getLine >>= _ >>= mapM_ putStrLn)
. But streaming
is a pretty good default.
Well, if you open the book, it has a pretty good section on it: Storing UTF-8 Encoded Text with Strings - The Rust Programming Language
If we look at the Hindi word “नमस्ते” written in the Devanagari script, it is stored as a vector of
u8
values that looks like this:[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]
That’s 18 bytes and is how computers ultimately store this data. If we look at them as Unicode scalar values, which are what Rust’s
char
type is, those bytes look like this:['न', 'म', 'स', '्', 'त', 'े']
There are six
char
values here, but the fourth and sixth are not letters: they’re diacritics that don’t make sense on their own. Finally, if we look at them as grapheme clusters, we’d get what a person would call the four letters that make up the Hindi word:["न", "म", "स्", "ते"]
So you want a way to pattern match on all 3 variants, imo. And write literals for all 3 variants that are checked at compile time (not runtime or other trash like OverloadedStrings
). Sorry for the language.
And then you still have to educate users about it.
So it actually does not get easier. It just gets more correct.
I had a conversation a while ago about AFPP/OsString and got asked what people should do who don’t care about the details. I actually wasn’t sure what to answer. The entire point of it is to force you to care.
-
I did open that book (to get those quotes);
-
Your quote still doesn’t justify the claim that “most Haskellers don’t really have a good intuition about what Unicode code points are”, as if it’s only a Haskell problem - it seems to be a problem for users of Rust too (otherwise they surely would have implemented Unicode “properly” ).
So again:
I got confused by the fact that simdutf8 needs extra code to work across chunk boundaries.
And now that I know that, this is an extremely annoying restriction for parsing, as merely knowing that a certain stretch of input is UTF-8 is not enough to transform that input into Text
. I’ll now need a whole extra rechunking function just to conform to that expectation.
You’re nitpicking and reading into my post.
My points are:
- Unicode codepoints are very non-intuitive (rust doesn’t have a type for them: good)
- Improving the situation will likely make things more unergonomic (at least in part), because Strings are hard (compare with rust not allowing to index strings or OsString throwing stones at you if you want to convert to String/Text)
- people want better ergonomics (zomg, too many string types), not write more correct code
That’s why most of these string rants kinda miss the point, imo.
Some time ago this was written, presumably in passing joviality:
…but if things really are (already?) so bad:
-
data String a ... unitString :: a -> String a unitString x = ... bindString :: String a -> (a -> String b) -> String b bindString str k = ... ⋮
-
data Byte ... type ByteString = String Byte ⋮
-
data OS ... type OSString = String OS ⋮
-
data Char ... type Text = String Char ⋮
-
data Bin ... type BinString = String Bin ⋮
Or can some simplicity be salvaged from the morass of “stringy” types?
Before dreaming of eradication of String
, someone would have to accomplish a much more mundane task of putting Text
datatype into base
. Yet I don’t see a queue of volunteers.
Mechanically, that doesn’t sound difficult. Out of idle curiosity, what additional work would one be volunteering to do beyond submitting an MR that moves the things from one library to the other and puts in aliases/deprecation warnings?
It would be easier to comment on additional work if someone describes first how exactly they would approach the task. Putting ByteArray
from primitive
into base
is the closest example, which can serve as a model.
If this would happen, the amount of breakage would be monumental, hence it’s not going to.