The Quest to Completely Eradicate `String` Awkwardness

flamewanes · August 6, 2024, 12:53pm

Originally titled "The Quest to Completely Eradicate String"

I have been recently reminded of this thread: Informal discussion about the progression of `base`:

These are some of the most awkward parts when it comes to handling strings in Haskell right now:
Show, String vs Text, OverloadedStrings, and string interpolation

And it is kind of something that I want to rehash right now for no particular reason. It feels like these days, we’re making a lot of changes to how Haskell works by default and violating some past expectations, so hopefully people are more agreeable when it comes to some things I mention here. A lot of this has been spoken of over and over but simply never implemented… yet!

My feelings about the issue with strings being something that people have been working around for a little bit now, and I feel like all the solutions people come up with have been bandages that work around the main issue at best.

The whole way the Haskell ecosystem deals with strings feels like one of the most awkward and flimsy of any language I’ve dealt with which is really annoying.
One of the first issues people deal with are whether to use String, or Text. And it is always better to use Text, but you’d usually see people avoid using it because it isn’t in base, and also because it is just more convenient to work with Strings. Especially without OverloadedStrings… and even those are not a panacea, either, as they’re known to introduce some issues when it comes to inference in particular. Unfortunately, educational material mostly uses String. I don’t know of many well known projects that use String instead of Text, personally, but I know that there is a fracture in terms of which to use, because String is more convenient for being in base, despite Text being the better solution (though more awkward because some thing are built around Strings, and because of OverloadedStrings)

That covers the last 3 things in the title, but what about Show? It’s kind of well known that Show is an annoying typeclass in the community, I believe. It doesn’t really match well between how you’d have other usages of similar functions like for other languages, like Rust’s Display and Debug. That is because it’s intended more of a counterpart to Read… And the issue of Show revolving around String, rather than using the better option of Text.

There’s a few different workarounds for Show, like text-show. And for example, pretty-simple for pretty-printing. But the need to rely on these libraries for what I would say is just behavior that’s simple enough to build into base (or core libraries) by default is really irritating, and it causes people to reach for the more ad-hoc, less reliable solutions, which further cause a fracture in terms of usage.

In ancient times (2 years ago), there has been some great discussion about merging Data.Text into base, which I totally agree with, and while it would probably help a lot with the String/Text/OverloadedStrings issue… the mechanics of Show would need to be re-discussed IMO, and if we should remove it, change it’s behavior, or add entirely different typeclasses like Display and Pretty for user-facing output and pretty printing respectively. That being said, OverloadedStrings wouldn’t be required anymore for Text, but it could be useful for bytestrings, and it feels like from there we’d kind of fall into the same issues we get today, although less so, since I presume there would be less inference issues.

[NOTE: The post effectively ends there. The next part kind of involves some incoherent ramblings that I’ll have to edit later… If you want to ignore this, you can skip until the next note you see]
Recently, there has been a little bit more discussion on the delightful proposal that intends to add convenient string interpolation into Haskell, and in such discussion I feel there have been some issues with the design of how they exactly want to present it to get it working in a generic way for all the different forms of strings. Personally, I think a part of this stems from the issue that comes from Show and how it’s simply not doing what people need it to do.

Nevertheless, there’s also the issue of whether to distinguish a normal string literal from an interpolated one, and if we need to distinguish an interpolated one… I think we have some nice things to add, but I’m not sure.

This is (well, barely, really), a bit of an odd solution to how we would do things without OverloadedStrings. I don’t particularly have a good syntactical suggestion for this, I would probably have to ask people to recommend them, but distinguishing string literals allows us to not only distinguish an interpolated string from an interpolated one, it would allow us to distinguish string literals that represent different types. so for example, you could do b"hello world" to represent a Strict ByteString. Of course, that’s just syntax I came up with on the fly, but I think that it’ll help make errors a little easier to understand than OverloadedStrings. Specifically, I’m imagining this in a world where String was removed from base entirely and replaced with Text, in which case the default string literal would simply evaluate to Text. I know that one thing that should be thought about is whether to provide possibilities for all the different forms of bytestrings, like short bytestrings, lazy, strict, etc. I suppose at that point it gets a little too much, and the idea starts to sound really stupid. Uh, I guess you can ignore it, if so…

Nevertheless, the point I brought this up isn’t exactly to theorize about some badly thought out form of string literal, but to show that I think it’s an example where if Show was a little bit better, and if people were to do away with String entirely, it could have been made a little easier. Or maybe not, now that I think about it. So perhaps just ignore me mentioning that PR at all other than saying that it’s delightful… it would definitely be a very nice QoL addition to Haskell…

While the main topic of this post is about strings and how we handle them, I also feel like some part of the issue has to do with the fact that base is kind of in lockstep with the compiler, there has been some discussion about separating it from the compiler, but I’m not quite sure how helpful that is. It feels like another part of the issue has to do with the fact that typeclasses are kind of difficult to deal with if you change them across versions… it turns out, some module shenanigans (via cabal?) are a possible solution, but I have never seen them touted or recommended…
[NOTE: The rant ends here!]

Sorry for this being such a messy, stream of consciousness type of post… I think to help make it clear what my intentions are, I think the best way forward for the Haskell ecosystem is to erase String everywhere, not just in base, and just start using Text instead! It’s probably one of the most difficult things to do in Haskell right now, but I think it’s necessary, it feels like it has been for a while… it seems to me like delaying it will only make the issue worse in the future.

sand-witch · August 6, 2024, 1:45pm

To be more accurate, it’s not only for different forms of strings, here is an example of an SQL interpolation authored by @brandonchinn178:

let val = "true; drop table tab;"
s"SELECT * FROM tab WHERE col = ${val}"

-- should result in:
SqlQuery
    "SELECT * FROM tab WHERE col = ?"
    [SqlString "true; drop table tab;"]

flamewanes · August 6, 2024, 2:18pm

I see! Thank you so much for noting that!

BurningWitness · August 6, 2024, 3:35pm

People avoid a lot of things and good programming is one of them. Every single string type in the boot libraries has its own domain of usefulness, and to simply say “everything has to be Text” is to make the language less explicit at the cost of safety.

If the language were perfect, you’d still have to deal with string conversions, it would just be far more organized and straightforward.

Show is good enough for dumping debug traces, for everything highbrow there’s Builder. Serialization typeclasses may look convenient at the first glance, having to bend over backwards to “override” an instance is anything but.

In my view the fact that I can’t execute code at compilation time is far more annoying, as it is impossible to validate (can’t prove string literals are well-formed) and precompute (can’t collapse a bunch of tree insertions into a tree) expressions, leading to code that is both less secure and slower respectively. I’m not expecting any advances on it in the coming decades, since “what should be allowed at compilation time” is a antediluvian quagmire of a discussion, but it does feel like something a perfect iteration of Haskell would have.

Yes, and it will take about a decade for that to get fixed, but consider that base also doesn’t have all that many String-based things in it anyway. There’s Show, printing, environment queries, ancient FFI, … TypeLits?.. that’s it. I don’t see a big need to eradicate String, a road to slowly phase it out is more than enough for the job and it’s probably already in the hopper as of right now. It’ll just take years, as everything does here.

brandonchinn178 · August 6, 2024, 5:41pm

Related to validating literals at compile time:

github.com/ghc-proposals/ghc-proposals

Break out fromInteger/fromRational from Num/Fractional?

opened 07:37PM - 12 Oct 21 UTC

brandonchinn178

There have been a few cases where I would like to write a numeric literal in pla…ce of a custom data type, but that data type does not have valid implementations for the other functions in the type class (or you don't want to implement the other functions for various reasons). If I were redesigning everything from scratch, I'd do: ```hs -- integral literals class FromInteger a where fromInteger :: Integer -> a -- decimal literals class FromRational a where fromRational :: Rational -> a -- string literals, renamed from IsString class FromString s where fromString :: String -> s -- new IsString; I've always wanted this anyway, -- to generalize functions that can take in any -- string-like thing (e.g. logging or golden tests) class FromString s => IsString s where toString :: s -> String -- list literals, broken out from IsList class FromList l where type Item l fromList :: [Item l] -> l -- new IsList class FromList l => IsList l where toList :: l -> [Item l] ``` For example, we have a `BigDecimal` newtype over `Scientific` because `Scientific` has unsafe instances. This way, we can provide explicit `unsafeAdd` functions and prevent people from accidentally using `+`. But currently, there's no way to use a literal `10` to be `BigDecimal 10`. With this proposal, one could implement `FromInteger` but not `Num` as a whole. This is mentioned as part of https://github.com/ghc-proposals/ghc-proposals/pull/124, but I think it's a worthwhile unit separate from the "prevent `fromInteger` from being partial" discussion. Obviously, this is a big breaking change, but I just wanted to write this up and see if others have any thoughts on this before trying to come up with a proposal/migration plan.

BurningWitness · August 6, 2024, 6:09pm

Kinda. It doesn’t feel right to have to use Template Haskell to get baseline features across, that extension just does too many things.

At the same time I don’t know if it’s really possible to both outline a compilation-time safe subset of Haskell and to have the ability to precompile things nicely, the latter must require array indexing, which can obviously segfault and/or corrupt unrelated memory. Same with non-termination. The status quo clearly sucks, but fixing it requires drawing a line which has been avoided this entire time.

L0neGamer · August 6, 2024, 6:16pm

Making more of the language safe doesn’t mean we need the whole thing to be safe; we can increase the safety of using OverloadedLists with NonEmpty, send that doesn’t mean we have to give up on unsafe indexing

flamewanes · August 6, 2024, 6:19pm

This doesn’t make any sense to me. All code that uses String can be rewritten to use Text and it would just be better. And you do deal with string conventions in Text. I don’t really understand what this is attempting to say, I suppose? The point is that String and Text are essentially types that are used identically, but one is a far better option than the other for ALL (every single use-case) that the former is being used for. Furthermore, I only recommend removing Strings from base and not other string types. It seems to me all the string types except String have their valid use-cases. And the only reason, also seemingly to me, is that String was the one first implemented. I’m not quite sure what form of explicitness we would be get rid of at the cost of safety…

And you kind of do say that it’s getting phased out:

Which does imply that it has no use… doesn’t it?
To respond to the other parts of that quote, I don’t really see why them being so few means that there’s no need to eradicate them. The few things you mention already have a bit of a big impact as it is, well, specifically Show, printing, and FFI.

Nevertheless, it doesn’t seem to me like the process to phase it out has started. I think for something like that to happen, there needs to be discussions in the community like this post, and the one 2 years ago about adding Data.Text into base, so that people do work towards it, at whatever speed they wish. In the end, I think that the final outcome will need to be eradicating the String type from common use. I think that aiming for that from the start will help solidify some efforts.

I don’t think Show is near good enough, personally. Show has not been good enough for me in my codebase for anything that is the very, very bare basic form of “I want to get a string representation of this value”, and even that form is, as aforementioned, quite lacking because getting a text representation of the value would be the better way to handle things.

I have never been recommended nor used Builder in my projects, so I’m not really quite sure what use-cases you’re thinking about. In the post above I mentioned pretty printing and pretty-simple, and it looks like that’s just a more convenient way to do pretty printing, and it is annoying that it’s missing from base, or boot libraries. There are also very many pretty printing libraries, and a lot of them have been way more complex than necessary, which is why I prefer pretty-simple. The deal with Show is basically it’s relation to Read, which results in the default instances of a few things being surprising, like Showing strings results with them surrounded in quotes, Showing something like Set or Vectors results in showing fromList rather than anything helpful… so you end up needing a “minimal” pretty-printing that’s not show, which is kind of like text-display or text-show, and then you have a more “fancy” pretty-printing which just prints long text in a pretty way, like pretty-simple.

To answer the off-topic part of your post (well, more like to question it?):

Hmmm… isn’t this kind of what Dependent Haskell buys us, which is getting heavily worked on? Maybe not…

mixphix · August 6, 2024, 6:35pm

Multiline strings (with optional indentation removal) were already possible in Haskell through the use of QuasiQuotes, such as the shakespeare and hsx2hs libraries, which provide ways of producing typed content (Text, HTML, JavaScript) into Haskell files, including variable interpolation, embedded in a Haskell source file or in another location. While it’s great that multiline string support has been made a compiler feature, the ecosystem already has Builder-backed Text construction-with-interpolation, so I don’t see myself using it very soon.

flamewanes · August 6, 2024, 6:39pm

I’m going to guess that mentioning multiline strings was to bring in quasiquotes and how you can implement string interpolation using them? Nevertheless, while they were already possible, the need to use quasiquotes was generally frowned upon which is why the proposal to include them was accepted (and I believe merged in? just not released) into GHC. I definitely support the efforts to lean less and less on quasiquotes and TemplateHaskell in general, as they feel like hacky workarounds at best for things like these when built-in compiler support feels like the best way to do things.

mixphix · August 6, 2024, 7:03pm

The purpose of Builder is to provide O(1) concatenation. In apps that must compile a lot of Text or embedded code, this is crucial for adequate running times. Writing to type String = [Char] or even Text directly incurs a performance hit due to slower concatenation. This is one of the main reasons Text was to be brought into base, was to improve upon the linked list, though that brings with it the style of Show and Read. Why not solve this problem in library-land?

flamewanes · August 6, 2024, 7:05pm

My suggestion is… kind of to improve this in library land. base is a library, right? All we have to do is remove every single usage of String and replace any functions that use it with the better alternatives, whether it’s Text or something else.

mixphix · August 6, 2024, 7:09pm

base is not capable of such drastic changes in the current Haskell ecosystem, or at least does not have the resources to do so. The CLC oversees proposals, which are submitted by members of the Haskell community, who wish to make a change. There has been no coordinated plan to address this problem in a serious way, engaging the community to accept and migrate, or otherwise.

flamewanes · August 6, 2024, 7:12pm

Ah, yeah, I see that. I suppose this would best be a CLC proposal at some point. I think it is more beneficial for everyone to get all popular libraries moving off of String.

BurningWitness · August 6, 2024, 7:28pm

I was making a broader point about all other string types, String is probably the least practical out of the bunch (albeit still neat).

os-string was released last year, alongside the changes to filepath, directory, unix and Win32 packages. You can already avoid pretty much all String-related IO, except perhaps reading locale-dependent C strings.

The topic of Strings being inefficient is not contentious, it’s about the single largest point of agreement in the entire community.

Type classes are bound by laws, and the laws forShow/Read is that they’re classes of things that roughly can be copied from stdout to your code file and back to stdin. You’re free to choose a lawless typeclass to simplify printing, but that set of personal preferences will necessarily exclude half the community who will happily disagree on every single instance.

Builder is a general way to serialize anything to raw bytes in the exact format you want it to be in, both the smallest necessary tool for any such job and a showcase demonstrating you don’t need type classes for this.

In my understanding Dependent Haskell is about being able to compile type-level data into the runtime to extend type-checking, so I don’t think it’s anywhere near.

chrisdone · August 6, 2024, 8:48pm

While we’re in a tangent, I was working on solving this problem (GitHub - chrisdone-archive/present: Make presentations for data types) see the extension section and how the output contains many choices for a given type. The presentation generated lazily, so an editor could walk through a nested structure or even infinite ones. I had an Emacs mode which talks to GHCi.

I gave up after getting stuck on the implementation, but I think that’s the direction I’d take Haskell if I had a magic wand. Seeing what’s going in my Haskell programs is often not possible due to type erasure and lack of good debugging facilities. It’s not even that I want to walk through my code with a stepper, I just want to see the data, with a rich representation that I don’t have to prepare in advance.

darkxero · August 7, 2024, 1:44am

Text is the better solution when you just want to shove bytes you can read and understand, somewhere (the web, a file). Which is 99% most of the time. So yeah, you want String to be Text and you want it to work like Java (it replaces String with StringBuilder when it can, do you really want to iterate all over each previous String to concat another when you just want a lone big String at the end?). This is very good because you help people fall on the pit of success.

Now, the current value of String is educational. It’s a list. List functions, functor functions, monad functions, all apply. You teach students to come up with reverse on a list, and then teach them how to use it. You make them compose the acquired list tools to solve String related problems.

This is the one in base, students can come up with either.

reverse                 :: [a] -> [a]
#if defined(USE_REPORT_PRELUDE)
reverse                 =  foldl (flip (:)) []
#else
reverse l =  rev l []
  where
    rev []     a = a
    rev (x:xs) a = rev xs (x:a)
#endif

A classic problem example and the solution:

isPalindrome :: String -> Bool
isPalindrome ss = ss == reverse ss

Now, if it was:

isPalindrome :: Text -> Bool
isPalindrome ss = ss == T.reverse ss

You would be showing them this:

reverse ::
#if defined(ASSERTS)
  HasCallStack =>
#endif
  Text -> Text
reverse (Text _ _ 0) = empty
reverse t            = reverseNonEmpty t

reverseNonEmpty ::
  Text -> Text
#if defined(PURE_HASKELL)
reverseNonEmpty (Text src off len) = runST $ do
    dest <- A.new len
    _ <- reversePoints src off dest len
    result <- A.unsafeFreeze dest
    pure $ Text result 0 len

reversePoints
    :: A.Array -- ^ Input array
    -> Int -- ^ Input index
    -> A.MArray s -- ^ Output array
    -> Int -- ^ Output index
    -> ST s ()
reversePoints src xx dest yy = go xx yy where
    go !_ y | y <= 0 = pure ()
    go x y =
        let pLen = utf8LengthByLeader (A.unsafeIndex src x)
            -- The next y is also the start of the current point in the output
            yNext = y - pLen
        in do
            A.copyI pLen dest yNext src x
            go (x + pLen) yNext
#else
reverseNonEmpty (Text (A.ByteArray ba) off len) = runST $ do
    marr@(A.MutableByteArray mba) <- A.new len
    unsafeIOToST $ c_reverse mba ba (fromIntegral off) (fromIntegral len)
    brr <- A.unsafeFreeze marr
    return $ Text brr 0 len
#endif

Monads, the ST monad, unsafe operations, Arrays and ByteArrays.
Hardly beginner material and not an inviting first look at Haskell.

Of course you can… not show them and “just trust me bro”. Or make them take the Text API for granted and work with that. Like how IO is represented and done and how the monad of the IO monad is glossed over at first.

atravers · August 7, 2024, 4:36am

In additions to the other problems mentioned here:

So while “just” replacing String with Text could reduce the existing awkwardness, new awkwardness would be introduced regarding where to use a lazy or strict version of some type.

Perhaps the lazy Text type could be replaced with a combination of the strict Text type and (lazy) lists i.e. [Text]:

(Lazy lists: in SWI-Prolog since 2019, in Haskell since 1990 ;-)

Then a merger of the two Text modules could at least be contemplated: if that really is possible, it would help to simplify the replacement of String.

BurningWitness · August 7, 2024, 7:59am

I’m pretty sure LazyText chunks can start and end in the middle of a code point ~~, so at best you’d have newtype LazyText = LazyText [TextChunk] (public) and newtype TextChunk = TextChunk StrictByteString1 (internal)~~. You could force every chunk to end on code point boundaries, but then any LazyByteString that is known to be UTF-8 at the time of parsing would need to be explicitly rechunked to fit that expectation.

For ByteStrings however a public newtype LazyByteString = LazyByteString [StrictByteString1] definitely makes a lot of sense, that string being chunked doesn’t feel like an implementation detail.

f-a · August 7, 2024, 9:39am

I confess I have not — yet! — read all of this discussion, but on Discourse there is no shame in opening a new topic if the discussion splinters in multiple sub-threads. It helps keep everything tidy.

Topic		Replies	Views
HF Tech Proposal #1: UTF-8 Encoded Text Haskell Foundation	40	6368	June 2, 2021
Informal discussion about the progression of `base` Core Libraries Committee	158	8380	May 6, 2024
Bringing Data.Text into `base`: What is the next step?	52	3942	October 20, 2022
Text-2.0 with UTF8 is finally released! Announcements	20	2822	January 27, 2022
Base proposal around vector-like types Haskell Foundation	85	7508	August 26, 2023

The Quest to Completely Eradicate `String` Awkwardness

Related topics