Bringing Data.Text into `base`: What is the next step?

Hey everyone,

I was wondering if there were concrete blockers to the upstreaming of Data.Text (and associates) into base. I think it’s high time we stopped settling for String.

Amongst the positive changes that I can see happening, I’d like to underline two specifically:

  1. We change the culture towards a legitimisation of Text as part of the basic toolkit of the Haskeller, especially in Cabal projects where text isn’t readily available unless it is added manually to the dependencies.

  2. New APIs inside of base can adopt Text. One of the latest examples is GHC Proposal 330 - Decorate exceptions with backtrace information. The API proposed uses String but does not seem to do so for the (real and valid) properties of a lazy linked list, but because we cannot depend on text from base.

Moreover, I am explicitly not talking about replacing pre-existing usages of String in base, like FilePath (as that was raised on Reddit).

27 Likes

Please, let’s split base first. More code in the GHC repo inexorably tied to the GHC version is a recipe for disaster.

18 Likes

You’ve my worthless support :relieved:

5 Likes

I’d like to apologize in advance if I’ve missed it, but I don’t the community recently discussed the purpose of the base library.

For a long time I was quite satisfied with it (mainly for sparse time project and teaching Haskell). IMO, base is well suited for onboarding new users to Haskell (globally well design, functions/types are easy to implement/understand, many concepts are used).

However, when I started to use Haskell professionally, a lot of issues arised (mainly due to partial functions), I don’t consider the lack of “advanced” data structures as a defect.

So we switched to an alternate Prelude (for the record it is Universum, which is a good tradeoff between divergence between base and structural improvements), which did great on our codebase but increased onboarding time.

Adding Text would increase base’s incoherence (many functions defined in Data.Text have the same name in base, while it’s perfectly acceptable in a third party library, it’ll create confusion amongst the same library), and it’ll require a redesign.

Few hints I had in mind (no strong opinion):

  • base being the minimal subset for any Haskellers to work with, but it would require bigger Prelude to be usable, doing so may guarantee a level of stability and a better cohesion amongst non-base users
  • base being a go-to library (as Ruby’s std), but it’ll require a complete redesign, and, due to a bigger size, may decrease coherence and ability to have backward compatible changes (or at least, make them more costly)
1 Like

I think there ought to be an end goal in mind for this change. Should this be the first step in a concerted effort to move away from String? Or is it merely for convenience? As it stands, making changes to text does not require CLC approval, but would following this change. Is it worth bringing in that level of scrutiny over the implementation details? I’m not so sure, but with clear community support for a detailed proposal I could be swayed.

Either way, I think that considering a merge of this weight merits a serious consideration of base-5.0, what it should look like, and how closely it should adhere to the Haskell Report.

I would love if base version could decouple from GHC, so I am curious of the progress of the base split. How close is it to be implemented?

3 Likes

I don’t see why this ought to be the case. They’ll still be in different modules. It doesn’t seem particularly confusing to me, as long as it’s clearly documented that Data.Text is designed to imported qualified (which it is)

3 Likes

As I understand it, when the UTF-8 text change was made in text-2.0, users of older GHCs could upgrade to it straight away just by depending on text-2.0. Is that correct?

On the other hand, if Data.Text were in base then users of older GHCs could not have upgraded to it, instead they would have had to wait for the next release of GHC and upgrade to that GHC version. Is that correct?

2 Likes

@tomjaguarpaw You are absolutely right. However I would make the case that I don’t see us moving to another character encoding in the future, as the initial choice of UTF-16 was misguided. As such, This is not something I’m taking into account (but I am fully open to other potential pain points!).

Moreover, having experience with other languages, having a decent textual representation as part of the base social contract has never been a contention point, or even controversial.
Even our cousin from the 80’s Erlang has switched to having binaries with UTF-8-encoded codepoints as a way to encode its string literals, and its string module operates on grapheme clusters.

5 Likes

Indeed, I don’t expect us to change encoding, but the example serves as a caution against moving things into base: it ties upgrades to GHC. That doesn’t seem desirable.

Perhaps you can elaborate on the positive aspects of moving text into base. What would we achieve by it that we can’t already do?

1 Like

Sure, let me amend my initial post. :slight_smile:

Thanks. So I agree with 2 (New APIs inside of base can adopt Text). That would be nice. But I think 1 should be solved by different means. If it’s difficult to use text in a Cabal project, make it easier! That shouldn’t require radically-restructuring base libraries.

For reference:

  1. We change the culture towards a legitimisation of Text as part of the basic toolkit of the Haskeller, especially in Cabal projects where text isn’t readily available unless it is added manually to the dependencies.

If it’s difficult to use text in a Cabal project, make it easier! That shouldn’t require radically-restructuring base libraries.

Are you thinking about a solution specifically to make it easier? I know of Cabal mixins but they would probably need base to re-export text (and thus depend on it). Or we can have something radical like a std library that re-exports the Core Libraries (with sensible PVP bounds), and have cabal-install amend its default template.

Regarding a radical restructuring of the base library, I’m afraid I don’t quite understand what you mean by that, this discussion is about having Data.Text.* in base. But perhaps you’re seeing something that I do not?

Are you thinking about a solution specifically to make it easier?

I don’t have any particular solution in mind, but it shouldn’t be beyond the wit of humanity to come up with some reasonable solution, less drastic than including Data.Text in base.

To me, including Data.Text in base is a radical restructuring, because it prevents the evolution of Data.Text as distinct from GHC.

I have a simple proposal: let there be a new package standard. standard includes and re-exports modules from base and text (and possibly a few more). We teach cabal to either default to including standard as a dependency or ask users during the interactive init. Problem solved?

11 Likes

Doesn’t sound appealing to hardcode a 3rd party library name into cabal and treat it special. Then this should be a generic feature that supports arbitrary alternative preludes. And that’s a more complicated design space.

1 Like

I don’t mean to grill you on that topic, but do you have anything that is not covered by Cabal mixins today? :slight_smile:

While this is a technically valid solution, I think that from a PL design perspective, not only having but maintaining the textual type in a separate library outside of the lowest common denominator (in our case, base) really sends out the wrong message.
I do not know of any other language that has decided to do such a thing in favour of a data structure that has a useless API and cannot even give me the number of graphemes in a character string vs. the number of code points.

(rant: And to be frank, I think that I am quite fed up by this Haskell Exceptionalism that is so pervasive in our culture to the point of justifying the worst design decision from the last millennium as if their rectification was an intolerable attack on our very being. I’m not only speaking of what this discussion brings but it is a general sentiment that strikes me when I read similar community debates.)

5 Likes

What about bytestring? It seems like both text and bytestring are already provided with ghc according to Haskell Hierarchical Libraries . Why Cabal projects can use bytestring but not text?

This page lists the libraries shipped with GHC distributions (both text and bytestring are listed there) but this is not something that is reflected in Cabal projects, no.

Also I don’t understand your last sentence. Of course Cabal projects can use bytestring and text.