A different approach to emancipating `base`

There has been much talk of the challenge of emancipating base from the GHC release process over the years. Challenges have been listed. I have not followed these conversations deeply, but I respect those that have written how hard it would be. In this post, I describe a different way to structure the relationship between GHC and base than we have today. I don’t have an opinion on whether this is a better way than the status quo, and I’m not advocating for change here; instead, I’m laying out an alternative design so that more-informed others can consider it and decide whether it is in fact better than the status quo.

Assumption

The key challenge of emancipating base is that GHC has special treatment of some of the definitions in base, notably Monad and friends (which GHC must know about in order to compile do blocks) and Num and friends (which GHC must know about in order to compile numeric literals). Because GHC knows about these definitions, the modules that define them must be shipped with GHC itself, causing a key challenge in emancipating (the rest of) base. This post is all about defining a different potential relationship between e.g. the definition of Monad and GHC.

Background

GHC currently is aware of at least three categories of definitions in libraries:

  1. Primitive types. There are a number of type definitions that have no phrasing in Haskell. A good example is Int#. GHC gives these types meaning with a little magic in the compiler (that is, they are treated in a different way from ordinary types), and they are exported from the magic module GHC.Prim. The GHC.Prim module has no source code (in its place is a stub just used for creating Haddock output) and lives in the special package ghc-prim. GHC knows that when GHC.Prim.Int# is looked up, it should provide its internal magical type definition; the type is identified by the name GHC.Prim.Int#.

  2. Primitive values. There are a number of primitive values, like negateInt#, that cannot be written in Haskell. For the purposes of this post, these are treated identically to primitive types, in that they are exported from GHC.Prim and are identified by their name (e.g. GHC.Prim.negateInt#). The GHC implementation routes these through quite a different path than primitive types, but the differences don’t matter here.

  3. Wired-in types. There are a number of Haskell type definitions that GHC knows intimately. A good example is Bool. GHC knows where Bool is declared, that it has exactly two constructors False and True, and what order these constructors are declared in. It must know this in order to compile guards and if expressions. Most of these live in the module GHC.Types in the ghc-prim package; some live elsewhere in the ghc-prim package. None lives in base. When GHC compiles the actual definition of these types (GHC.Types does have source code), it essentially ignores the definition provided and uses its own internal definition instead. These types are identified by their name (e.g. GHC.Types.Bool).

  4. Known-key types and values. There are a number of Haskell type and value definitions that GHC knows how to find. A good example is Monad. GHC knows the package, module, and name of Monad (to wit, base, GHC.Base, and Monad). When it needs to emit, say, a Monad m constraint, it looks up the type with that name from that module and proceeds. These definitions might appear outside of ghc-prim and pose the challenge: we cannot today have it so that these types can get released separately from GHC.

The challenge is all around these known-key definitions. I don’t think anyone is talking about making changes to the treatment or packaging of any of the first three cases – just known-key types cause pain.

“Don’t call us; we’ll call you!”

The key challenge here is simply that GHC knows exactly where the known-key definitions are declared, and thus the package including these must be released with GHC. My idea here is simply not to do this, but instead to have the types declare themselves to be e.g. the “real” Monad. This is what Agda does, for example, with its BUILTIN directive and how OCaml works with its external declarations.

Concretely, we might imagine a declaration like

class {-# BUILTIN Monad #-} Applicative m => Monad m where ...

where the pragma tells GHC that this definition is the Monad definition. When GHC reads this definition, it remembers that the class being defined is the Monad, and so when it needs to produce a Monad m constraint, it knows where to go. This approach means that base is no longer special – instead, it just means that there must be one class definition labeled as Monad loaded when compiling any module that uses do. base can evolve independently from GHC – or community members could write completely alternative standard libraries, as long as they have a Monad class.

We could imagine an even-more Haskelly approach of having class Builtin (name :: Symbol) (ty :: k) and the definition of Monad would come with instance Builtin "Monad" Monad. The class-instance solver would then be used to find the Monad class. This would likely be considerably less performant.

This route also interacts nicely with -XRebindableSyntax: instead of looking for one definition with BUILTIN Monad, GHC could look for the in-scope type with BUILTIN REBINDABLE Monad. This would likely be considerably easier to configure than the current -XRebindableSyntax.

Left unsolved: what to do with a NoImplicitPrelude module that doesn’t import anything. In this case, no other modules are loaded, and so GHC doesn’t have a chance to find the BUILTIN Monad. I actually think it’s reasonable to error in such a file if it uses a do, but others may feel differently (and this would not be backward compatible). We could alternatively say that NoImplicitPrelude really means import "base" GHC.Required () or something, which looks for a package named base and a module named GHC.Required, loading the definitions therein. Then we would make sure that base had such a module, and that it depended on all the modules with BUILTINs. But the base could still evolve independently from GHC.

Conclusion

Perhaps I’m addressing the wrong problems here, and perhaps there are other unseen challenges in this approach. But it might be that this idea – changing GHC instead of base – allows things to move forward in this space.

7 Likes

I don’t know enough about the code to evaluate any concrete approach.

But what about other parts of base that are strongly coupled to GHC? Here I’m thinking of things like GHC.Exts, which are somewhat stable and could use this at the cost of there being lots and lots of BUILTIN options, but even more so about things like GHC.Stats.RTSStats, which seem to me to be more likely to change from release to release.

I suspect that the answer is to put those into a package that does ship with GHC, rather than base, and deprecate their current positions in base. But it seems to me that the real decoupling work will spend a short time on Monad, and a very long time figuring out which of these definitions remain strongly coupled and which can be loosely coupled, all with a good deprecation story. So I worry that this proposal (while elegant) is mostly solving the easy problem rather than the hard one, unfortunately.

4 Likes

I still don’t understand why we can’t “just” rename base to ghc-base, ship the latter with GHC, and make a new base package that re-exports modules/definitions from ghc-base. Crucially, that means that GHC devs can make changes to ghc-base without needing CLC approval (since the CLC is now claiming to have authority over every change to the exposed API of base). See Specification of the stable API of `base` · Issue #105 · haskell/core-libraries-committee · GitHub for a recent related discussion.

Of course this doesn’t completely decouple everything, because ghc-base will still be a big tangle, and known-key types have to live in ghc-base and be re-exported by base, so they can’t change freely. But it’s a smallish step in the right direction. Or am I missing a reason this is fundamentally difficult?

4 Likes

It’s not just known-key types, but also known-key values (e.g. the Monad members) isn’t it?

Would Richard’s proposed change would make it that base can be built or rebuild normally by Cabal, and thus be fetched from hackage and built independently, right? I think this used to be discussed under the term “upgradeable base”, but I am not sure of the state of affairs. I vaguely recall that this already works to some degree?

It would still require there to be exactly one BUILTIN Monad pragma visible to GHC, right? The pragma approach would create new error conditions, such as multiple such pragmas visible.

BTW, GHC also has known-key types in other packages: ghc-bignum, base and template-haskell.

2 Likes

I think this used to be discussed under the term “upgradeable base”, but I am not sure of the state of affairs. I vaguely recall that this already works to some degree?

cabal or stack doesn’t like the idea of rebuilding base, but technically it’s possible to create a tarball for base and invoke Setup.hs to build it and install it to other package databases.

The real problem is: that tarball would need to contain certain files auto-generated by configure/hadrian. And those are target-specific. So you can’t really have a single base package uploaded to Hackage and expect it to work for all targets.

Thanks for the reminder, @TerrorJack! So while Richard’s idea sounds reasonable, I’m not sure what exactly we’d gain.

Could we move the target-specific files to ghc-prim?

Or spin out ghc-base from base. I think @Ericson2314 is working on this.

Concretely, we might imagine a declaration like

class {-# BUILTIN Monad #-} Applicative m => Monad m where ...

For what it’s worth, this is also very similar to how rustc solves this problem. Concretely, they have so-called lang-items, which are declarations marked with a lang pragma which identifies the declaration as somehow magical to the compiler.

It’s a bit unclear to me where/how these BUILTIN pragmas are recorded. In principle we could record them in much the same way we record orphan information: every interface file will contain all of the BUILTIN things visible in its transitive imports. However, this it seems like doing this for every module would be rather costly. One advantage of the status quo is that we know precisely which interface we need to load for a particular builtin declaration.

I believe that you could upgrade/replace base today provided:

  • Known-key things and wired things remain defined in the same module, with the same definition. E.g. Show is a known-key thing (not a wired-in thing), but if you made it into a multi-parameter type class, deriving( Show ) would fail badly. (Precisely what GHC relies on about the definition of each known-key thing will vary from one to another.)

I think that’s all; keep that stuff stable and you can upgrade base. I don’t know how hard it would be to actually do such an upgrade, but if it was desirable, I bet we could do it.

Is keeping the above known-key/wired-in things stable a major impediment on those who want to upgrade base? I doubt it. Richard’s post suggests the ability to move the definition of a known-key thing from one module to another. Maybe we could do that, with work, but the thing you moved would still have to be defined in the same way, so the gain would be extremely modest. I don’t think this is the problem we need to solve.

To my mind, a far bigger impediment is that, if you want the upgrade to be backward compatible, you need to support the API of base; and that API currently consists of every export of every exposed module, which is far too much.

That lies as the root of Adam’s suggestion:

I still don’t understand why we can’t “just” rename base to ghc-base, ship the latter with GHC, and make a new base package that re-exports modules/definitions from ghc-base. Crucially, that means that GHC devs can make changes to ghc-base without needing CLC approval

Nothing would be gained here if the new base had modules corresponding 1-1 with ghc-base, and exported precisely the same things. But what does that really gain compared to simply

  • defining the (narrower) API we want base to have, and
  • giving it that API

Then the exposed modules are the API of base (under the purview of CLC) and the non-exposed ones are GHC-internals (and not under CLC’s purview). That is, the same as today! But the API is narrower.

Well, two things would be gained:

  1. Psychological. Defining a new package may make it easier to have the conversation about the API we want.
  2. Access to internals. It seems extremely likely that some users will want to reach inside GHC’s implementation, and exploit somethign about the way it works, perhaps to improve performance. They may fully understand that in doing so they are exposing themselves to changes that GHC implementors may make. But if the GHC-internals are in un-exposed modules, they simply can’t import them. So it would be helpful to have them as the exposed modules of a GHC-internal package ghc-base.

So thumbs up to Adam’s suggestion.

The big challenge would be that the API of base would change (at least by shrinking, substantially) and that risks breakage. Of course, breakage in service of less future breakage, but still breakage.

2 Likes

In my opinion, this requirement is the biggest issue standing between us and a decoupled base and I’m not sure we can so easily sweep it under the rug. As far as I can tell, most of the high-impact CLC proposals in the last several years have broken this:

  • the Applicative/Monad proposal added an Applicative superclass to Monad
  • the Semigroup/Monoid proposal added a Semigroup superclass to Monoid
  • the MonadFail proposal split fail into a new MonadFail class

Perhaps the fact that they changed these declarations were a sign that they were too impactful and should not have been made to begin with. However, there are plenty of other less impactful changes that similarly affected builtin declarations:

1 Like

But that implies changing the definition of Monad; it does not imply moving the declaration from one module to another (which is what Richard wants to support). Now, it might be that adding a superclass to a known-key class is fine – but we would need to check that on a case-by-case basis. Some changes are OK; others are not; and it’ll probably vary from one known-key thing to another.

It’s undeniable that some of the user-facing API of base is baked into GHC. But I don’t think that coupling is easily removed. The ability to move from one module to another doesn’t seem to me to contribute much.

1 Like

A bit off topic, but being able to define the built ins could be very good regardless of base:

If we can define the built ins as actual built ins rather than as simple data types and functions in an alternative prelude, the alternative preludes could become much more useful and possibly even help in avoiding breakage (by e.g. implementing upcoming changes ahead of time)

Consider the definition of the linearity aware Monad in linear base: even if we import the linear base prelude and use NoImplicitPreude and RebindableSyntax, every function that depends on base’s Monad is unusable in combination with the linear Monad because they simply aren’t the same type. However, if the linear monad was defined with BUILTIN, and only that one imported and not the one from base, wouldn’t all occurrences of Monad refer to the linear Monad?

It sounds great.

I very much agree with @rae’s analysis. GHC currently thinks of base coupling in a fuzzy undirected way, and the key way to tease things apart and simplify them is focus in on direction: Magic stuff like Int# where GHC defines and base uses is very different and a stronger coupling that stuff like Bool or Monad where base defines and GHC uses.

@adamgundry We can do ghc-base first and I think we should. That is simpler. But I think things like this are still very good follow up and can and should do.

@rae to flip the above around, there is also tons of code in base that has nothing to do GHC integration — it is simply there because of the “social purpose” of base being where standardized things go, as opposed to any technical reason. So the ghc-base thing is just a way to start plucking that lowest hanging fruit.

4 Likes