Language, library, and compiler stability (moved from GHC 9.6 Migration guide)

To elaborate a bit here:

Would it have been technically possible to continue accepting these programs (presumably with a warning and deprecation cycle)? Very likely, yes.

However, we are very reluctant to start letting implementation bugs supercede specification. Afterall, what if, when adding logic to accept these programs we introduced yet another bug. Are we also now obligated to emulate this bug, with a deprecation cycle? Where does this end? How do we manage the technical complexity of validating these layers of bug-emulations? Avoiding this is why we specify behavior and reserve the right to change behavior which is not specified (e.g. the GADT update behavior noted in the migration guide).

Now, this is not to say that we will never add deprecation cycles around standards-non-compliant behavior. If the user-facing impact were high enough in this case, we could have done so here. However, we judged (using empirical evidence provided by head.hackage) that this particular issue was of sufficiently low impact that users would be better served by our time being spent elsewhere. As with most things in engineering, this is a trade-off that must be struck on a case-by-case basis.

6 Likes

I don’t follow. GHC is pretty much the de-facto Haskell compiler. We all write GHC Haskell today. I doubt many read the spec and write code according to spec (if there is one) to be surprised that GHC doesn’t accept their code. We write code to be accepted by GHC. Thus if GHC starts rejecting code, whether or not that is technically a bug, it is a user facing change.

head.hackage is pretty much the database of likely known “changes”.

Anything that prevents a codebase compiling (warning free and hypothetical -fno-experimental) with GHC N-2, and fails to compile with GHC N-1 or GHC N is a change.

I think that collecting examples of these would be one of the highest-value activities we could do around improving the stability story.

3 Likes

The role of a specification is not solely to inform users of what behavior to expect; it is equally (if not more) important as a tool for implementators to judge what behavior is correct. Without a specification, it is impossible to say whether a particular behavior is self-consistent, interacts poorly with other behaviors, or is implemented correctly.

This is why language extensions tend to change: we cannot predict all interactions without a thorough specification (and even with one it is quite hard). Consequently, we may later notice (often by chance) an internal inconsistency. We generally try to fix such inconsistencies by appealing to whatever existing specification applies since even under ideal conditions, developing a language implementation is hard; doing so without a clear specification is significantly harder

Yes, this is why I’m asking for concrete examples; when I look at head.hackage I see very few changes that stem from language or compiler changes. The vast majority of patches are due to:

  • changes in base
  • changes in template-haskell
  • changes in the boot libraries

Changes in the surface language itself are quite rare. The two notable exceptions that I can think of are:

  1. simplified subsumption, which we reverted
  2. 9.2’s change in representation of boxed sized integer types, which was sadly necessary for correctness on AArch64/Darwin.

As @tomjaguarpaw says, I think starting to collect a list of high-cost past changes would be an incredibly useful exercise. We can’t know how to improve until we know what we have done poorly in the past. Certainly the examples above should be on this list but surely there are others as well.

7 Likes

I know I’m a tiny minority, but I try to write code that will be accepted by either GHC or Hugs. This includes features way beyond what’s documented in Haskell 2010.

Yeah would be nice: Overlapping instances is poorly spec’d – especially how it interacts with FunDeps. There’s significant differences between what code the two compilers accept. GHC is too liberal/you can get incoherence; Hugs is too strict; nowhere explains the differences or why.

1 Like

Let me suggest an extreme example, as a thought experiment. Suppose that a new feature, say, FrobnicateTypes, allowed users to write unsafeCoerce. And now suppose a check were added to GHC to disallow precisely those cases where users were writing unsafeCoerce. Would we want a deprecation cycle specifically for people who made use of what seems to me an evident soundness bug in the feature? I think we would not, and that suggests to me that there’s a continuum between “bug-like” and “unintended-feature-like” which requires a little care in evaluating in each specific case.

2 Likes

Should this warn loudly? Yes!
Should this abruptly break code? No.

1 Like

Let’s spell this out so we are all on the same page.

Yes, the thesis is that any changes that cause GHC to reject code that would previously be accepted need to have “deprecation cycle” (for lack of a better way to call this).

I agree that there is a continuum, and I agree that can still be best effort. As @bgamari points out, since there’s no specification of what the behaviour is supposed to be, it’s hard to know if we are changing it by accident; but also we have Hackage with its ~17k packages which can be used as unit-tests. So, once the intention is clear, best effort can be a good enough effort.

1 Like

I would be surprised if there were any language implementation which would deal with a practically-observable unsoundness bug by mere warning. For instance, I know that Rust routinely fixes soundness issues (and even less-critical errors) in minor releases via breaking changes (these are all taken from rustc releases of the last twelve months).

Naturally, if it is possible to fix the soundness issue while avoiding breakage (even potentially at the cost of, e.g., performance) then that is perhaps preferable. Otherwise, the compiler merely warning warning the user that it just produced an executable with undefined behavior seems like a medicine — unsoundness — that is far worse than the disease — breakage.

4 Likes

Are you saying all the breakages since, say, 8.0.2 were due to unsoundness bugs or boot library changes?

I feel this discussion is getting derailed by arguing about corner cases.

Was simplified subsumption fixing an unsoundness bug? Maybe, but even if so, the programs that were accepted are not de-facto wrong. There was no surprising behavior like broken code generation that threw end users off.

So I don’t think the view about type system shenanigans and complex specification issues is helpful when evaluating what type of bugfixes end users want to deal with without a deprecation period. Because this is a constant research area, I doubt we’ll reach a place where we can sit down and say “ok, we’re done with the theory”.

Otherwise this becomes an argument why we can break anything, all the time, because we figured out better specifications.

I feel there might be issues with understanding the POV of regular end users when dealing with all the complex compiler stuff.

2 Likes

Who says we cannot do a better job than Rust?

Since we are comparing ourself to Rust, the discussion in the linked PR shows they used their crater tool to assess the breakage (link). They seem to have thought a lot about this in the past years (see Stability as a Deliverable | Rust Blog).

There they even say:

We reserve the right to fix compiler bugs, patch safety holes, and change type inference in ways that may occasionally require new type annotations. We do not expect any of these changes to cause headaches when upgrading Rust.

Don’t let “perfect” be the enemy of “good”.

3 Likes

I think this data could be collected by testing GHC nightlies against Hackage. As @andreabedini said, that’s our best (only?) battery of unit tests.

2 Likes

Indeed, and crater is a great idea! This is why we have developed our own tool, head.hackage, which GHC’s developers use to perform similar assessments: any GitLab merge request bearing the user-facing label has a corresponding head.hackage job.

That being said, head.hackage is no silver bullet. Ideally head.hackage would include the entirety of Stackage, mirroring how crater includes all of crates.io. However, for a variety of reasons (in part lack of hands, in part rate of breakage), this is not currently the case. Moreover, it is underutilized outside of GHC development: the GHC Steering Committee does not explicitly require that head.hackage be used to audit the impact proposals; perhaps this is something that we could change.

Agreed! This is precisely why I am skeptical of @angerman’s point that even soundness bugs addressed by surgical fixes do not justify breakage, however small. There needs to be a trade-off here.

To clarify, my previous message was intended only as a reply to @angerman’s response to @sclv’s hypothetical. I only meant to say that merely warning for soundness bugs seemed to me like a somewhat extreme take on stability. Afterall, a program silently yielding the wrong answer can be far worse than a program not compiling. However, as pointed out above, this in general needs to be taken on a case-by-case basis.

I am not arguing that simplified subsumption fixed a soundness bug. However, like all language changes since ~2017, it was a change which was approved by the GHC Steering Committee in light of community feedback. Should the proposal have been approved? Perhaps not, but I can easily see why it was: it’s a technical proposal with non-obvious implications. While the proposal mentioned the sort of breakage that could be expected from the change, a head.hackage assessment was not undertaken until a few months after the proposal was accepted and implemented This meant that the committee had little concrete evidence on which to judge the extent of the breakage.

In hindsight, the GHC developers, myself included, should have pushed back against releasing with the change when we realized the impact on head.hackage. However, we were reluctant to do so in part due to a feeling that the power to decide the merits of the change rested with the Steering Committee.

Eventually, when 9.0 started seeing real-world usage and the true scope the problem was apparent we proposed to reintroduce deep subsumption. This was quickly implemented and we took the rather unprecedented step of backporting for GHC 9.0, despite GHC 9.2 being the active development branch, to minimize user impact.

Would it have been better had we realized the true extent of the break prior to release of 9.0? Certainly, and from this experience we have learned: if faced with a similar situation today, I believe we would be better equipped to characterise and draw attention to the scope of the breakage in a prompt manner, well before the change lands in a release.

Yes, and it is important that we see the forest from the trees. However, I do feel like it is hard to discuss this matter in the abstract. Afterall, in the abstract we all agree that stability is a Good Thing. However, many things can be meant by stability:

  1. interface stability the ghc library
  2. stability of GHC’s binary distributions
  3. stability of GHC’s command-line interface
  4. stability of primops
  5. stability of the GHC/Haskell language
  6. stability of base
  7. stability of boot libraries

My sense is that the present discussion is centering around (5), (6), and (7). However, in each of these areas we have bodies outside of GHC’s maintainers which rightly oversee these changes: the GHC Steering Committee oversees (5), whereas (6) is managed by the CLC, and (7) by the boot libraries’ respective maintainers. For the most part, GHC’s maintainers are simply the mechanism by which these bodies’ policies are realized.

Are there exceptions to this? Certainly: the change in representation of Int16 and friends is one which the GHC and GHC alone was responsible for and a step only taken to fix a soundness issue affecting AArch64/Darwin, which was quickly gaining adoption. Likewise we have historically considered some modules in base's GHC.* namespace to be internal to GHC (something we are discussing addressing at the moment).

However, this is the exception, not the rule. Currently all of us — the GHC maintainers, GHC Steering Committee,and Core Libraries Committee — are making the decisions we do because we believe that they are the correct decisions in the particular cases we are considering. However, there is clearly a good number of people who disagree with these outcomes. This is why I think that @tomjaguarpaw is absolutely correct: one of the most important things we can do is collect and consider decisions made in the past. This gives us a basis for discussion and understand and allows us to distill lessons for the future.

Indeed we do precisely this with head.hackage. However, this will only catch future breakage; it does little to help us understand what we can learn from the past decisions which have given rise to the stability concerns expressed in this thread.

6 Likes

@bgamari I think what we (GHC developers, GHC Steering Committee, CLC) historically fail to recognise that stability is not just a technical, but also a social question. It’s too easy to take a defensive stance “this is a bug fix, not a breaking change”, “there is no such thing as absolute stability”, “GADTs are an experimental feature” or come up with a plausible technical justification why the breakage is for the greater good. I do exactly the same in my CLC hat.

Things however change when you are at the receiving end. At some point it does not matter how well motivated was the change or how many papers and committees support it. Enough is enough.

10 Likes

Yes, I’d like to lend my support to @bodigrim’s post, and add that I don’t think we will ever come up with a satisfactory formal definition of “stability” or “internal”. I wonder if that’s why it’s proving to be such a sticky issue in the Haskell world. We have a lot of world experts in expressing complex things formally, and we have a tendency to apply the same techniques to things where it doesn’t work so well.

6 Likes

What does this mean? Are you arguing for an absolute standard because a relative one sometimes gets it wrong? I am not sure what is being advocated here, though I do recognize the sentiment.

1 Like

Over in OCaml-land, there is a compiler flag -principal which says whether or not the compiler should be principled in its type inference. (The flag name is about whether or not the compiler infers “principal types” – a concept from type theory – but the way most users experience this is that -principal type inference is more predictable, and hence principled.)

Maybe we can have something similar in GHC?

That is, the example of TypeFamilies is a good one. There are aspects of that feature of GHC which are more reliable than others. For example, if I have type instance F Int = Bool, and I see F Int in my code, I can be confident it means the same as Bool. But if I have G1 [x] = Double and G2 x = [G2 x] and I wish to work with G1 (G2 Bool), then it’s much subtler to determine that G1 (G2 Bool) should reduce to Double. So part of the feature is nice, and part is not so nice.

I think we GHC developers might be able to differentiate these cases – at least on a best-effort basis. (OCaml has been more principled with their application of -principal than I think GHC can realistically reach.) That is, there are many places in GHC where we try the “obvious” thing, it doesn’t work, and so we do something clever that happens to accept a handful examples we have in front of us. We like to convince ourselves that we’re coming up with a general solution that will address all likely examples that will come up in practice, but experience suggests we’re actually just rationalizing accepting the few examples we’ve been creative enough to write down. Most of the time, we know when we’re playing this game – it’s usually evidenced by so-called Wrinkles (example) in the code base. So maybe we have a warning (off by default!) that bleats whenever we know that GHC is going outside its usual envelope?

This is something that would grow over time and would always be a partial solution, so maybe it’s just not fit for purpose. But my gut feeling is that, if we did this well, most of the type-checker changes that occur release-to-release would affect only code that uses these wrinkles. Unfortunately, the warnings produced would generally be unhelpful – something along the lines of GHC had to think hard to accept your program; your mileage -- and stability -- may vary. Doing better would be very hard, in general, and trying to do better would likely demotivate the GHC authors from adding new cases for the warning. But maybe the opaque warning (maybe with some lines highlighted) is enough?

3 Likes

Likewise we have historically considered some modules in base's GHC.* namespace to be internal to GHC (something we are discussing addressing at the moment).

I really feel like all the drama here is a matter of unclear social signalling. I am very thrilled to see more people get excited about a ghc-base vs base to help with library signalling [and the ball is squarely in my court to go edit that proposal!].

Likewise, things like -X.... and -fglasgow-exts make perfect sense in theory, but that fact the stable thing they refer to (the Haskell report) does not have a position in our community like stable Rust (to compare Rust unstable features) or standard C++ (to compare -std=gnu++17) means that people have all sorts of diverging interpretations.

Either we revive the Haskell Report (C/C++ route) or we make a GHC-specific notion of stability (like what @rae and others earlier in this thread say) and then…problem solved!

All this “restore the social contract”, which means we can fix bugs without acrimony, which is good.

5 Likes

Instead of attempting to define what the “stability” of an language extension (or feature) should be, perhaps a simpler heuristic could be based on “conflict counts” - the number of other extensions it conflicts with. An extension which conflicts with many others would then have an adverse “ranking”, alerting potential users to the possibility of future difficulties.

The problem is that “conflict count” isn’t something that we already know but haven’t yet written down. In many cases we don’t know that a “conflict” exists without hard thought and a bit of luck. There are countless ways in which extensions might interact and many of these live in the darker corners of the language. Of these potential interactions only some are problematic; seeing which those are can be a tricky exercise.

2 Likes