Language, library, and compiler stability (moved from GHC 9.6 Migration guide)

I think this discussion points me to a different conclusion. GHC has at times really highlighted that features are in some way experimental or have rough edges, and certainly people have over time encountered those rough edges themselves and learned the hard way. I think the issue is that many of the developers of libraries in Haskell, including real intended-for-production libraries are by nature early adopters who are very excited about getting and using all the new features, and rush to try them out.

So I doubt that putting these features behind a further experimental or -f-i-solemnly-swear-i-invoke-experimental-features flag or whatever would make much difference – the same folks that enthusiastically rush to make use of InvertedGeneralizedFamilyRecordTypes or whatever will still rush to do so even with more signposting.

(And honestly, this is something I tend to think is a great feature of the Haskell world, culturally, and I don’t necessarily wish to discourage it – it is just that when one is at the edges of what is new, then there is necessarily going to be more maintenance as those things develop and firm up).

4 Likes

I think the issue here would be that the vast majority of Hackage would have -fexperimental turned on anyway. It’s the nature of the Haskell community at large to want new shiny things. Of course it might be useful to have libraries with huge sets of reverse dependendies to opt out of ‘experimental’ features (e.g. aeson, bytestring, text), but AFAICT those don’t tend to be the ones that break anyway when a new version of GHC is out. (Well, they might break internally, but nothing that forces a change in the external API.)

We can see this writ large with how macros in Scala 2.x played out. They were very clearly marked experimental, but they were just so incredibly useful (in spite of all the flaws) that libraries that are used nearly everywhere came to fundamentally rely on them. Thus leaving a huge porting effort and or rewriting when migrating from Scala 2.x to Scala 3.x. I don’t see Haskell being much different in that regard.

(I don’t have any solutions to offer, I’m afraid. It’s an extremely hard problem. I do think industrial users could make a significant contribution here if they were so inclined by throwing more money at the problem, tho.)

3 Likes

This might be true. However if the compiler would reject those packages as experimental (if told not to use experimental code), policies may make those packages simply not eligible for use in production in some places.

Enforcing a policy that says the codebase (including dependencies) must compile with -fno-experimental (ideally a stable compiler, and not a runtime flag), would be fairly easy for “industrial users”. Especially if that comes with a guarantee that there is no breakage.

I am starting to have seriously doubts about this.


I really hope choosing Haskell doesn’t mean: either you accept eternal breakage, write every piece of library code yourself, or use hackage.

I also have serious doubts that many maintainers enjoy the constant churn to make their code compile with the latest GHC release (and even if this means just figuring out which of the libraries in their dependencies broke and needs to be updated now; and this a new release needs to be cut).

I am skeptical that this is a viable option. As I say above, it’s not at all clear (to me, at least) what “experimental” means. There is a continuum of possible degrees of stability and historically Haskellers have been eager to use features which lie pretty far from the “stable” side of this continuum.

To continue with the TypeFamilies example used above: would you give up use of Generics, servant, lens, and vector just to gain an small degree of (likely only theoretical) “stability”? I suspect that most commercial users would not.

@angerman, if you have a list handy, it would be great to know which compiler/language changes have been the largest offenders in terms of upgrade cost. I don’t doubt that there are things we could do here, but it’s hard to know what they are without making the discussion a bit more concrete.

2 Likes

I doubt anyone wants to prevent Haskell pushing the boundaries of what we know.

This is not the stable/experimental distinction I would make. Stable would mean: we consider this feature to add enough value to commit to putting effort into proper deprecation noticed and cycles when it becomes necessary to make changes.

Again, I am not against changing and breaking stuff as long as if comes with proper deprecation warning and lead times. Stable does not necessarily have to mean: this will never change in all eternity. It (to me) just means: this feature is ready for production and changes to it will be preceded by deprecation warnings.

6 Likes

This is not the stable/experimental distinction I would make. Stable would mean: we consider this feature to add enough value to commit to putting effort into proper deprecation noticed and cycles when it becomes necessary to make changes.

The problem is that we need to have a specification to say what “change” means. Many of the language extensions in common use today have no such specification; in many cases this is precisely why they haven’t been folded into the Haskell Report.

To take the GADT record example above, the previous behavior was neither intended nor documented. It was rather an emergent behavior resulting from the interaction of one loosely-specified feature (GADT record update) with another loosely-precisely-specified feature (type families). The specification we have, the Haskell Report, stipulated that programs of the form mentioned in the migration guide should be rejected, making the fact that they were previously accepted a bug.

5 Likes

To elaborate a bit here:

Would it have been technically possible to continue accepting these programs (presumably with a warning and deprecation cycle)? Very likely, yes.

However, we are very reluctant to start letting implementation bugs supercede specification. Afterall, what if, when adding logic to accept these programs we introduced yet another bug. Are we also now obligated to emulate this bug, with a deprecation cycle? Where does this end? How do we manage the technical complexity of validating these layers of bug-emulations? Avoiding this is why we specify behavior and reserve the right to change behavior which is not specified (e.g. the GADT update behavior noted in the migration guide).

Now, this is not to say that we will never add deprecation cycles around standards-non-compliant behavior. If the user-facing impact were high enough in this case, we could have done so here. However, we judged (using empirical evidence provided by head.hackage) that this particular issue was of sufficiently low impact that users would be better served by our time being spent elsewhere. As with most things in engineering, this is a trade-off that must be struck on a case-by-case basis.

6 Likes

I don’t follow. GHC is pretty much the de-facto Haskell compiler. We all write GHC Haskell today. I doubt many read the spec and write code according to spec (if there is one) to be surprised that GHC doesn’t accept their code. We write code to be accepted by GHC. Thus if GHC starts rejecting code, whether or not that is technically a bug, it is a user facing change.

head.hackage is pretty much the database of likely known “changes”.

Anything that prevents a codebase compiling (warning free and hypothetical -fno-experimental) with GHC N-2, and fails to compile with GHC N-1 or GHC N is a change.

I think that collecting examples of these would be one of the highest-value activities we could do around improving the stability story.

3 Likes

The role of a specification is not solely to inform users of what behavior to expect; it is equally (if not more) important as a tool for implementators to judge what behavior is correct. Without a specification, it is impossible to say whether a particular behavior is self-consistent, interacts poorly with other behaviors, or is implemented correctly.

This is why language extensions tend to change: we cannot predict all interactions without a thorough specification (and even with one it is quite hard). Consequently, we may later notice (often by chance) an internal inconsistency. We generally try to fix such inconsistencies by appealing to whatever existing specification applies since even under ideal conditions, developing a language implementation is hard; doing so without a clear specification is significantly harder

Yes, this is why I’m asking for concrete examples; when I look at head.hackage I see very few changes that stem from language or compiler changes. The vast majority of patches are due to:

  • changes in base
  • changes in template-haskell
  • changes in the boot libraries

Changes in the surface language itself are quite rare. The two notable exceptions that I can think of are:

  1. simplified subsumption, which we reverted
  2. 9.2’s change in representation of boxed sized integer types, which was sadly necessary for correctness on AArch64/Darwin.

As @tomjaguarpaw says, I think starting to collect a list of high-cost past changes would be an incredibly useful exercise. We can’t know how to improve until we know what we have done poorly in the past. Certainly the examples above should be on this list but surely there are others as well.

7 Likes

I know I’m a tiny minority, but I try to write code that will be accepted by either GHC or Hugs. This includes features way beyond what’s documented in Haskell 2010.

Yeah would be nice: Overlapping instances is poorly spec’d – especially how it interacts with FunDeps. There’s significant differences between what code the two compilers accept. GHC is too liberal/you can get incoherence; Hugs is too strict; nowhere explains the differences or why.

1 Like

Let me suggest an extreme example, as a thought experiment. Suppose that a new feature, say, FrobnicateTypes, allowed users to write unsafeCoerce. And now suppose a check were added to GHC to disallow precisely those cases where users were writing unsafeCoerce. Would we want a deprecation cycle specifically for people who made use of what seems to me an evident soundness bug in the feature? I think we would not, and that suggests to me that there’s a continuum between “bug-like” and “unintended-feature-like” which requires a little care in evaluating in each specific case.

2 Likes

Should this warn loudly? Yes!
Should this abruptly break code? No.

1 Like

Let’s spell this out so we are all on the same page.

Yes, the thesis is that any changes that cause GHC to reject code that would previously be accepted need to have “deprecation cycle” (for lack of a better way to call this).

I agree that there is a continuum, and I agree that can still be best effort. As @bgamari points out, since there’s no specification of what the behaviour is supposed to be, it’s hard to know if we are changing it by accident; but also we have Hackage with its ~17k packages which can be used as unit-tests. So, once the intention is clear, best effort can be a good enough effort.

1 Like

I would be surprised if there were any language implementation which would deal with a practically-observable unsoundness bug by mere warning. For instance, I know that Rust routinely fixes soundness issues (and even less-critical errors) in minor releases via breaking changes (these are all taken from rustc releases of the last twelve months).

Naturally, if it is possible to fix the soundness issue while avoiding breakage (even potentially at the cost of, e.g., performance) then that is perhaps preferable. Otherwise, the compiler merely warning warning the user that it just produced an executable with undefined behavior seems like a medicine — unsoundness — that is far worse than the disease — breakage.

4 Likes

Are you saying all the breakages since, say, 8.0.2 were due to unsoundness bugs or boot library changes?

I feel this discussion is getting derailed by arguing about corner cases.

Was simplified subsumption fixing an unsoundness bug? Maybe, but even if so, the programs that were accepted are not de-facto wrong. There was no surprising behavior like broken code generation that threw end users off.

So I don’t think the view about type system shenanigans and complex specification issues is helpful when evaluating what type of bugfixes end users want to deal with without a deprecation period. Because this is a constant research area, I doubt we’ll reach a place where we can sit down and say “ok, we’re done with the theory”.

Otherwise this becomes an argument why we can break anything, all the time, because we figured out better specifications.

I feel there might be issues with understanding the POV of regular end users when dealing with all the complex compiler stuff.

2 Likes

Who says we cannot do a better job than Rust?

Since we are comparing ourself to Rust, the discussion in the linked PR shows they used their crater tool to assess the breakage (link). They seem to have thought a lot about this in the past years (see Stability as a Deliverable | Rust Blog).

There they even say:

We reserve the right to fix compiler bugs, patch safety holes, and change type inference in ways that may occasionally require new type annotations. We do not expect any of these changes to cause headaches when upgrading Rust.

Don’t let “perfect” be the enemy of “good”.

3 Likes

I think this data could be collected by testing GHC nightlies against Hackage. As @andreabedini said, that’s our best (only?) battery of unit tests.

2 Likes

Indeed, and crater is a great idea! This is why we have developed our own tool, head.hackage, which GHC’s developers use to perform similar assessments: any GitLab merge request bearing the user-facing label has a corresponding head.hackage job.

That being said, head.hackage is no silver bullet. Ideally head.hackage would include the entirety of Stackage, mirroring how crater includes all of crates.io. However, for a variety of reasons (in part lack of hands, in part rate of breakage), this is not currently the case. Moreover, it is underutilized outside of GHC development: the GHC Steering Committee does not explicitly require that head.hackage be used to audit the impact proposals; perhaps this is something that we could change.

Agreed! This is precisely why I am skeptical of @angerman’s point that even soundness bugs addressed by surgical fixes do not justify breakage, however small. There needs to be a trade-off here.

To clarify, my previous message was intended only as a reply to @angerman’s response to @sclv’s hypothetical. I only meant to say that merely warning for soundness bugs seemed to me like a somewhat extreme take on stability. Afterall, a program silently yielding the wrong answer can be far worse than a program not compiling. However, as pointed out above, this in general needs to be taken on a case-by-case basis.

I am not arguing that simplified subsumption fixed a soundness bug. However, like all language changes since ~2017, it was a change which was approved by the GHC Steering Committee in light of community feedback. Should the proposal have been approved? Perhaps not, but I can easily see why it was: it’s a technical proposal with non-obvious implications. While the proposal mentioned the sort of breakage that could be expected from the change, a head.hackage assessment was not undertaken until a few months after the proposal was accepted and implemented This meant that the committee had little concrete evidence on which to judge the extent of the breakage.

In hindsight, the GHC developers, myself included, should have pushed back against releasing with the change when we realized the impact on head.hackage. However, we were reluctant to do so in part due to a feeling that the power to decide the merits of the change rested with the Steering Committee.

Eventually, when 9.0 started seeing real-world usage and the true scope the problem was apparent we proposed to reintroduce deep subsumption. This was quickly implemented and we took the rather unprecedented step of backporting for GHC 9.0, despite GHC 9.2 being the active development branch, to minimize user impact.

Would it have been better had we realized the true extent of the break prior to release of 9.0? Certainly, and from this experience we have learned: if faced with a similar situation today, I believe we would be better equipped to characterise and draw attention to the scope of the breakage in a prompt manner, well before the change lands in a release.

Yes, and it is important that we see the forest from the trees. However, I do feel like it is hard to discuss this matter in the abstract. Afterall, in the abstract we all agree that stability is a Good Thing. However, many things can be meant by stability:

  1. interface stability the ghc library
  2. stability of GHC’s binary distributions
  3. stability of GHC’s command-line interface
  4. stability of primops
  5. stability of the GHC/Haskell language
  6. stability of base
  7. stability of boot libraries

My sense is that the present discussion is centering around (5), (6), and (7). However, in each of these areas we have bodies outside of GHC’s maintainers which rightly oversee these changes: the GHC Steering Committee oversees (5), whereas (6) is managed by the CLC, and (7) by the boot libraries’ respective maintainers. For the most part, GHC’s maintainers are simply the mechanism by which these bodies’ policies are realized.

Are there exceptions to this? Certainly: the change in representation of Int16 and friends is one which the GHC and GHC alone was responsible for and a step only taken to fix a soundness issue affecting AArch64/Darwin, which was quickly gaining adoption. Likewise we have historically considered some modules in base's GHC.* namespace to be internal to GHC (something we are discussing addressing at the moment).

However, this is the exception, not the rule. Currently all of us — the GHC maintainers, GHC Steering Committee,and Core Libraries Committee — are making the decisions we do because we believe that they are the correct decisions in the particular cases we are considering. However, there is clearly a good number of people who disagree with these outcomes. This is why I think that @tomjaguarpaw is absolutely correct: one of the most important things we can do is collect and consider decisions made in the past. This gives us a basis for discussion and understand and allows us to distill lessons for the future.

Indeed we do precisely this with head.hackage. However, this will only catch future breakage; it does little to help us understand what we can learn from the past decisions which have given rise to the stability concerns expressed in this thread.

6 Likes

@bgamari I think what we (GHC developers, GHC Steering Committee, CLC) historically fail to recognise that stability is not just a technical, but also a social question. It’s too easy to take a defensive stance “this is a bug fix, not a breaking change”, “there is no such thing as absolute stability”, “GADTs are an experimental feature” or come up with a plausible technical justification why the breakage is for the greater good. I do exactly the same in my CLC hat.

Things however change when you are at the receiving end. At some point it does not matter how well motivated was the change or how many papers and committees support it. Enough is enough.

10 Likes