Most of the points in this document were firstly discussed with @bgamari in a me…eting.
So this is basically a refinement of those with additional context.
Some actions have already been taken, so we want to use this thread to track
progress and also inform the public.
Although the title is very broad, we focus on release and bindist issues here, because those are the most impactful at the moment.
## Context
The tooling ecosystem today consists of at least the following major components:
1. GHC
2. Cabal
3. Stack
4. HLS
5. GHCup
6. VSCode Haskell extension
The end-user experience depends on how well these tools interact and synchronize
releases/updates.
Although the tooling ecosystem has improved, we frequently get user reports about issues
that require deep knowledge about existings bugs, known workarounds or how these
tools actually interact, breaking the plug-and-play experience.
## Problem statement
Most of the tools mentioned above have different maintainer teams, with
some overlaps and individuals that regularly contribute across them.
We present here a dependency table between those tools:
| | **GHC** | **Cabal** | **Stack** | **HLS** | **GHCup** | **VSCode Haskell** |
|--------------------|---------|-----------|-----------|---------|--------------------|--------------------|
| **GHC** | - | (hadrian) | - | - | (CI) | - |
| **Cabal** | X | - | - | - | (GHC installation) | - |
| **Stack** | X | - | - | - | (GHC installation) | - |
| **HLS** | X | X | X | - | X | - |
| **GHCup** | - | X | X | X | X | - |
| **VSCode Haskell** | X | X | X | X | X | - |
This gives a superficial overview of how intertwined the tools are.
The end user often doesn't know these details if they follow the
installation instructions of the VSCode haskell extension or GHCup.
Frequently raised issues include:
1. GHC does not work on their system
- missing bindists for minor platforms
- issues with darwin M1 or windows
- GHCup or stack install incompatible bindists on their system
2. HLS does not work on their system
- missing binaries for a particular GHC version
- GHCup installs incompatible bindists on their system
3. Releases are out of sync
- GHCup or stack lag behind GHC releases
- HLS lags considerably behind GHC releases
The reasons for these issues are multi-faceted. Some of them are not of technical
nature.
I want to focus on a couple of interactions only and provide evidence of problems:
1. GHC and the downstream consumers GHCup and stack that consume the bindists
2. HLS and all other tools, because it's the most challenging to release
### GHC bindists
GHC bindists are consumed mainly by GHCup and by stack:
* https://github.com/haskell/ghcup-metadata/blob/c42bb4a2ffb681a372210ff29b0e00939e281080/ghcup-0.0.7.yaml#L160
* https://github.com/commercialhaskell/stackage-content/blob/8a469c0ff9e5061d0e6753bee09a71bf42513184/stack/stack-setup-2.yaml#L87
Both these tools provide means of installing GHC versions by providing a unique mapping.
GHCup uses a distro+distroversion mapping, while stack uses platform+library-sonames as index.
Both have their own caveats.
When new releases are announced, the GHCup and stack maintainers investigate the available
bindists GHC has provided with the release and decide on how to update their mappings,
so that users of any platform and linux distro get a working bindist.
The main issue here is that neither stack, nor GHCup maintainers have direct influence on
the decisions of what bindists are created and how, although they have high stakes in it.
They are not consulted or informed consistently (although occasionally) on every change
and are not required to sign off on any non-trivial changes.
Rather, they are more downstream consumers that, most of the time, have to deal with
the end result post-release. There are rarely testing or RC phases for bindists and
GHC maintainers are reluctant to update bindists in case they have issues (this
happened only once when profiling libs were missing).
As a consequence, GHCup maintainers have had to invest considerable amount of work
into curating GHC bindists. These efforts can be seen here: https://downloads.haskell.org/ghcup/unofficial-bindists/ghc
The reasons so far included:
* adding missing bindists for various alpine versions
* adding missing bindists for FreeBSD (which is now discontinued and no binaries are built anymore)
* fixing packaging bugs
- most prominent is the [DESTDIR bug](https://gitlab.haskell.org/ghc/ghc/-/issues/19646), which caused GHCup maintainers to fix 32 bindists manually (unpack, patch, pack, upload, change/update ghcup metadata)
The most recent disruption was an update of the Fedora bindist, which was regarded by
both GHCup and stack maintainers as the linux bindist with the highest portability. When
the CI configuration was changed to use Fedora 33 instead of 27 to build the binaries,
the portability assumption broke due to a too recent GLIBC version and no testing was able to
catch it.
It is still causing disruption today:
* https://gitlab.haskell.org/ghc/ghc/-/issues/22268
* https://github.com/haskell/haskell-language-server/issues/3367
* https://github.com/haskell/haskell-language-server/issues/3247
* https://github.com/commercialhaskell/stackage-content/pull/117
### GHC releases
GHC releases have become more frequent in recent times and have increased
pressure on tooling to keep up and support these new versions.
GHC release process until recently didn't include many downstream consumers
and there was no explicit communication. Some release candidates existed only
for major versions.
Tooling maintainers would have to follow the issue tracker and the mailing list closely
to be up to date with issues that are relevant to them.
As such, releases tended to overwhelm not only GHCup, but especially HLS maintainers, where
supporting new GHC versions is non-trivial.
### HLS releases
Releasing HLS is non-trivial and includes:
1. updating the codebase to support a new GHC version if one was just released
2. fixing plugins, pinging their maintainers or disabling them when they stopped working
3. dealing with the complex release CI, which takes many hours to run
In terms of dependencies, the cycle of releasing a new HLS version may look like this:
1. new GHC version gets released
2. HLS needs to wait for GHCup to support the new version, because GHCup is needed to build the binaries for the correct platforms (more on ABI compatibility later)
3. HLS can start its release process, deal with CI, then publish a release
4. now GHCup metadata needs to be updated as well, for the new HLS version to land
Issues with GHC bindists will bubble up to HLS and impact the end-user experience of someone
who just installed the Haskell VSCode extension and doesn't know anything about the tooling
(see the GLIBC issues described above).
So far, this pipeline is quite separated with no tight coordination. As such, support for
GHC 9.2.5 took a long time to finally land in HLS 1.9.0.0, which has been impacting
users as well.
One major reason these things became more complicated is that HLS had to switch to
dynamically linked bindists in order to avoid issues with the internal linker and TH, see:
* https://github.com/haskell/haskell-language-server/issues/2000
* https://gitlab.haskell.org/ghc/ghc/-/issues/20628
## Ways to improve
The main goal of the suggested improvements is that the end user never knows
what is going on behind the scenes to make the tools work smoothly with each other
and have releases that are tightly in sync. The solutions are not self-serving
and should be rejected or reconsidered if they don't bring us closer to this goal.
### Binary distribution metadata
https://gitlab.haskell.org/ghc/ghc/-/issues/22686
GHC maintainers should maintain a metadata file that describes each bindist (in e.g. json or yaml) and includes at least the following information:
* package versions in bindist package database (addressing #21996)
* operating system the bindist was built on
* `gmp`, `ncurses`, `libc` versions
* perhaps `gcc` and `binutils` versions
* build configuration (e.g. flavour)
* `ghc --info`
The idea behind this is that all changes to bindist are at least very strictly visible post-release and can be consumed by tools, linters or just diffed by a human to get an overview.
### Synchronous release meetings
However, the previous point does not solve communication issues and does not prevent mistakes pre-release.
To achieve that, I propose to hold synchronous meetings 1-2 weeks prior to a release to discuss:
1. changes that affect tooling maintainers (cabal, HLS, ghcup, stack), especially any changes to bindists or build system
2. getting last input from tooling maintainers (e.g. outstanding bugs that are still affecting them, but are not regressions)
3. sharing current state of most prominent issues that affect end-users (on any level of tooling) and evaluating whether the upcoming release can make a difference
4. getting in-person updates about things that affect each other, without expecting everyone to follow every single bug tracker and mailing list of every other tool
This should make it possible to catch issues before a release and make maintainers aware of upcoming issues.
### Making it feasible to change bindist mappings
A bindist mapping is a mapping of a platform (e.g. Ubuntu 20.04) to a binary distribution of a GHC version. At the moment, these are the problems when changing bindist mapping in ghcup after they've already been released (mutating them, so to speak):
1. cabal indexes the store by ghc version only (not ABI) and the boot packages in the store metadata are generally not hashed. This can lead to miscompilations and has a large chance to break CIs. Cabal would either have to index the store by GHC ABI or somehow calculate hashes for boot packages as well. Also see https://github.com/haskell/cabal/issues/8114.
2. Stack uses `ghcup --info` to index package databases, which is a little more robust, but also fallible.
3. When a HLS version is already released, changing mappings requires a new HLS release as well, to avoid that it crashes on startup with "ABI mismatch" errors. Even if new binaries are out, users who don't do a fully consistent update may still face such issues.
4. ghcup itself currently has no concept of "revisions" (that is: a new packaged version of the same tool version, e.g. GHC 9.4.4-r1). The reason is that this is hard in terms of usability and design space. Also see https://github.com/haskell/ghcup-hs/issues/361
@bgamari opines that the right solution to (1) is for boot packages to have proper ABI hashes computed by Hadrian. These could then be factored in to the unit ids of a package's dependencies. Some refactoring of GHC's internal tracking of interface hashes may be necessary for this (see https://gitlab.haskell.org/ghc/ghc/-/issues/20810). See also https://gitlab.haskell.org/ghc/ghc/-/issues/21590.
This may be a major effort that we need to find a contractor for.
It's the main reason why GHCup maintainers have to be so paranoid about bindists being correct and often leads to bugfixes having to wait for the next GHC release (as is the case right now... Fedora 27 bindists were added back to GHC release CI, but tooling maintainers were not aware and ghcup and stack metadata was already released).
### Developing a holistic tooling release workflow
The current release workflow can benefit from more automation as well as more cross-checking between maintainers, e.g.:
1. GHC automatically sends PRs to ghcup-metadata repo on releases
2. significant ghcup-metadata changes (adding of new tool versions) are propagated in some way that can be subscribed to
3. HLS automatically sends PRs to ghcup-metadata repo on releases
4. there is a cross-maintainer sign-off policy, where
- significant PRs in GHC that affect bindists or build system need approval by stack/ghcup/HLS maintainers
- ghcup-metadata PRs need approval by the respective tool maintainer (GHC, stack, HLS, cabal)
- HLS releases need final sign-off by HLS and ghcup maintainers before publish
5. Document the workflow sufficiently and describe current decisions that are in effect (e.g. why dynamically linked HLS is used)
This would synergize with the synchronous GHC release meetings and may be expanded. E.g. currently, we have monthly HLS meetings and @mpickering is already attending them.
The idea behind this is to make the tool boundaries a little bit less strict and take shared responsibility and enable to discuss the end-user experience regularly and work towards shared solutions (relevant for the next section).
### Creating a shared project for tooling
I suggest to set shared goals for the entire tooling suite that focuses on the end user
and then create a project on github or an epic on gitlab to track the progress. The reasoning behind this
is that a lot of issues (like GHC ABI) affect multiple or all tools and tracking progress, discussions etc.
across multiple issue trackers is cumbersome and causes poor visibility of high priority issues.
Issues living in this project should naturally have high priority for all designated maintainer teams.
This could be combined with a bounty system backed by the Haskell Foundation.
### Decreasing bus factor across critical tooling
Many of the mentioned issues exist, because
all tools are short of maintainers and only
some tools have paid contributors.
GHCup has only one maintainer, who is not paid.
Not many people are able to do HLS releases and understand all the existing issues.
Cabal is understaffed too.
If we develop no strategy on how to deal with the bus factors, all effors to improve the situation may eventually regress when knowledge and engagement gets lost.
### Open question: outsourcing of GHC bindist creation
There has been some discussion whether outsourcing of bindist creation (e.g. for esoteric linux distros or tier3 platforms) would be an option too.
Practically, this is already happening partly in GHCup, but the idea was that GHCup completely relies on its own bindists.
This is more of an open thought than an actual proposal at the moment.
## Closing thoughts
As can be seen, not all issues here are technical and it has been my impression that we've focussed too much on that part.
Automation and CI are all great. But they don't ultimately solve the coordination issue.
It can be debated whether we would actually need a proper release coordinator who works across tooling boundaries.
Right now, I'm not sure how that would look like. I propose to try the above improvements first and maybe some of the friction will already be gone.
There are also more topics to be covered, e.g. how to unify the GHC installation methods to cause less friction with HLS+stack. But I feel that is for another day.