Opportunity for improving GHC: fixing some spurious failures in CI

I was asked about spurious failures in GHC CI, and figured the answer was worth a public post. There are three frequent causes of spurious failure that someone with a bit of experience could take a look at. I would be happy to assist.

One prevalent cause may have been figured out last week, as discussed on this thread. But the solution hasn’t been applied yet. The repo where the fix should be applied is ghc/ci-config if anyone wants to take a crack at it. Like I said, I would be happy to assist. Otherwise I’ll get to it myself when I get back to spurious failures.

Excitingly, there seems to be a bug in ghc-pkg that someone could try to fix. There’s an issue where Matthew Pickering has analyzed it and potentially found a solution: #22870.

Another exciting source of failures is potentially a cabal or hadrian segfault on aarch64, tracked at #22408.

In general, I have identified the specific spurious failures seen on this dashboard:

https://grafana.gitlab.haskell.org/d/167r9v6nk/ci-spurious-failures?orgId=2&var-types=All&var-jobs=All&var-runners=All&from=now-30d&to=now&refresh=15m

However, I never finished going through all reports of possible spurious failures, so there may be others.
Many of these issues are to do with the self-hosted GitLab server and CI runners. That’s why it might be a good idea for me to focus on that sysadmin stuff sometime in the future.

6 Likes