Haskell Foundation DevOps Weekly Log, 2023-03-22

Welcome to another weekly log!

In the last week, my time was spent on ops incidents and bug fixes:

  • Worked on new provisioning automation for GHC CI runners on Equinix Metal !16: Draft: use zfs on ghc-ci-x86 · Merge requests · Glasgow Haskell Compiler / ci-config · GitLab
  • Fixed a trio of bugs in my retry bot
    • False positives for “out of disk space” failures caused by erroneously matching on legitimate test output
    • False positives for “out of memory” failures with the same root cause
    • False positives for a few types of failures because of a logic bug in my code
  • Fixed an outage of the GitLab server. I restarted the server to resolve the problems. Apparently the server had run out of disk space earlier in the weekend; although space had been recovered, the initial outage was probably the cause for the ongoing behavior.
  • Identified and mitigated a couple new spurious failure types.

Coming up, I intend to stop focusing on fixing these spurious failures piecemeal and start focusing on scaling up the effort. Right now, if you want to do something about a spurious failure on GitLab, you ping me on your merge request and cross your fingers that something gets done about it. I’d like to think I have a pretty good track record, but it would be a bad use of Foundation resources if I never found a way to put more power in the hands of contributors. That will be my focus for the coming months.

See ya!


What do you have in mind for scaling up this effort?


Hey, well, the first thing I need to do is describe how the system currently works (which I’ll briefly sketch out here). Then we need to finish splitting one service into its own project and making it more robust with a testsuite and some improved behavior. Finally I need to work with the people who already use the system to find out how it could give them more power. I already have one rough idea from GHC maintainers that I want to work on.

Before describing the system, recall its purpose: it exists to track and mitigate spurious failures in CI.[1]

Here’s how the system works, described as a series of steps.

  1. A GHC contributor notices a strange failure on their own code (example) and pings me.
  2. I read the job log and use my noggin to figure out what’s strange about the failure.
  3. I do some investigation: does this happen often? How often? Which jobs or platforms are involved? Is it really spurious, or is it rather just a fragile test?
  4. Based on my findings, I open tickets (example) to get more eyes on the problem.
  5. If I positively identify a spurious failure that won’t be going away any time soon, I add it to the “retry bot”.

Aside: What is the retry bot, you ask? It is a service that listens to GitLab job events. (The service does many things that I won’t talk about now—this is why I want to split the retry functionality into its own project.) I have programmed it to identify spurious failures with pattern matching (example) that mimics the full-text searches I used during my investigation. If one or more failures is identified when a job is processed, a record is added to a central database, and the job is retried via the GitLab API. The database powers my spurious failure dashboard.

If you look at that list of steps, only the first one scales easily. I am the bottleneck on 2, 3, 4, and 5. Figuring out how to break that bottleneck is my goal.

[1]: A spurious failure is one that causes a CI job to fail that has nothing to do with the code or tests. An example is a job that fails because of a network failure between GitLab and the machine running the testsuite. Oh, and what’s a “job”? You can think of it as a process that runs the testsuite and reports on its results.