Haskell Foundation DevOps Weekly Log, 2023-03-22

chreekat · March 24, 2023, 12:15pm

Hey, well, the first thing I need to do is describe how the system currently works (which I’ll briefly sketch out here). Then we need to finish splitting one service into its own project and making it more robust with a testsuite and some improved behavior. Finally I need to work with the people who already use the system to find out how it could give them more power. I already have one rough idea from GHC maintainers that I want to work on.

Before describing the system, recall its purpose: it exists to track and mitigate spurious failures in CI.[1]

Here’s how the system works, described as a series of steps.

A GHC contributor notices a strange failure on their own code (example) and pings me.
I read the job log and use my noggin to figure out what’s strange about the failure.
I do some investigation: does this happen often? How often? Which jobs or platforms are involved? Is it really spurious, or is it rather just a fragile test?
- To accomplish this, I fetch GitLab job data and populate a full-text search index in SQLite. Then I write queries against the index.
- I also use some quick and dirty scripts to get more details from GitLab when needed. The one I use most frequently is trace, which I use to view a list of job logs selected by some sqlite query.
Based on my findings, I open tickets (example) to get more eyes on the problem.
If I positively identify a spurious failure that won’t be going away any time soon, I add it to the “retry bot”.

Aside: What is the retry bot, you ask? It is a service that listens to GitLab job events. (The service does many things that I won’t talk about now—this is why I want to split the retry functionality into its own project.) I have programmed it to identify spurious failures with pattern matching (example) that mimics the full-text searches I used during my investigation. If one or more failures is identified when a job is processed, a record is added to a central database, and the job is retried via the GitLab API. The database powers my spurious failure dashboard.

If you look at that list of steps, only the first one scales easily. I am the bottleneck on 2, 3, 4, and 5. Figuring out how to break that bottleneck is my goal.

[1]: A spurious failure is one that causes a CI job to fail that has nothing to do with the code or tests. An example is a job that fails because of a network failure between GitLab and the machine running the testsuite. Oh, and what’s a “job”? You can think of it as a process that runs the testsuite and reports on its results.