Welcome to week 11 of the devops log!
This week saw real results from my work to reduce spurious failures: jobs can now be safely retried. Automatically retrying impacted jobs allowed the head.hackage pipelines of Thursday morning (lint, perf) to eventually succeed! Plus, I got more useful data from that incident since there were a number of failures in a row.
Also on the subject of spurious failures,
- @simonpj highlighted another failure type: T16916 is a particularly flaky test. I found that it happens regularly, but less frequently than other failures, so I will deal with it later.
- I got access to more runners to inspect runner-related failures.
- @angerman and I began collaborating on “signal 9” failures.
Finally, there were also a few devops-y developments outside of spurious failures. I configured marge-bot to restart on failure like I discussed in my last devops log. I was also alerted to two potential issues with the GitLab server, itself: Disk usage is growing quickly, which may be related to my head.hackage residency profiling. And unrelated to anything else, GitLab reports that more repositories are “failing their repository check”, which needs a deeper look.
Therefore I’ll be looking into these GitLab server problems next, before continuing with signal_9 and other spurious failures!
-Bryan