Hello, welcome to week 8 of the DevOps log.
The main result of this week was a spurious failure dashboard. It is currently updating in real time for every CI job that runs in the GHC project. I also backfilled some historic data.
Thanks to the backfill, we can now see the effect of disabling the small number of runners that were causing a particular sort of error. That error used to happen at least once a day—sometimes much more—and now it’s gone! It’s great to be able to show this at a glance, and I’m glad to have a framework for tracking other failures as well.
In fact, I haven’t seen any of the known errors happen since I started recording them (see below for a caveat). Therefore I expanded my scope to include the head.hackage, ghc-debug, ghcup, and filesystem projects, which were chosen only because they’re on gitlab.haskell.org and it was easy to drop the monitoring onto them as well.
I’ll also be expanding my scope by finding other kinds of errors. One sore spot for GHC devs has been builds on Darwin and FreeBSD, so I’ll start trying to categorize the problems experienced on those systems.
The error-tracking dashboard also needs additional development. The code for backfilling data needs to be published and improved in one or two important ways. I also need to document what the system is doing, so that others can understand it and even pitch in. Finally, I just discovered that the system is reporting a false positive, so I need to fix that! (That is the caveat I mentioned a couple paragraphs back.)
Until next time,
-Bryan