Hello, welcome to another DevOps weekly log.
As is often the case with DevOps, this week I spent time on reactive operational work. An important part of the GHC DevOps machinery had begun misbehaving. As a result, CI became less reliable and GHC performance metrics were at risk of being thrown away. By investigating the failures, I was able to identify a way to improve the component such that the same kind of error is unlikely to happen ever again. That improvement is now deployed and operational.
Other reactive work included setting up a few new machines as CI runners to take the load off others that were failing. This work also has strategic value, however, since the new machines are special-purpose CI runners running on managed hardware, meaning they should be more reliable and maintainable going forward.
Finally, I also set myself up as a early-warning system for new failures in head.hackage. Um. By that I mean: Since head.hackage is a useful tool for GHC devs to test compatability of their patches against a subset of Hackage, it’s important for it to be reliable. (It’s part of the “user interface” for GHC DevOps.) Until recently, however, if it broke it was difficult to notice. Now, I get emails as soon as it happens, so I can quickly notice and report new problems.
Finally, I did also investigate reports of spurious CI failures, which is my main strategic goal. I also did the usual amount of admin stuff, discussed the upcoming GHC Contributors Workshop which I will be present for, and worked some more with a volunteer on splitting up the very same DevOps component that I had to do emergency work on earlier this week.
Coming up next, I will do some more reactive work, and look into why we ran out of disk space on a handful of CI machines all at the same time…
See you next week!