DevOps Log, 2024-01-24

Welcome to another weekly log!

There was a bit of variety this week. The Stackage migration is still my main task, but when you’re responsible for running systems, you can never focus 100% of your effort in one place.

For example, I made a small improvement to GitLab emails, fixing a typo in the SPF record (don’t ask). Since we had to disable the GitHub OAuth integration due to spammers, people need to use the password reset option to log back in to their accounts. This SPF record usually doesn’t cause any trouble, but may have done so for some people. Let me know if you ever have any problems with receiving emails.

Another highlight from this week was that I dealt with spurious failures in GHC CI again. The work of reducing spurious failures is never-ending. Like any working piece of machinery, little flaws build up over time. It’s impossible to notice every little problem when it gets introduced. So occasionally it needs maintenance.

Some systems are more reliable than others, of course. I have a few projects in mind. Just off the top of my head:

  • Visualize test failures on the master branch
  • Automate more of the CI infrastructure
  • Encourage more contributions to GHC, the testsuite, and CI by improving docs and onboarding

Anyway, it turns out the biggest problem was a misbehaving CI runner, which has been taken out of the pool now. I still want to look at a few more failed jobs just to be sure there’s nothing else bad happening. I have opened a number of tickets for the other problems that I’ve found so far. See issue list below!

As a bonus, I cleaned up my local tooling for investigating spurious failures, which should be easier for other people to use now. Check it out at local-tooling · master · Bryan R / spurious-failures · GitLab.

Once I’ve made enough progress, I will return to the Stackage migration. There are still two processes that need to be updated to write to the new HF haddocks bucket. I found out the hard way that these two need to be migrated simultaneously. Planning for how to deal with that, as well as figuring it out in the first place, has slowed me down a little bit. I still need to come up with a way to test the changes before deploying them. I want to do the synchronized deployment right the first time.

Well, that’s all for now! See you next time.

And here’s the next batch of issues discussed in the GHC triage meeting. Unlike most weeks, I stayed for the entire 2.25 hours that it took to work through these. It’s a lot of work!

P.S. did you know there is a weekly open GHC call? If you have an interest in GHC development, you should join! See SPJ’s invitation in the ghc-devs archive: GHC Weekly call

GHC issues triaged this week