Haskell Foundation DevOps Weekly Log, 2023-08-23

Hello, and welcome to the next weekly log!

The last few logs have been fairly brief, so in this one I’ll expand a bit on my
current context.

Spurious Failures

I started this job by looking at spurious failures in GHC CI. Now I am wrapping up that project. Not because I expect to get rid of all spurious failures, but because I have identified the point where the cost/benefit ratio tips towards working on other projects.

Here is the roadmap for spurious failures:

Status Task Number Description
:white_check_mark: 1 Put out a call for data, asking the experts to report failed jobs that look spurious
:white_check_mark: 2 Build tools for exploring the jobs and characterizing types of failures
:white_check_mark: 3 Continually identify new types of spurious failures and create tickets to raise awareness
:white_check_mark: 4 Build a service that listens to GitLab job events, identifies spurious failures, and restarts affected jobs
:white_check_mark: 5 Rewrite the service in Haskell, dub it ‘spuriobot’, and add quality-of-life features
:white_check_mark: 6 Extend spuriobot to automatically protect new forks on GHC GitLab, not just the main GHC repository
:hourglass_flowing_sand: 7 (in progress) Add spuriobot to all existing forks on GHC GitLab
:black_square_button: 8 Build a public web app for exploring failures and characterizing types of failures (based on my existing local tools)
:black_square_button: 9 Encourage others to use the app and start characterizing spurious failures!
:black_square_button: 10 Use the app myself until these numbers are all hovering near 0.0% (see my yearly log for an explanation of “Post-Merge Validate Pipeline”)

To be fair, that’s still a bit of work! But there are no surprises and no reasons to expect the scope to grow further.

Medium Term Projects

So, what’s after spurious failures then? In short, more of the “ops” side of DevOps.

1. Build a Haskell Foundation GHC CI build cluster

HF has the funds for sponsoring CI build machines. Somebody (me, for instance) just needs to put together a proposal and execute!

This isn’t just about creating a bunch of servers, though. It’s about streamlining management. Today’s CI cluster has a rainbow of owners, access methods, and configurations. While some of this is necessary complexity, the Haskell Foundation DevOps role is uniquely placed to reduce the accidental complexity. Along the way, I want to improve our security standards, system observability, and service reliability. Honestly, improving the actual throughput of the cluster–while desirable!–is just gonna be a nice consequence of the other work.

What makes this project interesting is the breadth of platforms that GHC supports. This support is a Very Good Thing, and I believe with good onboarding and documentation, even niche platforms with smaller groups of aficionados will be able to contribute to the ecosystem.

Oh, one last thing. Before anyone jumps to conclusions, don’t worry. I intend to look at virtualization options, managed options, and whatever else might reduce our operations load. (Get in touch if you have suggestions.) While the NIH pox has certainly left its scars, I am all the wiser for it now.

2. Take over GitLab administration

Ben Gamari, besides knowing GHC (and perhaps the entire universe itself) frontwards and backwards, is the primairy administrator for the GHC GitLab instance. I believe one of the best things the HF DevOps role could possibly do is reduce some of that burden.

Similar to the CI build cluster, this project will be about streamlining management. Although it’s already automated to a great degree, there are a million little papercuts, jobs going undone, and questions better left unasked.

My plan is honestly to just chip away at one thing at a time until the overall picture starts to take place. That’s exactly what I did with spurious failures, and I think it turned out great.

That’s all for this week’s log—see you next time!