Haskell Foundation DevOps Weekly Log, 2022-06-17

Sorry for skipping last week’s log, but thanks to everyone who spoke with me at Zurihac! I had fun talking and meeting people, and even managed to do a little hacking.

Since the last update, I have finished everything for adding residency profiling to nightly head.hackage jobs except testing that the results actually show up in the database. Although that took most of my time, I also have begun shifting to my main focus area: CI stability.

GHC residency profiling

While this was mostly “code complete” before Zurihac, I still had to spend a lot of time on it. Primarily, I was learning how to deploy changes to the GitLab server. Keeping such a system running is a herculean task, and it’s no wonder it has been a timesink for the GHC team.

Luckily, the server is maintained in a NixOS configuration, meaning even as a newbie I can discover the current state of the system and follow recent changes. This really minimizes the amount of folklore that must be passed down from admin to admin!

Not everything on the server is rigidly defined, however. There is still work done directly on the server, and there are git submodules that can accidentally acquire local commits that don’t get pushed upstream, and oh hey it turns out the repo actually defines multiple servers, and work is done directly on those servers as well, so watch out for conflicts…

This is all totally normal, though. The time spent easing onboarding newbies needs to be weighed against how often someone new actually joins, and the rigid definition of systems need to balanced against ease and speed of deployment. As long as the team is small, the current set of tradeoffs is reasonable.

Anyway, here’s another boring list of links summarizing my code changes that, by its breadth alone, gives some insight into the complexity of devops:

CI stability

Ironically, the task to add residency profiling isn’t my main focus, and it took a lot longer than anybody expected. But I think that’s great. I got to experience GHC CI as a user. I already have much better insight into the assemblage of servers, repos, pipelines, jobs, and workflows that give rise to the emergent phenomenon known as “GHC CI”. Plus, observant readers will note that some of the commits listed above are fixing CI failures I ran into!

I can now turn my focus completely towards making CI less painful. The meta issue about intermittent failures continues to bear fruit, and I’ve started to characterize the kinds of failures people (including me) experience. We already have an idea how to solve one whole class of problems, but I’ll leave that for next time, when I have harder data…

16 Likes

I don’t love TeamCity, but I liked using it to browse the history of test runs and being able to see trends in flakiness over time (in combination with using junit output, for a large scala monorepo). I imagine circleci and other vendors have built some useful reports for tracking flakiness over time as well.

I haven’t dug into it yet, but the GHC test infrastructure has a bunch of stuff for tracking flaky tests. None of it (or not much of it) is integrated with Gitlab, and I’m still learning how much of it is custom to GHC and how much of it is off-the-shelf.