Hello, it’s another edition of the HF DevOps weekly log.
This Week: Tracking Spurious Failures
This week I focused on understanding spurious failures and designing a system to track them. I’m starting small because there is still so much for me to learn about the overall system, but it’s progressing well.
I’m narrowing my scope by focusing on one clear source of errors: “Cannot connect to the Docker daemon”. This error is easy to spot, clearly represents something that should not happen, and happens often enough for GHC contributors to recognize it instantly. In fact, it represented 4% of all failed jobs in the period between May 20 and June 20 (145 of 3500 jobs).
There may actually be other spurious failures that are more numerous, but they are harder to spot! My strategy is to pick off an easy case first while building the reporting infrastructure, and deal with the harder cases once all the infrastructure is in place.
The system for tracking spurious failures will be built into the existing reporting system:
- Grafana instance for dashboards
- PostgreSQL database for storing metrics
- Webhooks, which the GitLab server calls for every job, for generating metrics
Failure data
I got the numbers about the “Docker daemon” failure from my local database of job data. I mentioned in my last update that I was pulling the data for a month of failed jobs. This week I got all of it in a sqlite database, complete with full-text search. This has let me start exploring failures in a faster and more rigorous way. Data is fun! (Data are fun?)
GHC residency profiling
Although my previous work on profiling GHC is mostly done for now, I noticed that the query populating my Grafana panel was slowing down too quickly. A few EXPLAIN
s and one new index later, I think it’s good enough for now.
Coming next
After I finish my spurious failure dashboard, I want to actually solve the “Docker daemon” failure. Or rather, I want to confirm that it already has been solved: My data also showed that a small handful of runners were responsible for the problem, and those runners have already been disabled. (While that will reduce the CI throughput, we can solve that problem separately.) Perhaps that will fix the problem for now.
The next large source of spurious failures appears to be a network timeout connecting to registry.gitlab.haskell.org, so that would be next on the agenda.
Until next time,
-Bryan