Haskell Foundation DevOps Weekly Log, 2020-07-01

chreekat · July 1, 2022, 10:22am

Hello, it’s another edition of the HF DevOps weekly log.

This Week: Tracking Spurious Failures

This week I focused on understanding spurious failures and designing a system to track them. I’m starting small because there is still so much for me to learn about the overall system, but it’s progressing well.

I’m narrowing my scope by focusing on one clear source of errors: “Cannot connect to the Docker daemon”. This error is easy to spot, clearly represents something that should not happen, and happens often enough for GHC contributors to recognize it instantly. In fact, it represented 4% of all failed jobs in the period between May 20 and June 20 (145 of 3500 jobs).

There may actually be other spurious failures that are more numerous, but they are harder to spot! My strategy is to pick off an easy case first while building the reporting infrastructure, and deal with the harder cases once all the infrastructure is in place.

The system for tracking spurious failures will be built into the existing reporting system:

Grafana instance for dashboards
PostgreSQL database for storing metrics
Webhooks, which the GitLab server calls for every job, for generating metrics

Failure data

I got the numbers about the “Docker daemon” failure from my local database of job data. I mentioned in my last update that I was pulling the data for a month of failed jobs. This week I got all of it in a sqlite database, complete with full-text search. This has let me start exploring failures in a faster and more rigorous way. Data is fun! (Data are fun?)

GHC residency profiling

Although my previous work on profiling GHC is mostly done for now, I noticed that the query populating my Grafana panel was slowing down too quickly. A few EXPLAINs and one new index later, I think it’s good enough for now.

Coming next

After I finish my spurious failure dashboard, I want to actually solve the “Docker daemon” failure. Or rather, I want to confirm that it already has been solved: My data also showed that a small handful of runners were responsible for the problem, and those runners have already been disabled. (While that will reduce the CI throughput, we can solve that problem separately.) Perhaps that will fix the problem for now.

The next large source of spurious failures appears to be a network timeout connecting to registry.gitlab.haskell.org, so that would be next on the agenda.

Until next time,

-Bryan

Topic		Replies	Views
Haskell Foundation DevOps Weekly Log, 2023-08-23 Haskell Foundation	0	581	August 23, 2023
Haskell Foundation DevOps Weekly Log, 2023-08-30 Haskell Foundation	0	615	August 30, 2023
Haskell Foundation DevOps Weekly Log, 2023-03-01 Haskell Foundation	0	483	March 1, 2023
Haskell Foundation DevOps Weekly Log, 2022-07-08 Haskell Foundation	0	586	July 8, 2022
Haskell Foundation DevOps Weekly Log, 2022-07-29 Haskell Foundation	4	786	August 3, 2022

Haskell Foundation DevOps Weekly Log, 2020-07-01

This Week: Tracking Spurious Failures

Failure data

GHC residency profiling

Coming next

Related topics