Haskell Foundation DevOps Weekly Update, 2023-02-08

Welcome to the 30th weekly update.

My primary focus the last week was spurious failures, although other things happened along the way.

First, spurious failures. I’ve looked into five more in the last week. It’s a fascinating process. Sometimes I discover a test that has always been fragile, like #22872 or #22910, which seems to point at some hard-to-reproduce bug in GHC. Other times a failure looks like a networking problem. In fact, after I look into N (for some N) more spurious failures, I will take some time to strengthen our network monitoring to see if we can pinpoint any problems. I suspect there’s a whole bunch of weird behavior around GitLab that will boil down to some networking issue.

Second was the followup regarding Mac notarization. I wrote a separate Discourse post, which has a little followup discussion as well.

Third was fixing up my full-text search for job logs. I rewrote it to avoid storing the actual logs, which were too big for my puny laptop to store. Now I can quickly search against hundreds of thousands of jobs to look for trends in failures. Nine months of CI is ~150,000 jobs and 33GB of disk space (which includes all job metadata, too). It allows me to write queries like,

select job_id, json ->> '$.created_at'
from job
where job_id in (
    select rowid from job_trace
    where trace match '"Timed out waiting for the build to finish" OR "Job failed execution took longer than"'
)
and json ->> '$.name' not like '%thread_sanitizer%'
order by json ->> '$.created_at' desc

which I will just leave without comment. Sorry.

Honorable mention of my time usage goes to

  • Reading and interacting with Language, Library, and Compiler Stability
  • Looking at Cabal CI
  • Reviewing a contributor’s project that reimplements my CI retry bot in Haskell. If you’re also looking for projects, ask me and I will have suggestions!

Coming up, I will continue with spurious failures: six more until I think about taking a break.

8 Likes