Hello again!
During week 12, I identified new types of spurious CI failures and kicked over a rock that may have been hiding a few more. There were a few ops-y tasks and a little community involvement as well.
Done recently
- Handled a minor GitLab incident involving abuse of the service
- Looked into signal 9 errors
- I suspected the OOM-killer, but free memory reporting looks ok
- Though the reporting is only at a 15-second granularity
- By themselves, it’s hard to say what’s going on with these errors, but see below
- I suspected the OOM-killer, but free memory reporting looks ok
- Looked at other errors
- Discovered cannot_allocate, segfault, and T16916
- Segfaults can be legitimate, but they seem to be happening in a clear pattern of spurious failures
- “Cannot allocate memory” (e.g. nightly-x86_64-linux-deb10-no_tntc-validate (#1130939) · Jobs · Glasgow Haskell Compiler / GHC · GitLab) seem to be much more clearly about hitting memory limits! So I can focus there and see if signal 9’s reduce as well
- T16916 just seems to be on the low side of reliability; not enough to get noticed as flaky but still showing up time-to-time
- Or who knows, maybe it’s a regression
- Started looking at errors related to missing output
- A random sample makes it look like it’s the same set of tests failing this way
- Something strange going on with windows CI and flushing (#20791) · Issues · Glasgow Haskell Compiler / GHC · GitLab is similar?
- Discovered cannot_allocate, segfault, and T16916
- Alerted to problems with CI on bgamari/ghcs-nix
- Mitigated by using an older version of cachix
- Full fix is to upgrade Docker on the affected CI runner systems
- It wasn’t a cachix bug, but https://github.com/cachix/cachix/issues/427 has some details
- Commented on a few discussions on various proposals floating around. Generally I try to stay neutral and uninvolved, so I should probably scale that back. I think it’s an important way to avoid burnout (since I already work on supporting the community full-time). But community involvement is also important!
- [Shameless HF promotion] Check out https://github.com/haskellfoundation/tech-proposals/pull/37 which is fantastic
Coming up
- Try to ensure jobs are not running out of memory
- See if I can positively identify spurious errors out of the patterns of segfaults and missing output I’m seeing
Also in “Coming up”, the dates were confirmed for my absence: 18.8–25.9. I will delay next week’s log until Wednesday 17.8 so I can wrap up everything at once.
That’s all, have a nice weekend!
-Bryan
aka chreekat