One month ago, the automatic update began failing. This seems to have been caused by conflicting configuration in our Fastly and Cloudflare accounts, although that is still being investigated. Today, the failures caught up to us and the last good cert expired.
Sadly, we don’t have any monitoring or alerting in place for this update process, so although the logs are there, we didn’t spot the errors before today.
Fastly has sponsored CDN and TLS services for Haskell domains for a very long time. Some of our configuration predates ACME wildcard domains, if not ACME itself. Given all the legacy stuff, it’s hard to know why everything is the way it is.
Recently, some changes pushed the problem over the line. ACME config on the servers, DNS zone config in Cloudflare (another sponsor), and TLS config in Fastly conflicted in a way that broke the ACME update.
The fix involved removing old conflicting certs in Fastly, removing or correcting DNS entries in Cloudflare, and rerunning the ACME update. This cleaned up other potential lurking issues, as well.
It looks like there may have been migrations done in the past that were partially incomplete. We believe—well, I believe in particular—a better focus on understanding systems, planning changes, and following up on them would have prevented this. But the real failure here was the lack of alerting. The cert tried to tell us for a whole month that something was wrong, but we never heard it.
The Infra team has been getting more organized recently. The Haskell Foundation has devoted some resources to that work. Getting a team alerting system set up is high on our wishlist.