Between Dec 1 12:20 UTC and Dec 2 1:05 UTC, availability of large hosted runners for Actions was degraded due to failures in background VM provisioning jobs. Users would see workflows queued waiting for a runner. On average, 8% of all workflows requiring large runners over the incident time were affected, peaking at 37.5% of requests. There were also lower levels of intermittent queuing on Dec 1 beginning around 3:00 UTC. Standard and Mac runners were not affected. The job failures were caused by timeouts to a dependent service in the VM provisioning flow and gaps in the jobs’ resilience to those timeouts. The incident was mitigated by circumventing the dependency as it was not in the critical path of VM provisioning.There are a few immediate improvements we are making in response to this. We are addressing the causes of the failed calls to improve the availability of calls to that backend service. Even with that impact, the critical flow of large VM provisioning should not have been impacted, so we are improving the client behavior to fail fast and circuit break non-critical calls. Finally the alerting for this service was not adequate in this particular scenario to ensure fast response by our team. We are improving our automated detection from this to reduce our time to detection and mitigation of issues like this one in the future.
Between 13:30 and 15:00 UTC, repository searches were timing out for most users. The ongoing efforts from the similar incident last week helped uncover the main contributing factors. We have deployed short-term mitigations and identified longer term work to proactively identify and limit resource-intensive searches.
On November 25th, 2024 between 10:38 UTC and 12:00 UTC the Claude model for GitHub Copilot Chat experienced degraded performance. During the impact, all requests to Claude would result in an immediate error to the user. This was due to upstream errors with one of our infrastructure providers, which have since been mitigated.We are working with our infrastructure providers to reduce time to detection and implement additional failover options to mitigate issues like this one in the future.
Between 2024-11-06 11:14 UTC and 2024-11-08 at 18:15 UTC, pull requests added to merge queues in some repositories were not processed. This was caused by a bug in a new version of the merge queue code, and was mitigated by rolling back a feature flag. Around 1% of enqueued PRs were affected, with around 7% of repositories that use a merge queue being impacted at some time during the incident.
Queues were impacted if their target branch had the “require status checks” setting enabled, but did not have any individual required checks configured. Our monitoring strategy only covered PRs automatically removed from the queue, which was insufficient to detect this issue.
We are improving our monitors to cover anomalous manual queue entry removal rates, which will allow us to detect this class of issue much sooner.
On November 21, 2024, between 14:30 UTC and 15:53 UTC search services at GitHub were degraded and CPU load on some nodes hit 100%. On average, the error rate was 22 requests/second and peaked at 83 requests/second. During this incident Enterprise Profile pages were slow to load and searches may have returned low quality results.The CPU load was mitigated by redeploying portions of our web infrastructure.We are still working to identify the cause of the increase in CPU usage and are improving our observability tooling to better expose the cause of an incident like this in the future.
On November 19, 2024, between 10:56:00 UTC and 12:03:00 UTC the notifications service was degraded and stopped sending notifications. On average, notifications delivery was delayed about 1 hour. This was due to a database host coming out of a regular maintenance process in read only-mode.We mitigated the incident by making the host writable again. After that the notifications delivery recovered and any delivery job that had failed during the incident was successfully retried.We are working to improve our observability across database clusters to reduce our time to detection and mitigation of issues like this one in the future.
On October 30, 2024, between 5:45 and 9:42 UTC, the Actions service was degraded, causing run delays. On average, Actions workflow run, job, and step updates were delayed as much as one hour. The delays were caused by updates in a dependent service that led to failures in Redis connectivity. Delays recovered once the Redis cluster connectivity was restored at 8:16 UTC. The incident was fully mitigated once the job queue had processed by 9:24 UTC. This incident followed an earlier short period of impact on hosted runners due to a similar issue, which was mitigated by failing over to a healthy cluster.From this, we are working to improve our observability across Redis clusters to reduce our time to detection and mitigation of issues like this one in the future where multiple clusters and services were impacted. We will also be working to reduce the time to mitigate and improve general resilience to this dependency.
On Oct 24 2024 at 06:55 UTC, a syntactically correct, but invalid discussion template YAML config file was committed in the community/community repository. This caused all users of that repository who tried to access a discussion template or attempted to create a discussion to receive a 500 error response.We mitigated the incident by manually reverting the invalid template changes.We are adding support to detect and prevent invalid discussion template YAML from causing user-facing errors in the future.