On December 4th, 2024 between 18:52 UTC and 19:11 UTC, several GitHub services were degraded with an average error rate of 8%.The incident was caused by a change to a centralized authorization service that contained an unoptimized database query. This led to an increase in overall load on a shared database cluster, resulting in a cascading effect on multiple services and specifically affecting repository access authorization checks. We mitigated the incident after rolling back the change at 19:07 UTC, fully recovering within 4 minutes. While this incident was caught and remedied quickly, we are implementing process improvements around recognizing and reducing risk of changes involving high volume authorization checks. We are investing in broad improvements to our safe rollout process, such as improving early detection mechanisms.
On December 3rd, between 23:29 and 23:43 UTC, Pull Requests experienced a brief outage and teams have confirmed the issue to be resolved. Due to brevity of incident it was not publicly statused at the time however an RCA will be conducted and shared in due course.
On December 3, 2024, between 19:35 UTC and 20:05 UTC API requests, Actions, Pull Requests and Issues were degraded. Web and API requests for Pull Requests experienced a 3.5% error rate and Issues had a 1.2% error rate. The highest impact was for users who experienced errors while creating and commenting on Pull Requests and Issues. Actions had a 3.3% error rate in jobs and delays on some updates during this time.This was due to an erroneous database credential change impacting write access to Issues and Pull Requests data. We mitigated the incident by reverting the credential change at 19:52 UTC. We continued to monitor service recovery before resolving the incident at 20:05 UTC. There are a few improvements we are making in response to this. We are investing in safe guards to the change management process in order to prevent erroneous database credential changes. Additionally, the initial rollback attempt was unsuccessful which led to a longer time to mitigate. We were able to revert through an alternative method and are updating our playbooks to document this mitigation strategy.
Between Dec 3 03:35 UTC and 04:35 UTC, availability of large hosted runners for Actions was degraded due to failures in background VM provisioning jobs. This was a shorter recurrence of the issue that occurred the previous day. Users would see workflows queued waiting for a large runner. On average, 13.5% of all workflows requiring large runners over the incident time were affected, peaking at 46% of requests. Standard and Mac runners were not affected.Following the Dec 1 incident, we had disabled non-critical paths in the provisioning job and believed that would eliminate any impact while we understood and addressed the timeouts. Unfortunately, the timeouts were a symptom of broader job health issues, so those changes did not prevent this second occurrence the following day. We now understand that other jobs on these agents had issues that resulted in them hanging and consuming available job agent capacity. The reduced capacity led to saturation of the remaining agents and significant performance degradation in the running jobs.In addition to the immediate improvements shared in the previous incident summary, we immediately initiated regular recycles of all agents in this area while we continue to address the issues in both the jobs themselves and the resiliency of the agents. We also continue to improve our detection to ensure we are automatically detecting these delays.
Between Dec 1 12:20 UTC and Dec 2 1:05 UTC, availability of large hosted runners for Actions was degraded due to failures in background VM provisioning jobs. Users would see workflows queued waiting for a runner. On average, 8% of all workflows requiring large runners over the incident time were affected, peaking at 37.5% of requests. There were also lower levels of intermittent queuing on Dec 1 beginning around 3:00 UTC. Standard and Mac runners were not affected. The job failures were caused by timeouts to a dependent service in the VM provisioning flow and gaps in the jobs’ resilience to those timeouts. The incident was mitigated by circumventing the dependency as it was not in the critical path of VM provisioning.There are a few immediate improvements we are making in response to this. We are addressing the causes of the failed calls to improve the availability of calls to that backend service. Even with that impact, the critical flow of large VM provisioning should not have been impacted, so we are improving the client behavior to fail fast and circuit break non-critical calls. Finally the alerting for this service was not adequate in this particular scenario to ensure fast response by our team. We are improving our automated detection from this to reduce our time to detection and mitigation of issues like this one in the future.
Between 13:30 and 15:00 UTC, repository searches were timing out for most users. The ongoing efforts from the similar incident last week helped uncover the main contributing factors. We have deployed short-term mitigations and identified longer term work to proactively identify and limit resource-intensive searches.
On November 25th, 2024 between 10:38 UTC and 12:00 UTC the Claude model for GitHub Copilot Chat experienced degraded performance. During the impact, all requests to Claude would result in an immediate error to the user. This was due to upstream errors with one of our infrastructure providers, which have since been mitigated.We are working with our infrastructure providers to reduce time to detection and implement additional failover options to mitigate issues like this one in the future.
Between 2024-11-06 11:14 UTC and 2024-11-08 at 18:15 UTC, pull requests added to merge queues in some repositories were not processed. This was caused by a bug in a new version of the merge queue code, and was mitigated by rolling back a feature flag. Around 1% of enqueued PRs were affected, with around 7% of repositories that use a merge queue being impacted at some time during the incident.
Queues were impacted if their target branch had the “require status checks” setting enabled, but did not have any individual required checks configured. Our monitoring strategy only covered PRs automatically removed from the queue, which was insufficient to detect this issue.
We are improving our monitors to cover anomalous manual queue entry removal rates, which will allow us to detect this class of issue much sooner.