Incident History

Disruption with some GitHub services

Between Dec 3 03:35 UTC and 04:35 UTC, availability of large hosted runners for Actions was degraded due to failures in background VM provisioning jobs. This was a shorter recurrence of the issue that occurred the previous day. Users would see workflows queued waiting for a large runner. On average, 13.5% of all workflows requiring large runners over the incident time were affected, peaking at 46% of requests. Standard and Mac runners were not affected.Following the Dec 1 incident, we had disabled non-critical paths in the provisioning job and believed that would eliminate any impact while we understood and addressed the timeouts. Unfortunately, the timeouts were a symptom of broader job health issues, so those changes did not prevent this second occurrence the following day. We now understand that other jobs on these agents had issues that resulted in them hanging and consuming available job agent capacity. The reduced capacity led to saturation of the remaining agents and significant performance degradation in the running jobs.In addition to the immediate improvements shared in the previous incident summary, we immediately initiated regular recycles of all agents in this area while we continue to address the issues in both the jobs themselves and the resiliency of the agents. We also continue to improve our detection to ensure we are automatically detecting these delays.

Disruption with some GitHub services

Between Dec 1 12:20 UTC and Dec 2 1:05 UTC, availability of large hosted runners for Actions was degraded due to failures in background VM provisioning jobs. Users would see workflows queued waiting for a runner. On average, 8% of all workflows requiring large runners over the incident time were affected, peaking at 37.5% of requests. There were also lower levels of intermittent queuing on Dec 1 beginning around 3:00 UTC. Standard and Mac runners were not affected. The job failures were caused by timeouts to a dependent service in the VM provisioning flow and gaps in the jobs’ resilience to those timeouts. The incident was mitigated by circumventing the dependency as it was not in the critical path of VM provisioning.There are a few immediate improvements we are making in response to this. We are addressing the causes of the failed calls to improve the availability of calls to that backend service. Even with that impact, the critical flow of large VM provisioning should not have been impacted, so we are improving the client behavior to fail fast and circuit break non-critical calls. Finally the alerting for this service was not adequate in this particular scenario to ensure fast response by our team. We are improving our automated detection from this to reduce our time to detection and mitigation of issues like this one in the future.

Incident with Codespaces

This incident has been resolved.

Incident with Sporadic Timeouts in Codespaces

This incident has been resolved.

Disruption with GitHub Search

Between 13:30 and 15:00 UTC, repository searches were timing out for most users. The ongoing efforts from the similar incident last week helped uncover the main contributing factors. We have deployed short-term mitigations and identified longer term work to proactively identify and limit resource-intensive searches.

Disruption with some GitHub services

On November 25th, 2024 between 10:38 UTC and 12:00 UTC the Claude model for GitHub Copilot Chat experienced degraded performance. During the impact, all requests to Claude would result in an immediate error to the user. This was due to upstream errors with one of our infrastructure providers, which have since been mitigated.We are working with our infrastructure providers to reduce time to detection and implement additional failover options to mitigate issues like this one in the future.

[Retroactive] Merge Queues not processing queued Pull Requests in some repositories

Between 2024-11-06 11:14 UTC and 2024-11-08 at 18:15 UTC, pull requests added to merge queues in some repositories were not processed. This was caused by a bug in a new version of the merge queue code, and was mitigated by rolling back a feature flag. Around 1% of enqueued PRs were affected, with around 7% of repositories that use a merge queue being impacted at some time during the incident.

Queues were impacted if their target branch had the “require status checks” setting enabled, but did not have any individual required checks configured. Our monitoring strategy only covered PRs automatically removed from the queue, which was insufficient to detect this issue.

We are improving our monitors to cover anomalous manual queue entry removal rates, which will allow us to detect this class of issue much sooner.

Repository searches not working for some users

On November 21, 2024, between 14:30 UTC and 15:53 UTC search services at GitHub were degraded and CPU load on some nodes hit 100%. On average, the error rate was 22 requests/second and peaked at 83 requests/second. During this incident Enterprise Profile pages were slow to load and searches may have returned low quality results.The CPU load was mitigated by redeploying portions of our web infrastructure.We are still working to identify the cause of the increase in CPU usage and are improving our observability tooling to better expose the cause of an incident like this in the future.

Disruption with some GitHub services

On November 19, 2024, between 10:56:00 UTC and 12:03:00 UTC the notifications service was degraded and stopped sending notifications. On average, notifications delivery was delayed about 1 hour. This was due to a database host coming out of a regular maintenance process in read only-mode.We mitigated the incident by making the host writable again. After that the notifications delivery recovered and any delivery job that had failed during the incident was successfully retried.We are working to improve our observability across database clusters to reduce our time to detection and mitigation of issues like this one in the future.

Incident with Actions

On October 30, 2024, between 5:45 and 9:42 UTC, the Actions service was degraded, causing run delays. On average, Actions workflow run, job, and step updates were delayed as much as one hour. The delays were caused by updates in a dependent service that led to failures in Redis connectivity. Delays recovered once the Redis cluster connectivity was restored at 8:16 UTC. The incident was fully mitigated once the job queue had processed by 9:24 UTC. This incident followed an earlier short period of impact on hosted runners due to a similar issue, which was mitigated by failing over to a healthy cluster.From this, we are working to improve our observability across Redis clusters to reduce our time to detection and mitigation of issues like this one in the future where multiple clusters and services were impacted. We will also be working to reduce the time to mitigate and improve general resilience to this dependency.

⮜ Previous Next ⮞