Incident History

[Retroactive] Incident with some GitHub services

A component that imports external git repositories into GitHub had an incident that was caused by the improper internal configuration of a gem. We have since rolled back to a stable version, and all migrations are able to resume.

1738611469 - 1738611469 Resolved

Incident with Pull Requests and Issues

On January 30th, 2025 from 14:22 UTC to 14:48 UTC, web requests to GitHub.com experienced failures (at peak the error rate was 44%), with the average successful request taking over 3 seconds to complete.This outage was caused by a hardware failure in the caching layer that supports rate limiting. In addition, the impact was prolonged due to a lack of automated failover for the caching layer. A manual failover of the primary to trusted hardware was performed following recovery to ensure that the issue would not reoccur under similar circumstances.As a result of this incident, we will be moving to a high availability cache configuration and adding resilience to cache failures at this layer to ensure requests are able to be handled should similar circumstances happen in the future.

1738247383 - 1738251593 Resolved

Disruption with some GitHub services

On 29 January 2025 between 14:00 UTC and 16:28 UTC Copilot chat in github.com was degraded, where chat messages which included chat skills failed to save to our datastore due to a change in client side generated identifiers.We mitigated the incident by rolling back the client side changes. Based on this incident, we are working on better monitoring to reduce our detection time, fixing gaps in testing to prevent a repeat of incidents such as this one in the future.

1738162376 - 1738168258 Resolved

Incident With Migration Service

Between Sunday 20:50 UTC and Monday 15:20 UTC the Migrations service was unable to process migrations. This was due to a invalid infrastructure credential.

We mitigated the issue by updating the credential internally.

Mechanisms and automation will be implemented to detect and prevent this issue again in the future.

1738075398 - 1738075398 Resolved

Disruption with some GitHub services

On January 27th, 2025, between 23:32:00 UTC and 23:41:00 UTC the Audit Log Streaming service experienced an approximate 9 minute delay of Audit Log Events. Our systems maintained data continuity and we experienced no data loss. There was no impact to the Audit Log API or the Audit Log user interface. Any configured Audit Log Streaming endpoints received all relevant Audit Log Events (but they were delayed) and normal service was restored after the incident's resolution.

1738020733 - 1738021273 Resolved

Incident with Actions

On January 23, 2025, between 9:49 and 17:00 UTC, the available capacity of large hosted runners was degraded. On average, 26% of jobs requiring large runners had a >5min delay getting a runner assigned. This was caused by the rollback of a configuration change and a latent bug in event processing, which was triggered by the mixed data shape that resulted from the rollback. The processing would reprocess the same events unnecessarily and cause the background job that manages large runner creation and deletion to run out of resources. It would automatically restart and continue processing, but the jobs were not able to keep up with production traffic. We mitigated the impact by using a feature flag to bypass the problematic event processing logic. While these changes had been rolling out in stages over the last few months and had been safely rolled back previously, an unrelated change prevented rollback from causing this problem in earlier stages.We are reviewing and updating the feature flags in this event processing workflow to ensure that we have high confidence in rollback in all rollout stages. We are also improving observability of the event processing to reduce the time to diagnose and mitigate similar issues going forward.

1737627931 - 1737653222 Resolved

Incident with Pull Request Rebase Merges

On January 16, 2025, between 00:45 UTC and 09:40 UTC the Pull Requests service was degraded and failed to generate rebase merge commits. This was due to a configuration change that introduced disagreements between replicas. These disagreements caused a secondary job to run, triggering timeouts while computing rebase merge commits. We mitigated the incident by rolling back the configuration change.We are working on improving our monitoring and deployment practices to reduce our time to detection and mitigation of issues like this one in the future.

1737008552 - 1737020406 Resolved

Disruption connecting to Codespaces

On January 14, 2025, between 19:13 UTC and 21:210 UTC the Codespaces service was degraded and led to connection failures with running codespaces, with a 7.6% failure rate for connections during the degradation. Users with bad connections could not use impacted codespaces until they were stopped and restarted.This was caused by bad connections left behind after a deployment in an upstream dependency that the Codespaces service still provided to clients. The incident self-mitigated as new connections replaced stale ones. We are coordinating to ensure connection stability with future deployments of this nature.

1736888107 - 1736889619 Resolved

Incident with Git Operations

On January 13, 2025, between 23:35 UTC and 00:24 UTC all Git operations were unavailable due to a configuration change causing our internal load balancer to drop requests between services that Git relies upon.We mitigated the incident by rolling back the configuration change.We are improving our monitoring and deployment practices to reduce our time to detection and automated mitigation for issues like this in the future.

1736811875 - 1736814538 Resolved

Issues with VNet Injected Larger Hosted Runners in East US 2

On January 9, 2025, larger hosted runners configured with Azure private networking in East US 2 were degraded, causing delayed job starts for ~2,300 jobs between 16:00 and 20:00 UTC. There was also an earlier period of impact from 2025-01-08 22:00 UTC to 2025-01-09 4:10 UTC with 488 jobs impacted. The cause of both these delays was an incident in East US 2 impacting provisioning and network connectivity of Azure resources. More details on that incident are visible at https://azure.status.microsoft/en-us/status/history (Tracking ID: PLP3-1W8). Because these runners are reliant on private networking with networks in the East US 2 region, there were no immediate mitigations available other than restoring network connectivity. Going forward, we will continue evaluating options to provide better resilience to 3rd party regional outages that affect private networking customers.

1736442727 - 1736452826 Resolved
⮜ Previous Next ⮞