Between Thursday 13th, 2025 19:30 UTC and Friday 14th, 2025 08:02 UTC the Migrations service was experiencing intermittent migration failures for some customers. This was caused by a code change that contained an edge case that erroneously failed some migrations.
We mitigated the incident by rolling back the code change.
We are working on improving our monitoring and deployment practices to reduce our time to detection and mitigation of issues like this one in the future.
On February 16th, 2025 from 11:30 UTC to 12:44 UTC, API requests to GitHub.com experienced increased latency and failures. Around 1% of API requests failed at the peak of this incident.This outage was caused by an experimental feature that malfunctioned and generated excessive database latency. In response to this incident, the feature has been redesigned to avoid database load which should prevent similar issues going forward.
On February 15, 2025, between 6:35 pm UTC and 4:15 am UTC the Codespaces service was degraded and users in various regions experienced intermittent connection failures. On average, the error rate was 50% and peaked at 65% of requests to the service. This was due to a service deployment.We mitigated the incident by completing the deployment to the impacted regions.The completion of this deployment should prevent future deployments of the service from negatively impacting Codespace connectivity.
On February 12th, 2025, between 21:30 UTC and 23:10 UTC the Copilot service was degraded and all requests to Claude 3.5 Sonnet were failing. No other models were impacted. This was due to an issue with our upstream provider which was detected within 12 minutes, at which point we raised the issue to our provider to remediate. GitHub is working with our provider to improve the resiliency of the service.
On February 6, 2025, between 8:40AM UTC and 11:13AM UTC the GitHub REST API was degraded following the rollout of a new feature. The feature resulted in an increase in requests that saturated a cache and led to cascading failures in unrelated services. The error rate peaked at 100% of requests to the service.The incident was mitigated by increasing the allocated memory to the cache and rolling back the feature that led to the cache saturation. To prevent future incidents, we are working to reduce the time to detect a similar issue and optimize the overall calls to the cache.
Between Feb 5, 2025 00:34 UTC and 11:16 UTC, up to 7% of organizations using GitHub-hosted larger runners with public IP addresses had those jobs fail to start during the impact window. The issue was caused by a backend migration in the public IP management system, which caused certain public IP address runners to be placed in a non-functioning state.We have improved the rollback steps for this migration to reduce the time to mitigate any future recurrences, are working to improve automated detection of this error state, and are improving the resiliency of runners to handle this error state without customer impact.
A component that imports external git repositories into GitHub had an incident that was caused by the improper internal configuration of a gem. We have since rolled back to a stable version, and all migrations are able to resume.
On January 30th, 2025 from 14:22 UTC to 14:48 UTC, web requests to GitHub.com experienced failures (at peak the error rate was 44%), with the average successful request taking over 3 seconds to complete.This outage was caused by a hardware failure in the caching layer that supports rate limiting. In addition, the impact was prolonged due to a lack of automated failover for the caching layer. A manual failover of the primary to trusted hardware was performed following recovery to ensure that the issue would not reoccur under similar circumstances.As a result of this incident, we will be moving to a high availability cache configuration and adding resilience to cache failures at this layer to ensure requests are able to be handled should similar circumstances happen in the future.
On 29 January 2025 between 14:00 UTC and 16:28 UTC Copilot chat in github.com was degraded, where chat messages which included chat skills failed to save to our datastore due to a change in client side generated identifiers.We mitigated the incident by rolling back the client side changes. Based on this incident, we are working on better monitoring to reduce our detection time, fixing gaps in testing to prevent a repeat of incidents such as this one in the future.
Between Sunday 20:50 UTC and Monday 15:20 UTC the Migrations service was unable to process migrations. This was due to a invalid infrastructure credential.
We mitigated the issue by updating the credential internally.
Mechanisms and automation will be implemented to detect and prevent this issue again in the future.