Incident History

Incident with Issues

On February 24, 2025, between 15:17 UTC and 17:08 UTC the GitHub Issues & Pull Requests services were degraded by showing stale results on search powered pages such as /issues and /pulls, meaning the displayed results may not have included the most recent updates. Additional features that depend on search functionality may have served stale results during this incident. There was no increase in latency for any of the services depending on search.We mitigated the incident by increasing the replica count for the workers that process background jobs related to search indexing. We are working on identifying the root cause to avoid similar incidents in the future.

Disruption with some GitHub services

On February 21 2025 12:00 UTC - 2/24/2025, 18:31 UTC, the Copilot Metrics API failed to ingest daily metrics aggregations for all customers resulting in failure to populate new metrics from 2025-02-21 to 2025-02-24. This failure was triggered by the metrics ingestion process timing out when querying across the event dataset. The API was functional for retrieving historical metrics prior to 2025-02-21. On Monday morning 2/24/2025, 15:00 UTC, customer support was notified of the issue and the team deployed a fix to resolve query timeouts and ran backfills for the data from 2025-02-21 to 2025-02-23.We are working to prevent further outages by adding more alerting to timeouts and have further optimized all our queries to aggregate data.

[Retroactive] Incident with Migrations service

Between Thursday 13th, 2025 19:30 UTC and Friday 14th, 2025 08:02 UTC the Migrations service was experiencing intermittent migration failures for some customers. This was caused by a code change that contained an edge case that erroneously failed some migrations.

We mitigated the incident by rolling back the code change.

We are working on improving our monitoring and deployment practices to reduce our time to detection and mitigation of issues like this one in the future.

Disruption with some GitHub services

On February 16th, 2025 from 11:30 UTC to 12:44 UTC, API requests to GitHub.com experienced increased latency and failures. Around 1% of API requests failed at the peak of this incident.This outage was caused by an experimental feature that malfunctioned and generated excessive database latency. In response to this incident, the feature has been redesigned to avoid database load which should prevent similar issues going forward.

Disruption with some GitHub services

On February 15, 2025, between 6:35 pm UTC and 4:15 am UTC the Codespaces service was degraded and users in various regions experienced intermittent connection failures. On average, the error rate was 50% and peaked at 65% of requests to the service. This was due to a service deployment.We mitigated the incident by completing the deployment to the impacted regions.The completion of this deployment should prevent future deployments of the service from negatively impacting Codespace connectivity.

Claude Sonnet unavailable in GitHub Copilot

On February 12th, 2025, between 21:30 UTC and 23:10 UTC the Copilot service was degraded and all requests to Claude 3.5 Sonnet were failing. No other models were impacted. This was due to an issue with our upstream provider which was detected within 12 minutes, at which point we raised the issue to our provider to remediate. GitHub is working with our provider to improve the resiliency of the service.

Incident with GIT LFS and Other Requests

On February 6, 2025, between 8:40AM UTC and 11:13AM UTC the GitHub REST API was degraded following the rollout of a new feature. The feature resulted in an increase in requests that saturated a cache and led to cascading failures in unrelated services. The error rate peaked at 100% of requests to the service.The incident was mitigated by increasing the allocated memory to the cache and rolling back the feature that led to the cache saturation. To prevent future incidents, we are working to reduce the time to detect a similar issue and optimize the overall calls to the cache.

Actions Larger Runners Provisioning Delays

Between Feb 5, 2025 00:34 UTC and 11:16 UTC, up to 7% of organizations using GitHub-hosted larger runners with public IP addresses had those jobs fail to start during the impact window. The issue was caused by a backend migration in the public IP management system, which caused certain public IP address runners to be placed in a non-functioning state.We have improved the rollback steps for this migration to reduce the time to mitigate any future recurrences, are working to improve automated detection of this error state, and are improving the resiliency of runners to handle this error state without customer impact.

[Retroactive] Incident with some GitHub services

A component that imports external git repositories into GitHub had an incident that was caused by the improper internal configuration of a gem. We have since rolled back to a stable version, and all migrations are able to resume.

Incident with Pull Requests and Issues

On January 30th, 2025 from 14:22 UTC to 14:48 UTC, web requests to GitHub.com experienced failures (at peak the error rate was 44%), with the average successful request taking over 3 seconds to complete.This outage was caused by a hardware failure in the caching layer that supports rate limiting. In addition, the impact was prolonged due to a lack of automated failover for the caching layer. A manual failover of the primary to trusted hardware was performed following recovery to ensure that the issue would not reoccur under similar circumstances.As a result of this incident, we will be moving to a high availability cache configuration and adding resilience to cache failures at this layer to ensure requests are able to be handled should similar circumstances happen in the future.

⮜ Previous Next ⮞