Incident History

Incident with Packages

On February 25, 2025, between 00:17 UTC and 01:08 UTC, GitHub Packages experienced a service degradation, leading to failures uploading and downloading packages, along with increased latency for all requests to GitHub Packages registry. At peak impact, about 14% of uploads and downloads failed, and all Packages requests were delayed by an average of 7 seconds. The incident was caused by the rollout of a database configuration change that resulted in a degradation in database performance. We mitigated the incident by rolling back the contributing change and failing over the database. In response to this incident, we are tuning database configurations and resolving a source of deadlocks. We are also redistributing certain workloads to read replicas to reduce latency and enhance overall database performance.

Claude 3.5 Sonnet model is unavailable in Copilot

On February 24, 2025 between 21:42 UTC and 22:14 UTC the Claude 3.5 Sonnet model for GitHub Copilot Chat experienced degraded performance. During the impact, all requests to Claude 3.5 Sonnet would result in an immediate error to the user. This was due to misconfiguration within one of our infrastructure providers that has since been mitigated.We are working to prevent this error from occurring in the future by implementing additional failover options. Additionally we are updating our playbooks and alerting to reduce time to detection.

Incident with Issues

On February 24, 2025, between 15:17 UTC and 17:08 UTC the GitHub Issues & Pull Requests services were degraded by showing stale results on search powered pages such as /issues and /pulls, meaning the displayed results may not have included the most recent updates. Additional features that depend on search functionality may have served stale results during this incident. There was no increase in latency for any of the services depending on search.We mitigated the incident by increasing the replica count for the workers that process background jobs related to search indexing. We are working on identifying the root cause to avoid similar incidents in the future.

Disruption with some GitHub services

On February 21 2025 12:00 UTC - 2/24/2025, 18:31 UTC, the Copilot Metrics API failed to ingest daily metrics aggregations for all customers resulting in failure to populate new metrics from 2025-02-21 to 2025-02-24. This failure was triggered by the metrics ingestion process timing out when querying across the event dataset. The API was functional for retrieving historical metrics prior to 2025-02-21. On Monday morning 2/24/2025, 15:00 UTC, customer support was notified of the issue and the team deployed a fix to resolve query timeouts and ran backfills for the data from 2025-02-21 to 2025-02-23.We are working to prevent further outages by adding more alerting to timeouts and have further optimized all our queries to aggregate data.

[Retroactive] Incident with Migrations service

Between Thursday 13th, 2025 19:30 UTC and Friday 14th, 2025 08:02 UTC the Migrations service was experiencing intermittent migration failures for some customers. This was caused by a code change that contained an edge case that erroneously failed some migrations.

We mitigated the incident by rolling back the code change.

We are working on improving our monitoring and deployment practices to reduce our time to detection and mitigation of issues like this one in the future.

Disruption with some GitHub services

On February 16th, 2025 from 11:30 UTC to 12:44 UTC, API requests to GitHub.com experienced increased latency and failures. Around 1% of API requests failed at the peak of this incident.This outage was caused by an experimental feature that malfunctioned and generated excessive database latency. In response to this incident, the feature has been redesigned to avoid database load which should prevent similar issues going forward.

Disruption with some GitHub services

On February 15, 2025, between 6:35 pm UTC and 4:15 am UTC the Codespaces service was degraded and users in various regions experienced intermittent connection failures. On average, the error rate was 50% and peaked at 65% of requests to the service. This was due to a service deployment.We mitigated the incident by completing the deployment to the impacted regions.The completion of this deployment should prevent future deployments of the service from negatively impacting Codespace connectivity.

Claude Sonnet unavailable in GitHub Copilot

On February 12th, 2025, between 21:30 UTC and 23:10 UTC the Copilot service was degraded and all requests to Claude 3.5 Sonnet were failing. No other models were impacted. This was due to an issue with our upstream provider which was detected within 12 minutes, at which point we raised the issue to our provider to remediate. GitHub is working with our provider to improve the resiliency of the service.

Incident with GIT LFS and Other Requests

On February 6, 2025, between 8:40AM UTC and 11:13AM UTC the GitHub REST API was degraded following the rollout of a new feature. The feature resulted in an increase in requests that saturated a cache and led to cascading failures in unrelated services. The error rate peaked at 100% of requests to the service.The incident was mitigated by increasing the allocated memory to the cache and rolling back the feature that led to the cache saturation. To prevent future incidents, we are working to reduce the time to detect a similar issue and optimize the overall calls to the cache.

Actions Larger Runners Provisioning Delays

Between Feb 5, 2025 00:34 UTC and 11:16 UTC, up to 7% of organizations using GitHub-hosted larger runners with public IP addresses had those jobs fail to start during the impact window. The issue was caused by a backend migration in the public IP management system, which caused certain public IP address runners to be placed in a non-functioning state.We have improved the rollback steps for this migration to reduce the time to mitigate any future recurrences, are working to improve automated detection of this error state, and are improving the resiliency of runners to handle this error state without customer impact.

⮜ Previous Next ⮞