Incident History

Incident with Issues, Git Operations and API Requests

On March 3rd 2025 between 04:07 UTC and 09:36 UTC various GitHub services were degraded with an average error rate of 0.03% and peak error rate of 9%. This issue impacted web requests, API requests, and git operations. This incident was triggered because a network node in one of GitHub's datacenter sites partially failed, resulting in silent packet drops for traffic served by that site. At 09:22 UTC, we identified the failing network node, and at 09:36 UTC we addressed the issue by removing the faulty network node from production.In response to this incident, we are improving our monitoring capabilities to identify and respond to similar silent errors more effectively in the future.

1740975624 - 1740979879 Resolved

Elevated Request Latency for Write operations on github.com and api.github.com

On February 28th, 2025, between 05:49 UTC and 06:55 UTC, a newly deployed background job caused increased load on GitHub’s primary database hosts, resulting in connection pool exhaustion. This led to degraded performance, manifesting as increased latency for write operations and elevated request timeout rates across multiple services.The incident was mitigated by halting execution of the problematic background job and disabling the feature flag controlling the job execution. To prevent similar incidents in the future, we are collaborating on a plan to improve our production signals to better detect and respond to query performance issues.

1740723135 - 1740725757 Resolved

Disruption with some GitHub services

On February 27, 2025, between 11:30 UTC and 12:22 UTC, Actions experienced degraded performance, leading to delays in workflow runs. On average, 5% of Actions workflow runs were delayed by 31 minutes. The delays were caused by updates in a dependent service that led to failures in Redis connectivity in one region. We mitigated the incident by failing over the impacted service and re-routing the service’s traffic out of that region. We are working to improve monitoring and processes of failover to reduce our time to detection and mitigation of issues like this one in the future.

1740655700 - 1740658954 Resolved

Incident with Actions and Packages

On February 26, 2025, between 14:51 UTC and 17:19 UTC, GitHub Packages experienced a service degradation, leading to billing-related failures when uploading and downloading Packages. During this period, the billing usage and budget pages were also inaccessible. Initially, we reported that GitHub Actions was affected, but we later determined that the impact was limited to jobs interacting with Packages services, while jobs that did not upload or download Packages remained unaffected.The incident occurred due to an error in newly introduced code, which caused containers to get into a bad state, ultimately leading to billing API calls failing with 503 errors. We mitigated the issue by rolling back the contributing change. In response to this incident, we are enhancing error handling, improving the resiliency of our billing API calls to minimize customer impact, and improving change rollout practices to catch these potential issues prior to deployment.

1740585067 - 1740590349 Resolved

Disruption with some GitHub services

On February 25th, 2025, between 14:25 UTC and 16:44 UTC email and web notifications experienced delivery delays. At the peak of the incident the delay resulted in ~10% of all notifications taking over 10 minutes to be delivered, with the remaining ~90% being delivered within 5-10 minutes. This was due to insufficient capacity in worker pools as a result of increased load during peak hours.We also encountered delivery delays for a small number of webhooks, with delays of up-to 2.5 minutes to be delivered.We mitigated the incident by scaling out the service to meet the demand.The increase in capacity gives us extra headroom, and we are working to improve our capacity planning to prevent issues like this occurring in the future.

1740496345 - 1740502214 Resolved

Claude 3.7 Sonnet Partially Unavailable

On February 25, 2025 between 13:40 UTC and 15:45 UTC the Claude 3.7 Sonnet model for GitHub Copilot Chat experienced degraded performance. During the impact, occasional requests to Claude would result in an immediate error to the user. This was due to upstream errors with one of our infrastructure providers, which have since been mitigated.We are working with our infrastructure providers to reduce time to detection and implement additional failover options, to mitigate issues like this one in the future.

1740494404 - 1740498344 Resolved

Incident with Packages

On February 25, 2025, between 00:17 UTC and 01:08 UTC, GitHub Packages experienced a service degradation, leading to failures uploading and downloading packages, along with increased latency for all requests to GitHub Packages registry. At peak impact, about 14% of uploads and downloads failed, and all Packages requests were delayed by an average of 7 seconds. The incident was caused by the rollout of a database configuration change that resulted in a degradation in database performance. We mitigated the incident by rolling back the contributing change and failing over the database. In response to this incident, we are tuning database configurations and resolving a source of deadlocks. We are also redistributing certain workloads to read replicas to reduce latency and enhance overall database performance.

1740442629 - 1740445723 Resolved

Claude 3.5 Sonnet model is unavailable in Copilot

On February 24, 2025 between 21:42 UTC and 22:14 UTC the Claude 3.5 Sonnet model for GitHub Copilot Chat experienced degraded performance. During the impact, all requests to Claude 3.5 Sonnet would result in an immediate error to the user. This was due to misconfiguration within one of our infrastructure providers that has since been mitigated.We are working to prevent this error from occurring in the future by implementing additional failover options. Additionally we are updating our playbooks and alerting to reduce time to detection.

1740434787 - 1740435248 Resolved

Incident with Issues

On February 24, 2025, between 15:17 UTC and 17:08 UTC the GitHub Issues & Pull Requests services were degraded by showing stale results on search powered pages such as /issues and /pulls, meaning the displayed results may not have included the most recent updates. Additional features that depend on search functionality may have served stale results during this incident. There was no increase in latency for any of the services depending on search.We mitigated the incident by increasing the replica count for the workers that process background jobs related to search indexing. We are working on identifying the root cause to avoid similar incidents in the future.

1740413327 - 1740416986 Resolved

Disruption with some GitHub services

On February 21 2025 12:00 UTC - 2/24/2025, 18:31 UTC, the Copilot Metrics API failed to ingest daily metrics aggregations for all customers resulting in failure to populate new metrics from 2025-02-21 to 2025-02-24. This failure was triggered by the metrics ingestion process timing out when querying across the event dataset. The API was functional for retrieving historical metrics prior to 2025-02-21. On Monday morning 2/24/2025, 15:00 UTC, customer support was notified of the issue and the team deployed a fix to resolve query timeouts and ran backfills for the data from 2025-02-21 to 2025-02-23.We are working to prevent further outages by adding more alerting to timeouts and have further optimized all our queries to aggregate data.

1740410249 - 1740421899 Resolved
⮜ Previous Next ⮞