Between March 17, 2025, 18:05 UTC and March 18, 2025, 09:50 UTC, GitHub.com experienced intermittent failures in web and API requests. These issues affected a small percentage of users (mostly related to pull requests and issues), with a peak error rate of 0.165% across all requests.We identified a framework upgrade that caused kernel panics in our Kubernetes infrastructure as the root cause. We mitigated the incident by downgrading until we were able to disable a problematic feature. In response, we have investigated why the upgrade caused the unexpected issue, have taken steps to temporarily prevent it, and are working on longer term patch plans while improving our observability to ensure we can quickly react to similar classes of problems in the future.
On March 12, 2025, between 13:28 UTC and 14:07 UTC, the Actions service experienced degradation leading to run start delays. During the incident, about 0.6% of workflow runs failed to start, 0.8% of workflow runs were delayed by an average of one hour, and 0.1% of runs ultimately ended with an infrastructure failure. The issue stemmed from connectivity problems between the Actions services and certain nodes within one of our Redis clusters. The service began recovering once connectivity to the Redis cluster was restored at 13:41 UTC. These connectivity issues are typically not a concern because we can fail over to healthier replicas. However, due to an unrelated issue, there was a replication delay at the time of the incident, and failing over would have caused a greater impact on our customers. We are working on improving our resiliency and automation processes for this infrastructure to improve the speed of diagnosing and resolving similar issues in the future.
On March 8, 2025, between 17:16 UTC and 18:02 UTC, GitHub Actions and Pages services experienced degraded performance leading to delays in workflow runs and Pages deployments. During this time, 34% of Actions workflow runs experienced delays, and a small percentage of runs using GitHub-hosted runners failed to start. Additionally, Pages deployments for sites without a custom Actions workflow (93% of them) did not run, preventing new changes from being deployed. An unexpected data shape led to crashes in some of our pods. We mitigated the incident by excluding the affected pods and correcting the data that led to the crashes. We’ve fixed the source of the unexpected data shape and have improved the overall resilience of our service against such occurrences.
On March 7, 2025, from 09:30 UTC to 11:07 UTC, we experienced a networking event that disrupted connectivity to our search infrastructure, impacting about 25% of search queries and indexing attempts. Searches for PRs, Issues, Actions workflow runs, Packages, Releases, and other products were impacted, resulting in failed requests or stale data. The connectivity issue self-resolved after 90 minutes. The backlog of indexing jobs was fully processed and saw recovery soon after, and queries to all indexes also saw an immediate return to normal throughput.We are working with our cloud provider to identify the root cause and are researching additional layers of redundancy to reduce customer impact in the future for issues like this one. We are also exploring mitigation strategies for faster resolution.
On March 3rd 2025 between 04:07 UTC and 09:36 UTC various GitHub services were degraded with an average error rate of 0.03% and peak error rate of 9%. This issue impacted web requests, API requests, and git operations. This incident was triggered because a network node in one of GitHub's datacenter sites partially failed, resulting in silent packet drops for traffic served by that site. At 09:22 UTC, we identified the failing network node, and at 09:36 UTC we addressed the issue by removing the faulty network node from production.In response to this incident, we are improving our monitoring capabilities to identify and respond to similar silent errors more effectively in the future.
On February 28th, 2025, between 05:49 UTC and 06:55 UTC, a newly deployed background job caused increased load on GitHub’s primary database hosts, resulting in connection pool exhaustion. This led to degraded performance, manifesting as increased latency for write operations and elevated request timeout rates across multiple services.The incident was mitigated by halting execution of the problematic background job and disabling the feature flag controlling the job execution. To prevent similar incidents in the future, we are collaborating on a plan to improve our production signals to better detect and respond to query performance issues.
On February 27, 2025, between 11:30 UTC and 12:22 UTC, Actions experienced degraded performance, leading to delays in workflow runs. On average, 5% of Actions workflow runs were delayed by 31 minutes. The delays were caused by updates in a dependent service that led to failures in Redis connectivity in one region. We mitigated the incident by failing over the impacted service and re-routing the service’s traffic out of that region. We are working to improve monitoring and processes of failover to reduce our time to detection and mitigation of issues like this one in the future.
On February 26, 2025, between 14:51 UTC and 17:19 UTC, GitHub Packages experienced a service degradation, leading to billing-related failures when uploading and downloading Packages. During this period, the billing usage and budget pages were also inaccessible. Initially, we reported that GitHub Actions was affected, but we later determined that the impact was limited to jobs interacting with Packages services, while jobs that did not upload or download Packages remained unaffected.The incident occurred due to an error in newly introduced code, which caused containers to get into a bad state, ultimately leading to billing API calls failing with 503 errors. We mitigated the issue by rolling back the contributing change. In response to this incident, we are enhancing error handling, improving the resiliency of our billing API calls to minimize customer impact, and improving change rollout practices to catch these potential issues prior to deployment.
On February 25th, 2025, between 14:25 UTC and 16:44 UTC email and web notifications experienced delivery delays. At the peak of the incident the delay resulted in ~10% of all notifications taking over 10 minutes to be delivered, with the remaining ~90% being delivered within 5-10 minutes. This was due to insufficient capacity in worker pools as a result of increased load during peak hours.We also encountered delivery delays for a small number of webhooks, with delays of up-to 2.5 minutes to be delivered.We mitigated the incident by scaling out the service to meet the demand.The increase in capacity gives us extra headroom, and we are working to improve our capacity planning to prevent issues like this occurring in the future.
On February 25, 2025 between 13:40 UTC and 15:45 UTC the Claude 3.7 Sonnet model for GitHub Copilot Chat experienced degraded performance. During the impact, occasional requests to Claude would result in an immediate error to the user. This was due to upstream errors with one of our infrastructure providers, which have since been mitigated.We are working with our infrastructure providers to reduce time to detection and implement additional failover options, to mitigate issues like this one in the future.