Incident History

Disruption with some GitHub services

On March 18th, 2025, between 13:35 UTC and 17:45 UTC, some users of GitHub Copilot Chat in GitHub experienced intermittent failures when reading or writing messages in a thread, resulting in a degraded experience. The error rate peaked at 3% of requests to the service. This was due to an availability incident with a database provider. Around 16:15 UTC the upstream service provider mitigated their availability incident, and service was restored in the following hour.We are working to improve our failover strategy for this database to reduce the time to mitigate similar incidents in the future.

macos-15-arm64 hosted runner queue delays

On March 18, between 13:04 and 16:55 UTC, Actions workflows relying on hosted runners using the beta MacOS 15 image experienced increased queue time waiting for available runners. An image update was pushed the previous day that included a performance reduction. The slower performance caused longer average runtimes, exhausting our available Mac capacity for this image. This was mitigated by rolling back the image update. We have updated our capacity allocation to the beta and other Mac images and are improving monitoring in our canary environments to catch this potential issue before it impacts customers.

Incident with Issues

Between March 17, 2025, 18:05 UTC and March 18, 2025, 09:50 UTC, GitHub.com experienced intermittent failures in web and API requests. These issues affected a small percentage of users (mostly related to pull requests and issues), with a peak error rate of 0.165% across all requests.We identified a framework upgrade that caused kernel panics in our Kubernetes infrastructure as the root cause. We mitigated the incident by downgrading until we were able to disable a problematic feature. In response, we have investigated why the upgrade caused the unexpected issue, have taken steps to temporarily prevent it, and are working on longer term patch plans while improving our observability to ensure we can quickly react to similar classes of problems in the future.

Some Actions users are seeing their workflow jobs failing to start

On March 12, 2025, between 13:28 UTC and 14:07 UTC, the Actions service experienced degradation leading to run start delays. During the incident, about 0.6% of workflow runs failed to start, 0.8% of workflow runs were delayed by an average of one hour, and 0.1% of runs ultimately ended with an infrastructure failure. The issue stemmed from connectivity problems between the Actions services and certain nodes within one of our Redis clusters. The service began recovering once connectivity to the Redis cluster was restored at 13:41 UTC. These connectivity issues are typically not a concern because we can fail over to healthier replicas. However, due to an unrelated issue, there was a replication delay at the time of the incident, and failing over would have caused a greater impact on our customers. We are working on improving our resiliency and automation processes for this infrastructure to improve the speed of diagnosing and resolving similar issues in the future.

Incident with Actions and Pages

On March 8, 2025, between 17:16 UTC and 18:02 UTC, GitHub Actions and Pages services experienced degraded performance leading to delays in workflow runs and Pages deployments. During this time, 34% of Actions workflow runs experienced delays, and a small percentage of runs using GitHub-hosted runners failed to start. Additionally, Pages deployments for sites without a custom Actions workflow (93% of them) did not run, preventing new changes from being deployed. An unexpected data shape led to crashes in some of our pods. We mitigated the incident by excluding the affected pods and correcting the data that led to the crashes. We’ve fixed the source of the unexpected data shape and have improved the overall resilience of our service against such occurrences.

Disruption with some GitHub services

On March 7, 2025, from 09:30 UTC to 11:07 UTC, we experienced a networking event that disrupted connectivity to our search infrastructure, impacting about 25% of search queries and indexing attempts. Searches for PRs, Issues, Actions workflow runs, Packages, Releases, and other products were impacted, resulting in failed requests or stale data. The connectivity issue self-resolved after 90 minutes. The backlog of indexing jobs was fully processed and saw recovery soon after, and queries to all indexes also saw an immediate return to normal throughput.We are working with our cloud provider to identify the root cause and are researching additional layers of redundancy to reduce customer impact in the future for issues like this one. We are also exploring mitigation strategies for faster resolution.

Incident with Issues, Git Operations and API Requests

On March 3rd 2025 between 04:07 UTC and 09:36 UTC various GitHub services were degraded with an average error rate of 0.03% and peak error rate of 9%. This issue impacted web requests, API requests, and git operations. This incident was triggered because a network node in one of GitHub's datacenter sites partially failed, resulting in silent packet drops for traffic served by that site. At 09:22 UTC, we identified the failing network node, and at 09:36 UTC we addressed the issue by removing the faulty network node from production.In response to this incident, we are improving our monitoring capabilities to identify and respond to similar silent errors more effectively in the future.

Elevated Request Latency for Write operations on github.com and api.github.com

On February 28th, 2025, between 05:49 UTC and 06:55 UTC, a newly deployed background job caused increased load on GitHub’s primary database hosts, resulting in connection pool exhaustion. This led to degraded performance, manifesting as increased latency for write operations and elevated request timeout rates across multiple services.The incident was mitigated by halting execution of the problematic background job and disabling the feature flag controlling the job execution. To prevent similar incidents in the future, we are collaborating on a plan to improve our production signals to better detect and respond to query performance issues.

Disruption with some GitHub services

On February 27, 2025, between 11:30 UTC and 12:22 UTC, Actions experienced degraded performance, leading to delays in workflow runs. On average, 5% of Actions workflow runs were delayed by 31 minutes. The delays were caused by updates in a dependent service that led to failures in Redis connectivity in one region. We mitigated the incident by failing over the impacted service and re-routing the service’s traffic out of that region. We are working to improve monitoring and processes of failover to reduce our time to detection and mitigation of issues like this one in the future.

Incident with Actions and Packages

On February 26, 2025, between 14:51 UTC and 17:19 UTC, GitHub Packages experienced a service degradation, leading to billing-related failures when uploading and downloading Packages. During this period, the billing usage and budget pages were also inaccessible. Initially, we reported that GitHub Actions was affected, but we later determined that the impact was limited to jobs interacting with Packages services, while jobs that did not upload or download Packages remained unaffected.The incident occurred due to an error in newly introduced code, which caused containers to get into a bad state, ultimately leading to billing API calls failing with 503 errors. We mitigated the issue by rolling back the contributing change. In response to this incident, we are enhancing error handling, improving the resiliency of our billing API calls to minimize customer impact, and improving change rollout practices to catch these potential issues prior to deployment.

⮜ Previous Next ⮞