Incident History

[Retroactive] Incident with Migrations Submitted Via GitHub UI

Between 2024-03-23 18:10 UTC and 2024-03-24 16:10 UTC, migration jobs submitted through the GitHub UI experienced processing delays and increased failure rates. This issue only affected migrations initiated via the web interface. Migrations started through the API or the command line tool continued to function normally. We are sorry for the delayed post on githubstatus.com.

Disruption with some GitHub services

On March 21st, 2025, between 11:45 UTC and 13:20 UTC, users were unable to interact with GitHub Copilot Chat in GitHub. The issue was caused by a recently deployed Ruby change that unintentionally overwrote a global value. This led to GitHub Copilot Chat in GitHub being misconfigured with an invalid URL, preventing it from connecting to our chat server. Other Copilot clients were not affected.We mitigated the incident by identifying the source of the problematic query and rolling back the deployment.We are reviewing our deployment tooling to reduce the time to mitigate similar incidents in the future. In parallel, we are also improving our test coverage for this category of error to prevent them from being deployed to production.

Intermittent GitHub Actions workflow failures

On March 21st, 2025, between 05:43 UTC and 08:49 UTC, the Actions service experienced degradation, leading to workflow run failures. During the incident, approximately 2.45% of workflow runs failed due to an infrastructure failure. This incident was caused by intermittent failures in communicating with an underlying service provider. We are working to improve our resilience to downtime in this service provider and to reduce the time to mitigate in any future recurrences.

Incident with Codespaces

On March 21, 2025 between 01:00 UTC and 02:45 UTC, the Codespaces service was degraded and users in various regions experienced intermittent connection failures. The peak error error was 30% of connection attempts across 38% of Codespaces. This was due to a service deployment.The incident was mitigated by completing the deployment to the impacted regions. We are working with the service team to identify the cause of the connection losses and perform necessary repairs to avoid future occurrences.

Incident with Pages

On March 20, 2025, between 19:24 UTC and 20:42 UTC the GitHub Pages experience was degraded and returned 503s for some customers. We saw an error rate of roughly 2% for Pages views, and new page builds were unable to complete successfully before timing out. This was due to replication failure at the database layer between a write destination and read destination. We mitigated the incident by redirecting reads to the same destination as writes. The error with replication occurred while in this transitory phase, as we are in the process of migrating the underlying data for Pages to new database infrastructure. Additionally our monitors failed to detect the error.We are addressing the underlying cause of the failed replication and telemetry.

Incident with Actions: Queue Run Failures

On March 18th, 2025, between 23:20 UTC and March 19th, 2025 00:15 UTC, the Actions service experienced degradation, leading to run start delays. During the incident, about 0.3% of all workflow runs queued during the time failed to start, about 0.67% of all workflow runs were delayed by an average of 10 minutes, and about 0.16% of all workflow runs ultimately ended with an infrastructure failure. This was due to a networking issue with an underlying service provider. At 00:15 UTC the service provider mitigated their issue, and service was restored immediately for Actions. We are working to improve our resilience to downtime in this service provider to reduce the time to mitigate in any future recurrences.

Disruption with some GitHub services

On March 18th, 2025, between 13:35 UTC and 17:45 UTC, some users of GitHub Copilot Chat in GitHub experienced intermittent failures when reading or writing messages in a thread, resulting in a degraded experience. The error rate peaked at 3% of requests to the service. This was due to an availability incident with a database provider. Around 16:15 UTC the upstream service provider mitigated their availability incident, and service was restored in the following hour.We are working to improve our failover strategy for this database to reduce the time to mitigate similar incidents in the future.

macos-15-arm64 hosted runner queue delays

On March 18, between 13:04 and 16:55 UTC, Actions workflows relying on hosted runners using the beta MacOS 15 image experienced increased queue time waiting for available runners. An image update was pushed the previous day that included a performance reduction. The slower performance caused longer average runtimes, exhausting our available Mac capacity for this image. This was mitigated by rolling back the image update. We have updated our capacity allocation to the beta and other Mac images and are improving monitoring in our canary environments to catch this potential issue before it impacts customers.

Incident with Issues

Between March 17, 2025, 18:05 UTC and March 18, 2025, 09:50 UTC, GitHub.com experienced intermittent failures in web and API requests. These issues affected a small percentage of users (mostly related to pull requests and issues), with a peak error rate of 0.165% across all requests.We identified a framework upgrade that caused kernel panics in our Kubernetes infrastructure as the root cause. We mitigated the incident by downgrading until we were able to disable a problematic feature. In response, we have investigated why the upgrade caused the unexpected issue, have taken steps to temporarily prevent it, and are working on longer term patch plans while improving our observability to ensure we can quickly react to similar classes of problems in the future.

Some Actions users are seeing their workflow jobs failing to start

On March 12, 2025, between 13:28 UTC and 14:07 UTC, the Actions service experienced degradation leading to run start delays. During the incident, about 0.6% of workflow runs failed to start, 0.8% of workflow runs were delayed by an average of one hour, and 0.1% of runs ultimately ended with an infrastructure failure. The issue stemmed from connectivity problems between the Actions services and certain nodes within one of our Redis clusters. The service began recovering once connectivity to the Redis cluster was restored at 13:41 UTC. These connectivity issues are typically not a concern because we can fail over to healthier replicas. However, due to an unrelated issue, there was a replication delay at the time of the incident, and failing over would have caused a greater impact on our customers. We are working on improving our resiliency and automation processes for this infrastructure to improve the speed of diagnosing and resolving similar issues in the future.

⮜ Previous Next ⮞