Incident History

Intermittent GitHub Actions workflow failures

On March 21st, 2025, between 05:43 UTC and 08:49 UTC, the Actions service experienced degradation, leading to workflow run failures. During the incident, approximately 2.45% of workflow runs failed due to an infrastructure failure. This incident was caused by intermittent failures in communicating with an underlying service provider. We are working to improve our resilience to downtime in this service provider and to reduce the time to mitigate in any future recurrences.

1742538087 - 1742549663 Resolved

Incident with Codespaces

On March 21, 2025 between 01:00 UTC and 02:45 UTC, the Codespaces service was degraded and users in various regions experienced intermittent connection failures. The peak error error was 30% of connection attempts across 38% of Codespaces. This was due to a service deployment.The incident was mitigated by completing the deployment to the impacted regions. We are working with the service team to identify the cause of the connection losses and perform necessary repairs to avoid future occurrences.

1742523132 - 1742526525 Resolved

Incident with Pages

On March 20, 2025, between 19:24 UTC and 20:42 UTC the GitHub Pages experience was degraded and returned 503s for some customers. We saw an error rate of roughly 2% for Pages views, and new page builds were unable to complete successfully before timing out. This was due to replication failure at the database layer between a write destination and read destination. We mitigated the incident by redirecting reads to the same destination as writes. The error with replication occurred while in this transitory phase, as we are in the process of migrating the underlying data for Pages to new database infrastructure. Additionally our monitors failed to detect the error.We are addressing the underlying cause of the failed replication and telemetry.

1742501055 - 1742504047 Resolved

Incident with Actions: Queue Run Failures

On March 18th, 2025, between 23:20 UTC and March 19th, 2025 00:15 UTC, the Actions service experienced degradation, leading to run start delays. During the incident, about 0.3% of all workflow runs queued during the time failed to start, about 0.67% of all workflow runs were delayed by an average of 10 minutes, and about 0.16% of all workflow runs ultimately ended with an infrastructure failure. This was due to a networking issue with an underlying service provider. At 00:15 UTC the service provider mitigated their issue, and service was restored immediately for Actions. We are working to improve our resilience to downtime in this service provider to reduce the time to mitigate in any future recurrences.

1742341509 - 1742345747 Resolved

Disruption with some GitHub services

On March 18th, 2025, between 13:35 UTC and 17:45 UTC, some users of GitHub Copilot Chat in GitHub experienced intermittent failures when reading or writing messages in a thread, resulting in a degraded experience. The error rate peaked at 3% of requests to the service. This was due to an availability incident with a database provider. Around 16:15 UTC the upstream service provider mitigated their availability incident, and service was restored in the following hour.We are working to improve our failover strategy for this database to reduce the time to mitigate similar incidents in the future.

1742313535 - 1742323532 Resolved

macos-15-arm64 hosted runner queue delays

On March 18, between 13:04 and 16:55 UTC, Actions workflows relying on hosted runners using the beta MacOS 15 image experienced increased queue time waiting for available runners. An image update was pushed the previous day that included a performance reduction. The slower performance caused longer average runtimes, exhausting our available Mac capacity for this image. This was mitigated by rolling back the image update. We have updated our capacity allocation to the beta and other Mac images and are improving monitoring in our canary environments to catch this potential issue before it impacts customers.

1742310303 - 1742318101 Resolved

Incident with Issues

Between March 17, 2025, 18:05 UTC and March 18, 2025, 09:50 UTC, GitHub.com experienced intermittent failures in web and API requests. These issues affected a small percentage of users (mostly related to pull requests and issues), with a peak error rate of 0.165% across all requests.We identified a framework upgrade that caused kernel panics in our Kubernetes infrastructure as the root cause. We mitigated the incident by downgrading until we were able to disable a problematic feature. In response, we have investigated why the upgrade caused the unexpected issue, have taken steps to temporarily prevent it, and are working on longer term patch plans while improving our observability to ensure we can quickly react to similar classes of problems in the future.

1742236797 - 1742252553 Resolved

Some Actions users are seeing their workflow jobs failing to start

On March 12, 2025, between 13:28 UTC and 14:07 UTC, the Actions service experienced degradation leading to run start delays. During the incident, about 0.6% of workflow runs failed to start, 0.8% of workflow runs were delayed by an average of one hour, and 0.1% of runs ultimately ended with an infrastructure failure. The issue stemmed from connectivity problems between the Actions services and certain nodes within one of our Redis clusters. The service began recovering once connectivity to the Redis cluster was restored at 13:41 UTC. These connectivity issues are typically not a concern because we can fail over to healthier replicas. However, due to an unrelated issue, there was a replication delay at the time of the incident, and failing over would have caused a greater impact on our customers. We are working on improving our resiliency and automation processes for this infrastructure to improve the speed of diagnosing and resolving similar issues in the future.

1741786131 - 1741788432 Resolved

Incident with Actions and Pages

On March 8, 2025, between 17:16 UTC and 18:02 UTC, GitHub Actions and Pages services experienced degraded performance leading to delays in workflow runs and Pages deployments. During this time, 34% of Actions workflow runs experienced delays, and a small percentage of runs using GitHub-hosted runners failed to start. Additionally, Pages deployments for sites without a custom Actions workflow (93% of them) did not run, preventing new changes from being deployed. An unexpected data shape led to crashes in some of our pods. We mitigated the incident by excluding the affected pods and correcting the data that led to the crashes. We’ve fixed the source of the unexpected data shape and have improved the overall resilience of our service against such occurrences.

1741455910 - 1741457516 Resolved

Disruption with some GitHub services

On March 7, 2025, from 09:30 UTC to 11:07 UTC, we experienced a networking event that disrupted connectivity to our search infrastructure, impacting about 25% of search queries and indexing attempts. Searches for PRs, Issues, Actions workflow runs, Packages, Releases, and other products were impacted, resulting in failed requests or stale data. The connectivity issue self-resolved after 90 minutes. The backlog of indexing jobs was fully processed and saw recovery soon after, and queries to all indexes also saw an immediate return to normal throughput.We are working with our cloud provider to identify the root cause and are researching additional layers of redundancy to reduce customer impact in the future for issues like this one. We are also exploring mitigation strategies for faster resolution.

1741341838 - 1741346646 Resolved
⮜ Previous Next ⮞