Incident History

Disruption with some GitHub services

This incident was opened by mistake. Public services are currently functional.

1743184387 - 1743185696 Resolved

Disruption with Pull Request Ref Updates

Between March 27, 2025, 23:45 UTC and March 28, 2025, 01:40 UTC the Pull Requests service was degraded and failed to update refs for repositories with higher traffic activity. This was due to a large repository migration that resulted in a larger than usual number of enqueued jobs; while simultaneously impacting git fileservers where the problematic repository was hosted. This resulted in an increase in queue depth due to retries on failures to perform those jobs causing delays for non-migration sourced jobs.We declared an incident once we confirmed that this issue was not isolated to the problematic migration and other repositories were also failing to process ref updates. We mitigated the issue by stopping the migration and short circuiting the remaining jobs. Additionally, we increased the worker pool of this job to reduce the time required to recover. As a result of this incident, we are revisiting our repository migration process and are working to isolate potentially problematic migration workloads from non-migration workloads.

1743119350 - 1743126038 Resolved

[Retroactive] Incident with Migrations Submitted Via GitHub UI

Between 2024-03-23 18:10 UTC and 2024-03-24 16:10 UTC, migration jobs submitted through the GitHub UI experienced processing delays and increased failure rates. This issue only affected migrations initiated via the web interface. Migrations started through the API or the command line tool continued to function normally. We are sorry for the delayed post on githubstatus.com.

1742837211 - 1742837211 Resolved

Disruption with some GitHub services

On March 21st, 2025, between 11:45 UTC and 13:20 UTC, users were unable to interact with GitHub Copilot Chat in GitHub. The issue was caused by a recently deployed Ruby change that unintentionally overwrote a global value. This led to GitHub Copilot Chat in GitHub being misconfigured with an invalid URL, preventing it from connecting to our chat server. Other Copilot clients were not affected.We mitigated the incident by identifying the source of the problematic query and rolling back the deployment.We are reviewing our deployment tooling to reduce the time to mitigate similar incidents in the future. In parallel, we are also improving our test coverage for this category of error to prevent them from being deployed to production.

1742560814 - 1742564656 Resolved

Intermittent GitHub Actions workflow failures

On March 21st, 2025, between 05:43 UTC and 08:49 UTC, the Actions service experienced degradation, leading to workflow run failures. During the incident, approximately 2.45% of workflow runs failed due to an infrastructure failure. This incident was caused by intermittent failures in communicating with an underlying service provider. We are working to improve our resilience to downtime in this service provider and to reduce the time to mitigate in any future recurrences.

1742538087 - 1742549663 Resolved

Incident with Codespaces

On March 21, 2025 between 01:00 UTC and 02:45 UTC, the Codespaces service was degraded and users in various regions experienced intermittent connection failures. The peak error error was 30% of connection attempts across 38% of Codespaces. This was due to a service deployment.The incident was mitigated by completing the deployment to the impacted regions. We are working with the service team to identify the cause of the connection losses and perform necessary repairs to avoid future occurrences.

1742523132 - 1742526525 Resolved

Incident with Pages

On March 20, 2025, between 19:24 UTC and 20:42 UTC the GitHub Pages experience was degraded and returned 503s for some customers. We saw an error rate of roughly 2% for Pages views, and new page builds were unable to complete successfully before timing out. This was due to replication failure at the database layer between a write destination and read destination. We mitigated the incident by redirecting reads to the same destination as writes. The error with replication occurred while in this transitory phase, as we are in the process of migrating the underlying data for Pages to new database infrastructure. Additionally our monitors failed to detect the error.We are addressing the underlying cause of the failed replication and telemetry.

1742501055 - 1742504047 Resolved

Incident with Actions: Queue Run Failures

On March 18th, 2025, between 23:20 UTC and March 19th, 2025 00:15 UTC, the Actions service experienced degradation, leading to run start delays. During the incident, about 0.3% of all workflow runs queued during the time failed to start, about 0.67% of all workflow runs were delayed by an average of 10 minutes, and about 0.16% of all workflow runs ultimately ended with an infrastructure failure. This was due to a networking issue with an underlying service provider. At 00:15 UTC the service provider mitigated their issue, and service was restored immediately for Actions. We are working to improve our resilience to downtime in this service provider to reduce the time to mitigate in any future recurrences.

1742341509 - 1742345747 Resolved

Disruption with some GitHub services

On March 18th, 2025, between 13:35 UTC and 17:45 UTC, some users of GitHub Copilot Chat in GitHub experienced intermittent failures when reading or writing messages in a thread, resulting in a degraded experience. The error rate peaked at 3% of requests to the service. This was due to an availability incident with a database provider. Around 16:15 UTC the upstream service provider mitigated their availability incident, and service was restored in the following hour.We are working to improve our failover strategy for this database to reduce the time to mitigate similar incidents in the future.

1742313535 - 1742323532 Resolved

macos-15-arm64 hosted runner queue delays

On March 18, between 13:04 and 16:55 UTC, Actions workflows relying on hosted runners using the beta MacOS 15 image experienced increased queue time waiting for available runners. An image update was pushed the previous day that included a performance reduction. The slower performance caused longer average runtimes, exhausting our available Mac capacity for this image. This was mitigated by rolling back the image update. We have updated our capacity allocation to the beta and other Mac images and are improving monitoring in our canary environments to catch this potential issue before it impacts customers.

1742310303 - 1742318101 Resolved
⮜ Previous Next ⮞