Disruption with some GitHub services
This incident has been resolved.
This incident has been resolved.
This incident has been resolved.
Beginning at 21:24 UTC on March 28 and lasting until 21:50 UTC, some customers of github.com had issues with PR tracking refs not being updated due to processing delays and increased failure rates. We did not status before we completed the rollback, and the incident is currently resolved. We are sorry for the delayed post on githubstatus.com.
This incident was opened by mistake. Public services are currently functional.
Between March 27, 2025, 23:45 UTC and March 28, 2025, 01:40 UTC the Pull Requests service was degraded and failed to update refs for repositories with higher traffic activity. This was due to a large repository migration that resulted in a larger than usual number of enqueued jobs; while simultaneously impacting git fileservers where the problematic repository was hosted. This resulted in an increase in queue depth due to retries on failures to perform those jobs causing delays for non-migration sourced jobs.We declared an incident once we confirmed that this issue was not isolated to the problematic migration and other repositories were also failing to process ref updates. We mitigated the issue by stopping the migration and short circuiting the remaining jobs. Additionally, we increased the worker pool of this job to reduce the time required to recover. As a result of this incident, we are revisiting our repository migration process and are working to isolate potentially problematic migration workloads from non-migration workloads.
Between 2024-03-23 18:10 UTC and 2024-03-24 16:10 UTC, migration jobs submitted through the GitHub UI experienced processing delays and increased failure rates. This issue only affected migrations initiated via the web interface. Migrations started through the API or the command line tool continued to function normally. We are sorry for the delayed post on githubstatus.com.
On March 21st, 2025, between 11:45 UTC and 13:20 UTC, users were unable to interact with GitHub Copilot Chat in GitHub. The issue was caused by a recently deployed Ruby change that unintentionally overwrote a global value. This led to GitHub Copilot Chat in GitHub being misconfigured with an invalid URL, preventing it from connecting to our chat server. Other Copilot clients were not affected.We mitigated the incident by identifying the source of the problematic query and rolling back the deployment.We are reviewing our deployment tooling to reduce the time to mitigate similar incidents in the future. In parallel, we are also improving our test coverage for this category of error to prevent them from being deployed to production.
On March 21st, 2025, between 05:43 UTC and 08:49 UTC, the Actions service experienced degradation, leading to workflow run failures. During the incident, approximately 2.45% of workflow runs failed due to an infrastructure failure. This incident was caused by intermittent failures in communicating with an underlying service provider. We are working to improve our resilience to downtime in this service provider and to reduce the time to mitigate in any future recurrences.
On March 21, 2025 between 01:00 UTC and 02:45 UTC, the Codespaces service was degraded and users in various regions experienced intermittent connection failures. The peak error error was 30% of connection attempts across 38% of Codespaces. This was due to a service deployment.The incident was mitigated by completing the deployment to the impacted regions. We are working with the service team to identify the cause of the connection losses and perform necessary repairs to avoid future occurrences.
On March 20, 2025, between 19:24 UTC and 20:42 UTC the GitHub Pages experience was degraded and returned 503s for some customers. We saw an error rate of roughly 2% for Pages views, and new page builds were unable to complete successfully before timing out. This was due to replication failure at the database layer between a write destination and read destination. We mitigated the incident by redirecting reads to the same destination as writes. The error with replication occurred while in this transitory phase, as we are in the process of migrating the underlying data for Pages to new database infrastructure. Additionally our monitors failed to detect the error.We are addressing the underlying cause of the failed replication and telemetry.