On August 15, 2024, between 13:14 UTC and 13:43 UTC, the Actions service was degraded and resulted in failures to start new workflow runs for customers of github.com. On average, 10% of Actions workflow runs failed to start with the failure rate peaking at 15%. This was due to an infrastructure change that enabled a network proxy for requests between the Actions service and an internal API which caused requests to fail.We mitigated the incident by rolling back the change. We are working to improve our monitoring and deployment practices to reduce our time to detection and mitigation of issues like this one in the future.
On August 14, 2024 between 23:02 UTC and 23:38 UTC, all GitHub services on GitHub.com were inaccessible for all users. This was due to a configuration change that impacted traffic routing within our database infrastructure, resulting in critical services unexpectedly losing database connectivity. There was no data loss or corruption during this incident. At 22:59 UTC an erroneous configuration change rolled out to all GitHub.com databases that impacted the ability of the database to respond to health check pings from the routing service. As a result, the routing service could not detect healthy databases to route application traffic to. This led to widespread impact on GitHub.com starting at 23:02 UTC. We mitigated the incident by reverting the change and confirming restored connectivity to our databases. At 23:38 UTC, traffic resumed and all services recovered to full health. Out of an abundance of caution, we continued to monitor before resolving the incident at 00:30 UTC on August 15th, 2024. To prevent recurrence we are implementing additional guardrails in our database change management process. We are also prioritizing several repair items such as faster rollback functionality and more resilience to dependency failures. Given the severity of this incident, follow-up items are the highest priority work for teams at this time.
On August 13, 2024, between 13:00 UTC and 13:23 UTC the Copilot service and some parts of the GitHub UI were degraded. This impacted about 25% of GitHub.com users. This was due to a partial rollout of a caching layer for Copilot licensing checks. During the rollout, connections to the caching layer were overwhelmed causing the licensing checks to timeout. Many pages were impacted by this failure due to a lack of resiliency to the timeouts.We mitigated the incident by reverting the rollout of the caching layer 11 minutes after initial detection. This immediately restored functionality for affected users.We are working to gracefully degrade experiences during these types of failures and reduce dependencies across services that may cause these types of failures in the future.
On August 12, 2024 from 13:39 to 14:28 UTC some users experienced an elevated rate of errors of up to 0.45% from the GitHub API. Less than 5% of webhooks interactions failed and less than 0.5% of Actions runs were delayed.This impact was caused by internal networking instances being insufficiently scaled.We mitigated the incident by provisioning additional instances. We are working to enhance the sizing strategy for the relevant infrastructure to prevent similar issues and to also improve monitoring and processes to reduce our time to detection and mitigation of issues like this one in the future.
Beginning at 6:52 PM UTC on August 6, 2024 and lasting until 6:59 PM UTC, some customers of github.com saw errors when navigating to a Pull Request. The error rate peaked at ~5% for logged in users. This was due to a change which was rolled back after alerts fired during the deployment.
We did not status before we completed the rollback, and the incident is currently resolved. We are sorry for the delayed post on githubstatus.com.
Beginning at 8:38 PM UTC on July 31, 2024 and lasting until 9:28 PM UTC on July 31, 2024, some customers of github.com saw errors when navigating to the Billing pages and/or when updating their payment method. These errors were caused due to a degradation in one of our partner services. A fix was deployed by the partner services and the Billing pages are back to being functional. For improved detection of such issues in future, we will work with the partner service to identify levers we can use to get an earlier indication of issues.
On July 31st, 2024, between 00:31 UTC and 03:37 UTC the Codespaces service was degraded and connecting through non-web flows such as VS Code or the GitHub CLI were unavailable. Using Codespaces via the web portal was not impacted. This was due to a code change resulting in authentication failures from the Codespaces public API.We mitigated the incident by reverting the change upon discovering the cause of the issue.We are working to improve testing, monitoring, and rollout of new features to reduce our time to detection and mitigation of issues like this one in the future.
On July 30th, 2024, between 13:25 UTC and 18:15 UTC, customers using Larger Hosted Runners may have experienced extended queue times for jobs that depended on a Runner with VNet Injection enabled in a virtual network within the East US 2 region. Runners without VNet Injection or those with VNet Injection in other regions were not affected. The issue was caused due to an outage in a third party provider blocking a large percentage of VM allocations in the East US 2 region. Once the underlying issue with the third party provider was resolved, job queue times went back to normal. We are exploring the addition of support for customers to define VNet Injection Runners with VNets across multiple regions to minimize the impact of outages in a single region.