On 16 March 2026, between 14:16 UTC and 15:18 UTC, Codespaces users encountered a download failure error message when starting newly created or resumed codespaces. At peak, 96% of the created or resumed codespaces were impacted. Active codespaces with a running VSCode environment were not affected. The error was a result of an API deployment issue with our VS Code remote experience dependency and was resolved by rolling back that deployment. We are working with our partners to reduce our incident engagement time, improve early detection before they impact our customers, and ensure safe rollout of similar changes in the future.
On March 13, 2026, between 13:35 UTC and 16:02 UTC, a configuration change to an internal authorization service reduced its processing capacity below what was needed during peak traffic. This caused intermittent timeouts when other GitHub services checked user permissions, resulting in four to five waves of errors over roughly two hours and forty minutes. In total, 0.4% of users were denied access to actions they were authorized to perform. The root cause was a resource right-sizing change deployed to the authorization service the previous day. It reduced CPU allocation below what was required at peak, causing the service's network gateway to throttle under load. Because the change was deployed after peak traffic on March 12, the reduced capacity wasn't surfaced until the next day's peak. The incident was mitigated by manually scaling up the authorization service and reverting the configuration change. To prevent recurrence, we are adding further resource utilization monitors across our entire stack to detect throttling and improving error handling so transient infrastructure timeouts are distinguished from authorization failures, enabling quicker detection of the root issue.
On March 12, 2026, between 01:00 UTC and 18:53 UTC, users saw failures downloading extensions within created or resumed codespaces. Users would see an error when attempting to use an extension within VS Code. Active codespaces with extensions already downloaded were not impacted.The extensions download failures were the result of a change introduced in our extension dependency and was resolved by updating the configuration of how those changes affect requests from Codespaces. We are enhancing observability and alerting of critical issues within regular codespace operations to better detect and mitigate similar issues in the future.
On March 12, 2026 between 02:30 and 06:02 UTC some GitHub Apps were unable to mint server to server tokens, resulting in 401 Unauthorized errors. During the outage window, ~1.3% of requests resulted in 401 errors incorrectly. This manifested in GitHub Actions jobs failing to download tarballs, as well as failing to mint fine-grained tokens. During this period, approximately 5% of Actions jobs were impacted The root cause was a failure with the authentication service’s token cache layer, a newly created secondary cache layer backed by Redis – caused by Kubernetes control plane instability, leading to an inability to read certain tokens which resulted in 401 errors. The mitigation was to fallback reads to the primary cache layer backed by mysql. As permanent mitigations, we have made changes to how we deploy redis to not rely on the Kubernetes control plane and maintain service availability during similar failure modes. We also improved alerting to reduce overall impact time from similar failures.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
On March 11, 2026, between 13:00 UTC and 15:23 UTC the Copilot Code Review service was degraded and experienced longer than average review times. On average, Copilot Code Review requests took 4 minutes and peaked at just under 8 minutes. This was due to hitting worker capacity limits and CPU throttling. We mitigated the incident by increasing partitions, and we are improving our resource monitoring to identify potential issues sooner.
On March 9, 2026, between 15:03 and 20:52 UTC, the Webhooks API experienced was degraded, resulted in higher average latency on requests and in certain cases error responses. Approximately 0.6% of total requests exceeded the normal latency threshold of 3s, while 0.4% of requests resulted in 500 errors. At peak, 2.0% experienced latency greater than 3 seconds and 2.8% of requests returned 500 errors.The issue was caused by a noisy actor that led to resource contention on the Webhooks API service. We mitigated the issue initially by increasing CPU resources for the Webhooks API service, and ultimately applied lower rate limiting thresholds to the noisy actor to prevent further impact to other users.We are working to improve monitoring to more quickly ascertain noisy traffic and will continue to improve our rate-limiting mechanisms to help prevent similar issues in the future.
On March 9, 2026, between 01:23 UTC and 03:25 UTC, users attempting to create or resume codespaces in the Australia East region experienced elevated failures, peaking at a 100% failure rate for this region. Codespaces in other regions were not affected.The create and resume failures were caused by degraded network connectivity between our control plane services and the VMs hosting the codespaces. This was resolved by redirecting traffic to an alternate site within the region. While we are addressing the core network infrastructure issue, we have also improved our observability of components in this area to improve detection. This will also enable our existing automated failovers to cover this failure mode. These changes will prevent or significantly reduce the time any similar incident causes user impact.
On March 6, 2026, between 16:16 UTC and 23:28 UTC the Webhooks service was degraded and some users experienced intermittent errors when accessing webhook delivery histories, retrying webhook deliveries, and listing webhooks via the UI and API. On average, the error rate was 0.57% and peaked at approximately 2.73% of requests to the service. This was due to unhealthy infrastructure affecting a portion of webhook API traffic.We mitigated the incident by redeploying affected services, after which service health returned to normal.We are working to improve detection of unhealthy infrastructure and strengthen service safeguards to reduce time to detection and mitigation of issues like this one in the future.