Errors starting and connecting to Codespaces
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
On March 12, 2026, between 01:00 UTC and 18:53 UTC, users saw failures downloading extensions within created or resumed codespaces. Users would see an error when attempting to use an extension within VS Code. Active codespaces with extensions already downloaded were not impacted.The extensions download failures were the result of a change introduced in our extension dependency and was resolved by updating the configuration of how those changes affect requests from Codespaces. We are enhancing observability and alerting of critical issues within regular codespace operations to better detect and mitigate similar issues in the future.
On March 12, 2026 between 02:30 and 06:02 UTC some GitHub Apps were unable to mint server to server tokens, resulting in 401 Unauthorized errors. During the outage window, ~1.3% of requests resulted in 401 errors incorrectly. This manifested in GitHub Actions jobs failing to download tarballs, as well as failing to mint fine-grained tokens. During this period, approximately 5% of Actions jobs were impacted The root cause was a failure with the authentication service’s token cache layer, a newly created secondary cache layer backed by Redis – caused by Kubernetes control plane instability, leading to an inability to read certain tokens which resulted in 401 errors. The mitigation was to fallback reads to the primary cache layer backed by mysql. As permanent mitigations, we have made changes to how we deploy redis to not rely on the Kubernetes control plane and maintain service availability during similar failure modes. We also improved alerting to reduce overall impact time from similar failures.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
On March 11, 2026, between 13:00 UTC and 15:23 UTC the Copilot Code Review service was degraded and experienced longer than average review times. On average, Copilot Code Review requests took 4 minutes and peaked at just under 8 minutes. This was due to hitting worker capacity limits and CPU throttling. We mitigated the incident by increasing partitions, and we are improving our resource monitoring to identify potential issues sooner.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
On March 9, 2026, between 01:23 UTC and 03:25 UTC, users attempting to create or resume codespaces in the Australia East region experienced elevated failures, peaking at a 100% failure rate for this region. Codespaces in other regions were not affected.The create and resume failures were caused by degraded network connectivity between our control plane services and the VMs hosting the codespaces. This was resolved by redirecting traffic to an alternate site within the region. While we are addressing the core network infrastructure issue, we have also improved our observability of components in this area to improve detection. This will also enable our existing automated failovers to cover this failure mode. These changes will prevent or significantly reduce the time any similar incident causes user impact.
On March 6, 2026, between 16:16 UTC and 23:28 UTC the Webhooks service was degraded and some users experienced intermittent errors when accessing webhook delivery histories, retrying webhook deliveries, and listing webhooks via the UI and API. On average, the error rate was 0.57% and peaked at approximately 2.73% of requests to the service. This was due to unhealthy infrastructure affecting a portion of webhook API traffic.We mitigated the incident by redeploying affected services, after which service health returned to normal.We are working to improve detection of unhealthy infrastructure and strengthen service safeguards to reduce time to detection and mitigation of issues like this one in the future.