CCR and CCA failing to start for PR comments
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
On May 6, 2026 between 11:02 UTC and 11:13 UTC, users were unable to start or view Copilot Cloud Agent or remote sessions. During this time, requests to the session API returned errors, preventing users from creating new sessions or viewing existing ones. The issue was caused by a configuration change to the service's network routing that inadvertently removed the ingress path for the service. The team reverted the change at 11:13 UTC which restored service. The incident remained open until 11:59 UTC while the team verified full recovery. We are taking steps to improve our deployment validation process to prevent similar configuration changes from impacting production traffic in the future.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
On April 28, 2026, at approximately 14:07 UTC, GitHub received reports that pull requests were missing from search results across global and repository /pulls pages. The issue was caused by a manually invoked repair job intended for a single repository, which was executed without the required safety flags. During execution of the repair job, the database query remained correctly scoped to the repo’s PR IDs. However, the Elasticsearch reconciliation logic did not apply the same scope. It interpreted the min and max PR IDs as a continuous range, causing unrelated PR documents across other repos to be marked for deletion. This resulted in the removal of 1,789,756,838 PR documents from the search index, approximately 49% of indexed PR documents. Customer impact was limited to PR search and list discoverability. Primary storage was unaffected, and there was no impact to opening, updating, or merging PRs. The issue was identified ~10 minutes after initial customer reports. Because it affected search index completeness rather than service availability, it was not caught by existing monitoring. The root cause was a flaw in the search document repair framework: it allowed a scoped reconciliation to run without enforcing a matching Elasticsearch query scope. This created a destructive mismatch between the source-of-truth and the index. The issue was compounded by the ability to trigger the job from the production console without safety defaults. Prior testing focused only on safe backfill scenarios and did not cover this reconciliation path. Additionally, there was no automated detection for large-volume deletions in Elasticsearch. We mitigated the incident through three parallel actions: (1) Deployed a MySQL-backed search fallback for the most active repos by traffic to restore PR visibility for highly impacted users (2) Initiated a snapshot restore and reindex process to repopulate missing pull request documents in Elasticsearch (3) Added a degradation notice on PR pages to inform users of incomplete search results while recovery was in progress. The incident was resolved on May 1, 2026 at 4:15 UTC, following completion and validation of the reindex process. To prevent recurrence, we are prioritizing improvements to the repair framework and safeguards. These include enforcing scoped query alignment between primary storage and Elasticsearch, preventing destructive operations without explicit opt-in, strengthening guardrails for manual repair jobs, and evaluating restrictions on production console access. In parallel, we are expanding automated test coverage for reconciliation safety invariants and introducing detection for anomalous deletion patterns in Elasticsearch so similar issues can be identified or blocked earlier. We are committed to improving the safety and reliability of our repair systems and ensuring that operational workflows are resilient to both software defects and manual invocation risks.
On April 28, 2026, from approximately 12:41 UTC to 17:09 UTC, GitHub Actions jobs using Standard Ubuntu 22 and Ubuntu 24 hosted runners experienced run start delays. Approximately 8% of hosted runner jobs using Ubuntu 22 and Ubuntu 24 experienced delays greater than 5 minutes or failures. Larger and self-hosted runners were not impacted.This was caused by a performance regression introduced in the VM reimage process. That reimage delay lowered the overall capacity of runners available to pick up new jobs. This was mitigated with a rollback to a known good image version.We are addressing the core issue with reimage performance and improving the granularity of reimage telemetry across our services and our compute provider to more quickly diagnose similar issues in the future. Finally, we are evaluating other rollout changes to automatically detect similar regressions.
On April 22, 2026 from 18:49 to 19:32 UTC , the Copilot Cloud Agent service began failing during session execution for users running the Agent HQ Codex agent. Codex agent sessions failed to start for all entry points (issue assignment, @copilot comment mentions). 0.5% of total Copilot Cloud Agent jobs were impacted (~2,000 failed jobs). Copilot and other agent sessions were unaffected.This was caused by a model resolution mismatch in Codex agent sessions, resulting in an incompatible model being used at runtime. A mitigation was deployed to select a stable default model for Codex agent sessions.We are working to harden the underlying model-resolution path so it correctly scopes to the requesting agent's supported models to prevent similar failure mode in the future.