Incident History

Incomplete pull request results in repositories

This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.

1777385854 - 1777608909 Resolved

Disruption with some GitHub services

This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.

1777384756 - 1777396142 Resolved

Disruption with some GitHub services

On April 22, 2026 from 18:49 to 19:32 UTC , the Copilot Cloud Agent service began failing during session execution for users running the Agent HQ Codex agent. Codex agent sessions failed to start for all entry points (issue assignment, @copilot comment mentions). 0.5% of total Copilot Cloud Agent jobs were impacted (~2,000 failed jobs). Copilot and other agent sessions were unaffected.This was caused by a model resolution mismatch in Codex agent sessions, resulting in an incompatible model being used at runtime. A mitigation was deployed to select a stable default model for Codex agent sessions.We are working to harden the underlying model-resolution path so it correctly scopes to the requesting agent's supported models to prevent similar failure mode in the future.

1777308498 - 1777316579 Resolved

GitHub search is degraded

This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.

1777307467 - 1777330002 Resolved

Disruption with some GitHub services

We are investigating reports of degraded performance for Actions

1777307266 Ongoing

Delays with Actions Jobs for Larger Runners using VNet Injection in the East US region

On April 24, 2026, from approximately 11:39 UTC to April 25, 2026 at 00:15 UTC, GitHub Actions experienced delays and timeouts for Larger Hosted Runner jobs using VNet injection in the East US region without a failover region configured. Standard and Self-hosted runners were not impacted. This was caused by backend failures in our compute provider’s provisioning, scaling, and update operations for VMs in the East US region and mitigated by a rollback across all affected Availability Zones. More detail is available at https://azure.status.microsoft/en-us/status/history/?trackingId=5GP8-W0G.We are working to improve the reliability of our annotations for jobs impacted by regional issues and are adding system log notifications as an additional customer communication channel alongside annotations.VNet Failover is also now in public preview, allowing customers to evacuate Larger Hosted Runners using VNet injection in cases like this.

1777057359 - 1777077364 Resolved

Incident with Pull Requests

On April 23, 2026, between 16:05 UTC and 20:43 UTC, the Pull Requests service experienced a regression affecting merge queue operations. PRs merged via merge queue using the squash merge method produced incorrect merge commits when the merge group contained more than one PR. In affected cases, changes from previously merged PRs and prior commits were inadvertently reverted by subsequent merges.During the impact window 2,092 pull requests were affected. The issue did not affect pull requests merged outside of merge queue, nor merge queue groups using the merge or rebase methods.It took approximately 3 hours and 33 minutes to identify the issue. The change completed deployment at approximately 16:05 UTC, and we became aware at 19:38 UTC following an increase in customer support inquiries. Because the issue affected merge commit correctness rather than availability, it was not detected by existing automated monitoring and was identified through customer reports.The regression was introduced by a new code path that adjusted merge base computation for merge queue ref updates. This code path was intended to be gated behind a feature flag for an unreleased feature, but the gating was incomplete.As a result, the new behavior was inadvertently applied to squash merge groups, producing an incorrect three-way merge. This caused subsequent squash merges to revert changes from earlier pull requests and, in some cases, changes between their starting points.We mitigated the incident by reverting the code change and force-deploying the fix across all environments. After resolution, we identified affected repositories and sent targeted remediation instructions to repository administrators with step-by-step recovery guidance.The regression was not identified during internal validation. Existing test coverage primarily exercised single-PR merge queue groups, which did not exhibit the faulty base-reference calculation. Because automated checks did not validate merge correctness for multi-PR squash groups, the defect surfaced only in production.To prevent recurrence, GitHub is expanding test coverage for merge correctness validation. We are broadening automated coverage for merge queue operations, including regression checks that validate resulting Git contents across supported configurations, so issues affecting merge correctness are caught before reaching production.We are committed to ensuring the correctness and reliability of merge queue operations. These actions will reduce the risk of similar regressions and improve confidence in future changes to the Pull Requests service.

1776973851 - 1776980606 Resolved

Disruption with users unable to start Claude and Codex agent task from the web

Between 18:45 and 19:42 UTC on April 23, users were unable to start new agent tasks using either Claude or Codex agent on github.com. This was caused by a code change to how Copilot mission control routes task creation requests. Ongoing agent tasks and other Copilot agent features were not affected. We mitigated the impact by reverting the breaking change. We are adding extra monitoring and integration test coverage for the task creation path to prevent future recurrence.

1776972523 - 1776973345 Resolved

Incident with multiple GitHub services

On April 23, 2026, between 16:03 UTC and 17:27 UTC, multiple GitHub services experienced elevated error rates and degraded performance due to DNS resolution failures originating from our DNS infrastructure in our VA3 datacenter. Approximately 5–7% of overall traffic was affected during the impact window: - Webhooks: ~0.35% of API requests returned 5xx (peak ~0.39%). ~0.88% of requests exceeded 3s latency; at peak, >3s responses represented ~10% of Webhooks API traffic. - Copilot Metrics: ~9% of Copilot Insights dashboard requests returned 5xx. - Copilot cloud agents: ~10% of cloud agent sessions were affected and failing. - Octoshift: 0.88% of active repo migrations failed and 79% saw elevated durations (avg. 5.2 min) during this period. - Git Operations: averaged 1.25% errors over the duration of the incident, with a peak of 2.07% errors. - Actions: Workflow run status updates experienced delays of up to ~8s over the duration of the incident window. Our DNS infrastructure in VA3 entered a degraded state and began intermittently returning NXDOMAIN responses and timing out on lookups for both internal service discovery and external endpoints. This caused a cascading impact across the dependent services listed above. We identified a specific load pattern under which our DNS resolvers began failing. The evidence points to a recently introduced traffic-balancing mechanism, rolled out progressively to support our growth, as the root cause. We have since reverted this change. We are immediately prioritizing investments in a more controlled rollout and validation process, including a dedicated environment to safely shadow production DNS traffic and detect these failure modes before they can affect production.

1776960767 - 1776965449 Resolved

Investigating errors on GitHub

On April 23, 2026 between 14:30 UTC and 15:18 UTC multiple services were degraded on github.com. During this time approximately 1.5% of all web requests resulted in a 5xx status and unicorn pages for github.com users. We also saw elevated error rates across Actions workflow runs, Copilot, Codespaces and Packages, leading to degraded experiences during this timeframe. Codespaces impact peaked at 45% failures for create requests and 65% failures for resume requests. Packages impact was mainly Maven related with 50% failure rates in downloads and 70% failure rates in uploads. Actions experienced a peak of 8% of failed jobs and up to 85% of jobs impacted by run start delays of more than 5 minutes.This was due to a configuration change to an internal billing service that led to a cache being overwhelmed and causing requests to time out. These timeouts cascaded across multiple services and eventually caused requests to queue up and exhaust web request workers.This configuration change was reverted at 14:42 UTC and following this, all services began to see recovery immediately.To prevent this situation in the future, we are taking steps to ensure that failures and timeouts in the billing service don’t cascade to other services causing impact. This includes implementing more aggressive timeouts on callers of these billing services, adding circuit breaker configurations for cache timeouts and using more resilient cache options. We have also decreased max request timeouts within the billing service that caused impact and added more capacity to our cache to prevent traffic spikes from having the same impact.

1776955218 - 1776957521 Resolved
⮜ Previous