Incident with multiple GitHub services


Incident resolved in 1h18m2s

Resolved

On April 23, 2026, between 16:03 UTC and 17:27 UTC, multiple GitHub services experienced elevated error rates and degraded performance due to DNS resolution failures originating from our DNS infrastructure in our VA3 datacenter. Approximately 5–7% of overall traffic was affected during the impact window: - Webhooks: ~0.35% of API requests returned 5xx (peak ~0.39%). ~0.88% of requests exceeded 3s latency; at peak, >3s responses represented ~10% of Webhooks API traffic. - Copilot Metrics: ~9% of Copilot Insights dashboard requests returned 5xx. - Copilot cloud agents: ~10% of cloud agent sessions were affected and failing. - Octoshift: 0.88% of active repo migrations failed and 79% saw elevated durations (avg. 5.2 min) during this period. - Git Operations: averaged 1.25% errors over the duration of the incident, with a peak of 2.07% errors. - Actions: Workflow run status updates experienced delays of up to ~8s over the duration of the incident window. Our DNS infrastructure in VA3 entered a degraded state and began intermittently returning NXDOMAIN responses and timing out on lookups for both internal service discovery and external endpoints. This caused a cascading impact across the dependent services listed above. We identified a specific load pattern under which our DNS resolvers began failing. The evidence points to a recently introduced traffic-balancing mechanism, rolled out progressively to support our growth, as the root cause. We have since reverted this change. We are immediately prioritizing investments in a more controlled rollout and validation process, including a dedicated environment to safely shadow production DNS traffic and detect these failure modes before they can affect production.

1776965449

Investigating

Webhooks is operating normally.

1776964229

Investigating

Many services are mitigated and are validating the remaining services.

1776963870

Investigating

The degradation affecting Actions and Copilot has been mitigated. We are monitoring to ensure stability.

1776963780

Investigating

We have identified the root problem and are working on mitigation.

1776963170

Investigating

Actions is experiencing degraded performance. We are continuing to investigate.

1776962059

Investigating

We are investigating multiple unavailable services.

1776961173

Investigating

We are investigating reports of degraded availability for Copilot and Webhooks

1776960767