Incident History

Incident with Pages

On Sunday April 13th, 2026, between 18:53 UTC and 20:30 UTC, the GitHub Pages service experienced elevated error rates. On average, the error rate was 10.58% and peaked at 12.77% of requests to the service, resulting in approximately 17.5 million failed requests returning HTTP 500 errors. This was due to an automated DNS management tool (octodns) erroneously deleting a DNS record for a Pages backend storage host after its upstream data source intermittently failed to return the record, causing the tool to treat it as stale and remove it.We mitigated the incident by re-creating the deleted DNS record. To prevent future incidents, we are implementing availability-zone-tolerant routing in the Pages frontend so that an unresolvable backend host triggers failover to healthy hosts rather than returning errors, adding safeguards to prevent automated deletion of DNS records owned by other systems, and improving logging and alerting for DNS resolution failures in the Pages serving path.

1776110164 - 1776112537 Resolved

Disruption with some GitHub services

On April 13, 2026, between 14:41 UTC and 17:29 UTC, the Copilot service experienced degraded performance. All Copilot users were impacted by increased latency, and approximately 20% experienced request failures when interacting with Copilot Cloud Agent (CCA). On average, request latency increased to approximately 950ms. The GitHub User Dashboard also displayed intermittent errors loading Copilot quota information. CCA and the User Dashboard were impacted for approximately 2 hours and 56 minutes. This was due to an infrastructure change that reduced the available compute capacity for a backend service responsible for Copilot rate limiting and quota management. The reduced capacity caused resource exhaustion under normal traffic load, leading to cascading failures in downstream request processing. We mitigated the incident by increasing compute resources allocated to the affected service and scaling out the number of service instances to distribute load more effectively. We are working to improve proactive capacity monitoring to detect resource degradation before it impacts users, reviewing retry and timeout configurations across dependent services to reduce amplification during degraded states, and evaluating connection management strategies to improve resilience under constrained resources.

1776098487 - 1776102009 Resolved

Problems with third-party Claude and Codex Agent sessions not being listed in the agents tab dashboard

On April 9, 2026, between 22:59 UTC and April 10, 2026, 13:24 UTC, the Copilot Mission Control service was degraded and did not display Claude and Codex Cloud Agent sessions in the agents tab dashboard. Customers were unable to see, list, or manage their third party agent sessions during this period. The underlying agent sessions continued to function normally. This was a visibility and management issue only, and no HTTP errors were generated. The API returned successful responses with incomplete results, with an average error rate of 0% and a maximum error rate of 0%. This was due to a code change that introduced a filter which inadvertently excluded third party agent sessions.We mitigated the incident by reverting the problematic code change and deploying the fix to production.We are working to add automated monitoring for dashboard content visibility and improve integration test coverage for third party agent session listing to reduce our time to detection and mitigation of issues like this one in the future.

1775826441 - 1775827703 Resolved

Disruption with some GitHub services

On April 9, 2026, between 16:05 UTC and 20:36 UTC, the Copilot cloud agent service was degraded, causing new agent sessions to be delayed or fail to start. Users who attempted to start Copilot cloud agent sessions during this period experienced jobs getting stuck in the queue, with wait times peaking at 54 minutes compared to the normal 15–40 seconds. On average, approximately 84% of requests to start agent sessions failed, peaking at 97.5% during the worst period.This was due to an internal service exceeding API rate limits, compounded by a caching bug that persisted the rate-limited state beyond the actual rate limit window, causing recurring outage waves rather than a single recovery.We mitigated the incident by deploying a configuration change to bypass the affected cache and shifting API traffic to an alternative authentication path that reduced rate limit exposure. We have since added automated monitoring and alerting for this failure mode, deployed per-endpoint rate limit controls, and added caching for high-traffic API calls to reduce overall load. We are also working on longer-term improvements to rate limit isolation and traffic management to prevent similar issues in the future.This incident shared the same underlying root causes with an incident declared in the time frame https://www.githubstatus.com/incidents/zn1t56bfxdzg

1775751603 - 1775767012 Resolved

Disruption with some GitHub services

On April 9, 2026, between 09:05 UTC and 19:05 UTC, the Copilot coding agent service was degraded and users experienced significant delays starting new agent sessions. Approximately 84% of new agent session requests were delayed across four separate outage waves, with queue wait times peaking at 54 minutes compared to a normal baseline of 15–40 seconds. On average, the error rate was 83.9% and peaked at 97.5% of requests to the service. Approximately 22,700 workflow creations were delayed or failed during the incident.This was due to a bug in our rate limiting logic that incorrectly applied a rate limit globally across all users, rather than scoping it to the individual installation that triggered the limit. A contributing factor was a surge in API traffic from a client update that increased requests to an internal endpoint by 3–4x, which accelerated rate limit exhaustion.We mitigated the incident by disabling the faulty rate limit caching mechanism via feature flag and updating our service to use per-installation credentials for API calls, ensuring rate limits are correctly scoped to individual installations.We have since added automated monitoring and alerting to detect this failure mode proactively, deployed fixes to reduce unnecessary API traffic through caching improvements, and are continuing work to further isolate rate limit scoping across client types to prevent similar issues in the future.This incident shared the same underlying root causes with an incident declared in the time frame https://www.githubstatus.com/incidents/2rqwxl8y7m0j

1775728221 - 1775729737 Resolved

Disruption with GitHub notifications

On April 9, 2026, between 03:22 UTC and 04:49 UTC, GitHub Notifications experienced degraded availability. During this time, approximately 45% of requests to the notifications service returned errors, with a peak error rate of approximately 54%, preventing affected users from successfully viewing or interacting with their notifications service. The issue was identified and resolved, restoring the service to full availability.We are working to improve our metrics to reduce time to detection and mitigation for similar issues in the future.

1775709726 - 1775710637 Resolved

Disruption with some GitHub services

Between 15:20 and 20:18 UTC on Thursday April 2, Copilot Cloud Agent entered a period of reduced performance. Due to an internal feature being developed for Copilot Code Review, the Copilot Cloud Agent infrastructure started to receive an increased number of jobs. This load eventually caused us to hit an internal rate limit, causing all work to suspend for an hour. During this hour, some new jobs would time out, while others would resume once rate limiting ended. Roughly 40% of jobs in this period were affected.Once the cause of this rate limiting was identified, we were able to disable the new CCR feature via a feature flag. Once the jobs that were already in the queue were able to clear, we didn't see additional instances of rate limiting afterwards.

1775152176 - 1775166523 Resolved

Copilot Coding Agent failing to start some jobs

Between 15:20 and 20:18 UTC on Thursday April 2, Copilot Cloud Agent entered a period of reduced performance. Due to an internal feature being developed for Copilot Code Review, the Copilot Cloud Agent infrastructure started to receive an increased number of jobs. This load eventually caused us to hit an internal rate limit, causing all work to suspend for an hour. During this hour, some new jobs would time out, while others would resume once rate limiting ended. Roughly 40% of jobs in this period were affected.Once the cause of this rate limiting was identified, we were able to disable the new CCR feature via a feature flag. Once the jobs that were already in the queue were able to clear, we didn't see additional instances of rate limiting afterwards.This was the same incident declared in https://www.githubstatus.com/incidents/d96l71t3h63k

1775146684 - 1775147405 Resolved

GitHub audit logs are unavailable

On April 1, 2026, between 15:34 UTC and 16:02 UTC, our audit log service lost connectivity to its backing data store due to a failed credential rotation. During this 28-minute window, audit log history was unavailable via both the API and web UI. This resulted in 5xx errors for 4,297 API actors and 127 github.com users. Additionally, events created during this window were delayed by up to 29 minutes in github.com and event streaming. No audit log events were lost; all audit log events were ultimately written and streamed successfully. Customers using GitHub Enterprise Cloud with data residency were not impacted by this incident. We were alerted to the infrastructure failure at 15:40 UTC — six minutes after onset — and resolved the issue by recycling the affected environment, restoring full service by 16:02 UTC. We are conducting a thorough review of our credential rotation process to strengthen its resiliency and prevent recurrence. In parallel, we are strengthening our monitoring capabilities to ensure faster detection and earlier visibility into similar issues going forward.

1775059571 - 1775059817 Resolved

Disruption with GitHub's code search

On April 1st, 2026 between 14:40 and 17:00 UTC the GitHub code search service had an outage which resulted in users being unable to perform searches.The issue was initially caused by an upgrade to the code search Kafka cluster ZooKeeper instances which caused a loss of quorum. This resulted in application-level data inconsistencies which required the index to be reset to a point in time before the loss of quorum occurred. Meanwhile, an accidental deploy resulted in query services losing their shard-to-host mappings, which are typically propagated by Kafka.We remediated the problem by performing rolling restarts in the Kafka cluster, allowing quorum to be reestablished. From there we were able to reset our index to a point in time before the inconsistencies occurred.The team is working on ways to improve our time to respond and mitigate issues relating to Kafka in the future.

1775055774 - 1775087145 Resolved
⮜ Previous Next ⮞