Incident History

Disruption with some GitHub services

On April 9, 2026, between 16:05 UTC and 20:36 UTC, the Copilot cloud agent service was degraded, causing new agent sessions to be delayed or fail to start. Users who attempted to start Copilot cloud agent sessions during this period experienced jobs getting stuck in the queue, with wait times peaking at 54 minutes compared to the normal 15–40 seconds. On average, approximately 84% of requests to start agent sessions failed, peaking at 97.5% during the worst period.This was due to an internal service exceeding API rate limits, compounded by a caching bug that persisted the rate-limited state beyond the actual rate limit window, causing recurring outage waves rather than a single recovery.We mitigated the incident by deploying a configuration change to bypass the affected cache and shifting API traffic to an alternative authentication path that reduced rate limit exposure. We have since added automated monitoring and alerting for this failure mode, deployed per-endpoint rate limit controls, and added caching for high-traffic API calls to reduce overall load. We are also working on longer-term improvements to rate limit isolation and traffic management to prevent similar issues in the future.This incident shared the same underlying root causes with an incident declared in the time frame https://www.githubstatus.com/incidents/zn1t56bfxdzg

1775751603 - 1775767012 Resolved

Disruption with some GitHub services

On April 9, 2026, between 09:05 UTC and 19:05 UTC, the Copilot coding agent service was degraded and users experienced significant delays starting new agent sessions. Approximately 84% of new agent session requests were delayed across four separate outage waves, with queue wait times peaking at 54 minutes compared to a normal baseline of 15–40 seconds. On average, the error rate was 83.9% and peaked at 97.5% of requests to the service. Approximately 22,700 workflow creations were delayed or failed during the incident.This was due to a bug in our rate limiting logic that incorrectly applied a rate limit globally across all users, rather than scoping it to the individual installation that triggered the limit. A contributing factor was a surge in API traffic from a client update that increased requests to an internal endpoint by 3–4x, which accelerated rate limit exhaustion.We mitigated the incident by disabling the faulty rate limit caching mechanism via feature flag and updating our service to use per-installation credentials for API calls, ensuring rate limits are correctly scoped to individual installations.We have since added automated monitoring and alerting to detect this failure mode proactively, deployed fixes to reduce unnecessary API traffic through caching improvements, and are continuing work to further isolate rate limit scoping across client types to prevent similar issues in the future.This incident shared the same underlying root causes with an incident declared in the time frame https://www.githubstatus.com/incidents/2rqwxl8y7m0j

1775728221 - 1775729737 Resolved

Disruption with GitHub notifications

On April 9, 2026, between 03:22 UTC and 04:49 UTC, GitHub Notifications experienced degraded availability. During this time, approximately 45% of requests to the notifications service returned errors, with a peak error rate of approximately 54%, preventing affected users from successfully viewing or interacting with their notifications service. The issue was identified and resolved, restoring the service to full availability.We are working to improve our metrics to reduce time to detection and mitigation for similar issues in the future.

1775709726 - 1775710637 Resolved

Disruption with some GitHub services

Between 15:20 and 20:18 UTC on Thursday April 2, Copilot Cloud Agent entered a period of reduced performance. Due to an internal feature being developed for Copilot Code Review, the Copilot Cloud Agent infrastructure started to receive an increased number of jobs. This load eventually caused us to hit an internal rate limit, causing all work to suspend for an hour. During this hour, some new jobs would time out, while others would resume once rate limiting ended. Roughly 40% of jobs in this period were affected.Once the cause of this rate limiting was identified, we were able to disable the new CCR feature via a feature flag. Once the jobs that were already in the queue were able to clear, we didn't see additional instances of rate limiting afterwards.

1775152176 - 1775166523 Resolved

Copilot Coding Agent failing to start some jobs

Between 15:20 and 20:18 UTC on Thursday April 2, Copilot Cloud Agent entered a period of reduced performance. Due to an internal feature being developed for Copilot Code Review, the Copilot Cloud Agent infrastructure started to receive an increased number of jobs. This load eventually caused us to hit an internal rate limit, causing all work to suspend for an hour. During this hour, some new jobs would time out, while others would resume once rate limiting ended. Roughly 40% of jobs in this period were affected.Once the cause of this rate limiting was identified, we were able to disable the new CCR feature via a feature flag. Once the jobs that were already in the queue were able to clear, we didn't see additional instances of rate limiting afterwards.This was the same incident declared in https://www.githubstatus.com/incidents/d96l71t3h63k

1775146684 - 1775147405 Resolved

GitHub audit logs are unavailable

On April 1, 2026, between 15:34 UTC and 16:02 UTC, our audit log service lost connectivity to its backing data store due to a failed credential rotation. During this 28-minute window, audit log history was unavailable via both the API and web UI. This resulted in 5xx errors for 4,297 API actors and 127 github.com users. Additionally, events created during this window were delayed by up to 29 minutes in github.com and event streaming. No audit log events were lost; all audit log events were ultimately written and streamed successfully. Customers using GitHub Enterprise Cloud with data residency were not impacted by this incident. We were alerted to the infrastructure failure at 15:40 UTC — six minutes after onset — and resolved the issue by recycling the affected environment, restoring full service by 16:02 UTC. We are conducting a thorough review of our credential rotation process to strengthen its resiliency and prevent recurrence. In parallel, we are strengthening our monitoring capabilities to ensure faster detection and earlier visibility into similar issues going forward.

1775059571 - 1775059817 Resolved

Disruption with GitHub's code search

On April 1st, 2026 between 14:40 and 17:00 UTC the GitHub code search service had an outage which resulted in users being unable to perform searches.The issue was initially caused by an upgrade to the code search Kafka cluster ZooKeeper instances which caused a loss of quorum. This resulted in application-level data inconsistencies which required the index to be reset to a point in time before the loss of quorum occurred. Meanwhile, an accidental deploy resulted in query services losing their shard-to-host mappings, which are typically propagated by Kafka.We remediated the problem by performing rolling restarts in the Kafka cluster, allowing quorum to be reestablished. From there we were able to reset our index to a point in time before the inconsistencies occurred.The team is working on ways to improve our time to respond and mitigate issues relating to Kafka in the future.

1775055774 - 1775087145 Resolved

Incident with Copilot

On April 1, 2026, between 07:29 and 12:41 UTC, some customers experienced elevated 5xx errors and increased latency when using GitHub Copilot features that rely on /agents/sessions endpoints (including creating or viewing agent sessions). The issue was caused by resource exhaustion in one of the Copilot backend services handling these requests, in turn, causing timeouts and failed requests. We mitigated the incident by increasing the service’s available compute resources and tuning its runtime concurrency settings. Service health returned to normal and the incident was fully resolved by 12:41 UTC.

1775037493 - 1775047298 Resolved

Incident with Pull Requests: High percentage of 500s

On Monday March 31st, 2026, between 13:53 UTC and 21:23 UTC the Pull Requests service experienced elevated latency and failures. On average, the error rate was 0.15% and peaked at 0.28% of requests to the service. This was due to a change in garbage collection (GC) settings for a Go-based internal service that provides access to Git repository data. The changes caused more frequent GC activity and elevated CPU consumption on a subset of storage nodes, increasing latency and failure rates for some internal API operations.We mitigated the incident by reverting the GC changes. To prevent future incidents and improve time to detection and mitigation, we are instrumenting additional metrics and alerting for GC-related behavior, improving our visibility into other signals that could cause degraded impact of this type, and updating our best practices and standards for garbage collection in Go-based services.

1774969539 - 1774992223 Resolved

Issues with metered billing report generation

On March 31, 2026, between 06:15 UTC and 15:30 UTC, the GitHub billing usage reports feature was degraded due to reduced server capacity. Customers requesting billing usage reports and loading the top usage by organization and repository on the billing overview and usage pages were impacted. The average error rate for usage report requests was 15%, peaking at 98% over an eight-minute window. For the billing pages, an average of 56% of requests failed to load the top usage cards. The root cause was an increase in billing usage report requests with large datasets, which exhausted the capacity of the nodes responsible for reporting data. There was no impact on billing charges. We mitigated the incident by adjusting our auto-scaling thresholds to better meet our capacity needs. We are working to improve our metrics to reduce time to detection and mitigation for similar issues in the future.

1774964866 - 1774969822 Resolved
⮜ Previous Next ⮞