Incident History

Incident with CodeQL

This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.

1778683288 - 1778688233 Resolved

Incident with CodeQL, Webhooks, Notifications, and Slack Integration

This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.

1778596694 - 1778607812 Resolved

Incident with high errors on Git Operations

This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.

1778509530 - 1778510032 Resolved

CCR and CCA failing to start for PR comments

On May 7, 2026, between 04:12 UTC and 06:13 UTC, Copilot Cloud Agent and Copilot Code Review Agent sessions for pull requests were delayed or failed to start.The issue was caused by follow-up recovery work from a separate Pull Requests incident (https://www.githubstatus.com/incidents/f5pb5d5mr9yh). As part of that recovery, we ran a large database migration, which caused replication delays on several replica hosts.Although those replicas were not serving user traffic, our safeguards correctly treated the elevated replication lag as a signal to slow down writes to the affected database cluster. As a result, some pull request background processing was temporarily delayed. That processing is responsible for sending the internal events that Copilot agents use to begin work, so affected agents did not start until the database replicas caught up.The system recovered once replication lag returned to normal and pull request processing resumed. We are reviewing how this safeguard interacts with recovery migrations so we can reduce the chance of similar secondary impact during future incident recovery work.

1778130121 - 1778136976 Resolved

Incident with Pull Requests

On May 6, 2026 between 15:12 and 19:02 UTC creation of new pull request review threads on GitHub.com failed. This included new line comments and file comments on pull requests. Existing PRs and previously created comments were unaffected. This incident was caused by a 32-bit integer key reaching its maximum value in a Vitess lookup table used during PR thread creation. The primary table had been migrated to a 64-bit integer key but the Vitesse lookup table remained 32-bit. Once the values in the primary table passed the available 32-bit ID space in the lookup table, attempts to create new review threads began failing, resulting in near 100% failure rate for new thread creation requests. We mitigated the issue by updating the impacted lookup table definitions across all shards to use 64-bit integer column types, increasing the available ID range and restoring normal operation. Service was fully restored once the schema changes competed globally. To help prevent similar incidents, we are expanding existing monitoring of database columns to include Vitess lookup tables to notify in advance of any tables that is approaching a column size limit. This work is intended to provide earlier detection of columns approaching size limits before customer impact occurs.

1778081123 - 1778094276 Resolved

Disruption with some GitHub services

On May 6, 2026 between 11:02 UTC and 11:13 UTC, users were unable to start or view Copilot Cloud Agent or remote sessions. During this time, requests to the session API returned errors, preventing users from creating new sessions or viewing existing ones. The issue was caused by a configuration change to the service's network routing that inadvertently removed the ingress path for the service. The team reverted the change at 11:13 UTC which restored service. The incident remained open until 11:59 UTC while the team verified full recovery. We are taking steps to improve our deployment validation process to prevent similar configuration changes from impacting production traffic in the future.

1778066488 - 1778068791 Resolved

Incident with Actions, we are investigating reports of degraded availability

On May 6, 2026, from approximately 06:45 UTC to 09:15 UTC, GitHub Actions Standard Ubuntu hosted runners were degraded. 17.1% of jobs requesting a standard runner failed.This was caused by an unexpected data shape in the allocation configuration data for standard runners. That data was introduced as part of post-incident remediation work for an incident the previous day and caused new allocations to be blocked as load ramped up for the day. Removing that data at 08:51 allowed allocations to proceed and hosted runner pools to scale up and recover.We are updating the filter logic for this allocation data to be resilient to abnormal data shapes and improving monitoring to alert when allocations are blocked, allowing the team to respond before customer impact starts.

1778051947 - 1778060666 Resolved

Increased Latency and Failures for SSH Git Operations

Between approximately 14:00 and 16:10 UTC on May 5, 2026, SSH-based Git operations experienced elevated latency and intermittent failures. On average, the error rate was 0.46% and peaked at 0.6% of SSH write requests. HTTP-based Git operations, including web UI and HTTPS clones, were not affected. The impact was caused by reduced SSH capacity at one of our data center sites. During a period of high traffic, the remaining hosts became overloaded, leading to connection exhaustion and some failures for SSH-based operations. Additional capacity was provisioned to expand SSH capacity and resolve the incident. The expanded capacity was fully online by 18:18 UTC. To reduce the likelihood of similar incidents, we will implement faster scaling solutions for SSH infrastructure and improved alerting for host availability and capacity thresholds.

1777999776 - 1778006141 Resolved

Incident with Actions

On May 5, 2026, from approximately 13:22 UTC to 17:05 UTC, GitHub Actions hosted runners in the East US region were degraded. 13.5% of jobs requesting a standard runner failed and ~16% of requested Larger Runners with private networking pinned to East US failed or were delayed by more than 5 minutes. Copilot Code Review requests were also impacted. Approximately 8,500 code review requests timed out during this window. Affected users saw an error comment on their pull requests and were able to retry by re-requesting a review. Most runner requests were picked up by other regions automatically, but a portion of requests still routing to East US were impacted.This was triggered by a scale-up operation for hosted runner VMs in the East US region. This is a regular operation, but the VM create load hit an internal rate limit when VM creates pull images from storage. Existing backoff logic was not triggered because of the response code returned in this case. The rate limiting and VM creation failures were mitigated by reducing load to allow for recovery and allowing queued work to be processed. By 15:34 UTC, queued and failed job assignments were mostly mitigated, with less than 0.5% of runner assignments impacted between 15:34 and full recovery at 17:05.We are improving our system’s throttling behavior when limits occur, improving our controls to more quickly mitigate similar situations in the future, and reviewing all limits end-to-end for similar operations. We also immediately paused all scale and similar operations until these changes are in place and validated.

1777988278 - 1778002012 Resolved

Incident with Issues and Webhooks

On 2026-05-04 at 3:37:17 PM UTC we detected increased latency on issues resulting in timeouts, and elevated 500 errors on webhooks. A scheduled workload drove high utilization on the primary host of a critical datastore, saturating the connection pool. We paused the job to mitigate the problem at 4:40:05 PM UTC and have implemented measures to prevent recurrence.

1777909535 - 1777912805 Resolved
⮜ Previous