Intermittent errors with app installation token authentication
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
On May 20, 2026, between 16:00 UTC and 17:45 UTC, GitHub Actions customers experienced run start delays exceeding 5 minutes. Approximately 4.5% of all runs were delayed during the impact window, with scale set jobs disproportionately affected. 30% of scale set jobs were delayed and 4% failed to start entirely. The incident was caused by a misconfigured health check on an internal service that assigns jobs to runners. A brief latency spike in an upstream dependency triggered health check failures across several pods, removing them from service and concentrating load on the remaining capacity. The added load drove memory pressure that escalated into a cascading failure in one regional cluster, leaving it unable to self-recover. Responders mitigated the incident by scaling capacity in the healthy regional clusters and draining traffic away from the impaired one, after which run start latency recovered. To prevent recurrence, we are strengthening our health check configuration to avoid cascading failure scenarios and evaluating automated mitigations to rebalance traffic when a region is degraded.
On May 15, 2026, from approximately 07:43 UTC to 08:48 UTC, GitHub Actions experienced a degradation that caused workflow runs to fail or experience delayed starts for a subset of customers. The incident was triggered by a planned failover of supporting infrastructure used by GitHub Actions. During that operation, an automated service discovery update did not propagate correctly, which caused traffic to be routed incorrectly and increased request timeouts in a core dependency for workflow orchestration. At peak impact, 42% of Actions runs failed. Downstream services that depend on Actions workflow execution were also impacted, including GitHub Pages and Copilot cloud services. At 08:12 UTC, responders manually corrected the service discovery routing issue. Timeout and failure rates recovered shortly after, and we continued monitoring until full stabilization was confirmed across all affected services. The incident was marked resolved at 08:48 UTC. To prevent recurrence, we are implementing failover guardrails that validate service discovery state before completing failover operations, strengthening pre-flight and post-flight verification checks, and improving dependency resilience to reduce timeout cascades during infrastructure events.
Beginning at 02:49 UTC on May 15 2026 and lasting until 03:04 UTC, GitHub.com was unavailable for a subset of customers. This impact has been mitigated and normal service resumed.
The issue was rooted in a sudden spike in traffic, with intermittent impact. We've identified the source of the traffic and prevented further disruption.
On May 13, 2026, between 14:31 and 16:03 UTC, the Code Scanning service experienced processing delays and 12% of check runs took over 15 minutes to complete. The delays were caused by replication lag due to an internal database migration, resulting in insufficient worker capacity for our high rate of job enqueues. We mitigated the impact by scaling our processing workers by 34%. Code Scanning results returned to normal processing times after the mitigation was applied. The capacity increases are permanent, and we are looking into more ways to decrease the load on our workers to help prevent this in the future.
On May 12, 2026, between 13:41 and 17:43 UTC, some services experienced delays in processing. For the Code Scanning service, 53% of check runs took over 15 minutes to complete. Additionally, notifications took an average of 22 minutes to be delivered and Slack integration webhooks took an average of 20 minutes to be delivered. The delays were caused by replication lag due to an internal database migration, resulting in insufficient worker capacity for our high rate of job enqueues. We mitigated the impact by scaling our processing workers to handle the increased load. All services returned to normal processing times after the mitigation was applied. We are working to create dedicated worker pools for some of our high usage shared queues to help prevent this in the future.
On May 11th, 2026, between 14:00 UTC and 14:33 UTC, HTTP-based Git read operations were degraded. On average, the error rate was 2.8% and peaked at 7.5% of requests to the service. This was due to resource exhaustion in a networking gateway between GitHub.com’s frontend service for Git operations and a dependency service that performs authentication and authorization. Following the initial spike, the frontend service became stuck in a degraded state in one of our data centers, increasing time to mitigation. We mitigated the incident by scaling the networking gateway and re-deploying the frontend service. To reduce our time to detection and mitigation in the future, we are adding auto-scaling to the networking gateway, and resolving a bug which caused the frontend service to remain degraded.
On May 7, 2026, between 04:12 UTC and 06:13 UTC, Copilot Cloud Agent and Copilot Code Review Agent sessions for pull requests were delayed or failed to start.The issue was caused by follow-up recovery work from a separate Pull Requests incident (https://www.githubstatus.com/incidents/f5pb5d5mr9yh). As part of that recovery, we ran a large database migration, which caused replication delays on several replica hosts.Although those replicas were not serving user traffic, our safeguards correctly treated the elevated replication lag as a signal to slow down writes to the affected database cluster. As a result, some pull request background processing was temporarily delayed. That processing is responsible for sending the internal events that Copilot agents use to begin work, so affected agents did not start until the database replicas caught up.The system recovered once replication lag returned to normal and pull request processing resumed. We are reviewing how this safeguard interacts with recovery migrations so we can reduce the chance of similar secondary impact during future incident recovery work.
On May 6, 2026 between 15:12 and 19:02 UTC creation of new pull request review threads on GitHub.com failed. This included new line comments and file comments on pull requests. Existing PRs and previously created comments were unaffected. This incident was caused by a 32-bit integer key reaching its maximum value in a Vitess lookup table used during PR thread creation. The primary table had been migrated to a 64-bit integer key but the Vitesse lookup table remained 32-bit. Once the values in the primary table passed the available 32-bit ID space in the lookup table, attempts to create new review threads began failing, resulting in near 100% failure rate for new thread creation requests. We mitigated the issue by updating the impacted lookup table definitions across all shards to use 64-bit integer column types, increasing the available ID range and restoring normal operation. Service was fully restored once the schema changes competed globally. To help prevent similar incidents, we are expanding existing monitoring of database columns to include Vitess lookup tables to notify in advance of any tables that is approaching a column size limit. This work is intended to provide earlier detection of columns approaching size limits before customer impact occurs.
On May 6, 2026 between 11:02 UTC and 11:13 UTC, users were unable to start or view Copilot Cloud Agent or remote sessions. During this time, requests to the session API returned errors, preventing users from creating new sessions or viewing existing ones. The issue was caused by a configuration change to the service's network routing that inadvertently removed the ingress path for the service. The team reverted the change at 11:13 UTC which restored service. The incident remained open until 11:59 UTC while the team verified full recovery. We are taking steps to improve our deployment validation process to prevent similar configuration changes from impacting production traffic in the future.