On Dec 15th, 2025, between 14:00 UTC and 15:45 UTC the Copilot service was degraded for Grok Code Fast 1 model. On average, 4% of the requests to this model failed due to an issue with our upstream provider. No other models were impacted.The issue was resolved after the upstream provider fixed the problem that caused the disruption. GitHub will continue to enhance our monitoring and alerting systems to reduce the time it takes to detect and mitigate similar issues in the future.
On December 3, 2025, between 22:21 UTC and 23:44 UTC, the Webhooks service experienced a degradation that delayed writes of webhook delivery records to our database. During this period, many webhook deliveries were not visible in the webhook delivery UI or API for more than an hour after they were sent. As a result, customers were temporarily unable to request redeliveries for those delayed records. The underlying cause was throttling of database writes due to high replication lag.
We mitigated the incident by temporarily disabling delivery history for a small number of very high‑volume webhook owners to reduce write pressure and stabilize the service. We are contacting the affected customers directly with more details.
We are improving our webhook delivery storage architecture so it can scale with current and future webhook traffic, reducing the likelihood and impact of similar issues.
Between 13:25 UTC and 18:35 UTC on Dec 11th, GitHub experienced an increase in scraper activity on public parts of our website. This scraper activity caused a low priority web request pool to increase and eventually exceed total capacity resulting in users experiencing 500 errors. In particular, this affected Login, Logout, and Signup routes, along with less than 1% requests from within Actions jobs. At the peak of the incident, 7.6% of login requests were impacted, which was the most significant impact of this scraping attack.Our mitigation strategy identified the scraping activity and blocked it. We also increased the pool of web requests that were impacted to have more capacity, and lastly we upgraded key user login routes to higher priority queues. In future, we’re working to more proactively identify this particular scraper activity and have faster mitigation times.
Between 13:25 UTC and 18:35 UTC on December 11th, GitHub experienced elevated traffic to portions of GitHub.com that exceeded previously provisioned capacity for specific request types. As a result, users encountered intermittent 500 errors. Impact was most pronounced on Login, Logout, and Signup pages, peaking at 7.6% of login requests. Additionally, fewer than 1% of requests originating from GitHub Actions jobs were affected. This incident was driven by the same underlying factors as the previously reported disruption to Login and Signup flowsOur immediate response focused on identifying and mitigating the source of the traffic increase. We increased available capacity for web request handling to relieve pressure on constrained pools. To reduce recurrence risk, we also re-routed critical authentication endpoints to a different traffic pool, ensuring sufficient isolation and headroom for login related traffic.In future, we’re working to more proactively identify these large changes in traffic volume and improve our time to mitigation.
Between December 9th, 2025 21:07 UTC and December 10th, 2025 14:52 UTC, 177 macos-14-large jobs were run on an Ubuntu larger runner VM instead of MacOS runner VMs. The impacted jobs were routed to a larger runner with incorrect metadata. We mitigated this by deleting the runner.The routing configuration is not something controlled externally. A manual override was done previously for internal testing, but left incorrect metadata for a large runner instance. An infrastructure migration caused this misconfigured runner to come online which started the incorrect assignments. We are removing the ability to manually override this configuration entirely, and are adding alerting to identify possible OS mismatches for hosted runner jobs.As a reminder, hosted runner VMs are secure and ephemeral, with every VM reimaged after every single job. All jobs impacted here were originally targeted at a GitHub-owned VM image and were run on a GitHub-owned VM image.
On December 10, 2025 between 08:50 UTC and 11:00 UTC, some GitHub Actions workflow runs experienced longer-than-normal wait times for jobs starting or completing. All jobs successfully completed despite the delays. At peak impact, approximately 8% of workflow runs were affected.During this incident, some nodes received a spike in workflow events that led to queuing of event processing. Because runs are pinned to nodes, runs being processed by these nodes saw delays in starting or showing as completed. The team was alerted to this at 8:58 UTC. Impacted nodes were disabled from processing new jobs to allow queues to drain.We have increased overall processing capacity and are implementing safeguards to better balance load across all nodes when spikes occur. This is important to ensure our available capacity can always be fully utilized.
On December 8, 2025, between 21:15 and 22:24 UTC, Copilot code completions experienced a significant service degradation. During this period, up to 65% of code completion requests failed.The root cause was an internal feature flag that caused the primary model supporting Copilot code completions to appear unavailable to the backend service. The issue was resolved once the flag was disabled.To prevent recurrence, we expanded test coverage for Copilot code completion models and are strengthening our detection mechanisms to better identify and respond to traffic anomalies.
On November 26th, 2025, between approximately 02:24 UTC and December 8th, 2025 at 20:26 UTC, enterprise administrators experienced a disruption when viewing agent session activities in the Enterprise AI Controls page. During this period, users were unable to list agent session activity in the AI Controls view. This did not impact viewing agent session activity in audit logs or directly navigating to individual agent session logs, or otherwise managing AI Agents.The issue was caused by a misconfiguration in a change deployed on November 25th that unintentionally prevented data from being published to an internal Kafka topic responsible for feeding the AI Controls page with agent session activity information.The problem was identified and mitigated on December 8th by correcting the configuration issue. GitHub is improving monitoring for data pipeline dependencies and enhancing pre-deployment validation to catch configuration issues before they reach production.
On December 5th, 2025, between 12:00 pm UTC and 9:00 pm UTC, our Team Synchronization service experienced a significant degradation, preventing over 209,000 organization teams from syncing their identity provider (IdP) groups. The incident was triggered by a buildup of synchronization requests, resulting in elevated Redis key usage and high CPU consumption on the underlying Redis cluster.To mitigate further impact, we proactively paused all team synchronization requests between 3:00 pm UTC and 8:15 pm UTC, allowing us to stabilize the Redis cluster. Our engineering team also resolved the issue by flushing the affected Redis keys and queues, which promptly stopped runaway growth and restored service health. Additionally, we scaled up our infrastructure resources to improve our ability to process the high volume of synchronization requests. All pending team synchronizations were successfully processed following service restoration.We are working to strengthen the Team Synchronization service by implementing a killswitch, adding throttling to prevent excessive enqueueing of synchronization requests, and improving the scheduler to avoid duplicate job requests. Additionally, we’re investing in better observability to alert when job drops occur. These efforts are focused on preventing similar incidents and improving overall reliability going forward.
On November 28th, 2025, between approximately 05:51 and 08:04 UTC, Copilot experienced an outage affecting the Claude Sonnet 4.5 model. Users attempting to use this model received an HTTP 400 error, resulting in 4.6% of total chat requests during this timeframe failing. Other models were not impacted.The issue was caused by a misconfiguration deployed to an internal service which made Claude Sonnet 4.5 unavailable. The problem was identified and mitigated by reverting the change. GitHub is working to improve cross-service deploy safeguards and monitoring to prevent similar incidents in the future.