On July 23rd, 2025, from approximately 14:30 to 16:30 UTC, GitHub Actions experienced delayed job starts for workflows in private repos using Ubuntu-24 standard hosted runners. This was due to resource provisioning failures in one of our datacenter regions. During this period, approximately 2% of Ubuntu-24 hosted runner jobs on private repos were delayed. Other hosted runners, self-hosted runners, and public repo workflows were unaffected.To mitigate the issue, additional worker capacity was added from a different datacenter region at 15:35 UTC and further increased at 16:00 UTC. By 16:30 UTC, job queues were healthy and service was operating normally. Since the incident, we have deployed changes to improve how regional health is accounted for when allocating new runners, and we are investigating further improvements to our automated capacity scaling logic and manual overrides to prevent a recurrence.
On July 22nd, 2025, between 17:58 and 18:35 UTC, the Copilot service experienced degraded availability for Claude Sonnet 4 requests. 4.7% of Claude 4 requests failed during this time. No other models were impacted. The issue was caused by an upstream problem affecting our ability to serve requests.We mitigated by rerouting capacity and monitoring recovery. We are improving detection and mitigation to reduce future impact.
On July 21, 2025, between 07:20 UTC and 08:00 UTC, the Copilot service experienced degraded availability for Claude 4 requests. 2% of Claude 4 requests failed during this time. The issue was caused by an upstream problem affecting our ability to serve requests.We mitigated by rerouting capacity and monitoring recovery. We are improving detection and mitigation to reduce future impact.
On July 21st, 2025, between 07:00 UTC and 09:45 UTC the API, Codespaces, Copilot, Issues, Package Registry, Pull Requests and Webhook services were degraded and experienced dropped requests and increased latency. At the peak of this incident (a 2 minute period around 07:00 UTC) error rates peaked at 11% and went down shortly after. Average web request load times rose to 1 second during this same time frame. After this period, traffic gradually recovered but error rate and latency remained slightly elevated until the end of the incident.This incident was triggered by a kernel bug that caused a crash of some of our load balancers during a scheduled process after a kernel upgrade. In order to mitigate the incident, we halted the roll out of our upgrades, and rolled back the impacted instances. We are working to make sure the kernel version is fully removed from our fleet. As a precaution, we temporarily paused the scheduled process to prevent any unintended use in the affected kernel. We also tuned our alerting so we can more quickly detect and mitigate future instances where we experience a sudden drop in load-balancing capacity.
On 15 July, between 19:55 and 19:58 UTC, requests to GitHub had a high failure rate while successful requests suffered up to 10x expected latency.
Browser-based requests saw a failure rate of up to 20%, GraphQL had up to a 9% failure rate and 2% of REST API requests failed. Any downstream service dependent on GitHub APIs was also affected during this short window.
The failure stemmed from a database query change, and was rolled back by our deployment tooling which automatically detected the issue. We will continue to invest in automated detection and rollback with a goal of minimizing time to recovery.
On July 16, 2025, between 05:20 UTC and 08:30 UTC, the Copilot service experienced degraded availability for Claude 3.7 requests. Around 10% of Claude 3.7 requests failed during this time. The issue was caused by an upstream problem affecting our ability to serve requests.We mitigated by rerouting capacity and monitoring recovery. We are improving detection and mitigation to reduce future impact.
On July 8, 2025, between 14:20 UTC and 16:30 UTC, GitHub Actions service experienced degraded performance leading to delays in updates to Actions workflow runs including missing logs and delayed run status. During this period, 1.07% of Actions workflow runs experienced delayed updates, while 0.34% of runs completed, but showed as canceled in their status. The incident was caused by imbalanced load in our underlying service infrastructure. The issue was mitigated by scaling up our services and tuning resource thresholds. We are working to improve our resilience to load spikes, capacity planning to prevent similar issues, and are implementing more robust monitoring to reduce detection and mitigation time for similar incidents in the future.
On July 7, 2025, between 21:14 UTC and 22:34 UTC, Copilot coding Agent was degraded and non-responsive to issue assignment. Impact was limited to internal GitHub staff because the feature flag gating a newly released feature was enabled on internal development setups and not in global GitHub production environments.
The incident was mitigated by disabling the feature flag for all users.
While our existing safeguards worked as intended—the feature flag allowed for immediate mitigation and the limited scope prevented broader impact—we are enhancing our monitoring to better detect issues that affect smaller user segments and reviewing our internal testing processes to identify similar edge cases before they reach production.
On July 7th, 2025, between 18:20 UTC and 22:10 UTC the Actions service was degraded for GitHub Larger Hosted and scale set runners. During this time window, 9% of GitHub Larger Hosted Runners and scale set jobs saw a delay of at least 5 minutes before being assigned to a runner. Impact was more apparent to customers that didn’t have pre-scaled runner pools or who infrequently queued jobs during the incident window. This was due to a change that unintentionally decreased the rate at which we notified our backend that new scale set runners were coming online, and was mitigated by reverting that change. To reduce the likelihood and impact time of a similar issue in the future, we are improving our detection of this failure mode so we catch it in earlier stages of development and rollout.
On 7/3/2025, between 3:22 AM and 7:12 AM UTC, customers were prevented from SSO authorizing Personal Access Tokens and SSH keys via the GitHub UI. Approximately 1300 users were impacted.A code change modified the content type of the response returned by the server, causing a lazily-loaded dropdown to fail to render, prohibiting the user from proceeding to authorize. No authorization systems were impacted during the incident, only the UI component. We mitigated the incident by reverting the code change that introduced the problem.We are making improvements to our release process and test coverage to catch this class of error earlier in our deployment pipeline. Further, we are improving monitoring to reduce our time to detection and mitigation of issues like this one in the future.