Incident History

Disruption with some GitHub services

On September 15th between 17:55 and 18:20 UTC, Copilot experienced degraded availability for all features. This was due a partial deployment of a feature flag to a global rate limiter. The flag triggered behavior that unintentionally rate limited all requests, resulting in 100% of them returning 403 errors. The issue was resolved by reverting the feature flag which resulted in immediate recovery.The root cause of the incident was from an undetected edge case in our rate limiting logic. The flag was meant to scale down rate limiting for a subset of users, but unintentionally put our rate limiting configuration into an invalid state.To prevent this from happening again, we have addressed the bug with our rate limiting. We are also adding additional monitors to detect anomalies in our traffic patterns, which will allow us to identify similar issues during future deployments. Furthermore, we are exploring ways to test our rate limit scaling in our internal environment to enhance our pre-production validation process.

Repository search is degraded

At around 18:45 UTC on Friday, September 12, 2025, a change was deployed that unintentionally affected search index management. As a result, approximately 25% of repositories were temporarily missing from search results.By 12:45 UTC on Saturday, September 14, most missing repositories were restored from an earlier search index snapshot, and repositories updated between the snapshot and the restoration were reindexed. This backfill was completed at 21:25 UTC.After these repairs, about 98.5% of repositories were once again searchable. We are performing a full reconciliation of the search index and customers can expect to see records being updated and content becoming searchable for all repos again between now and Sept 25.NOTE: Users who notice missing or outdated repositories in search results can force reindexing by starring or un-starring the repository. Other repository actions such as adding topics, or updating the repository description, will also result in reindexing. In general, changes to searchable artifacts in GitHub will also update their respective search index in near-real time.User impact has been mitigated with the exception of the 1.5% of repos that are missing from the search index. The change responsible for the search issue has been reverted, and full reconciliation of the search index is underway, expected to complete by September 23. We have added additional checks to our indexing model to ensure this failure does not happen again. We are also investigating faster repair alternatives.To avoid resource contention and possible further issues we are currently not repairing repositories or organizations individually at this time. No repository data was lost, and other search types were not affected.

Incident with Actions

On September 10, 2025 between 13:00 and 14:15 UTC, Actions users experienced failed jobs and run start delays for Ubuntu 24 and Ubuntu 22 jobs on standard runners in private repositories. Additionally, larger runner customers experienced run start delays for runner groups with private networking configured in the eastus2 region. This was due to an outage in an underlying compute service provider in eastus2. 1.06% of Ubuntu 24 jobs and 0.16% of Ubuntu 22 jobs failed during this period. Jobs for larger runners using private networking in the eastus2 region were unable to start for the duration of the incident.We have identified and are working on improvements in our resilience to single partner region outages for standard runners so impact is reduced in similar scenarios in the future.

Degraded REST API success rates for some customers

On September 4, 2025 between 15:30 UTC and 20:00 UTC the REST API endpoints git/refs, git/refs/, and git/matching-refs/ were degraded and returned elevated errors for repositories with reference counts over 22k. On average, the request error rate to these specific endpoints was 0.5%. Overall REST API availability remained 99.9999%. This was due to the introduction of a code change that added latency to reference evaluations and overly affected repositories with many branches, tags, or other references.We mitigated the incident by reverting the new code.We are working to improve performance testing and to reduce our time to detection and mitigation of issues like this one in the future.

Loading avatars might fail for a 0.5% of total users and 100% users around the Arabian Peninsula. We are investigating.

Between August 21, 2025 at 15:00 UTC, and September 2, 2025 at 15:22 UTC the avatars.githubusercontent.com image service was degraded and failed to display user avatars for users in the Middle East. During this time, avatar images appeared broken on github.com for affected users. On average, this impacted about 82% of users routed through one of our Middle East-based points-of-presence, which represents about 0.14% of global users.This was due to a configuration change within GitHub's edge infrastructure in the affected region, causing HTTP requests to fail. As a result, image requests returned HTTP 503 errors. The failure to detect the issues was the result of an alerting threshold set too low.We mitigated the incident by removing the affected site from service, which restored avatar serving for impacted users.To prevent this from recurring, we have tuned configuration defaults for graceful degradation. We also added new health checks to automatically shift traffic from impacted sites. We are updating our monitoring to prevent undetected errors like this in the future.

Disruption with some GitHub services

On August 27, 2025 between 20:35 and 21:17 UTC, Copilot, Web and REST API traffic experienced degraded performance. Copilot saw an average of 36% of requests fail with a peak failure rate of 77%. Approximately 2% of all non-Copilot Web and REST API traffic requests failed.

This incident occurred after we initiated a production database migration to drop a column from a table backing copilot functionality. While the column was no longer in direct use, our ORM continued to reference the dropped column. This led to a large number of 5xx responses and was similar to the incident on August 5th. At 21:15 UTC, we applied a fix to the production schema and by 21:17 UTC, all services had fully recovered.

While repairs were in progress to avoid this situation, they were not completed quickly enough to prevent a second incident. We have now implemented a temporary block for all drop column operations as an immediate solution while we add more safeguards to prevent similar issues from occurring in the future. We are also implementing graceful degradation so that Copilot issues will not impact other features of our product.

Incident with Actions

On August 21, 2025, from approximately 15:37 UTC to 18:10 UTC, customers experienced increased delays and failures when starting jobs on GitHub Actions using standard hosted runners. This was caused by connectivity issues in our East US region, which prevented runners from retrieving jobs and sending progress updates. As a result, capacity was significantly reduced, especially for busier configurations, leading to queuing and service interruptions. Approximately 8.05% of jobs on public standard Ubuntu24 runners and 3.4% of jobs on private standard Ubuntu24 runners did not start as expected.By 18:10 UTC, we had mitigated the issue by provisioning additional resources in the affected region and burning down the backlog of queued runner assignments. By the end of that day, we deployed changes to improve runner connectivity resilience and graceful degradation in similar situations. We are also taking further steps to improve system resiliency by enhancing observability of network connection health with runners and improving load distribution and failover handling to help prevent similar issues in the future.

Incident with Issues and Git Operations

On August 21st, 2025, between 6:15am UTC and 6:25am UTC Git and Web operations were degraded and saw intermittent errors. On average, the error rate was 1% for API and Web requests. This was due to database infrastructure automated maintenance reducing capacity below our tolerated threshold.The incident was resolved when the impacted infrastructure self-healed and returned to normal operating capacity.We are adding guardrails to reduce the impact of this type of maintenance in the future.

Disruption with some GitHub services

Between 15:49 and 16:37 UTC on 20 Aug 2025, creating a new GitHub account via the web signup page consistently returned server errors, and users were unable to complete signup during this 48-minute window. We detected the issue at 16:04 UTC and restored normal signup functionality by 16:37 UTC. A recent change to signup flow logic caused all attempts to error. The change was rolled back to restore service. This exposed a gap in our test coverage that we are fixing.

Disruption with some GitHub services

On August 19, 2025, between 13:35 UTC and 14:33 UTC, GitHub search was in a degraded state. When searching for pull requests, issues, and workflow runs, users would have seen some slow, empty or incomplete results. In some cases, pull requests failed to load.The incident was triggered by intermittent connectivity issues between our load balancers and search hosts. While retry logic initially masked these problems, retry queues eventually overwhelmed the load balancers, causing failure. The incident was mitigated at 14:33 UTC by throttling our search index pipeline. Our automated alerting and internal retries reduced the impact of this event significantly. As a result of this incident we believe we have identified a faster way to mitigate it in the future. We are also working on multiple solutions to resolve the underlying connectivity issues.

⮜ Previous Next ⮞