Incident History

Incident with Issues and Git Operations

On August 21st, 2025, between 6:15am UTC and 6:25am UTC Git and Web operations were degraded and saw intermittent errors. On average, the error rate was 1% for API and Web requests. This was due to database infrastructure automated maintenance reducing capacity below our tolerated threshold.The incident was resolved when the impacted infrastructure self-healed and returned to normal operating capacity.We are adding guardrails to reduce the impact of this type of maintenance in the future.

Disruption with some GitHub services

Between 15:49 and 16:37 UTC on 20 Aug 2025, creating a new GitHub account via the web signup page consistently returned server errors, and users were unable to complete signup during this 48-minute window. We detected the issue at 16:04 UTC and restored normal signup functionality by 16:37 UTC. A recent change to signup flow logic caused all attempts to error. The change was rolled back to restore service. This exposed a gap in our test coverage that we are fixing.

Disruption with some GitHub services

On August 19, 2025, between 13:35 UTC and 14:33 UTC, GitHub search was in a degraded state. When searching for pull requests, issues, and workflow runs, users would have seen some slow, empty or incomplete results. In some cases, pull requests failed to load.The incident was triggered by intermittent connectivity issues between our load balancers and search hosts. While retry logic initially masked these problems, retry queues eventually overwhelmed the load balancers, causing failure. The incident was mitigated at 14:33 UTC by throttling our search index pipeline. Our automated alerting and internal retries reduced the impact of this event significantly. As a result of this incident we believe we have identified a faster way to mitigate it in the future. We are also working on multiple solutions to resolve the underlying connectivity issues.

Incident with Packages

On August 14, 2025, between 17:50 UTC and 18:08 UTC, the Packages NPM Registry service was degraded. During this period, NPM package uploads were unavailable and approximately 50% of download requests failed. We identified the root cause as a sudden spike in Packages publishing activity that exceeded our service capacity limits. We are implementing better guardrails to protect the service against unexpected traffic surges and improving our incident response runbooks to ensure faster mitigation of similar issues.

Incident with Actions

On August 14, 2025, between 02:30 UTC and 06:14 UTC, GitHub Actions was degraded. On average, 3% of workflow runs were delayed by at least 5 minutes. The incident was caused by an outage in a downstream dependency that led to failures in backend service connectivity in one region. At 03:59 UTC, we evacuated a majority of services in the impacted region, but some users may have seen ongoing impact until all services were fully evacuated at 06:14 UTC. We are working to improve monitoring and processes of failover to reduce our time to detection and mitigation of issues like this one in the future.

Incident with search on GitHub we are seeing increased failure rates

On August 12, 2025, between 13:30 UTC and 17:14 UTC, GitHub search was in a degraded state. Users experienced inaccurate or incomplete results, failures to load certain pages (like Issues, Pull Requests, Projects, and Deployments), and broken components like Actions workflow and label filters.Most user impact occurred between 14:00 UTC and 15:30 UTC, when up to 75% of search queries failed, and updates to search results were delayed by up to 100 minutes. The incident was triggered by intermittent connectivity issues between our load balancers and search hosts. While retry logic initially masked these problems, retry queues eventually overwhelmed the load balancers, causing failure. The query failures were mitigated at 15:30 UTC after throttling our search indexing pipeline to reduce load and stabilize retries. The connectivity failures were resolved at 17:14 UTC after the automated reboot of a search host, causing the rest of the system to recover. We have improved internal monitors and playbooks, and tuned our search cluster load balancer to further mitigate the recurrence of this failure mode. We continue to invest in resolving the underlying connectivity issues.

Disruption with some GitHub services

On August 11, 2025, from 18:41 to 18:57 UTC, GitHub customers experienced errors and increased latency when loading GitHub’s web interface. During this time, a configuration change to improve our UI deployment system caused a surge in requests to a backend datastore. This change led to an unexpected spike in connection attempts to our datastore and saturated its connection backlog and resulted in intermittent failures to serve required UI content. This resulted in elevated error rates for frontend requests.The incident was mitigated by reverting the configuration, which restored normal service.Following mitigation, we are evaluating improvements to our alerting thresholds and exploring architectural changes to reduce load to this datastore and improve the resilience of our UI delivery pipeline.

Incident with Pull Requests

At 15:33 UTC on August 5, 2025, we initiated a production database migration to drop a column from a table backing pull request functionality. While the column was no longer in direct use, our ORM continued to reference the dropped column in a subset of pull request queries. As a result, there were elevated error rates across pushes, webhooks, notifications, and pull requests with impact peaking at approximately 4% of all web and REST API traffic. We mitigated the issue by deploying a change that instructed the ORM to ignore the removed column. Most affected services recovered by 16:13 UTC. However, that fix was applied only to our largest production environment. An update to some of our custom and canary environments did not pick up the fix and this triggered a secondary incident affecting ~0.1% of pull request traffic, which was fully resolved by 19:45 UTC.While migrations have protections such as progressive roll-out first targeting validation environments and acknowledge gates, this incident identified an application monitoring gap that would have prevented continued rollout when impact was observed. We will add additional automation and safeguards to prevent future incidents without requiring human intervention. We are also already working on a way to streamline some types of changes across environments, which would have prevented the second incident from occurring.

Incident with pull requests

Disruption with some GitHub services

Between 06:04 UTC to 10:55 UTC on August 1, 2025, 100% of users attempting to sign up with an email and password experienced errors. Social signup was not affected. Once the problem became clear, the offending code was identified and a change was deployed to resolve the issue. We are adding additional monitoring to our sign-up process to improve our time to detection.

⮜ Previous Next ⮞