Incident History

Incident with search on GitHub we are seeing increased failure rates

On August 12, 2025, between 13:30 UTC and 17:14 UTC, GitHub search was in a degraded state. Users experienced inaccurate or incomplete results, failures to load certain pages (like Issues, Pull Requests, Projects, and Deployments), and broken components like Actions workflow and label filters.Most user impact occurred between 14:00 UTC and 15:30 UTC, when up to 75% of search queries failed, and updates to search results were delayed by up to 100 minutes. The incident was triggered by intermittent connectivity issues between our load balancers and search hosts. While retry logic initially masked these problems, retry queues eventually overwhelmed the load balancers, causing failure. The query failures were mitigated at 15:30 UTC after throttling our search indexing pipeline to reduce load and stabilize retries. The connectivity failures were resolved at 17:14 UTC after the automated reboot of a search host, causing the rest of the system to recover. We have improved internal monitors and playbooks, and tuned our search cluster load balancer to further mitigate the recurrence of this failure mode. We continue to invest in resolving the underlying connectivity issues.

Disruption with some GitHub services

On August 11, 2025, from 18:41 to 18:57 UTC, GitHub customers experienced errors and increased latency when loading GitHub’s web interface. During this time, a configuration change to improve our UI deployment system caused a surge in requests to a backend datastore. This change led to an unexpected spike in connection attempts to our datastore and saturated its connection backlog and resulted in intermittent failures to serve required UI content. This resulted in elevated error rates for frontend requests.The incident was mitigated by reverting the configuration, which restored normal service.Following mitigation, we are evaluating improvements to our alerting thresholds and exploring architectural changes to reduce load to this datastore and improve the resilience of our UI delivery pipeline.

Incident with Pull Requests

At 15:33 UTC on August 5, 2025, we initiated a production database migration to drop a column from a table backing pull request functionality. While the column was no longer in direct use, our ORM continued to reference the dropped column in a subset of pull request queries. As a result, there were elevated error rates across pushes, webhooks, notifications, and pull requests with impact peaking at approximately 4% of all web and REST API traffic. We mitigated the issue by deploying a change that instructed the ORM to ignore the removed column. Most affected services recovered by 16:13 UTC. However, that fix was applied only to our largest production environment. An update to some of our custom and canary environments did not pick up the fix and this triggered a secondary incident affecting ~0.1% of pull request traffic, which was fully resolved by 19:45 UTC.While migrations have protections such as progressive roll-out first targeting validation environments and acknowledge gates, this incident identified an application monitoring gap that would have prevented continued rollout when impact was observed. We will add additional automation and safeguards to prevent future incidents without requiring human intervention. We are also already working on a way to streamline some types of changes across environments, which would have prevented the second incident from occurring.

Incident with pull requests

Disruption with some GitHub services

Between 06:04 UTC to 10:55 UTC on August 1, 2025, 100% of users attempting to sign up with an email and password experienced errors. Social signup was not affected. Once the problem became clear, the offending code was identified and a change was deployed to resolve the issue. We are adding additional monitoring to our sign-up process to improve our time to detection.

Increase in 429s for Git Operations

Public Summary DraftBetween July 28, 2025 16:31 UTC to July 29, 2025 12:05 UTC users saw degraded Git Operations for raw file downloads. On average, the error rate was .005%, with a peak error rate of 3.89%. This was due to a sustained increase in unauthenticated repository traffic.We mitigated the incident by applying regional rate limiting, rolling back a service that was unable to scale with the additional traffic, and addressed a bug that impacted the caching of raw requests. Additionally, we horizontally scaled several dependencies of the service to appropriately handle the increase in traffic.We are working on improving our time to detection and have implemented controls to prevent similar incidents in future.

Incident with Issues, API Requests and Pull Requests

Between July 28, 2025, 22:23:00 UTC and July 29, 2025 02:06:00 UTC, GitHub experienced degraded performance across multiple services including API, Issues, GraphQL and Pull Requests. During this time, approximately 4% of Web and API requests resulted in 500 errors. This incident was caused by DNS resolution failure while decommissioning infrastructure hosts. We resolved the incident by removing references to the stale hosts.We are working to improve our host replacement process by correcting our automatic host ejection behavior and by ensuring configuration is updated before hosts are decommissioned. This will prevent similar issues in the future.

GitHub Enterprise Importer migrations are stalled

Between approximately 21:41 UTC July 28th and 03:15 UTC July 29th, GitHub Enterprise Importer (GEI) operated in a degraded state where migrations could not be processed. Our investigation found that a component of the GEI infrastructure had been improperly taken out of service and could not be restored to its previous configuration. This necessitated the provisioning of new resources to resolve the incident.As a result, customers will need to add our new IP range to the following IP allow lists, if enabled:- The IP allow list on your destination GitHub.com organization or enterprise- If you're running migrations from GitHub.com, the IP allow list on your source GitHub.com organization or enterprise- If you're running migrations from a GitHub Enterprise Server, Bitbucket Server or Bitbucket Data Center instance, the allow list on your configured Azure Blob Storage or -- Amazon S3 storage account- If you're running migrations from Azure DevOps, the allow list on your Azure DevOps organizationThe new GEI IP ranges for inclusion in applicable IP allow lists are:- 20.99.172.64/28- 135.234.59.224/28 The following IP ranges are no longer used by GEI and can be removed from all applicable IP allow lists:- 40.71.233.224/28- 20.125.12.8/29Users who have run migrations using GitHub Enterprise Importer in the past 90 days will receive email alerts about this change.

Disruption with some GitHub services

Between July 28, 2025 16:31 UTC to July 29, 2025 12:05 UTC users saw degraded Git Operations for raw file downloads. On average, the error rate was .005%, with a peak error rate of 3.89%. This was due to a sustained increase in unauthenticated repository traffic.We mitigated the incident by applying regional rate limiting, rolling back a service that was unable to scale with the additional traffic, and addressed a bug that impacted the caching of raw requests. Additionally, we horizontally scaled several dependencies of the service to appropriately handle the increase in traffic.We are working on improving our time to detection and have implemented controls to prevent similar incidents in future.

Incident with Actions Hosted Runners

On July 23rd, 2025, from approximately 14:30 to 16:30 UTC, GitHub Actions experienced delayed job starts for workflows in private repos using Ubuntu-24 standard hosted runners. This was due to resource provisioning failures in one of our datacenter regions. During this period, approximately 2% of Ubuntu-24 hosted runner jobs on private repos were delayed. Other hosted runners, self-hosted runners, and public repo workflows were unaffected.To mitigate the issue, additional worker capacity was added from a different datacenter region at 15:35 UTC and further increased at 16:00 UTC. By 16:30 UTC, job queues were healthy and service was operating normally. Since the incident, we have deployed changes to improve how regional health is accounted for when allocating new runners, and we are investigating further improvements to our automated capacity scaling logic and manual overrides to prevent a recurrence.

⮜ Previous Next ⮞