On August 21, 2025, from approximately 15:37 UTC to 18:10 UTC, customers experienced increased delays and failures when starting jobs on GitHub Actions using standard hosted runners. This was caused by connectivity issues in our East US region, which prevented runners from retrieving jobs and sending progress updates. As a result, capacity was significantly reduced, especially for busier configurations, leading to queuing and service interruptions. Approximately 8.05% of jobs on public standard Ubuntu24 runners and 3.4% of jobs on private standard Ubuntu24 runners did not start as expected.By 18:10 UTC, we had mitigated the issue by provisioning additional resources in the affected region and burning down the backlog of queued runner assignments. By the end of that day, we deployed changes to improve runner connectivity resilience and graceful degradation in similar situations. We are also taking further steps to improve system resiliency by enhancing observability of network connection health with runners and improving load distribution and failover handling to help prevent similar issues in the future.
On August 21st, 2025, between 6:15am UTC and 6:25am UTC Git and Web operations were degraded and saw intermittent errors. On average, the error rate was 1% for API and Web requests. This was due to database infrastructure automated maintenance reducing capacity below our tolerated threshold.The incident was resolved when the impacted infrastructure self-healed and returned to normal operating capacity.We are adding guardrails to reduce the impact of this type of maintenance in the future.
Between 15:49 and 16:37 UTC on 20 Aug 2025, creating a new GitHub account via the web signup page consistently returned server errors, and users were unable to complete signup during this 48-minute window. We detected the issue at 16:04 UTC and restored normal signup functionality by 16:37 UTC. A recent change to signup flow logic caused all attempts to error. The change was rolled back to restore service. This exposed a gap in our test coverage that we are fixing.
On August 19, 2025, between 13:35 UTC and 14:33 UTC, GitHub search was in a degraded state. When searching for pull requests, issues, and workflow runs, users would have seen some slow, empty or incomplete results. In some cases, pull requests failed to load.The incident was triggered by intermittent connectivity issues between our load balancers and search hosts. While retry logic initially masked these problems, retry queues eventually overwhelmed the load balancers, causing failure. The incident was mitigated at 14:33 UTC by throttling our search index pipeline. Our automated alerting and internal retries reduced the impact of this event significantly. As a result of this incident we believe we have identified a faster way to mitigate it in the future. We are also working on multiple solutions to resolve the underlying connectivity issues.
On August 14, 2025, between 17:50 UTC and 18:08 UTC, the Packages NPM Registry service was degraded. During this period, NPM package uploads were unavailable and approximately 50% of download requests failed. We identified the root cause as a sudden spike in Packages publishing activity that exceeded our service capacity limits. We are implementing better guardrails to protect the service against unexpected traffic surges and improving our incident response runbooks to ensure faster mitigation of similar issues.
On August 14, 2025, between 02:30 UTC and 06:14 UTC, GitHub Actions was degraded. On average, 3% of workflow runs were delayed by at least 5 minutes. The incident was caused by an outage in a downstream dependency that led to failures in backend service connectivity in one region. At 03:59 UTC, we evacuated a majority of services in the impacted region, but some users may have seen ongoing impact until all services were fully evacuated at 06:14 UTC. We are working to improve monitoring and processes of failover to reduce our time to detection and mitigation of issues like this one in the future.
On August 12, 2025, between 13:30 UTC and 17:14 UTC, GitHub search was in a degraded state. Users experienced inaccurate or incomplete results, failures to load certain pages (like Issues, Pull Requests, Projects, and Deployments), and broken components like Actions workflow and label filters.Most user impact occurred between 14:00 UTC and 15:30 UTC, when up to 75% of search queries failed, and updates to search results were delayed by up to 100 minutes. The incident was triggered by intermittent connectivity issues between our load balancers and search hosts. While retry logic initially masked these problems, retry queues eventually overwhelmed the load balancers, causing failure. The query failures were mitigated at 15:30 UTC after throttling our search indexing pipeline to reduce load and stabilize retries. The connectivity failures were resolved at 17:14 UTC after the automated reboot of a search host, causing the rest of the system to recover. We have improved internal monitors and playbooks, and tuned our search cluster load balancer to further mitigate the recurrence of this failure mode. We continue to invest in resolving the underlying connectivity issues.
On August 11, 2025, from 18:41 to 18:57 UTC, GitHub customers experienced errors and increased latency when loading GitHub’s web interface. During this time, a configuration change to improve our UI deployment system caused a surge in requests to a backend datastore. This change led to an unexpected spike in connection attempts to our datastore and saturated its connection backlog and resulted in intermittent failures to serve required UI content. This resulted in elevated error rates for frontend requests.The incident was mitigated by reverting the configuration, which restored normal service.Following mitigation, we are evaluating improvements to our alerting thresholds and exploring architectural changes to reduce load to this datastore and improve the resilience of our UI delivery pipeline.
At 15:33 UTC on August 5, 2025, we initiated a production database migration to drop a column from a table backing pull request functionality. While the column was no longer in direct use, our ORM continued to reference the dropped column in a subset of pull request queries. As a result, there were elevated error rates across pushes, webhooks, notifications, and pull requests with impact peaking at approximately 4% of all web and REST API traffic. We mitigated the issue by deploying a change that instructed the ORM to ignore the removed column. Most affected services recovered by 16:13 UTC. However, that fix was applied only to our largest production environment. An update to some of our custom and canary environments did not pick up the fix and this triggered a secondary incident affecting ~0.1% of pull request traffic, which was fully resolved by 19:45 UTC.While migrations have protections such as progressive roll-out first targeting validation environments and acknowledge gates, this incident identified an application monitoring gap that would have prevented continued rollout when impact was observed. We will add additional automation and safeguards to prevent future incidents without requiring human intervention. We are also already working on a way to streamline some types of changes across environments, which would have prevented the second incident from occurring.
At 15:33 UTC on August 5, 2025, we initiated a production database migration to drop a column from a table backing pull request functionality. While the column was no longer in direct use, our ORM continued to reference the dropped column in a subset of pull request queries. As a result, there were elevated error rates across pushes, webhooks, notifications, and pull requests with impact peaking at approximately 4% of all web and REST API traffic. We mitigated the issue by deploying a change that instructed the ORM to ignore the removed column. Most affected services recovered by 16:13 UTC. However, that fix was applied only to our largest production environment. An update to some of our custom and canary environments did not pick up the fix and this triggered a secondary incident affecting ~0.1% of pull request traffic, which was fully resolved by 19:45 UTC.While migrations have protections such as progressive roll-out first targeting validation environments and acknowledge gates, this incident identified an application monitoring gap that would have prevented continued rollout when impact was observed. We will add additional automation and safeguards to prevent future incidents without requiring human intervention. We are also already working on a way to streamline some types of changes across environments, which would have prevented the second incident from occurring.