On August 11, 2025, from 18:41 to 18:57 UTC, GitHub customers experienced errors and increased latency when loading GitHub’s web interface. During this time, a configuration change to improve our UI deployment system caused a surge in requests to a backend datastore. This change led to an unexpected spike in connection attempts to our datastore and saturated its connection backlog and resulted in intermittent failures to serve required UI content. This resulted in elevated error rates for frontend requests.The incident was mitigated by reverting the configuration, which restored normal service.Following mitigation, we are evaluating improvements to our alerting thresholds and exploring architectural changes to reduce load to this datastore and improve the resilience of our UI delivery pipeline.
At 15:33 UTC on August 5, 2025, we initiated a production database migration to drop a column from a table backing pull request functionality. While the column was no longer in direct use, our ORM continued to reference the dropped column in a subset of pull request queries. As a result, there were elevated error rates across pushes, webhooks, notifications, and pull requests with impact peaking at approximately 4% of all web and REST API traffic. We mitigated the issue by deploying a change that instructed the ORM to ignore the removed column. Most affected services recovered by 16:13 UTC. However, that fix was applied only to our largest production environment. An update to some of our custom and canary environments did not pick up the fix and this triggered a secondary incident affecting ~0.1% of pull request traffic, which was fully resolved by 19:45 UTC.While migrations have protections such as progressive roll-out first targeting validation environments and acknowledge gates, this incident identified an application monitoring gap that would have prevented continued rollout when impact was observed. We will add additional automation and safeguards to prevent future incidents without requiring human intervention. We are also already working on a way to streamline some types of changes across environments, which would have prevented the second incident from occurring.
At 15:33 UTC on August 5, 2025, we initiated a production database migration to drop a column from a table backing pull request functionality. While the column was no longer in direct use, our ORM continued to reference the dropped column in a subset of pull request queries. As a result, there were elevated error rates across pushes, webhooks, notifications, and pull requests with impact peaking at approximately 4% of all web and REST API traffic. We mitigated the issue by deploying a change that instructed the ORM to ignore the removed column. Most affected services recovered by 16:13 UTC. However, that fix was applied only to our largest production environment. An update to some of our custom and canary environments did not pick up the fix and this triggered a secondary incident affecting ~0.1% of pull request traffic, which was fully resolved by 19:45 UTC.While migrations have protections such as progressive roll-out first targeting validation environments and acknowledge gates, this incident identified an application monitoring gap that would have prevented continued rollout when impact was observed. We will add additional automation and safeguards to prevent future incidents without requiring human intervention. We are also already working on a way to streamline some types of changes across environments, which would have prevented the second incident from occurring.
Between 06:04 UTC to 10:55 UTC on August 1, 2025, 100% of users attempting to sign up with an email and password experienced errors. Social signup was not affected. Once the problem became clear, the offending code was identified and a change was deployed to resolve the issue. We are adding additional monitoring to our sign-up process to improve our time to detection.
Public Summary DraftBetween July 28, 2025 16:31 UTC to July 29, 2025 12:05 UTC users saw degraded Git Operations for raw file downloads. On average, the error rate was .005%, with a peak error rate of 3.89%. This was due to a sustained increase in unauthenticated repository traffic.We mitigated the incident by applying regional rate limiting, rolling back a service that was unable to scale with the additional traffic, and addressed a bug that impacted the caching of raw requests. Additionally, we horizontally scaled several dependencies of the service to appropriately handle the increase in traffic.We are working on improving our time to detection and have implemented controls to prevent similar incidents in future.
Between July 28, 2025, 22:23:00 UTC and July 29, 2025 02:06:00 UTC, GitHub experienced degraded performance across multiple services including API, Issues, GraphQL and Pull Requests. During this time, approximately 4% of Web and API requests resulted in 500 errors. This incident was caused by DNS resolution failure while decommissioning infrastructure hosts. We resolved the incident by removing references to the stale hosts.We are working to improve our host replacement process by correcting our automatic host ejection behavior and by ensuring configuration is updated before hosts are decommissioned. This will prevent similar issues in the future.
Between approximately 21:41 UTC July 28th and 03:15 UTC July 29th, GitHub Enterprise Importer (GEI) operated in a degraded state where migrations could not be processed. Our investigation found that a component of the GEI infrastructure had been improperly taken out of service and could not be restored to its previous configuration. This necessitated the provisioning of new resources to resolve the incident.As a result, customers will need to add our new IP range to the following IP allow lists, if enabled:- The IP allow list on your destination GitHub.com organization or enterprise- If you're running migrations from GitHub.com, the IP allow list on your source GitHub.com organization or enterprise- If you're running migrations from a GitHub Enterprise Server, Bitbucket Server or Bitbucket Data Center instance, the allow list on your configured Azure Blob Storage or -- Amazon S3 storage account- If you're running migrations from Azure DevOps, the allow list on your Azure DevOps organizationThe new GEI IP ranges for inclusion in applicable IP allow lists are:- 20.99.172.64/28- 135.234.59.224/28 The following IP ranges are no longer used by GEI and can be removed from all applicable IP allow lists:- 40.71.233.224/28- 20.125.12.8/29Users who have run migrations using GitHub Enterprise Importer in the past 90 days will receive email alerts about this change.
Between July 28, 2025 16:31 UTC to July 29, 2025 12:05 UTC users saw degraded Git Operations for raw file downloads. On average, the error rate was .005%, with a peak error rate of 3.89%. This was due to a sustained increase in unauthenticated repository traffic.We mitigated the incident by applying regional rate limiting, rolling back a service that was unable to scale with the additional traffic, and addressed a bug that impacted the caching of raw requests. Additionally, we horizontally scaled several dependencies of the service to appropriately handle the increase in traffic.We are working on improving our time to detection and have implemented controls to prevent similar incidents in future.
On July 23rd, 2025, from approximately 14:30 to 16:30 UTC, GitHub Actions experienced delayed job starts for workflows in private repos using Ubuntu-24 standard hosted runners. This was due to resource provisioning failures in one of our datacenter regions. During this period, approximately 2% of Ubuntu-24 hosted runner jobs on private repos were delayed. Other hosted runners, self-hosted runners, and public repo workflows were unaffected.To mitigate the issue, additional worker capacity was added from a different datacenter region at 15:35 UTC and further increased at 16:00 UTC. By 16:30 UTC, job queues were healthy and service was operating normally. Since the incident, we have deployed changes to improve how regional health is accounted for when allocating new runners, and we are investigating further improvements to our automated capacity scaling logic and manual overrides to prevent a recurrence.
On July 22nd, 2025, between 17:58 and 18:35 UTC, the Copilot service experienced degraded availability for Claude Sonnet 4 requests. 4.7% of Claude 4 requests failed during this time. No other models were impacted. The issue was caused by an upstream problem affecting our ability to serve requests.We mitigated by rerouting capacity and monitoring recovery. We are improving detection and mitigation to reduce future impact.