On May 26, 2025, between 06:20 UTC and 09:45 UTC GitHub experienced broad failures across a variety of services (API, Issues, Git, etc). These were degraded at times, but peaked at 100% failure rates for some operations during this time.On May 23, a new feature was added to Copilot APIs and monitored during rollout but it was not tested at peak load. At 6:20 UTC on May 26, load increased on the code path in question and started to degrade a Copilot API because the caching for this endpoint and circuit breakers for high load were misconfigured.In addition, the traffic limiting meant to protect wider swaths of the GitHub API from queuing was not yet covering this endpoint, meaning it was able to overwhelm the capacity to serve traffic and cause request queuing.We were able to mitigate the incident by turning off the endpoint until the behavior could be reverted.We are already working on a quality of service strategy for API endpoints like this that will limit the impact of a broad incident and are rolling it out. We are also addressing the specific caching and circuit breaker misconfigurations for this endpoint, which would have reduced the time to mitigate this particular incident and the blast radius.
On May 23, 2025, between 17:40 UTC and 18:30 UTC public API and UI requests to read and write Git repository content were degraded and triggered user-facing 500 responses. On average, the error rate was 61% and peaked at 88% of requests to the service. This was due to the introduction of an uncaught fatal error in an internal service. A manual rollback was required which increased the time to remediate the incident.We are working to automatically detect and revert a change based on alerting to reduce our time to detection and mitigation. In addition, we are adding relevant test coverage to prevent errors of this type getting to production.
On May 22, 2025, between 07:06 UTC and 09:10 UTC, the Actions service experienced degradation, leading to run start delays. During the incident, about 11% of all workflow runs were delayed by an average of 44 minutes. A recently deployed change contained a defect that caused improper request routing between internal services, resulting in security rejections at the receiving endpoint. We resolved this by reverting the problematic change and are implementing enhanced testing procedures to catch similar issues before they reach production environments.
A change to the webhooks UI removed the ability to add webhooks. The timeframe of this impact was between May 20th, 2025 20:40 UTC and May 21st, 2025 12:55 UTC. Existing webhooks, as well as adding webhooks via the API were unaffected. The issue has been fixed.
On May 20, 2025, between 18:18 UTC and 19:53 UTC, Copilot Code Completions were degraded in the Americas. On average the error rate was 50% of requests to the service in the affected region. This was due to a misconfiguration in load distribution parameters after a scale down operation.We mitigated the incident by addressing the misconfiguration.We are working to improve our automated failover and load balancing mechanisms to reduce our time to detection and mitigation of issues like this one in the future.
On May 20, 2025, between 12:09 PM UTC and 4:07 PM UTC, the GitHub Copilot service experienced degraded availability, specifically for the Claude Sonnet 3.7 model. During this period, the success rate for Claude Sonnet 3.7 requests was highly variable, down to approximately 94% during the most severe spikes. Other models remained available and working as expected throughout the incident.The issue was caused by capacity constraints in our model processing infrastructure that affected our ability to handle the large volume of Claude Sonnet 3.7 requests.We mitigated the incident by rebalancing traffic across our infrastructure, adjusting rate limits, and working with our infrastructure teams to resolve the underlying capacity issues. We are working to improve our infrastructure redundancy and implementing more robust monitoring to reduce detection and mitigation time for similar incidents in the future.
Between May 16, 2025, 1:21 PM UTC and May 17, 2025, 2:26 AM UTC, the GitHub Enterprise Importer service was degraded and experienced slow processing of customer migrations. Customers may have seen extended wait times for migrations to start or complete.This incident was initially observed as a slowdown in migration processing. During our investigation, we identified that a recent change aimed at improving API query performance caused an increase in load signals, which triggered migration throttling. As a result, the performance of migrations was negatively impacted, and overall migration duration increased. In parallel, we identified a race condition that caused a specific migration to be repeatedly re-queued, further straining system resources and contributing to a backlog of migration jobs, resulting in accumulated delays. No data was lost, and all migrations were ultimately processed successfully.We have reverted the feature flag associated with a query change and are working to improve system safeguards to help prevent similar race condition issues from occurring in the future.
On May 16th, 2025, between 08:42:00 UTC and 12:26:00 UTC, the data store powering the Audit Log API service experienced elevated latency resulting in higher error rates due to timeouts. About 3.8% of Audit Log API queries for Git events experienced timeouts. The data store team deployed mitigating actions which resulted in a full recovery of the data store’s availability.
Between May 15, 2025 10:10 UTC and May 15, 2025 22:58 UTC the Copilot service was degraded and returned a high volume of internal server errors for requests targeting Gemini 2.5 Pro, a public preview model. This was due to a high volume of rate limiting by the upstream model provider, similar in volume to the internal server errors during the previous day.We mitigated the incident by temporarily disabling Gemini 2.5 Pro for all Copilot Chat experiences, and then worked with the model provider to ensure model health was sufficiently improved before re-enabling.We are working with the model provider to move to more resilient infrastructure to mitigate issues like this one in the future.
On May 15, 2025, between 00:08 AM UTC and 10:21 AM UTC, customers were unable to create fine-grained Personal Access Tokens (PATs) on github.com. This incident was triggered by a recent code change to our front end that unintentionally affected the way certain pages loaded and prevented the PAT creation process from completing.We mitigated the incident by reverting the problematic change. To reduce the likelihood of similar issues in the future, we are improving our monitoring for page load anomalies and PAT creation failures and improving our safe deployment practices.