Incident History

Incident with Copilot

On May 20, 2025, between 18:18 UTC and 19:53 UTC, Copilot Code Completions were degraded in the Americas. On average the error rate was 50% of requests to the service in the affected region. This was due to a misconfiguration in load distribution parameters after a scale down operation.We mitigated the incident by addressing the misconfiguration.We are working to improve our automated failover and load balancing mechanisms to reduce our time to detection and mitigation of issues like this one in the future.

1747769827 - 1747771321 Resolved

Elevated error rates for Claude Sonnet 3.7

On May 20, 2025, between 12:09 PM UTC and 4:07 PM UTC, the GitHub Copilot service experienced degraded availability, specifically for the Claude Sonnet 3.7 model. During this period, the success rate for Claude Sonnet 3.7 requests was highly variable, down to approximately 94% during the most severe spikes. Other models remained available and working as expected throughout the incident.The issue was caused by capacity constraints in our model processing infrastructure that affected our ability to handle the large volume of Claude Sonnet 3.7 requests.We mitigated the incident by rebalancing traffic across our infrastructure, adjusting rate limits, and working with our infrastructure teams to resolve the underlying capacity issues. We are working to improve our infrastructure redundancy and implementing more robust monitoring to reduce detection and mitigation time for similar incidents in the future.

1747746614 - 1747757316 Resolved

GitHub Enterprise Importer (GEI) is experiencing degraded throughput

Between May 16, 2025, 1:21 PM UTC and May 17, 2025, 2:26 AM UTC, the GitHub Enterprise Importer service was degraded and experienced slow processing of customer migrations. Customers may have seen extended wait times for migrations to start or complete.This incident was initially observed as a slowdown in migration processing. During our investigation, we identified that a recent change aimed at improving API query performance caused an increase in load signals, which triggered migration throttling. As a result, the performance of migrations was negatively impacted, and overall migration duration increased. In parallel, we identified a race condition that caused a specific migration to be repeatedly re-queued, further straining system resources and contributing to a backlog of migration jobs, resulting in accumulated delays. No data was lost, and all migrations were ultimately processed successfully.We have reverted the feature flag associated with a query change and are working to improve system safeguards to help prevent similar race condition issues from occurring in the future.

1747403173 - 1747448821 Resolved

Disruption with some GitHub services

On May 16th, 2025, between 08:42:00 UTC and 12:26:00 UTC, the data store powering the Audit Log API service experienced elevated latency resulting in higher error rates due to timeouts. About 3.8% of Audit Log API queries for Git events experienced timeouts. The data store team deployed mitigating actions which resulted in a full recovery of the data store’s availability.

1747387379 - 1747409087 Resolved

Disruption with Gemini 2.5 Pro

Between May 15, 2025 10:10 UTC and May 15, 2025 22:58 UTC the Copilot service was degraded and returned a high volume of internal server errors for requests targeting Gemini 2.5 Pro, a public preview model. This was due to a high volume of rate limiting by the upstream model provider, similar in volume to the internal server errors during the previous day.We mitigated the incident by temporarily disabling Gemini 2.5 Pro for all Copilot Chat experiences, and then worked with the model provider to ensure model health was sufficiently improved before re-enabling.We are working with the model provider to move to more resilient infrastructure to mitigate issues like this one in the future.

1747312894 - 1747349885 Resolved

Disruption with some GitHub services

On May 15, 2025, between 00:08 AM UTC and 10:21 AM UTC, customers were unable to create fine-grained Personal Access Tokens (PATs) on github.com. This incident was triggered by a recent code change to our front end that unintentionally affected the way certain pages loaded and prevented the PAT creation process from completing.We mitigated the incident by reverting the problematic change. To reduce the likelihood of similar issues in the future, we are improving our monitoring for page load anomalies and PAT creation failures and improving our safe deployment practices.

1747292429 - 1747305507 Resolved

Disruption with Gemini 2.5 Pro model

Between May 14, 2025 14:16 UTC and May 15, 2025 01:02 UTC the Copilot service was degraded and returned a high volume of internal server errors for requests targeting Gemini 2.5 Pro, a public preview model. On average, the error rate for Gemini 2.5 Pro was 19.6% and peaked at 41%. This was due to a high volume of internal server errors and rate limiting by the upstream model provider.We mitigated the incident by temporarily disabling Gemini 2.5 Pro for all Copilot Chat experiences, and then worked with the model provider to ensure model health was sufficiently improved before re-enabling.We are working with partners to improve communication speed and are planning to move to more resilient infrastructure to mitigate issues like this one in the future.

1747233588 - 1747270974 Resolved

Disruption with some GitHub services

This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.

1747061639 - 1747062362 Resolved

Incident with Git Operations

On May 8, 2025, between 14:40 UTC and 16:27 UTC the Git Operations service was degraded causing some pushes and merges to fail. On average, the error rate was 1.4% with a peak error rate of 2.24%. This was due to a configuration change which unexpectedly led a critical service to shut down on a subset of hosts that store repository data.We mitigated the incident by re-deploying the affected service to restore its functionality.In order to prevent similar incidents from happening again, we identified the cause that triggered this behavior and mitigated it for future deployments. Additionally, to reduce time to detection we will improve monitoring of the impacted service.

1746717631 - 1746721676 Resolved

Issue Attachments Failing to Upload

On May 1, 2025 from 22:09 UTC to 23:13 UTC, the Issues service was degraded and users weren't able to upload attachments. The root cause was identified to be a new feature which added a custom header to all client-side HTTP requests, causing a CORS errors when uploading attachments to our provider.We mitigated the incident by rolling back the feature flag that added the new header at 22:56 UTC. In order to prevent this from happening again, we are adding new metrics to monitor and ensure the safe rollout of changes to client-side requests.

1746138511 - 1746141198 Resolved
⮜ Previous Next ⮞