Incident History

GitHub Enterprise Importer (GEI) is experiencing degraded throughput

Between May 16, 2025, 1:21 PM UTC and May 17, 2025, 2:26 AM UTC, the GitHub Enterprise Importer service was degraded and experienced slow processing of customer migrations. Customers may have seen extended wait times for migrations to start or complete.This incident was initially observed as a slowdown in migration processing. During our investigation, we identified that a recent change aimed at improving API query performance caused an increase in load signals, which triggered migration throttling. As a result, the performance of migrations was negatively impacted, and overall migration duration increased. In parallel, we identified a race condition that caused a specific migration to be repeatedly re-queued, further straining system resources and contributing to a backlog of migration jobs, resulting in accumulated delays. No data was lost, and all migrations were ultimately processed successfully.We have reverted the feature flag associated with a query change and are working to improve system safeguards to help prevent similar race condition issues from occurring in the future.

1747403173 - 1747448821 Resolved

Disruption with some GitHub services

On May 16th, 2025, between 08:42:00 UTC and 12:26:00 UTC, the data store powering the Audit Log API service experienced elevated latency resulting in higher error rates due to timeouts. About 3.8% of Audit Log API queries for Git events experienced timeouts. The data store team deployed mitigating actions which resulted in a full recovery of the data store’s availability.

1747387379 - 1747409087 Resolved

Disruption with Gemini 2.5 Pro

Between May 15, 2025 10:10 UTC and May 15, 2025 22:58 UTC the Copilot service was degraded and returned a high volume of internal server errors for requests targeting Gemini 2.5 Pro, a public preview model. This was due to a high volume of rate limiting by the upstream model provider, similar in volume to the internal server errors during the previous day.We mitigated the incident by temporarily disabling Gemini 2.5 Pro for all Copilot Chat experiences, and then worked with the model provider to ensure model health was sufficiently improved before re-enabling.We are working with the model provider to move to more resilient infrastructure to mitigate issues like this one in the future.

1747312894 - 1747349885 Resolved

Disruption with some GitHub services

On May 15, 2025, between 00:08 AM UTC and 10:21 AM UTC, customers were unable to create fine-grained Personal Access Tokens (PATs) on github.com. This incident was triggered by a recent code change to our front end that unintentionally affected the way certain pages loaded and prevented the PAT creation process from completing.We mitigated the incident by reverting the problematic change. To reduce the likelihood of similar issues in the future, we are improving our monitoring for page load anomalies and PAT creation failures and improving our safe deployment practices.

1747292429 - 1747305507 Resolved

Disruption with Gemini 2.5 Pro model

Between May 14, 2025 14:16 UTC and May 15, 2025 01:02 UTC the Copilot service was degraded and returned a high volume of internal server errors for requests targeting Gemini 2.5 Pro, a public preview model. On average, the error rate for Gemini 2.5 Pro was 19.6% and peaked at 41%. This was due to a high volume of internal server errors and rate limiting by the upstream model provider.We mitigated the incident by temporarily disabling Gemini 2.5 Pro for all Copilot Chat experiences, and then worked with the model provider to ensure model health was sufficiently improved before re-enabling.We are working with partners to improve communication speed and are planning to move to more resilient infrastructure to mitigate issues like this one in the future.

1747233588 - 1747270974 Resolved

Disruption with some GitHub services

This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.

1747061639 - 1747062362 Resolved

Incident with Git Operations

On May 8, 2025, between 14:40 UTC and 16:27 UTC the Git Operations service was degraded causing some pushes and merges to fail. On average, the error rate was 1.4% with a peak error rate of 2.24%. This was due to a configuration change which unexpectedly led a critical service to shut down on a subset of hosts that store repository data.We mitigated the incident by re-deploying the affected service to restore its functionality.In order to prevent similar incidents from happening again, we identified the cause that triggered this behavior and mitigated it for future deployments. Additionally, to reduce time to detection we will improve monitoring of the impacted service.

1746717631 - 1746721676 Resolved

Issue Attachments Failing to Upload

On May 1, 2025 from 22:09 UTC to 23:13 UTC, the Issues service was degraded and users weren't able to upload attachments. The root cause was identified to be a new feature which added a custom header to all client-side HTTP requests, causing a CORS errors when uploading attachments to our provider.We mitigated the incident by rolling back the feature flag that added the new header at 22:56 UTC. In order to prevent this from happening again, we are adding new metrics to monitor and ensure the safe rollout of changes to client-side requests.

1746138511 - 1746141198 Resolved

Disruption with Pull Request Ref Updates

On April 30, 2025, between 8:02 UTC and 9:05 UTC, the Pull Requests service was degraded and failed to update refs for repositories with higher traffic. This was due to a repository migration creating a larger than usual number of enqueued jobs. This resulted in an increase in job failures, delays for non-migration sourced jobs, and delays to tracking refs.We declared an incident once we confirmed that this issue was not isolated to the migrating repository and other repositories were also failing to process ref updates.We mitigated the incident by shifting the migration jobs to a different job queue.To avoid problems like this in the future, we are revisiting our repository migration process and are working to isolate potentially problematic migration workloads from non-migration workloads.

1746046316 - 1746047136 Resolved

Delays for web and email notification delivery

On April 29th, 2025, between 8:40am UTC and 12:50pm UTC the notifications service was degraded and stopped delivering most web and email notifications as well as some mobile push notifications. This was due to a large and faulty schema migration that rendered a set of database primaries unhealthy, affecting the notification delivery pipelines, causing delays in the most of the web and email notification deliveries.We mitigated the incident by stopping the migration and promoting replicas to replace the unhealthy primaries.In order to prevent similar incidents in the future, we are addressing the underlying issues in the online schema tooling and improving the way we interact with the database to not be disruptive to production workloads.

1745921143 - 1745931148 Resolved
⮜ Previous Next ⮞