Incident History

Disruption with GitHub Search

Between 13:30 and 15:00 UTC, repository searches were timing out for most users. The ongoing efforts from the similar incident last week helped uncover the main contributing factors. We have deployed short-term mitigations and identified longer term work to proactively identify and limit resource-intensive searches.

1732543074 - 1732548310 Resolved

Disruption with some GitHub services

On November 25th, 2024 between 10:38 UTC and 12:00 UTC the Claude model for GitHub Copilot Chat experienced degraded performance. During the impact, all requests to Claude would result in an immediate error to the user. This was due to upstream errors with one of our infrastructure providers, which have since been mitigated.We are working with our infrastructure providers to reduce time to detection and implement additional failover options to mitigate issues like this one in the future.

1732531916 - 1732537055 Resolved

[Retroactive] Merge Queues not processing queued Pull Requests in some repositories

Between 2024-11-06 11:14 UTC and 2024-11-08 at 18:15 UTC, pull requests added to merge queues in some repositories were not processed. This was caused by a bug in a new version of the merge queue code, and was mitigated by rolling back a feature flag. Around 1% of enqueued PRs were affected, with around 7% of repositories that use a merge queue being impacted at some time during the incident.

Queues were impacted if their target branch had the “require status checks” setting enabled, but did not have any individual required checks configured. Our monitoring strategy only covered PRs automatically removed from the queue, which was insufficient to detect this issue.

We are improving our monitors to cover anomalous manual queue entry removal rates, which will allow us to detect this class of issue much sooner.

1732292732 - 1732292732 Resolved

Repository searches not working for some users

On November 21, 2024, between 14:30 UTC and 15:53 UTC search services at GitHub were degraded and CPU load on some nodes hit 100%. On average, the error rate was 22 requests/second and peaked at 83 requests/second. During this incident Enterprise Profile pages were slow to load and searches may have returned low quality results.The CPU load was mitigated by redeploying portions of our web infrastructure.We are still working to identify the cause of the increase in CPU usage and are improving our observability tooling to better expose the cause of an incident like this in the future.

1732203009 - 1732207717 Resolved

Disruption with some GitHub services

On November 19, 2024, between 10:56:00 UTC and 12:03:00 UTC the notifications service was degraded and stopped sending notifications. On average, notifications delivery was delayed about 1 hour. This was due to a database host coming out of a regular maintenance process in read only-mode.We mitigated the incident by making the host writable again. After that the notifications delivery recovered and any delivery job that had failed during the incident was successfully retried.We are working to improve our observability across database clusters to reduce our time to detection and mitigation of issues like this one in the future.

1732016196 - 1732017823 Resolved

Incident with Actions

On October 30, 2024, between 5:45 and 9:42 UTC, the Actions service was degraded, causing run delays. On average, Actions workflow run, job, and step updates were delayed as much as one hour. The delays were caused by updates in a dependent service that led to failures in Redis connectivity. Delays recovered once the Redis cluster connectivity was restored at 8:16 UTC. The incident was fully mitigated once the job queue had processed by 9:24 UTC. This incident followed an earlier short period of impact on hosted runners due to a similar issue, which was mitigated by failing over to a healthy cluster.From this, we are working to improve our observability across Redis clusters to reduce our time to detection and mitigation of issues like this one in the future where multiple clusters and services were impacted. We will also be working to reduce the time to mitigate and improve general resilience to this dependency.

1730273144 - 1730281350 Resolved

Incident with GitHub Community Discussions

On Oct 24 2024 at 06:55 UTC, a syntactically correct, but invalid discussion template YAML config file was committed in the community/community repository. This caused all users of that repository who tried to access a discussion template or attempted to create a discussion to receive a 500 error response.We mitigated the incident by manually reverting the invalid template changes.We are adding support to detect and prevent invalid discussion template YAML from causing user-facing errors in the future.

1729750329 - 1729752906 Resolved

Disruption with some GitHub services

On October 11, 2024, starting at 05:59 UTC, DNS infrastructure in one of our sites started to fail to resolve lookups following a database migration. Attempts to recover the database led to cascading failures that impacted the DNS systems for that site. The team worked to restore the infrastructure and there was no customer impact until 17:31 UTC. During the incident, impact to the following services could be observed:- Copilot: Degradation in IDE code completions for 4% of active users during the incident from 17:31 UTC to 21:45 UTC.- Actions: Workflow runs delay (25% of runs delayed by over 5 minutes) and errors (1%) between 20:28 UTC and 21:30 UTC. Errors while creating Artifact Attestations.- Customer migrations: From 18:16 UTC to 23:12 UTC running migrations stopped and new ones were not able to start.- Support: support.github.com was unavailable from 19:28 UTC to 22:14 UTC. - Code search: 100% of queries failed between 2024-10-11 20:16 UTC and 2024-10-12 00:46 UTC.Starting at 18:05 UTC, engineering attempted to repoint the degraded site DNS to a different site to restore DNS functionality. At 18:26 UTC the test system had validated this approach and a progressive rollout to the affected hosts proceeded over the next hour. While this mitigation was effective at restoring connectivity within the site, it caused issues with connectivity from healthy sites back to the degraded site, and the team proceeded to plan out a different remediation effort.At 20:52 UTC, the team finalized a remediation plan and began the next phase of mitigation by deploying temporary DNS resolution capabilities to the degraded site. At 21:46 UTC, DNS resolution in the degraded site began to recover and was fully healthy at 22:16 UTC. Lingering issues with code search were resolved at 01:11 UTC on October 12.The team continued to restore the original functionality within the site after public service functionality was restored. GitHub is working to harden our resiliency and automation processes around this infrastructure to make diagnosing and resolving issues like this faster in the future.

1728669208 - 1728695488 Resolved

Isolated Codespaces creation failures in the West Europe region

This incident has been resolved.

1728406966 - 1728430374 Resolved

Disruption with some GitHub services

Between September 27, 2024, 15:26 UTC and September 27, 2024, 15:34 UTC the Repositories Releases service was degraded. During this time 9% of requests to list releases via API or the webpage received a 500 Internal Server error.

This was due to a bug in our software roll out strategy. The rollout was reverted starting at 15:30 UTC, which began to restore functionality. The rollback was completed at 15:34 UTC.

We are continuing to improve our testing infrastructure to ensure that bugs such as this one can be detected before they make their way into production.

1727977027 - 1727977027 Resolved
⮜ Previous Next ⮞