Incident History

Some GitHub Actions may not run

On January 9, 2025, between 06:26 and 07:49 UTC, Actions experienced degraded performance, leading to failures in about 1% of workflow runs across ~10k repositories. The failures occurred due to an outage in a dependent service, which disrupted Redis connectivity in the East US 2 region. We mitigated the incident by re-routing Redis traffic out of that region at 07:49 UTC. We continued to monitor service recovery before resolving the incident at 08:30 UTC. We are working to improve our monitoring to reduce our time to detection and mitigation of issues like this one in the future.

Incident with Webhooks

On January 9, 2025, between 01:26 UTC and 01:56 UTC GitHub experienced widespread disruption to many services, with users receiving 500 responses when trying to access various functionality. This was due to a deployment which introduced a query that saturated a primary database server. On average, the error rate was 6% and peaked at 6.85% of update requests.We mitigated the incident by identifying the source of the problematic query and rolling back the deployment.We are investigating methods to detect problematic queries prior to deployment to prevent, and to reduce our time to detection and mitigation of issues like this one in the future.

Incident with Actions resulting in degraded performance

On January 7th, 2025 between 11:54:00 and 16:39 UTC, degraded performance was observed in Actions, Webhooks, and Issues, caused by an internal Certificate Authority configuration change that disrupted our event infrastructure. The configuration issue was promptly identified and resolved by rolling the change back on impacted hosts and re-issuing certificates.We have identified what services need updates to support the current PKI architecture and are working on implementing those changes to prevent a future recurrence.

Incident with Actions

On January 2, 2025 between 16:00:00 and 22:27:30 UTC, a bug in feature-flagged code that cleans up Pull Requests after they are closed or merged incorrectly cleared the merge commit SHA for ~139,000 pull requests. During the incident, Actions workflows triggered by the on: pull_request trigger for the closed type were not queued successfully because of these missing merge commit SHAs. Approximately 45,000 repositories experienced these missing workflow triggers in either of two possible scenarios: pull requests which were closed, but not merged; and pull requests which were merged. Impact was mitigated after rolling back the aforementioned feature flag. Merged pull requests that were affected have had their merge commit SHAs restored. Closed pull requests have not had their merge commit SHA restored; however, customers can re-open and close them again to recalculate this SHA. We are investigating methods to improve detection of these kinds of errors in the future.

Disruption with some GitHub services

On December 20th, 2024, between 15:57 UTC and 16:39 UTC some of our marketing pages became inaccessible and users attempting to access the pages would have received 500 errors. There was no impact to any operational product or service area. This issue was due to a partial outage with one of our service providers. At 16:39 UTC the service provider resolved the outage, restoring access to the affected pages. We are investigating methods to improve error handling and gracefully degrade these pages in case of future outages.

Live updates on pages not loading reliably

On December 17th, 2024, between 14:33 UTC and 14:50 UTC, many users experienced intermittent errors and timeouts when accessing github.com. The error rate was 8.5% on average and peaked at 44.3% of requests. The increased error rate caused a broad impact across our services, such as the inability to log in, view a repository, open a pull request, and comment on issues. The errors were caused by our web servers being overloaded as a result of planned maintenance that unintentionally caused our live updates service to fail to start. As a result of the live updates service being down, clients reconnected aggressively and overloaded our servers.We only marked Issues as affected during this incident despite the broad impact. This oversight was due to a gap in our alerting while our web servers were overloaded. The engineering team's focus on restoring functionality led us to not identify the broad scope of the impact to customers until the incident had already been mitigated.We mitigated the incident by rolling back the changes from the planned maintenance to the live updates service and scaling up the service to handle the influx of traffic from WebSocket clients.We are working to reduce the impact of the live updates service's availability on github.com to prevent issues like this one in the future. We are also working to improve our alerting to better detect the scope of impact from incidents like this.

Disruption with some GitHub services

Upon further investigation, the degradation in migrations in the EU was caused by an internal configuration issue, which was promptly identified and resolved. No customer migrations were impacted during this time and the issue only affected GitHub Enterprise Cloud - EU and had no impact on GitHub.com. The service is now fully operational. We are following up by improving our processes for these internal configuration changes to prevent a recurrence, and to have incidents that affect GitHub Enterprise Cloud - EU be reported on https://eu.githubstatus.com/.

Disruption with some GitHub services

On December 4th, 2024 between 18:52 UTC and 19:11 UTC, several GitHub services were degraded with an average error rate of 8%.The incident was caused by a change to a centralized authorization service that contained an unoptimized database query. This led to an increase in overall load on a shared database cluster, resulting in a cascading effect on multiple services and specifically affecting repository access authorization checks. We mitigated the incident after rolling back the change at 19:07 UTC, fully recovering within 4 minutes. While this incident was caught and remedied quickly, we are implementing process improvements around recognizing and reducing risk of changes involving high volume authorization checks. We are investing in broad improvements to our safe rollout process, such as improving early detection mechanisms.

[Retroactive] Incident with Pull Requests

On December 3rd, between 23:29 and 23:43 UTC, Pull Requests experienced a brief outage and teams have confirmed the issue to be resolved. Due to brevity of incident it was not publicly statused at the time however an RCA will be conducted and shared in due course.

Incident with Pull Requests and API Requests

On December 3, 2024, between 19:35 UTC and 20:05 UTC API requests, Actions, Pull Requests and Issues were degraded. Web and API requests for Pull Requests experienced a 3.5% error rate and Issues had a 1.2% error rate. The highest impact was for users who experienced errors while creating and commenting on Pull Requests and Issues. Actions had a 3.3% error rate in jobs and delays on some updates during this time.This was due to an erroneous database credential change impacting write access to Issues and Pull Requests data. We mitigated the incident by reverting the credential change at 19:52 UTC. We continued to monitor service recovery before resolving the incident at 20:05 UTC. There are a few improvements we are making in response to this. We are investing in safe guards to the change management process in order to prevent erroneous database credential changes. Additionally, the initial rollback attempt was unsuccessful which led to a longer time to mitigate. We were able to revert through an alternative method and are updating our playbooks to document this mitigation strategy.

⮜ Previous Next ⮞