On July 24, 2024, DigitalOcean experienced downtime from near-simultaneous crashes affecting multiple hypervisors (ref: https://docs.digitalocean.com/glossary/hypervisor/) in several regions. In total, fourteen hypervisors crashed, the majority of which were in the FRA1 and AMS3 regions, the remaining being in LON1, SGP1, and NYC1. A routine kernel fix to improve platform stability was being deployed to a subset of hypervisors across the fleet, and that kernel fix had an unexpected conflict with a separate automated maintenance routine, causing those hypervisors to experience kernel panics and become unresponsive. This led to an interruption in service for customer Droplets, and other Droplet-based services until the affected hypervisors were rebooted and restored to a functional state.
Incident Details
Root Cause: A kernel fix being rolled out to some hypervisors through an incremental process conflicted with a periodic maintenance operation which was in progress on a subset of those hypervisors.
Impact: The affected hypervisors crashed, causing Droplets (including other Droplet-based services) running on these hypervisors to become unresponsive. Customers were unable to reach them via networking, process events like power off/on, or see monitoring.
Response: After gathering diagnostic information and determining the root cause, we rebooted the affected hypervisors in order to safely restore service. Manual remediation was done on hypervisors that received the kernel fix to ensure it was applied while the maintenance operation was not in progress.
Timeline of Events (UTC)
July 24 22:55 - Rollout of the kernel fix begins.
July 24 23:10 - First hypervisor crash occurs and the Operations team begins investigating.
July 24 23:55 - Rollout of the kernel fix ends.
July 25 00:14 - Internal incident response begins, following further crash alerts firing.
July 25 00:35 - Diagnostic tests are run on impacted hypervisors to gather information.
July 25 00:47 - Kernel panic messages are observed on impacted hypervisors. Additional Engineering teams are paged for investigation.
July 25 01:42 - Operations team begins coordinated effort to reboot all impacted hypervisors to restore customer services.
July 25 01:50 - Root cause for the crashes is determined to be the conflict between the kernel fix and maintenance operation.
July 25 03:22 - Reboots of all impacted hypervisors complete, all services are restored to normal operation.
Remediation Actions
The continued rollout of this specific kernel fix, as well as future rollouts of this type of fix, will not be done on hypervisors while the maintenance operation is in progress, to avoid any possible conflicts.
Further investigation will be conducted to understand how the kernel fix and the maintenance operation conflicted to cause a kernel crash to help avoid similar problems in the future.
From 17:22 UTC to 17:27 UTC, we experienced an issue with requests to the Cloud Control Panel and API
During that timeframe, users may have experienced an increase in 5xx errors for Cloud/API requests. The issue self-resolved quickly and our Engineering team is continuing to investigate root cause to ensure it does not occur again.
Thank you for your patience, and we apologize for any inconvenience. If you continue to experience any issues, please open a Support ticket for further analysis.
From 19:33 on to 21:02 UTC on July 18th, App Platform users may have experienced delays when deploying new Apps or when deploying updates to existing Apps in SFO3.
Our engineering team has deployed a fix for this issue. The impact has been resolved and users should no longer see any issues with the impacted services.
If you continue to experience problems, please open a ticket with our support team from your Cloud Control Panel. Thank you for your patience, and we apologize for any inconvenience.
Our Engineering team has confirmed the full resolution of the issue impacting the ability to create and manage Functions through the Cloud Control Panel and API in our TOR1 region. We appreciate your patience throughout the process.
If you continue to see errors please open a ticket with our Support team and we will be glad to assist you further.
Our Engineering team has resolved the issue with processing payments via PayPal on our platform. Services should now be operating normally.
If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
Our Engineering team has confirmed the issue with SMS delivery report delays when sending messages has been fully resolved.
We appreciate your patience throughout this process and if you continue to experience problems, please open a ticket with our support team for further review.
Our Engineering team identified and resolved an issue with IPv6 networking in the BLR1 region.
From 12:15 UTC - 15:32 UTC, users may have experienced issues connecting to Droplets and Droplet-based services in the BLR1 region using IPv6 addresses.
Our Engineering team quickly identified the root cause of the incident to be related to a recent maintenance in that region and implemented a fix.
We apologize for the disruption. If you continue to experience any issues, please open a Support ticket from within your account.
Our Engineering team identified and resolved an issue with creation of Snapshots and Backups in the NYC3 region.
From 20:11 UTC to 21:04 UTC, users may have experienced errors while taking Snapshots of Droplets in NYC3. Backup creation was also failing, however, Backups will be retried automatically.
Our Engineering team quickly identified the root cause of the incident to be related to capacity on internal storage clusters and were able to rebalance capacity, allowing creations to succeed.
We apologize for the disruption. If you continue to experience any issues, please open a Support ticket from within your account.
Beginning July 2nd, 20:55 UTC, team account owners may have seen an issue with removing other users from their team accounts. As of July 3rd 20:47 UTC, a fix was deployed and our Engineering team has confirmed full resolution of the issue. Team owners should be able to remove other users from their teams without issue.
Thank you for your patience. If you continue to experience any problems, please open a support ticket from within your account.