From 10:30 to 11:35 UTC, our Engineering team observed an issue with Container Registry and App Platform builds in the AMS3 and FRA1 regions.
During this time, users may have experienced delays while building their Apps and could have potentially experienced timeout errors in builds as a result. Additionally, a subset of customers may have experienced latency while interacting with the Container Registries.
Our Engineering team found that a backing component of the Container Registry was experiencing high memory usage. They were able to remediate that component at 11:35 UTC, which resolved the issue.
We apologize for the inconvenience. If you have any questions or continue to experience issues, please reach out via a Support ticket on your account.
As of 10:45 UTC, our engineering team has resolved the issue with networking in SFO2 and SFO3 regions, and networking in the regions should now be operating normally. If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
Our Engineering team has resolved the issue with Droplet creates and Snapshots. As of 05:30 UTC, users should be able to create Droplets, Snapshots and process events. Droplet backed services should also be operating normally.
If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
From 23:08 August 12 to 01:26 August 13 UTC, customers may have experienced failures with Droplet creation, power on events, and restore events in the NYC3 region.
Our Engineering team has confirmed resolution of this issue. Thank you for your patience.
If you continue to experience any problems, please open a support ticket from within your account.
From 09:52 UTC to 19:32 UTC, customers may have experienced failures with Droplet rebuild and restore events in all regions. Our Engineering team has confirmed full resolution of this issue.
Thank you for your patience. If you continue to experience any problems, please open a support ticket from within your account.
On August 05, 2024 at 16:30 UTC, DigitalOcean experienced a disruption to internal service discovery. Customers experienced full disruption of creates, event processing, and management of other DigitalOcean products globally. Due to an error in a replication configuration that propagated globally, internal services were unable to correctly discover other services they depended on. This did not affect the availability of existing customer resources.
Incident Details
Root Cause: An incorrect replication configuration was deployed against the datastore which powers the internal service discovery service at DigitalOcean. The incorrect configuration specified a new datacenter with zero keys as 100% ownership of all keys in the datastore. This had an immediate global impact against the data storage layer and disrupted the quorum of datastore nodes across all regions. Clients of the service were unable to read/write to the datastore during this time, which had a cascading effect.
Impact:
The first observable impact was a complete disruption to the I/O layer of the backing datastore.
These events are consumed by a wide variety of backing services that compose the DigitalOcean Cloud platform. This incident impacted:
Droplet Creates
Droplet Updates
Network Creates
Login / Authentication services
Block Storage Volumes Snapshot creation
Spaces/CDN Creates
Spaces Updates
Managed Kubernetes cluster creates
Managed Databases creates
Other services across DigitalOcean, outside of the eventing flow, also rely on service discovery to talk to each other, so customers may have seen additional impact when attempting to manage assorted services through the Cloud Control Panel or via the API.
Response: After gathering diagnostic information and determining the root cause, an updated / correct replication configuration was deployed. Some regions ingested the new replication configuration and started to recover. Teams identified additional regions that took longer to ingest the updated configuration and manually invoked the change directly on the nodes, and then ran local repairs on the data to ensure alignment before moving to the next region.
Engineering teams cleaned up any remaining failed events and processed pending events that had not yet timed out. At the conclusion of that cleanup effort, the incident was declared resolved, and the cloud platform stabilized.
Timeline of Events (UTC)
Aug 05 16:30 - Rollout of the new datastore cluster begins.
Aug 05 16:35 - First report of service discovery unavailability is raised internally.
Aug 05 16:42 - Lack of quorum and datastore ownership is identified as the blocking issue.
Aug 05 17:00 - The replication configuration change, adding the new datacenter, is identified as the root cause behind the ownership change.
Aug 05 17:16 - The replication configuration change is reverted, and run against the region that had become the datastore owner. Some events start to fail faster at this point, changing the error from a distinct timeout to a failure to find endpoints.
Aug 05 18:25 - Regions that have not detected or applied the reverted configuration are identified, and engineers start manually applying the configuration and running repairs on the datastore for those regions.
Aug 05 19:10 - Remaining failure events resolve, and the platform stabilizes.
Remediation Actions
The replication configuration deployment happened outside of a normal maintenance window. Moving forward, these types of extension maintenances will be performed inside a declared maintenance window, with any potential for customer impact communicated via a maintenance notice posted on the status page.
The process documentation for this type of deployment will be updated to reflect the current requirements and clearly outline the steps and expectations for each stage of a new deployment. Additionally, the manual processes that occurred will be automated to help reduce the potential for human error.
Multiple teams are also evaluating if our current topology of the internal datastore is appropriate, and if there are any regionalizations or multi-layered approaches DigitalOcean can take to help ensure our internal service discovery remains as available as possible.
From 23:47 UTC until 01:11 UTC, users may have experienced errors when attempting to create Spaces Access Keys in the Cloud Control Panel.
Our Engineering team has identified and resolved the issue. The impact has been resolved and users should now be able to create Spaces Access Keys.
We apologize for any inconvenience this may have caused. If you have any questions or continue to experience issues, please reach out via a Support ticket on your account.
As of 05:05 UTC, our Engineering team has confirmed the full resolution of the issue impacting Snapshot and Backup Images in the TOR1 region. We have verified that the Snapshot and Backup events in the region are processing without any failures.
Users should also be able to create Droplets from Snapshot and Backup images in this region without any issues.
Thank you for your patience and understanding. If you should encounter any further issues at all, then please open a ticket with our Support team.
On July 24, 2024, DigitalOcean experienced downtime from near-simultaneous crashes affecting multiple hypervisors (ref: https://docs.digitalocean.com/glossary/hypervisor/) in several regions. In total, fourteen hypervisors crashed, the majority of which were in the FRA1 and AMS3 regions, the remaining being in LON1, SGP1, and NYC1. A routine kernel fix to improve platform stability was being deployed to a subset of hypervisors across the fleet, and that kernel fix had an unexpected conflict with a separate automated maintenance routine, causing those hypervisors to experience kernel panics and become unresponsive. This led to an interruption in service for customer Droplets, and other Droplet-based services until the affected hypervisors were rebooted and restored to a functional state.
Incident Details
Root Cause: A kernel fix being rolled out to some hypervisors through an incremental process conflicted with a periodic maintenance operation which was in progress on a subset of those hypervisors.
Impact: The affected hypervisors crashed, causing Droplets (including other Droplet-based services) running on these hypervisors to become unresponsive. Customers were unable to reach them via networking, process events like power off/on, or see monitoring.
Response: After gathering diagnostic information and determining the root cause, we rebooted the affected hypervisors in order to safely restore service. Manual remediation was done on hypervisors that received the kernel fix to ensure it was applied while the maintenance operation was not in progress.
Timeline of Events (UTC)
July 24 22:55 - Rollout of the kernel fix begins.
July 24 23:10 - First hypervisor crash occurs and the Operations team begins investigating.
July 24 23:55 - Rollout of the kernel fix ends.
July 25 00:14 - Internal incident response begins, following further crash alerts firing.
July 25 00:35 - Diagnostic tests are run on impacted hypervisors to gather information.
July 25 00:47 - Kernel panic messages are observed on impacted hypervisors. Additional Engineering teams are paged for investigation.
July 25 01:42 - Operations team begins coordinated effort to reboot all impacted hypervisors to restore customer services.
July 25 01:50 - Root cause for the crashes is determined to be the conflict between the kernel fix and maintenance operation.
July 25 03:22 - Reboots of all impacted hypervisors complete, all services are restored to normal operation.
Remediation Actions
The continued rollout of this specific kernel fix, as well as future rollouts of this type of fix, will not be done on hypervisors while the maintenance operation is in progress, to avoid any possible conflicts.
Further investigation will be conducted to understand how the kernel fix and the maintenance operation conflicted to cause a kernel crash to help avoid similar problems in the future.