Incident History

Container Registry and App Platform in AMS3/FRA1

From 10:30 to 11:35 UTC, our Engineering team observed an issue with Container Registry and App Platform builds in the AMS3 and FRA1 regions.

During this time, users may have experienced delays while building their Apps and could have potentially experienced timeout errors in builds as a result. Additionally, a subset of customers may have experienced latency while interacting with the Container Registries.

Our Engineering team found that a backing component of the Container Registry was experiencing high memory usage. They were able to remediate that component at 11:35 UTC, which resolved the issue.

We apologize for the inconvenience. If you have any questions or continue to experience issues, please reach out via a Support ticket on your account.

1724339076 - 1724339076 Resolved

Networking in SFO2 and SFO3 region

As of 10:45 UTC, our engineering team has resolved the issue with networking in SFO2 and SFO3 regions, and networking in the regions should now be operating normally. If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1723716149 - 1723722527 Resolved

Event processing and Droplet Creates in TOR1

Our Engineering team has resolved the issue with Droplet creates and Snapshots. As of 05:30 UTC, users should be able to create Droplets, Snapshots and process events. Droplet backed services should also be operating normally.

If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1723696146 - 1723700422 Resolved

Elevated Failure Rate for Droplet Creates

From 23:08 August 12 to 01:26 August 13 UTC, customers may have experienced failures with Droplet creation, power on events, and restore events in the NYC3 region.

Our Engineering team has confirmed resolution of this issue. Thank you for your patience.

If you continue to experience any problems, please open a support ticket from within your account.

1723508866 - 1723518614 Resolved

Droplet Rebuild and Restore Event Processing

From 09:52 UTC to 19:32 UTC, customers may have experienced failures with Droplet rebuild and restore events in all regions. Our Engineering team has confirmed full resolution of this issue.

Thank you for your patience. If you continue to experience any problems, please open a support ticket from within your account.

1722971943 - 1722976819 Resolved

Global Impact to Events, Cloud Control Panel, and API

Incident Summary

On August 05, 2024 at 16:30 UTC, DigitalOcean experienced a disruption to internal service discovery. Customers experienced full disruption of creates, event processing, and management of other DigitalOcean products globally. Due to an error in a replication configuration that propagated globally, internal services were unable to correctly discover other services they depended on. This did not affect the availability of existing customer resources.

Incident Details

Root Cause: An incorrect replication configuration was deployed against the datastore which powers the internal service discovery service at DigitalOcean. The incorrect configuration specified a new datacenter with zero keys as 100% ownership of all keys in the datastore. This had an immediate global impact against the data storage layer and disrupted the quorum of datastore nodes across all regions. Clients of the service were unable to read/write to the datastore during this time, which had a cascading effect.‌

Impact:

The first observable impact was a complete disruption to the I/O layer of the backing datastore.

These events are consumed by a wide variety of backing services that compose the DigitalOcean Cloud platform. This incident impacted:

Other services across DigitalOcean, outside of the eventing flow, also rely on service discovery to talk to each other, so customers may have seen additional impact when attempting to manage assorted services through the Cloud Control Panel or via the API.

Response: After gathering diagnostic information and determining the root cause, an updated / correct replication configuration was deployed. Some regions ingested the new replication configuration and started to recover. Teams identified additional regions that took longer to ingest the updated configuration and manually invoked the change directly on the nodes, and then ran local repairs on the data to ensure alignment before moving to the next region.

Engineering teams cleaned up any remaining failed events and processed pending events that had not yet timed out. At the conclusion of that cleanup effort, the incident was declared resolved, and the cloud platform stabilized.

Timeline of Events (UTC)

Aug 05 16:30 - Rollout of the new datastore cluster begins.

Aug 05 16:35 - First report of service discovery unavailability is raised internally.

Aug 05 16:42 - Lack of quorum and datastore ownership is identified as the blocking issue.

Aug 05 17:00 - The replication configuration change, adding the new datacenter, is identified as the root cause behind the ownership change.

Aug 05 17:16 - The replication configuration change is reverted, and run against the region that had become the datastore owner. Some events start to fail faster at this point, changing the error from a distinct timeout to a failure to find endpoints.

Aug 05 18:25 - Regions that have not detected or applied the reverted configuration are identified, and engineers start manually applying the configuration and running repairs on the datastore for those regions.

Aug 05 19:10 - Remaining failure events resolve, and the platform stabilizes.

Remediation Actions

The replication configuration deployment happened outside of a normal maintenance window. Moving forward, these types of extension maintenances will be performed inside a declared maintenance window, with any potential for customer impact communicated via a maintenance notice posted on the status page.

The process documentation for this type of deployment will be updated to reflect the current requirements and clearly outline the steps and expectations for each stage of a new deployment. Additionally, the manual processes that occurred will be automated to help reduce the potential for human error.

Multiple teams are also evaluating if our current topology of the internal datastore is appropriate, and if there are any regionalizations or multi-layered approaches DigitalOcean can take to help ensure our internal service discovery remains as available as possible.

1722877519 - 1722887170 Resolved

App Platform and Container Registry in NYC

Our Engineering team has confirmed the full resolution of the issue with the DigitalOcean App Platform and Container Registry in our NYC regions.

Users should no longer experience any issues while pushing to Container Registries and working with App Platform builds.

If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1722828074 - 1722836920 Resolved

Spaces Access Key

From 23:47 UTC until 01:11 UTC, users may have experienced errors when attempting to create Spaces Access Keys in the Cloud Control Panel.

Our Engineering team has identified and resolved the issue. The impact has been resolved and users should now be able to create Spaces Access Keys.

We apologize for any inconvenience this may have caused. If you have any questions or continue to experience issues, please reach out via a Support ticket on your account.

1722389344 - 1722389344 Resolved

Snapshots and Backups in TOR1

As of 05:05 UTC, our Engineering team has confirmed the full resolution of the issue impacting Snapshot and Backup Images in the TOR1 region. We have verified that the Snapshot and Backup events in the region are processing without any failures.

Users should also be able to create Droplets from Snapshot and Backup images in this region without any issues.

Thank you for your patience and understanding. If you should encounter any further issues at all, then please open a ticket with our Support team.

1722227263 - 1722232332 Resolved

Networking in Multiple Regions

Incident Summary

On July 24, 2024, DigitalOcean experienced downtime from near-simultaneous crashes affecting multiple hypervisors (ref: https://docs.digitalocean.com/glossary/hypervisor/) in several regions. In total, fourteen hypervisors crashed, the majority of which were in the FRA1 and AMS3 regions, the remaining being in LON1, SGP1, and NYC1. A routine kernel fix to improve platform stability was being deployed to a subset of hypervisors across the fleet, and that kernel fix had an unexpected conflict with a separate automated maintenance routine, causing those hypervisors to experience kernel panics and become unresponsive. This led to an interruption in service for customer Droplets, and other Droplet-based services until the affected hypervisors were rebooted and restored to a functional state.

Incident Details

Timeline of Events (UTC)

July 24 22:55 - Rollout of the kernel fix begins. 

July 24 23:10 - First hypervisor crash occurs and the Operations team begins investigating.

July 24 23:55 - Rollout of the kernel fix ends. 

July 25 00:14 - Internal incident response begins, following further crash alerts firing. 

July 25 00:35 - Diagnostic tests are run on impacted hypervisors to gather information.

July 25 00:47 - Kernel panic messages are observed on impacted hypervisors. Additional Engineering teams are paged for investigation.

July 25 01:42 - Operations team begins coordinated effort to reboot all impacted hypervisors to restore customer services.

July 25 01:50 - Root cause for the crashes is determined to be the conflict between the kernel fix and maintenance operation. 

July 25 03:22 - Reboots of all impacted hypervisors complete, all services are restored to normal operation.

Remediation Actions

1721867636 - 1721886497 Resolved
⮜ Previous Next ⮞