Incident History

Networking in SFO3 Region

Our Engineering team identified and resolved an issue affecting the networking in our SFO3 region. There were multiple impacts between 08:48 and 08:52 UTC, 09:08 and 09:12 UTC, and 10:51 and 10:55 UTC. During these periods, users may have experienced timeouts with network connections to and from the SFO3 region.

Our Engineering team was able to take quick action to mitigate the impact and resolve the issue. All services are now functioning as expected. Thank you for your patience, and we apologize for any inconvenience. If you continue to experience any issues, please open a Support ticket for further analysis.

1725020382 - 1725020382 Resolved

Creates/Resizes in SGP & SYD Regions

Our Engineering team has identified the root cause of the issue with creating and resizing Droplets and Droplet-based resources in our SGP1 and SYD1 regions.

No further user impact has occurred since our last update.

In order to fully remediate this issue, our Engineering team has scheduled emergency maintenance, which will take place from 14:00 - 22:00 UTC on August 29th.

Please visit the below maintenance link to know more : link:https://status.digitalocean.com/incidents/np0zw6m04jm1

If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1724857562 - 1724888721 Resolved

Droplet connectivity and Event processing in NYC1

Our Engineering team has confirmed that the issue impacting our Droplet-based services in the NYC1 region has been completely mitigated. Users should no longer see issues with their Droplets and Droplet-related services.

If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1724660008 - 1724668855 Resolved

Spaces Availability in NYC3

From 17:18 UTC until 17:58 UTC, users may have experienced issues when attempting to access Spaces in the NYC3 region.

Our Engineering team has confirmed full resolution of the issue, and users should now be able to access Spaces normally.

Thank you for your patience. If you continue to experience any problems, please open a ticket with our support team for further review.

1724349343 - 1724366621 Resolved

Container Registry and App Platform in AMS3/FRA1

From 10:30 to 11:35 UTC, our Engineering team observed an issue with Container Registry and App Platform builds in the AMS3 and FRA1 regions.

During this time, users may have experienced delays while building their Apps and could have potentially experienced timeout errors in builds as a result. Additionally, a subset of customers may have experienced latency while interacting with the Container Registries.

Our Engineering team found that a backing component of the Container Registry was experiencing high memory usage. They were able to remediate that component at 11:35 UTC, which resolved the issue.

We apologize for the inconvenience. If you have any questions or continue to experience issues, please reach out via a Support ticket on your account.

1724339076 - 1724339076 Resolved

Networking in SFO2 and SFO3 region

As of 10:45 UTC, our engineering team has resolved the issue with networking in SFO2 and SFO3 regions, and networking in the regions should now be operating normally. If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1723716149 - 1723722527 Resolved

Event processing and Droplet Creates in TOR1

Our Engineering team has resolved the issue with Droplet creates and Snapshots. As of 05:30 UTC, users should be able to create Droplets, Snapshots and process events. Droplet backed services should also be operating normally.

If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1723696146 - 1723700422 Resolved

Elevated Failure Rate for Droplet Creates

From 23:08 August 12 to 01:26 August 13 UTC, customers may have experienced failures with Droplet creation, power on events, and restore events in the NYC3 region.

Our Engineering team has confirmed resolution of this issue. Thank you for your patience.

If you continue to experience any problems, please open a support ticket from within your account.

1723508866 - 1723518614 Resolved

Droplet Rebuild and Restore Event Processing

From 09:52 UTC to 19:32 UTC, customers may have experienced failures with Droplet rebuild and restore events in all regions. Our Engineering team has confirmed full resolution of this issue.

Thank you for your patience. If you continue to experience any problems, please open a support ticket from within your account.

1722971943 - 1722976819 Resolved

Global Impact to Events, Cloud Control Panel, and API

Incident Summary

On August 05, 2024 at 16:30 UTC, DigitalOcean experienced a disruption to internal service discovery. Customers experienced full disruption of creates, event processing, and management of other DigitalOcean products globally. Due to an error in a replication configuration that propagated globally, internal services were unable to correctly discover other services they depended on. This did not affect the availability of existing customer resources.

Incident Details

Root Cause: An incorrect replication configuration was deployed against the datastore which powers the internal service discovery service at DigitalOcean. The incorrect configuration specified a new datacenter with zero keys as 100% ownership of all keys in the datastore. This had an immediate global impact against the data storage layer and disrupted the quorum of datastore nodes across all regions. Clients of the service were unable to read/write to the datastore during this time, which had a cascading effect.‌

Impact:

The first observable impact was a complete disruption to the I/O layer of the backing datastore.

These events are consumed by a wide variety of backing services that compose the DigitalOcean Cloud platform. This incident impacted:

Other services across DigitalOcean, outside of the eventing flow, also rely on service discovery to talk to each other, so customers may have seen additional impact when attempting to manage assorted services through the Cloud Control Panel or via the API.

Response: After gathering diagnostic information and determining the root cause, an updated / correct replication configuration was deployed. Some regions ingested the new replication configuration and started to recover. Teams identified additional regions that took longer to ingest the updated configuration and manually invoked the change directly on the nodes, and then ran local repairs on the data to ensure alignment before moving to the next region.

Engineering teams cleaned up any remaining failed events and processed pending events that had not yet timed out. At the conclusion of that cleanup effort, the incident was declared resolved, and the cloud platform stabilized.

Timeline of Events (UTC)

Aug 05 16:30 - Rollout of the new datastore cluster begins.

Aug 05 16:35 - First report of service discovery unavailability is raised internally.

Aug 05 16:42 - Lack of quorum and datastore ownership is identified as the blocking issue.

Aug 05 17:00 - The replication configuration change, adding the new datacenter, is identified as the root cause behind the ownership change.

Aug 05 17:16 - The replication configuration change is reverted, and run against the region that had become the datastore owner. Some events start to fail faster at this point, changing the error from a distinct timeout to a failure to find endpoints.

Aug 05 18:25 - Regions that have not detected or applied the reverted configuration are identified, and engineers start manually applying the configuration and running repairs on the datastore for those regions.

Aug 05 19:10 - Remaining failure events resolve, and the platform stabilizes.

Remediation Actions

The replication configuration deployment happened outside of a normal maintenance window. Moving forward, these types of extension maintenances will be performed inside a declared maintenance window, with any potential for customer impact communicated via a maintenance notice posted on the status page.

The process documentation for this type of deployment will be updated to reflect the current requirements and clearly outline the steps and expectations for each stage of a new deployment. Additionally, the manual processes that occurred will be automated to help reduce the potential for human error.

Multiple teams are also evaluating if our current topology of the internal datastore is appropriate, and if there are any regionalizations or multi-layered approaches DigitalOcean can take to help ensure our internal service discovery remains as available as possible.

1722877519 - 1722887170 Resolved
⮜ Previous Next ⮞