Global Impact to Events, Cloud Control Panel, and API
Update
Incident Summary
On August 05, 2024 at 16:30 UTC, DigitalOcean experienced a disruption to internal service discovery. Customers experienced full disruption of creates, event processing, and management of other DigitalOcean products globally. Due to an error in a replication configuration that propagated globally, internal services were unable to correctly discover other services they depended on. This did not affect the availability of existing customer resources.
Incident Details
Root Cause: An incorrect replication configuration was deployed against the datastore which powers the internal service discovery service at DigitalOcean. The incorrect configuration specified a new datacenter with zero keys as 100% ownership of all keys in the datastore. This had an immediate global impact against the data storage layer and disrupted the quorum of datastore nodes across all regions. Clients of the service were unable to read/write to the datastore during this time, which had a cascading effect.
Impact:
The first observable impact was a complete disruption to the I/O layer of the backing datastore.
These events are consumed by a wide variety of backing services that compose the DigitalOcean Cloud platform. This incident impacted:
- Droplet Creates
- Droplet Updates
- Network Creates
- Login / Authentication services
- Block Storage Volumes Snapshot creation
- Spaces/CDN Creates
- Spaces Updates
- Managed Kubernetes cluster creates
- Managed Databases creates
Other services across DigitalOcean, outside of the eventing flow, also rely on service discovery to talk to each other, so customers may have seen additional impact when attempting to manage assorted services through the Cloud Control Panel or via the API.
Response: After gathering diagnostic information and determining the root cause, an updated / correct replication configuration was deployed. Some regions ingested the new replication configuration and started to recover. Teams identified additional regions that took longer to ingest the updated configuration and manually invoked the change directly on the nodes, and then ran local repairs on the data to ensure alignment before moving to the next region.
Engineering teams cleaned up any remaining failed events and processed pending events that had not yet timed out. At the conclusion of that cleanup effort, the incident was declared resolved, and the cloud platform stabilized.
Timeline of Events (UTC)
Aug 05 16:30 - Rollout of the new datastore cluster begins.
Aug 05 16:35 - First report of service discovery unavailability is raised internally.
Aug 05 16:42 - Lack of quorum and datastore ownership is identified as the blocking issue.
Aug 05 17:00 - The replication configuration change, adding the new datacenter, is identified as the root cause behind the ownership change.
Aug 05 17:16 - The replication configuration change is reverted, and run against the region that had become the datastore owner. Some events start to fail faster at this point, changing the error from a distinct timeout to a failure to find endpoints.
Aug 05 18:25 - Regions that have not detected or applied the reverted configuration are identified, and engineers start manually applying the configuration and running repairs on the datastore for those regions.
Aug 05 19:10 - Remaining failure events resolve, and the platform stabilizes.
Remediation Actions
The replication configuration deployment happened outside of a normal maintenance window. Moving forward, these types of extension maintenances will be performed inside a declared maintenance window, with any potential for customer impact communicated via a maintenance notice posted on the status page.
The process documentation for this type of deployment will be updated to reflect the current requirements and clearly outline the steps and expectations for each stage of a new deployment. Additionally, the manual processes that occurred will be automated to help reduce the potential for human error.
Multiple teams are also evaluating if our current topology of the internal datastore is appropriate, and if there are any regionalizations or multi-layered approaches DigitalOcean can take to help ensure our internal service discovery remains as available as possible.
Resolved
Our engineering team has resolved the issue impacting the management of multiple services via the Cloud Control Panel and the API, including event processing. All services should now be functioning normally. If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
Update
Our Engineering teams have identified a remediation strategy and are working to roll it out to all regions. We are starting to see services recover as a result of this fix being deployed.
Customers may begin to see some management functionality for services return and we will continue to provide updates as we work to return full functionality.
Update
Our Engineering teams have now identified the cause of the issue impacting multiple services. They continue to work on resolving the impact to customers.
During this time, users may experience errors when attempting to manage services through the UI or the API, including accessing Droplet information or the processing of Droplet events. Existing customer resources, such as Droplets, remain online and accessible.
We apologize for the inconvenience and will share an update once we have more information.
Investigating
Multiple Engineering teams are continuing to investigate the issue. Customers may also encounter errors when managing other services and using the API at this time.
We will post another update soon.
Investigating
As of 16:30 UTC, Our Engineering team is investigating an issue with processing of events in all regions. During this time users may experience failed events on Droplets. We apologize for the inconvenience and will share an update once we have more information.