On Thursday, November 28, 2024, our domain registrar, Network Solutions, made an update to our digitaloceanspaces.com domain. The change made at 22:38 UTC (5:38:40PM ET) added an Extensible Provisioning Protocol (EPP) clientHold status code to our domain.
The code prevents DNS resolution until a customer contacts Network Solutions to clear the hold. The impact of the hold resulted in DigitalOcean customers being unable to sign-up/log-in via the Cloud Control Panel at the beginning of the incident and experiencing errors with Spaces buckets and dependent services, such as DigitalOcean Container Registry, App Platform, Functions, Load Balancers, and Mongo Managed Database Backups for the duration of the incident.
Incident Summary
Root Cause: Domain registrar erroneously placed a clientHold on the digitaloceanspaces.com domain, impacting DNS resolution and causing an outage.
Impact: Traffic to DigitalOcean products (e.g. Cloud Control Panel, Spaces, App Platform, etc.) began to see intermittent failures between 11/28 22:38 UTC and 04:03 UTC, with the worst failures happening between 03:00 and 04:03 UTC as DNS cache and TTLs began expiring.
Response: DigitalOcean escalated to the domain registrar, as well as Verisign (as the authoritative domain registry for .com domains), to have the clientHold removed.
Timeline of Events
November 28, 2024 (UTC)
22:38 - Incident declared based on monitoring and customer reports of our Cloud Control Panel not loading.
22:44 - Issue detected by an internal alert
22:57 - 1st customer contact regarding Cloud Control Panel impact
23:57 - We noticed that our whois information for the domain had been updated on November 28 22:38 UTC.
November 29, 2024 (UTC)
01:06 - We identified the clientHold status in the EPP section of our whois for digitaloceanspaces.com and confirmed in our Network Solutions account that the domain had been marked inactive.
01:08 - We initiated contact with Network Solutions to have the clientHold removed.
01:30 - Reached out to Verisign executives to assistance
01:42 - Cloud Control Panel functionality restored.
02:40 - Conference call with Verisign executive to brief them on the issue(s) we have encountered attempting to resolve the clientHold status.
02:44 - Email chain established with Verisign executives. A recap of all work done, problems, and a clear issue resolution request were sent.
02:44 - 04:03 - Multiple escalations between Verisign executives leading to clientHold being removed.
03:00 - As DNS caches and recursive DNS server TTLs began to expire, we saw the most severe service disruptions to our customers until resolution
04:03 - clientHold removed from domain
04:03 - 5:38:40 - Monitoring all infrastructure to ensure healthy and full recovery. DigitalOcean functionality fully restored.
Remediation Actions
DigitalOcean teams are working on multiple types of remediation to prevent a similar incident from happening.
DigitalOcean is working with Network Solutions to understand what happened on their end that resulted in the clientHold being applied to our domain incorrectly.
In addition, we are reviewing other domain registrars as possible new homes for our domains.
Teams are also reviewing our monitoring and alerting to reduce our time to detect incidents related to registrar imposed changes and/or DNS resolution.
Our Engineering team identified and resolved the networking issue in our SFO3 region.
From 10.02 UTC to 11:44 UTC, users may have experienced connectivity issues, latency, and timeout errors while interacting with Droplet-based services and App Platform.
The impact has been mitigated and services should be working normally at this time.
If you continue to experience problems, please open a ticket with our support team. Thank you for your patience and we apologize for any inconvenience.
Our Engineering team is investigating an issue related to our ongoing SFO3 maintenance here: https://status.digitalocean.com/incidents/4kj7krrpyg3k
From 20:23 - 20:25 UTC, some services were impacted by a drop in networking. During that time, some Managed Kubernetes clusters experienced errors from the Kubernetes API and/or an increase in 5xx errors. Communication between other services and Block Storage Volumes may have been impacted as well.
The impact has been mitigated and services should be working normally at this time.
If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
From 09:05 to 09:50 UTC (November 21), our Engineering team observed an issue with Droplet creation.
During this time, users may have experienced intermittent errors while creating the Droplets via the Cloud Panel and API. Users should no longer be experiencing these issues.
We apologize for the inconvenience. If you have any questions or continue to experience issues, please reach out via a Support ticket on your account.
Given the absence of outages for the Support Portal, we will now resolve this incident. Our Engineering team will continue to work with our vendor to ensure continued stability of this service. If we observe further outages, we will communicate those to our users via new updates on our status page.
We sincerely apologize and thank you for your patience as we worked through this issue. In case of any questions or concerns, please open a ticket with our Support team.
Our Engineering team has confirmed complete resolution of the issue that was impacting Spaces Object Storage, DigitalOcean Container Registry, and App Platform, across all regions.
From 16:52 UTC - 18:41 UTC, users may have experienced increased error rates when accessing Spaces objects, interacting with the DigitalOcean Container Registry and while creating/deploying Apps with App Platform. Functionality is completely restored and all operations are succeeding normally.
If you continue to experience any issues with these services please submit a ticket to our customer support team for assistance. Thank you for your patience.
Our Engineering team has confirmed that the issues with Authoritative DNS resolution across multiple regions has been fully resolved. DNS queries should now be resolving normally.
If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
From 19:06 - 19:14 UTC, our Engineering team observed an issue impacting Authoritative DNS Resolution globally. During this time, users might have experienced latency or resolution issues while querying DNS records hosted on our authoritative DNS infrastructure.
The Engineering team swiftly identified and resolved the issue, and as of 19:14 UTC, all DNS queries should now be resolving normally.
We apologize for the inconvenience and if you are still experiencing issues or have any additional questions, please open a support ticket from within your account.