Incident History

Reconciling Kubernetes Clusters

As of 18:55 UTC, our Engineering team has confirmed the full resolution of the issue that impacted Managed Kubernetes Clusters in all of our regions. All of the services should now be working normally.

If you continue to experience problems, please open a ticket with our Support team from within your Cloud Control Panel.

Thank you for your patience and we apologise for any inconvenience.

1736620167 - 1736629969 Resolved

Retro - Multiple services and API

From 15:28 to 15:32 UTC, an internal service disruption may have resulted in users experiencing errors while using the Cloud Panel or API to manage Spaces Buckets, Apps, Managed Database Clusters, or Load Balancers as well as other actions due to impacted downstream services.

If you continue to experience problems, please open a ticket with our support team.

Thank you and we apologize for any inconvenience.

1736445690 - 1736445690 Resolved

Network Connectivity in SFO3 Region

Postmortem

We post updates on incidents regularly to cover the details of our findings, learnings and corresponding actions to prevent recurrence. For this update, we will cover a recent incident related to network maintenance. Maintenance is an essential part of ensuring service stability at DigitalOcean, through scheduled upgrades, patches, and more. We work hard to plan for no interruptions from maintenance activities and we recognize the impact that maintenance can have on our customers if there is downtime. We apologize for the disruption that occurred and have identified action items to ensure maintenance activities are planned with minimal downtime and that we have plans to minimize impact, even when unforeseen errors occur. We go into detail about this below.

Incident Summary

On Jan. 7, 2025 12:10 UTC, DigitalOcean experienced a loss of network connectivity for Regional External Load Balancers, DOKS, and AppPlatform products in the SFO3 region. This impact was the result of an unexpected effect of scheduled maintenance work to upgrade our infrastructure and enhance network performance. The maintenance work to complete the network change was designed to be seamless, with the worst expected case being dropped packets. The resolution rollout began Jan. 7th at 13:15 UTC and was fully completed with all services reported operational Jan. 7th at 14:30 UTC.

Incident Details

The DigitalOcean Networking team started performing scheduled maintenance to enhance network performance at 10:00 UTC. As part of this maintenance, a routing change was rolled out at 12:10 UTC that would redirect the traffic over to a new path in the datacenter. There was an old routing configuration present on the Core switches that did not get updated. This resulted in traffic being dropped for products relying on Regional External Load Balancers (e.g. Droplets, DOKS, AppPlatform). As soon as we detected the drop in traffic, we reverted the changes to mitigate the impact. 

Timeline of Events

12:12 UTC - Flip to the new workflow was started and the impact began 

12:40 UTC - Alert fired for Load Balancers

13:00 UTC - First report from customers was received

13:13 UTC - Revert of the new code was started

13:21 UTC - Incident was spun up

14:19 UTC - Routing configuration was updated on Core Switches 

14:30 UTC - Code revert was completed and service functionality was restored

Remediation Actions

Following a full internal postmortem, DigitalOcean engineers identified several areas of learning to prevent similar incidents. These include updating maintenance processes, increasing monitoring and alerting, and improving observability of network services. 

Key Learnings:

  1. Learning: Gaps in our automated validation of Network configuration setup for the rollout of this enhancement were identified. While our new, upgraded network results in a simpler state, we have inherited interim complexity as we transition from old to new.

  1. Learning: Gaps in our incremental rollout process for network datapath changes for both pre- and post-deployment were identified. The process for these changes allowed for quicker rollouts for small changes that passed validation. We recognize that "small changes" can have a high impact.
1736256938 - 1736268188 Resolved

Error adding card details

Our engineering team has resolved the issue preventing some users from adding new credit cards from the control panel. If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1736256286 - 1736259291 Resolved

Droplet Event Processing in Multiple Regions

From 1:23 UTC to 3:04 UTC, users may have experienced issues with events being stuck or delayed, such as powering on/off, and resizing Droplets in the NYC3, AMS2, BLR1, SGP1, and SYD1 regions. Additionally, Managed Database creates were delayed in all regions.

Our Engineering team has confirmed full resolution of the issue, delayed events and new events should complete as normal now.

Thank you for your patience through this issue. If you continue to experience any issues, please open a support ticket from within your account.

1734572743 - 1734584602 Resolved

Retro - VPC Connectivity Issue in NYC3

From 13:25 to 14:45 UTC, our Engineering team observed a Networking issue in our NYC3 region. During this time, users may have experienced Droplet and VPC connectivity issue, Users should no longer be experiencing these issues. We apologize for the inconvenience. If you have any questions or continue to experience issues, please reach out via a Support ticket on your account.

1734104376 - 1734104376 Resolved

Spaces CDN

Our Engineering team has confirmed that the issues with Spaces CDN functionality has been fully resolved. Users should now be able to use CDN functionality normally.

If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1734083733 - 1734087664 Resolved

DNS Resolution in NYC1

Our Engineering team has confirmed that the issues with Authoritative DNS resolution in NYC1 has been fully resolved. DNS queries should now be resolving normally.

If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1734045634 - 1734053062 Resolved

Managed Database Operations

Our Engineering team has confirmed the full resolution of the issue impacting Managed Database Operations, and all systems are now operating normally. Users may safely resume operations, including upgrades, resizes, forking, and ad-hoc maintenance patches.

If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1734028453 - 1734062745 Resolved

Cloud Control Panel Logins for a Subset of Users

Our Engineering team has resolved the issue impacting prohibiting a subset of users from logging in, and the login flow is now operating normally.

If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.

1733945596 - 1733960038 Resolved
⮜ Previous Next ⮞