From 23:00 UTC on October 10th until 13:28 UTC on October 11th, users may have encountered intermittent errors when accessing Spaces endpoints or using the Container Registry in the SYD1 region. Our Engineering team has successfully resolved the issue affecting the functionality of Spaces and the Container Registry.
All services have been fully restored and are now functioning normally. If you continue to experience any problems, please open a ticket with our support team. We apologize for any inconvenience caused.
As of 17:05 UTC, our Engineering team has resolved the issue affecting new account sign-ups. Users should no longer experience errors and are now able to complete the sign-up process successfully.
If you continue to experience problems, please open a ticket with our support team. We apologize for any inconvenience.
From 18:22 UTC to 19:13 UTC, users may have experienced issues or errors when attempting to create or modify DigitalOcean services deployed in the SYD1 region and also when attempting to create or manage Volumes globally.
Our Engineering team has confirmed the full resolution of this issue. If you continue to experience problems, please open a ticket with our support team. Thank you for being so patient, and we apologize for any inconvenience.
On September 25, 2024 at 22:25 UTC, DigitalOcean experienced a reduction of datacenter capacity in SFO3 and impacted the availability of select DigitalOcean services. Due to a majority of the line cards rebooting at the same time on one of our core routers in SFO3, an inter-regional traffic interruption and traffic drop to the network backbone occurred. This issue impacted users of any DigitalOcean services in the SFO3 region, with a longer impact on select Managed Kubernetes Clusters (DOKS).
Incident Details
Networking
Root Cause: Several line cards rebooted at the same time on one of the core routers, due to hardware errors on the network device.
Impact: Datacenter traffic capacity to/from the backbone was reduced by half during this incident. Network connectivity to some DigitalOcean services was affected.
Response: All of the crashed line cards came online quickly, allowing network traffic to begin flowing again and the core router to become operational.
Specific Impact on DOKS
Root Cause: Due to the network issues from line card reboots, a number of DOKS fleet machines became unhealthy as guest networking failed to recover from the hardware fault.
Impact: Some customer clusters experienced connectivity issues and difficulty in accessing the K8s control plane until networking for the underlying nodes was restored.
Response: All affected nodes across the SFO3 region in the DOKS fleet were recycled.
Timeline of Events
Sep 25 22:21 - Large majority of line cards rebooted on the core router.
Sep 25 22:24 - Line cards became online.
Sep 25 22:25 - Network protocols started session establishment process.
Sep 25 22:30 - Traffic on the affected core router was restored.
Sep 25 22:50 - SFO3 control plane systems all reconnected and recovered.
Sep 25 23:07 - DOKS API servers degraded.
Sep 25 23:59 - Some DOKS clusters in the SFO3 region could not be scraped. Several nodes were discovered to be in a “not ready” state.
Sep 26 01:40 - All impacted DOKS nodes recycled and clusters are operational.
Remediation Actions
DigitalOcean teams are working on multiple types of remediation to help prevent a similar incident from happening in the future.
DigitalOcean is working with the vendor support team for the devices to determine the root cause of the line card crash, as well as upgrading software on the core routers in the SFO3 region..
During the incident, engineers had to manually remediate affected nodes across the entire SFO3 DOKS fleet to restore service. Teams are exploring methods to reduce the need for manual action in the future, by increasing thresholds for automated remediation actions, such that service is restored as quickly as possible.
Our Engineering team has identified and resolved an issue that impacted the ability to resize Droplets via both the API and UI from 18:15 until 21:55 UTC. During this time, users might have experienced errors when attempting to resize their Droplets through the API or the UI.
Additionally, in an effort to resolve the issue with resizes, a secondary issue affected all event processing and some API calls for Droplets and related services from 21:50 until 22:00 UTC.
Swift action was taken by our Engineering team to restore full functionality, and now everything is operating normally.
We apologize for any inconvenience this may have caused. If you have any questions or continue to experience issues, please reach out via a Support ticket on your account.