Service Disruption Caused by ElastiCache Automated Certificate Rotation
Summary
Based on our initial analysis with AWS, this incident was caused by an automated certificate update for the ElastiCache middleware.
Timeline
At 02:58 AM on October 29 (Beijing Time), AWS initiated an automated certificate update for our ElastiCache instances. During this process, the primary and replica nodes of the ElastiCache cluster experienced issues, preventing backend services from accessing the component.
Time of Impact: 02:59 - 03:22,03:42 - 03:48 (Beijing Time)
Scope of Impact: Cloud Printing and MakerWorld
Next Steps & Action Items
We have raised two critical issues with AWS Support:
The automated update occurred outside of our designated maintenance window.
The instability of the primary-replica nodes during the update process.
AWS Support has escalated these issues to their internal engineering team for a detailed root cause analysis. We will provide further updates as soon as we receive more information from AWS.
Between 08:02 and 11:52 UTC+8, some users experienced intermittent issues accessing our cloud services.
After a joint investigation with our cloud provider, AWS, we have confirmed the root cause was network instability from the carrier, Cogent. Access requests routed through the Cogent network were subject to timeouts and packet loss.
Due to several recent incidents involving this provider, AWS has proactively rerouted traffic away from Cogent to alternative network paths. This action significantly mitigates the risk of similar disruptions in the future.
Following a joint investigation with our cloud provider, AWS, we have confirmed the root cause was network instability from the carrier, Cogent. This instability caused access requests routed through the Cogent network to experience timeouts and packet loss.
Due to a surge in traffic on makerworld.com at 17:55 UTC, users are experiencing slow loading times, and some pages are not displaying correctly. This issue lasted for approximately 20 minutes.