Azure Network Availability Issues
Atlas is unaffected by the Azure network availability issue.
Atlas is unaffected by the Azure network availability issue.
This incident has been resolved.
This incident has been resolved.
This incident has been resolved.
This incident has been resolved.
Issues reported at Amazon Web Services (AWS) in the US East1 region. These problems are impacting multiple services that depend on AWS infrastructure. We're monitoring the situation: Check the AWS Status page for the latest updates: https://health.aws.amazon.com/health/status
This incident has been resolved.
Incident Date/Time: October 6–7, 2023
Duration: Approximately 2 days (partial impact observed intermittently over this period)
Impact:
Root Cause: Increased load on an internal backing database servicing Atlas and Cloud Manager monitoring systems due to a combination of unsharded high-traffic collections concentrated on a single shard, inefficient query patterns, and spikes in resource consumption coinciding with a software rollout.
Status: Resolved
On October 6 and 7, MongoDB Atlas and Cloud Manager encountered temporary delays in metrics ingestion and backend disruptions affecting certain operational workflows. The primary contributors were elevated resource consumption and localized data distribution challenges in an internal database cluster supporting critical monitoring and operational systems.
Initial investigations pointed to high resource usage in one shard of the backing database cluster. However, further review revealed systemic inefficiencies, including:
Though the functionality and availability of customer clusters remained unaffected, customers experienced degraded monitoring performance and fewer timely Atlas dashboard updates. MongoDB implemented mitigation measures to stabilize the system and then resolved long-term root causes to restore operational workflows.
Affected Services:
Customer Impact:
The incident resulted from several contributing factors:
Combined, these factors overwhelmed the targeted shard and contributed to backend delays affecting metrics ingestion and operational requests.
MongoDB has identified several lasting improvements and implemented strategic fixes to prevent recurrence:
We apologize for the impact of this event on our customers. We are aware that this outage had an impact on our customer’s operations. MongoDB’s highest priorities are security, durability, availability, and performance. We are committed to learning from this event and to update our internal processes to prevent similar scenarios in the future.
This incident has been resolved.
This incident has been resolved. Upgrades should complete normally. Any affected clusters should be returned to healthy.