Elevated Azure Capacity Errors
This incident has been resolved.
This incident has been resolved.
Incident Date/Time: October 6–7, 2023
Duration: Approximately 2 days (partial impact observed intermittently over this period)
Impact:
Root Cause: Increased load on an internal backing database servicing Atlas and Cloud Manager monitoring systems due to a combination of unsharded high-traffic collections concentrated on a single shard, inefficient query patterns, and spikes in resource consumption coinciding with a software rollout.
Status: Resolved
On October 6 and 7, MongoDB Atlas and Cloud Manager encountered temporary delays in metrics ingestion and backend disruptions affecting certain operational workflows. The primary contributors were elevated resource consumption and localized data distribution challenges in an internal database cluster supporting critical monitoring and operational systems.
Initial investigations pointed to high resource usage in one shard of the backing database cluster. However, further review revealed systemic inefficiencies, including:
Though the functionality and availability of customer clusters remained unaffected, customers experienced degraded monitoring performance and fewer timely Atlas dashboard updates. MongoDB implemented mitigation measures to stabilize the system and then resolved long-term root causes to restore operational workflows.
Affected Services:
Customer Impact:
The incident resulted from several contributing factors:
Combined, these factors overwhelmed the targeted shard and contributed to backend delays affecting metrics ingestion and operational requests.
MongoDB has identified several lasting improvements and implemented strategic fixes to prevent recurrence:
We apologize for the impact of this event on our customers. We are aware that this outage had an impact on our customer’s operations. MongoDB’s highest priorities are security, durability, availability, and performance. We are committed to learning from this event and to update our internal processes to prevent similar scenarios in the future.
This incident has been resolved.
This incident has been resolved. Upgrades should complete normally. Any affected clusters should be returned to healthy.
This incident has been resolved.
This incident has been resolved.
From approximately 07:00 to 08:45 UTC on 2025-09-17, metrics for Atlas Stream Processing were not being processed. This affected viewing the metrics in the Atlas UI, querying them via the Admin API, or viewing them in Datadog. Additionally, Query Shape metrics were not provided during this time. Affected users will see gaps in their metric data during those times. We will not be backfilling this metric data.
This incident has been resolved.
This incident has been resolved.
This incident has been resolved.