Incident History

Metrics ingestion delays and delayed cluster operations

Executive Summary

Incident Date/Time: October 6–7, 2023
Duration: Approximately 2 days (partial impact observed intermittently over this period)
Impact:

Root Cause: Increased load on an internal backing database servicing Atlas and Cloud Manager monitoring systems due to a combination of unsharded high-traffic collections concentrated on a single shard, inefficient query patterns, and spikes in resource consumption coinciding with a software rollout.

Status: Resolved

What Happened

On October 6 and 7, MongoDB Atlas and Cloud Manager encountered temporary delays in metrics ingestion and backend disruptions affecting certain operational workflows. The primary contributors were elevated resource consumption and localized data distribution challenges in an internal database cluster supporting critical monitoring and operational systems.

Initial investigations pointed to high resource usage in one shard of the backing database cluster. However, further review revealed systemic inefficiencies, including:

Though the functionality and availability of customer clusters remained unaffected, customers experienced degraded monitoring performance and fewer timely Atlas dashboard updates. MongoDB implemented mitigation measures to stabilize the system and then resolved long-term root causes to restore operational workflows.

Impact Assessment

Affected Services:

Customer Impact:

Root Cause Analysis

The incident resulted from several contributing factors:

Combined, these factors overwhelmed the targeted shard and contributed to backend delays affecting metrics ingestion and operational requests.

Prevention

MongoDB has identified several lasting improvements and implemented strategic fixes to prevent recurrence:

  1. Collections experiencing concentrated load will be sharded to distribute traffic more evenly across multiple nodes, alleviating pressure on single shards.
  2. Inefficient queries are being optimized to improve resource utilization and reduce latency during routine operations.
  3. Additional infrastructure capacity has been provisioned to better handle elevated traffic volumes. Capacity planning processes are also being refined to anticipate future spikes in load.
  4. Processes for deploying updated versions of software are being redesigned to account for predictable increases in system resource demands during rollouts, ensuring smoother deployment.

Next Steps

Conclusion

We apologize for the impact of this event on our customers. We are aware that this outage had an impact on our customer’s operations. MongoDB’s highest priorities are security, durability, availability, and performance. We are committed to learning from this event and to update our internal processes to prevent similar scenarios in the future.

1759844568 - 1759867402 Resolved

Some Atlas Cluster Operations Delayed

This incident has been resolved.

1759779204 - 1759792742 Resolved

Flex to Dedicated upgrades are currently failing and resulting in the Dedicated cluster having no monitoring

This incident has been resolved. Upgrades should complete normally. Any affected clusters should be returned to healthy.

1759428795 - 1759441567 Resolved

Atlas UI Project Overview Page Infinite Loading

This incident has been resolved.

1759342341 - 1759355511 Resolved

MongoDB Atlas

This incident has been resolved.

1758647665 - 1758656192 Resolved

MongoDB Atlas Stream Processing and Query Shape metrics gap

From approximately 07:00 to 08:45 UTC on 2025-09-17, metrics for Atlas Stream Processing were not being processed. This affected viewing the metrics in the Atlas UI, querying them via the Admin API, or viewing them in Datadog. Additionally, Query Shape metrics were not provided during this time. Affected users will see gaps in their metric data during those times. We will not be backfilling this metric data.

1758289720 - 1758289720 Resolved

Delayed cluster modifications in Azure East US 2

This incident has been resolved.

1757507687 - 1757535839 Resolved

Elevated Azure Capacity Errors

This incident has been resolved.

1757007371 - 1757018579 Resolved

Excessive Project Maintenance Notifications in some Atlas projects

This incident has been resolved.

1756601732 - 1756631130 Resolved

Atlas for Government Partial Outage

This incident has been resolved.

1756426386 - 1756428039 Resolved
⮜ Previous Next ⮞