Incident History

MongoDB Cloud: SSO logins fail for new users

This incident has been resolved.

1750178983 - 1750187926 Resolved

Intermittent Restore Failures on Atlas

While some restores from snapshots stored in Azure or GCP may have taken longer to start than usual, there have been no complete failures observed. Cluster health was unaffected during this time.

1749896502 - 1749920851 Resolved

Elevated failures from GCP

This incident has been resolved.

1749753534 - 1749771698 Resolved

Overly frequent maintenance notifications

We have remediated the issue causing overly frequent maintenance notifications.

1749673330 - 1749678372 Resolved

MongoDB Support for Atlas for Government: New users not able to create cases

This incident has been resolved.

1749231844 - 1749242075 Resolved

MongoDB Atlas: Cluster operation delays and UI timeouts

Summary of Atlas Outage on June 4th 2025

This document describes an Atlas control plane service disruption that occurred on June 4th, 2025.

Issue Summary

Between 17:05 GMT and 22:54 GMT on June 4th, the MongoDB Atlas control plane experienced a service disruption due to a DNS misconfiguration. Customers were unable to make configuration changes across a range of MongoDB services, including Atlas databases, App Services and Device Sync, Atlas Data Federation, Stream Processing, Backup/Restore, and Atlas Search. The core data plane remained operational, and customer workloads continued uninterrupted. However, during the time of the disruption, customer clusters could not be managed via the UI, Admin API, or the auto-scaling system. Similarly, customers were unable to change network configuration, modify projects, or add/remove database users during this time. The Atlas Web UI was unavailable during a portion of this outage. 

This incident was the result of a DNS configuration change that impacted communication within Atlas’s internal metadata servers. These servers employ recursive DNS resolution that relies on name servers on the public Internet. An authorized operator executed a planned update to a DNS nameserver record that was believed to be unused. However, this belief was based on an incorrect internal configuration source. The disruption of communication between our metadata servers in turn disrupted most operations against the Atlas control plane.

The operator detected the misconfiguration within minutes. We immediately rolled back the offending change. However, our recovery process was delayed for several reasons. The top-level DNS records have a Time To Live (TTL) of 2 days. As such, rolling back the misconfiguration did not resolve the problem. We attempted multiple mitigations, including flushing local DNS caches and redirecting to an alternate resolver. After these mitigations proved unsuccessful, we requested that our upstream DNS provider flush the offending DNS records from the long-term cache. This fixed the immediate connectivity problem. Partial recovery was immediate. It took roughly another 60 minutes for all services to resume normal operations after working through queued work.

We are making a set of corrective actions based on this event. First, we will modify our operational tooling to enforce additional safety checks, especially for changes that modify a DNS top-level domain. Second, we are enhancing our existing internal review process for DNS configuration changes. These reviews will include additional testing of such changes in a controlled environment and gating mechanisms to reduce blast radius. 

Conclusion

We apologize for the impact of this event on our customers. We are aware that this outage had an impact on our customer’s operations. MongoDB’s highest priorities are security, durability, availability, and performance. We are committed to learning from this event and to update our internal processes to prevent similar scenarios in the future.

1749058249 - 1749072513 Resolved

MongoDB Atlas: Datadog metrics failed to import to EU region

From approximately 13:40 to 17:20 UTC on 3 June 2025, MongoDB Atlas was unable to send metrics via Datadog's EU region for some Atlas Projects. Affected users may find gaps during this time in less than half of their metrics if they use Datadog's EU servers. These gaps will not be backfilled.

Cluster health was unaffected, however, alerts may have fired within Datadog – we do not have visibility to this.

1748975793 - 1748975793 Resolved

Accounts Locked Out

Incident Summary

Between May 28th 2025 16:00 UTC and May 28th 2025 19:36 UTC, emails were sent out to users on affected orgs notifying them that they have been moved into the locked status and users from all affected orgs were restricted from performing any actions on their organization.

Root Cause

The root cause of this incident was an outdated internal process that was inadvertently re-enabled during a database migration.

MongoDB Actions

The MongoDB Atlas Billing team has deleted the outdated process and increased alerting on changes in dunning statuses.

Recommended Customer Actions

This issue was fully addressed by the MongoDB Atlas Billing team and does not require customer action.

1748455862 - 1748469015 Resolved

Delays in Atlas cluster creation, modification, and scheduled backups

This incident has been resolved.

1747679560 - 1747691703 Resolved

MongoDB Atlas: False host down alerts were sent

Between 18:37 and 21:17 UTC on 16 May 2025, MongoDB Atlas sent out false Host Down alerts to a small portion of our users. Cluster health was unaffected during this time and we sincerely apologize for the confusion.

1747422959 - 1747422959 Resolved
⮜ Previous Next ⮞