Incident History

Elevated Azure Capacity Errors

This incident has been resolved.

1757007371 - 1757018579 Resolved

Excessive Project Maintenance Notifications in some Atlas projects

This incident has been resolved.

1756601732 - 1756631130 Resolved

Atlas for Government Partial Outage

This incident has been resolved.

1756426386 - 1756428039 Resolved

MongoDB Atlas: Cluster metrics page not visible to Read Only users

This incident has been resolved.

1756392654 - 1756400683 Resolved

Capacity Constrained Azure regions eastus, eastus2, germanywestcentral and more

This incident has been resolved.

1756333398 - 1756496634 Resolved

MongoDB Atlas: Atlas Streams Metrics to Datadog delayed

We have been monitoring the incident since 21:43 UTC, and see no evidence that it will recur. At this point, our metrics delay from MongoDB Atlas Streams to Datadog is within tolerances, and so we are treating this incident as Resolved.

1756234222 - 1756254448 Resolved

MongoDB Atlas: Signing up for AWS Marketplace failing

This incident has been resolved.

1755875161 - 1755881099 Resolved

Atlas Azure cluster new snapshot creation failure (EUROPE_WEST)

This incident has been resolved.

1755734612 - 1755736500 Resolved

Some AWS us-east-1 free tier (M0 instance size) clusters in degraded state

This incident has been resolved.

1755502056 - 1755563123 Resolved

Issue with Azure clusters with KeyVault enabled

Executive Summary

What Happened

On August 11, 2025, at 14:05 UTC, we implemented a planned infrastructure change to add additional IP addresses to our control plane NAT gateways. Although we communicated this change in advance on June 30, 2025, we acknowledge that our communication and the resulting customer preparations were not sufficient to prevent service disruptions for customers with IP access restrictions. Upon detecting the customer impact we initiated a rollback of the infrastructure change to restore service and prevent further disruptions. The rollback was completed by 16:19 UTC, returning all affected services to their previous operational state.

Impact Assessment

Root Cause Analysis

The addition of new IP addresses to our control plane caused network access failures for customers who had configured IP allowlists that didn't include the new addresses.  (Please note that this is not the same as the cluster IP allowlist which you would use to control how to connect to your cluster)

 The primary issues were:

  1. BYOK Encryption Validation: Our key validation process (running every 15 minutes) failed on operations from the new IPs. Due to a flaw in our error handling logic, the system incorrectly interpreted these network failures as an intentional revocation of access to the encryption key and automatically shut down affected clusters.
  2. Identity Provider: The new IP addresses weren't allow-listed in our identity provider, resulting in degraded registrations and logins until those IPs were allowed.
  3. Service Authentication:  App Services experienced partial service failures because requests to the Atlas API originating from Triggers new IP addresses were blocked by an outdated internal IP allowlist.

Prevention

Immediate fixes (Already implemented):

Next steps

Conclusion

We acknowledge the significant disruption this incident caused and its impact on your applications and business operations. We are committed to preventing similar issues in the future. Although we communicated the upcoming IP changes in advance, we take full responsibility for the conditions that led to these failures. We are implementing the improvements outlined in this postmortem and will continue to invest in more resilient infrastructure change processes to ensure the reliability and stability of our services.

1754927617 - 1754932975 Resolved
⮜ Previous Next ⮞