Elevated Azure Capacity Errors
This incident has been resolved.
This incident has been resolved.
This incident has been resolved.
This incident has been resolved.
This incident has been resolved.
This incident has been resolved.
We have been monitoring the incident since 21:43 UTC, and see no evidence that it will recur. At this point, our metrics delay from MongoDB Atlas Streams to Datadog is within tolerances, and so we are treating this incident as Resolved.
This incident has been resolved.
This incident has been resolved.
This incident has been resolved.
On August 11, 2025, at 14:05 UTC, we implemented a planned infrastructure change to add additional IP addresses to our control plane NAT gateways. Although we communicated this change in advance on June 30, 2025, we acknowledge that our communication and the resulting customer preparations were not sufficient to prevent service disruptions for customers with IP access restrictions. Upon detecting the customer impact we initiated a rollback of the infrastructure change to restore service and prevent further disruptions. The rollback was completed by 16:19 UTC, returning all affected services to their previous operational state.
Affected Services: Atlas Clusters with BYOK encryption, Atlas login/signup, App Services, MongoDB Charts
Geographic Scope: Global
Customer Impact:
Peak Impact Period: 14:05-16:19 UTC
The addition of new IP addresses to our control plane caused network access failures for customers who had configured IP allowlists that didn't include the new addresses. (Please note that this is not the same as the cluster IP allowlist which you would use to control how to connect to your cluster)
The primary issues were:
Immediate fixes (Already implemented):
Next steps
We acknowledge the significant disruption this incident caused and its impact on your applications and business operations. We are committed to preventing similar issues in the future. Although we communicated the upcoming IP changes in advance, we take full responsibility for the conditions that led to these failures. We are implementing the improvements outlined in this postmortem and will continue to invest in more resilient infrastructure change processes to ensure the reliability and stability of our services.