Incident Postmortem - Some customers are unable to interact with the admin console
Date of Incident: 2025-09-24
Time of Incident (UTC): 02:27 - 17:16
Service(s) Affected: Admin console, Sign in
Impact Duration: 36:49
Summary
Some customers with certain account configurations were placed on a blocklist and presented with a 403 error page after accessing the admin console.
Impact on Customers
Admin console: Affected customers were presented with a 403 error page whenever they tried to interact with any of the admin console pages.
Log in: Affected customers were also unable to log in to the application.
Number of Affected Customers (approximate): 515
Geographic Regions Affected (if applicable): All regions
What Happened?
Timeline of Events (UTC):
Sep 24th 2:27am: Spike in application monitoring alerted engineers to increased rates of IP blocking
Sep 24th 3:00am: Cause identified as a change to requests in the application, which had been partially rolled out via a feature flag.
Sep 24th 4:03am: The feature flag was enabled to all customers which reduced the spike, but IP blocks continued throughout the day.
Sep 24th 10:03pm: Merged an application change to revert the change to prevent the issue reoccurring.
Sep 25th 5:03pm: The change was deployed with scheduled application release, error rate dropped off shortly after.
Root Cause Analysis: The issue was caused by GET requests to the Users API exceeding the URL length limit due to a recent change to append a list of UUIDs to the request parameters to resolve customer reported performance issues.
Contributing factors:
Requests were switched from GET to POST to prevent requests from exceeding the URL limit, however an issue with the feature flag configuration was causing UUIDs to be sent with the GET endpoint.
An underlying issue with the feature flag not resolving as expected in the application.
How Was It Resolved?
Mitigation Steps: Customers were manually removed from the blocklist at multiple points in time as we evaluated the root cause and worked to patch the root issue.
Resolution Steps: The issue was mitigated by removing UUIDs at the API level if a GET request is used. Additional logging has been added to identify the root cause of the feature flag configuration issue.
Verification of Resolution: We monitored our server logs to ensure that we did not observe any additional GET requests to the affected URL.
What We Are Doing to Prevent Future Incidents
Audit additional admin console API requests: We’re performing a sweep of admin console API requests to ensure the utilization of POST requests with highly parameterized URLs.
Remove the feature flag misconfiguration: We’re correcting the way the feature flag is configured to ensure consistent outcomes.
Next Steps and Communication
No action is required from our customers at this time.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
Date of Incident: 2025-09-23 Time of Incident (UTC): 17:18 - 00:46 Service(s) Affected: Sign Up Impact Duration: 7h 28m
Summary
For 7 hours and 28 minutes, 1Password Provisioning invites could not be accepted, presenting to the user as an invite expiry. Invites could not be accepted due to a web browser routing defect that was not caught during development, review, or release. First identified by customer reports approximately two and a half hours after release, the issue was escalated to development teams and an incident was immediately called. The root cause was identified as a defect introduced by a web client modification, and a fix was created, tested, and released. By 00:46 UTC, the fix was deployed to all environments and service was fully restored.
Impact on Customers
Sign-up: Provisioning invites could not be accepted.
Number of Affected Customers (approximate): 100% of provisioning invites could not be accepted
Customer-facing impact: Users clicking their invite links encountered a misleading ‘Invite Expired’ message.
Geographic Regions Affected: 1Password USA/Canada/EU/Enterprise
What Happened?
A change to the web client contained a router defect that incorrectly rendered provisioning invites as expired. Users were presented with an error message that erroneously stated the invite was expired. The change responsible for introducing the defect was able to be released because it was not captured under automatic change notification rules, was lacking automated test coverage, and was not included in the set of manual tests.
Timeline of Events (UTC):
17:18: 1Password Release containing defect
19:53 (2 hours, 35 minutes later) First customer report
20:49 (56 minutes later) Escalation to developer teams
21:01: (12 minutes later) Incident called
22:02: (1 hour, 1 minute later) Root cause identified
22:29: (27 minutes later) Fix created and testing initiated
23:49: (1 hour, 30 minutes later) Fix merged
00:46: (57 minutes later) Fix released and service fully restored
Root Cause Analysis: A change modified the order in which key provisioning web routes were rendered. As a result, the route handling provisioning invitations failed to use the correct query parameters and the invite rendered as expired.
Contributing Factors: Automated tests on this endpoint do not exist. Manual testing missed testing the Provisioning routes. The modified code was not covered by automatic change notification rules to notify the Provisioning team. An existing bug that can fail the resending of invites was an initial red herring during the investigation.
How Was It Resolved?
Resolution Steps: The defect in the 1Password web client was corrected so provisioning invites would render correctly.
Verification of Resolution: 1Password engineering tested the changes and validated that the functionality was restored, as well as verifying that requests for the affected endpoints were successful after the fix was deployed.
What We Are Doing to Prevent Future Incidents
Improve automated tests: We are enhancing our automated tests for the Provisioning Invite routes.
Expand automatic change notifications: Expanding coverage of automatic change notification rules for areas of code owned by the Provisioning team.
Next Steps and Communication
No action is required from our customers at this time. Existing invites do not need to be resent and may be accepted.
If you are still experiencing issues, please contact our support team at support@1password.com.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
Date of Incident: 2025-09-03 Time of Incident (UTC): 11:06 - 12:07 Service(s) Affected: All APIs Impact Duration: 61 minutes
Summary
For 61 minutes on the morning of September 3rd, 2025, all 1Password APIs in the US/Global environment had degraded performance or returned an error for approximately 20% of requests. 92% of the impact was mitigated within 13 minutes at 11:19 by automation scaling up infrastructure. By 12:06 a manual restart of the remaining infrastructure completed mitigation. A permanent fix was implemented and deployed to prevent the issue from reoccurring.
Impact on Customers
APIs: High latency, or a 500 Internal Server Error.
Number of Affected Customers: 20% of all requests returned errors for 13 minutes, 1% thereafter.
Geographic Regions Affected (if applicable): 1Password USA/Global
What Happened?
Timeline of Events (UTC):
11:05: A customer started a stream of an unusually high volume of requests to an API with sub-optimal performance.
11:06: Some servers started consuming abnormally high memory, causing slow response times and high error rates.
11:19: Automation scaled up infrastructure to service additional load
11:51: Engineers declare an incident and alert response teams
12:02: Response team begins restarting affected servers.
12:07: All servers completed restarts, and error rates returned to normal levels
Root Cause Analysis: A poorly performing cache operation was triggered repeatedly in a short period of time across multiple servers, leading directly to greatly delayed responses.
How Was It Resolved?
Mitigation Steps: Automatic instance scaling restored over 98% of operational capacity after 13 minutes. Full capacity was restored through manual intervention
Resolution Steps: We refactored the poorly performing query.
Verification of Resolution: We tested the affected API to confirm refactoring of query produced the desired performance improvement. We deployed the fix and monitored it for 24 hours to assert the issue was resolved.
What We Are Doing to Prevent Future Incidents
We are auditing services for sub-optimal query performance.
Next Steps and Communication
No action is required from our customers at this time.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
Service(s) Affected: Sign-in, Web Application, Command Line Interface (CLI), Single Sign On (SSO), API’s
Impact Duration: 1 hour
Summary
On August 5, 2025, 1Password experienced a service degradation that impacted customers' ability to sign in and access the web application. The incident was triggered during a planned architectural improvement when a misconfigured rollback attempt caused an overload of traffic and a subsequent database connection bottleneck. The issue was resolved by correcting the misconfiguration and restarting the web application servers, fully restoring service.
Impact on Customers
During the service disruption, some customers experienced degraded performance when accessing their 1Password vaults and signing in.
Sign-in Issues: Customers may have experienced sign-in slowness or timeouts.
Error Messages: Customers may have seen error messages when attempting to sign in, such as "Can't sign in", "Failed to determine sign in methods for email", or "Upstream connect error".
Vault Access: Some customers experienced degraded performance when accessing their 1Password vaults.
Geographic Regions Affected: USA/Global
What Happened?
The incident was part of ongoing improvements to the 1Password infrastructure and was not the result of a security incident. Customer data was not affected.
Timeline of Events (UTC):
17:12: A planned, phased rollout of an architectural improvement to authentication systems begins.
18:20: Engineers monitoring the rollout begin to observe higher latency during sign-in for a small subset of accounts.
20:17: A rollback of the change is initiated. An error in the rollback configuration sends a high volume of traffic to the new code path, causing a database connection bottleneck. Engineers observing the deployment immediately observe service impact.
20:21: A corrective action is deployed to revert the system to its previous state before the misconfigured rollback.
20:39: While impact is still being observed, a failover from the primary database to a secondary database is initiated. This action has no effect.
20:58: A restart of the service that manages incoming traffic to our services is initiated to reset connections.
21:13: A rolling restart of the web application servers is initiated.
21:20: Service is fully restored for all customers.
Root Cause Analysis: The root cause was an error in the configuration of an attempted rollback. This misconfiguration incorrectly routed a high volume of sign-in traffic through a new, slower code path, which created a bottleneck of connections to our primary database and made the web application unresponsive.
How Was It Resolved?
Resolution Steps: The issue was fully resolved through two key actions:
The rollback misconfiguration was identified and corrected, which stopped traffic from flowing to the problematic new code.
A rolling restart of the web application servers was performed to clear the backlog of stuck database connections.
Verification of Resolution: Monitoring systems were closely observed for 30 minutes to ensure error rates returned to normal.
What We Are Doing to Prevent Future Incidents
We are working to implement the following improvements:
Improve configuration testing: We will improve testing procedures of configuration updates and their rollbacks prior to being pushed to production.
Improve our deployment tooling: We will add additional validation to our traffic management tools to prevent similar configuration errors.
Review our incident response procedures: We have updated the runbook used to respond to this type of incident with guidance that will enable faster recovery.
Enhance our monitors: We will add more specific alerts that will help us more quickly distinguish between different application tiers, allowing for faster diagnosis and time to resolution.
Next Steps and Communication
No action is required from our customers. 1Password applications are designed to be resilient, with local copies of vault data always available on customer devices, even without a connection to the 1Password service.
If you are still experiencing issues, please contact our support team.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
Date of Incident: 2025-07-22 Time of Incident (UTC): 10:40 - 06:30 Services Affected: Admin Console Activity log and User Details page Impact Duration: 43 hrs
Summary
A change to our internal data model that removed an unused type definition led to multiple failures in the Reports system.
Impact on Customers
Events API and Reporting: The activity log widget failed to display activities.
Admin console: User details page failed to display user details
What Happened?
A software update introduced an API change that caused a mismatch between client and server. The Audit log page in the admin console was still collecting data but was not able to render it.
Timeline of Events (UTC):
10:40 : Detected by 1Password personnel and root cause identified
10:42: Issue Identified
13:38 Fix created and testing initiated
06:30: Fix released, Service fully restored
Root Cause Analysis: A User State was removed from the 1Password.com server but not the client. This was a breaking API change.
How Was It Resolved?
Resolution Steps: The user state in the 1Password client was removed to get the server and client back into parity.
Verification of Resolution: 1Password engineering tested the changes and validated that full system functionality was restored.
What We Are Doing to Prevent Future Incidents
User State change process: 1Password is investigating how to catch breaking API changes by implementing additional end-to-end tests in the CI pipeline.
Audit log architecture change: Engineering is investigating a rework of the audit log page to change how it aggregates data so that it is not reliant on an endpoint. This would mitigate future occurrences that caused this issue.
Next Steps and Communication
No action is required from our customers at this time.
If you are still experiencing issues, please contact our support team at support@1password.com.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.