Incident Postmortem - Intermittent 500 error failures for US customers.
Date of Incident: 2025-06-16 Time of Incident (UTC): 14:39 UTC - 17:47 UTC Service(s) Affected: Web Interface, Sign in, Sign up, Admin console, Item Sync, SSO (Single Sign On), Command Line Interface (CLI)) for US based users.
Impact Duration: 68 minutes
Summary
At 14:39 UTC, users in the US region experienced intermittent errors while accessing 1Password. The issue stemmed from resource constraints within our infrastructure, specifically affecting the networking services. This was resolved by scaling up the affected services.
Impact on Customers
During the duration of the incident:
Web interface, Admin Console: Customers were able to log in but saw intermittent 500 errors, including “Failed to get Integrations” on the web Interface.
SSO (Single Sign On), Command Line Interface (CLI), Item Sync: There was degraded performance for authentication and API requests.
Sign in, Sign up: There were intermittent failures on sign in and sign up for some customers during the incident.
Number of Affected Customers (approximate): All users accessing the service in the US region were affected.
Geographic Regions Affected (if applicable): US
What Happened?
The incident began when our internal services started returning errors after deploying the latest version of the 1Password service. As part of the initial investigation, we restarted a supporting network service within our infrastructure, which resulted in an initial recovery of the affected service.
2025-06-16 15:21 UTC: Networking updates are rolled out
2025-06-16 15:23 UTC: Initial service recovery observed
2025-06-16 15:38 UTC: Root cause identified: Networking applications ran out of allocated resources.
2025-06-16 15:42 UTC: Additional capacity added to networking applications
2025-06-17 17:47 UTC: The spike in server errors stopped, and internal monitoring showed that system health had returned to normal.
2025-06-17 17:53 UTC: Incident resolved
Root Cause Analysis: An internal service that directs network traffic became resource constrained which caused degraded performance of the service. We first stabilized the system by adding more capacity and have since deployed a permanent fix by increasing system resources to prevent a recurrence.
How Was It Resolved?
Mitigation Steps: As an immediate mitigation, the number of replicas for the deployment was scaled up.
Resolution Steps: A more permanent fix was later applied by increasing the allocated resources for the networking applications.
Verification of Resolution: Around 15:25 UTC, we observed that the spike in 500 errors from the server had completely stopped. The team continued monitoring the errors and confirmed at 17:53 pm EST that allocated resource consumption had been stable for a while.
What We Are Doing to Prevent Future Incidents
Scale existing resources: We have effectively scaled resources and resource limits to address additional load and will implement monitoring to ensure we do not hit critical limits
Review and expand existing monitors: We will review our critical service monitors to improve alerting and catch future incidents earlier, before they have customer impact.
Next Steps and Communication
No action is needed from customers
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
We have resolved an issue where account administrators are unable to load the "People" view when logging in to their account on 1Password.com, 1Password.ca, and 1Password.eu.
Service(s) Affected: USA/Global 1Password.com website, Sign in, Sign up, Admin console, SSO (Single Sign On), Command Line Interface (CLI)).
Impact Duration: 41 minutes
Summary
On May 21st, 1Password's web interface, APIs, browser extension, and CLI tools experienced significant latency and errors. These problems stemmed from a code change that triggered a spike in server requests, leading to increased memory usage and system load. As a result, customers were unable to access their vaults or sign in via SSO.
This was not a result of a security incident and customer data was not affected.
Impact on Customers
During the duration of the incident:
Web interface, Administration: Customers experienced significant delays when accessing the 1Password web interface. Administrators could not access or use any administration tools.
Single Sign-on (SSO), Multi-factor Authentication (MFA): Users with SSO or MFA enabled could not sign in and received an "An unexpected error occurred" message. Customers may also have been required to re-authenticate to access 1Password once the issue was mitigated.
Command Line Interface (CLI): CLI users faced increased latency and timeouts when attempting to access our web APIs.
Browser Extension: Users requiring web interface authentication were unable to unlock their vaults.
Number of Affected Users (approximate): All users accessing the service in the US/Global (1password.com) region were affected
Geographic Regions Affected (if applicable):1password.com (US/Global)
What Happened?
We deployed code changes that increased the number of queries to our Redis clusters. The increase in queries caused a spike in memory usage which in turn caused latency and errors across all endpoints.
Timeline of Events (UTC):
2025-05-21 15:52 UTC: Deployment started
2025-05-21 15:57 UTC: Deployment complete
2025-05-21 16:00 UTC: Automated monitoring detects increased errors and latency
2025-05-21 16:01 UTC: Automation pages the incident response team
2025-05-21 16:06 UTC: The team activates our incident protocol and begins investigation
2025-05-21 16:21 UTC: The team initiates a rollback to a previous version
2025-05-21 16:23 Code change causing the issue identified
2025-05-21 16:48 UTC: Incident mitigated—rollback completed and we see a significant improvement in error rates and latency. The team continues to monitor the system.
2025-05-21 17:23:11 UTC: Incident resolved
Root Cause Analysis:
We released a code change that caused a significant increase in data writes to our session store cluster.
All operations, even those with a pre-established session depend on the session store for authenticating requests.
The resulting resource contention led to increased latency and timeouts.
The unplanned high volume of writes to this specific datastore also caused a portion sessions to be prematurely evicted, requiring customers to re-authenticate earlier than anticipated.
How Was It Resolved?
Our monitoring systems detected the issue and alerted the response team immediately after the release. The team quickly identified the problem and initiated a rollback.
Resolution Steps: The team identified the problematic code change and reverted to a previous version. As the rollback deployed, server functionality returned to normal.
Verification of Resolution: Our monitoring systems were closely observed for 2 hours after the rollback to ensure latency and errors were fully resolved.
What We Are Doing to Prevent Future Incidents
Our team will implement longer testing periods in lower-traffic environments to improve monitoring and issue detection for similarly high-risk changes.
Our team is working to improve our deployment process to enhance our incremental deployments, which will allow us to detect system issues earlier and contain fallout.
Next Steps and Communication
Some customers may need to re-authenticate in order to access 1Password
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.