Service(s) Affected: SSO, sign in, sign up, CLI, web interface, access to vault content and other items, admin console, MFA
Impact Duration: ~60 mins
Summary
On November 18, 2025, at 5:03 PM UTC, 1Password experienced degraded and temporarily unavailable cloud services for customers in the US region. The issue was caused by database resource exhaustion, causing operations to fail and connections to be rejected. This was not a security incident and no customer data was impacted. The issue was resolved by resizing the database to restore normal performance and ensure additional capacity for future growth.
Impact on Customers
Single Sign-on (SSO), Multi-factor Authentication (MFA): Users with SSO or MFA enabled experienced delays, and in some cases failures to log in.
Browser Extension: Users who needed to authenticate via the web interface were unable to unlock their vaults.
Web Interface, Administration: Customers were unable to log in, sign-ups failed, syncing between devices was not functioning, access to vault and other items were unavailable and the admin console was not reachable.
API Access: CLI users and API requests received timeout errors and slow responses.
Number of Affected Customers (approximate): All customers utilizing cloud interfaces and APIs in the affected region for the duration of the incident.
Geographic Regions Affected (if applicable): US/Global.
2025-11-18 5:56pm: Services slowly started to scale up
2025-11-18 5:57pm: Services started to come back as the database instance resize completes
2025-11-18 6:05pm: Incident marked as Identified
2025-11-18 6:05pm: Team continues to monitor, performance has returned to normal levels
2025-11-18 6:23pm: Incident marked as Monitoring and services Operational
2025-11-18 7:16pm: Incident marked as resolved
Root Cause Analysis: The refactor of a feature increased the impact of a poorly performing query that had previously gone undetected. The result was the exponential increase in resource consumption for the main database. Once resources were fully exhausted, the service rejected connections and all requests failed.
Contributing Factors (if any):
Non-performant queries
Database under-provisioned
How Was It Resolved?
Mitigation Steps:
Background services were halted to reduce load on the database.
Application servers were scaled down to further reduce load.
Resolution Steps: Increasing the database instance size resolved the issue.
Verification of Resolution: Monitoring metrics were closely observed to ensure error rates returned to normal and database performance had stabilized.
What We Are Doing to Prevent Future Incidents
Improving monitoring: We are updating our monitoring systems to better detect database issues like this before impacting customers.
Improve database performance: We are refactoring the responsible query to improve performance and reduce load, and tuning the background service to prevent resource contention.
Next Steps and Communication
No action is required from our customers at this time.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
Service(s) Affected:1Password.com website, Sign in, Access to passwords and other items
Impact Duration: 13 hours, 29 minutes
Summary
On October 20, 2025 at 07:26:00 UTC, 1Password.com faced intermittent latency, authentication failures, and degraded service availability due to a major outage at AWS in the us-east-1 region. This was not a security incident and no customer data was affected.
As a result, the 1Password server-side application experienced degradation or intermittent failures, affecting up to 50% of traffic in the US region. Complete service restoration occurred in conjunction with AWS’s final mitigations around 18:30 UTC.
Impact on Customers
All US customers accessing 1Password cloud services experienced intermittent latency, authentication failures, and degraded availability on 1Password.com.
File Share: Sharing of passwords via links could intermittently fail
Login: Users logging into vaults experienced timeout errors and slow responses
Web Access: Users accessing their vault through the web interface experienced timeout errors and slow responses
API Access: CLI users and API requests received timeout errors and slow responses
What Happened?
At 07:11:00 UTC, AWS began experiencing DNS resolution failures in the us-east-1 region, initially affecting DynamoDB and rapidly cascading to multiple AWS services. 1Password monitoring detected impact at 07:26:00 UTC when monitoring alerts fired for inability to scale up clusters, and an incident was declared.
1Password immediately deployed mitigations inside our infrastructure to ensure there was adequate compute capacity to serve our US-based users, which included pausing deployments and scaling down any services not critical to key functionality for our users.
Timeline of Events (UTC):
06:55:05 - 1Password monitoring triggers warning for unavailable Pods in Deployment (caused by inability to obtain AWS IAM credentials)
07:03:06 - 1Password monitoring alerts for 5xx errors on auth start endpoint (caused by inability to obtain AWS IAM credentials) - pages authentication team, but alert recovers within minutes
07:26:00 - 1Password monitoring alerts for inability to scale clusters, engineers begin investigating, Incident declared
07:26:41- AWS confirms elevated error rates across multiple services
07:49:06 - 1Password monitoring alerts for 5xx errors on auth start endpoint (caused by inability to obtain AWS IAM credentials)
07:51:09 - AWS identifies DNS as the root cause, begins mitigation
08:02:13 - 1Password suspends auto-scaling tooling to retain existing capacity
09:27:33 - AWS reports significant recovery signs
10:35:37 - AWS declares DNS issue fully mitigated, services recovering
14:14:00-15:43:00 - AWS announced full recovery across all services; throttles EC2 launches
16:42:49 - 1Password tooling and users start reporting 503s and inability to login due to volume of traffic
16:50:00 - 1Password services restarted to reset and flush connections, prioritizing post-recovery traffic.
20:53:00 - AWS resolves their incident
20:55:00 - 1Password engineers overscale deployments for stability and overnight observation
Oct 21, 2025 - Incident resolved after confirmation of complete upstream recovery
How Was It Resolved?
Mitigation Steps: 1Password paused deployments and auto-management of cluster capacity to ensure enough capacity was available to serve users through peak access times. As demand outstripped available capacity, 1Password engineering reset the circuit breaker to allow additional connections to the service.
Resolution Steps: AWS announced system restoration and a reduction in throttling of EC2 API calls. To ensure sufficient capacity for peak traffic, 1Password engineers updated the required number of pods for core services the following business day and resumed auto-management of cluster capacity tooling. The following day, 1Password engineers resumed verification of the health of the systems, deployments, and auto-scaling of the services.
Verification of Resolution: Engineers observed monitoring systems and cluster management tooling logs to ensure system health.
Root Cause Analysis
Root Cause Analysis: The failures in AWS's internal network affected multiple AWS product APIs. This disruption directly impacted 1Password’s ability to scale up infrastructure, deploy applications, and retrieve configuration data.
Contributing Factors (if any):
Third-party incident response services and paging services were affected by the AWS incident, which complicated communications.
Upstream customer IDPs were affected by the AWS outage, and returned errors that resulted in authentication failures.
What We Are Doing to Prevent Future Incidents
Improve Incident Response: Create additional backup protocols for when our incident response tooling is unavailable.
Improve multi-service outage response: Create strong break-glass runbooks in the event of a multi-service cloud provider outage.
Next Steps and Communication
No action is required from our customers at this time.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.
Incident Postmortem - Degraded performance when accessing 1Password
Date of Incident: 2025-09-26 Time of Incident: 4:20pm UTC - 5:39pm UTC Service(s) Affected: SSO, Web Sign In, Sign Up, Web Interface, CLI Impact Duration: ~60 minutes
Summary
On September 26, 2025 at 4:20 UTC 1Password’s web interface and APIs experienced degraded performance for all customers in the US region. This was not a result of a security incident and customer data was not affected.
Impact on Customers
During the duration of the incident:
Web interface, Administration: Customers experienced delays when accessing the 1Password web interface.
Single Sign-on (SSO), Multi-factor Authentication (MFA): Users with SSO or MFA enabled experienced delays, and in some cases failures to login.
Command Line Interface (CLI): CLI users faced increased latency and timeouts when attempting to access our web APIs.
Browser Extension: Users requiring web interface authentication experienced delays or failures.
Number of Affected Customers (approximate): ~30%
Geographic Regions Affected:1password.com (US/Global)
What Happened?
At 4:20PM UTC and 5 PM UTC There were traffic bursts which caused extra load on one of our caches. This cache was under-provisioned to handle that spike of activity, which resulted in it exhausting available CPU. This caused cascading errors/latency which manifested in slow and failed requests.
Timeline of Events (UTC):
2025-09-26 4:20pm: Spike in customer traffic began
2025-09-26 4:29pm: Automated monitoring detects increased errors and latency
2025-09-26 4:35pm: The team activates our incident protocol and begins investigation
2025-09-26 4:58pm: The team decides to restart application servers
2025-09-26 5:00pm: The servers have been restarted, service is still degraded, as a second traffic burst begins
2025-09-26 5:18pm: Service starts to improve
2025-09-26 5:25pm: The team detects increased load for the second time
2025-09-26 5:33pm: The team restarts application servers again
2025-09-26 5:39pm: Service is back to normal, team continues to investigate
2025-09-26 7:26pm: Team has found the issue, and proceeds to upgrade cache instance size
2025-09-26 7:50pm: Team continues to monitor, performance has returned to nominal levels
2025-09-26 8:24pm: Incident is marked as resolved
Root Cause Analysis:
A code library installed in July introduced latency issues for cache connections. Authentication operations weren't properly rate-limited, allowing large traffic influxes. During peak traffic periods, the cache infrastructure was operating near maximum CPU capacity. The incident occurred when a burst of authentication traffic pushed the cache CPU utilization to 100%. The increased latency and CPU usage together directly caused the incident.
Contributing Factors:
Latency increase due to cache library version upgrade
Inadequate rate limiting allowed traffic bursts to go unchecked
Cache instance size is under-provisioned
How Was It Resolved?
Mitigation Steps: Restarting application servers temporarily mitigated the latency and errors, but the problems returned when traffic spiked again.
Resolution Steps: Increasing the instance size for the cache resolved the issue.
Verification of Resolution: The incident team tested the upgrade in a staging deployment before executing it in production. They then monitored metrics to confirm the system returned to normal levels.
What We Are Doing to Prevent Future Incidents
Improve capacity planning for cache: We will ensure our internal infrastructure is properly sized to handle current traffic volumes and accommodate future growth. We'll implement regular resource evaluations to maintain adequate capacity as our traffic increases. We will also implement proactive alerting systems that notify our teams when resource utilization approaches critical thresholds.
Update library to a more performant version: We will upgrade our caching library to the latest stable version to eliminate the current latency issues.
Improve rate limiting for operations that triggered the traffic burst: Enhancing our rate limiting system will significantly improve our ability to handle future traffic bursts.
Timeline for Implementation: Observability improvements have already been implemented, and we will complete the remaining work by the end of Q1, 2026.
Next Steps and Communication
No action is required from our customers at this time.
We are committed to providing a reliable and stable service, and we are taking the necessary steps to learn from this event and prevent it from happening again. Thank you for your understanding.