Incident History

Login failures

This incident has been resolved.

1652555775 - 1652559363 Resolved

Missing Metrics for a Subset of Status Pages

This incident has been resolved.

1650926863 - 1651248526 Resolved

Multiple sites showing down/under maintenance

Earlier this month, several hundred Atlassian customers were impacted by a site outage. We have published a Post-Incident Review which includes a technical deep dive on what happened, details on how we restored customers sites, and the immediate actions we’ve taken to improve our operations and approach to incident management.

https://www.atlassian.com/engineering/post-incident-review-april-2022-outage

1649166301 - 1650233160 Resolved

Delay in System Metrics

This incident has been resolved.

1648678022 - 1648774264 Resolved

Issues With Login

SUMMARY

On March 14, 2022, between 01:05pm and 01:47pm UTC, some Atlassian customers were unable to login to our products including Trello and Statuspage, and could not access some services including the ability to create support tickets. The underlying cause was a newly introduced configuration data store that did not scale up properly due to a misconfiguration of autoscaling.

The incident was detected by Atlassian's automated monitoring system and mitigated by disabling the use of the new configuration datastore which put our systems into a known good state. The total time to resolution was approximately 42 minutes.

IMPACT

The overall impact was between March 14 2022, 01:05 PM UTC and March 14, 2022, 01:47 PM UTC across seven products and services. The bug impacted several of the key dependent services which resulted in an outage for end users, leading to failed logins across the following products and services:

ROOT CAUSE

The issue was caused by an underlying configuration data store based on AWS DynamoDB failing to scale up. During post-setup fine-tuning it was identified that initial values for the read capacity units (RCUs) and write capacity units (RCUs) were over-provisioned. As a result a decision was made to decrease them however the resulting values proved to be insufficient to handle the increased traffic in our system.

REMEDIAL ACTIONS PLAN & NEXT STEPS

We're prioritizing the following improvement actions to avoid repeating this type of incident:

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

Thanks,

Atlassian Customer Support

1647267031 - 1647268088 Resolved

Statuspage Cache Invalidation Delayed

This incident has been resolved.

1646798906 - 1646803029 Resolved

Public API experiencing increased errors

Increased load caused API service to experience errors from 10:03 - 10:15 PST.

1644259265 - 1644259265 Resolved

Intermittent errors accessing public pages due to elevated traffic

Due to elevated traffic, we experienced intermittent timeouts and errors in serving public pages between 5:57 and 5:58 AM PST. We have made updates to our services to prevent similar problems from happening.

1643238935 - 1643238935 Resolved

Server errors from the Manage Portal

Between 5:20 to 5:24 AM PST, a slow performing API query resulted in an elevated number of 500 errors on the manage portal. The root cause of the degraded performance has been identified and a fix has been deployed. The manage portal is now operating normally.

1641863099 - 1641863099 Resolved

Decrease in site availability due to errant database migration

This incident has been resolved.

1641425998 - 1641427860 Resolved
⮜ Previous Next ⮞