Incident History

Hosted Grafana Outage (prod-us-west-0)

Engineering has released a fix and as of 22:20 UTC, customers should no longer experience any issues. At this time, we are considering this issue resolved. No further updates.

1751581046 - 1751581696 Resolved

Read and write path outage in Hosted Logs ap-south-1 region cells.

The incident was caused by multiple ingesters being unavailable at the same time due to moving ingester pods between nodes. It's a regular operation, but in this particular case the ingester took an unexpected long time to restart which coincided with another ingester eventually restarting at the same time, causing an issue.

1751534633 - 1751553088 Resolved

cortex-prod-05 cell partial read-path outage

This incident has been resolved.

1751534309 - 1751577871 Resolved

Degraded Performance in Mimir Ingestion Path

This incident has been resolved.

1751308063 - 1751316773 Resolved

Read & Write Outage in Prod-GB-South-1

We observed a brief read & write outage in the prod-gb-south-1 region. This lasted from approximately 11:51-12:14 UTC.

1751038907 - 1751038907 Resolved

Synthetic Monitoring: Spain public probe failing intermittently.

Spain public probe was facing issues and had intermittent failures between June 21st 20:00 UTC to June 22nd 8:40 UTC. Synthetic monitoring checks using Spain public probe have been failing intermittently during that window. The issue is now resolved - the probe doesn't suffer any issues anymore and is stable.

1750585573 - 1750585573 Resolved

Degraded Write Performance on Tempo Metrics Generator

This incident has been resolved.

1750436813 - 1750458465 Resolved

High query latency for us-prod-east-0 hosted datasources

This incident has been resolved. No further issues were seen since adjusting the backend configuration on Friday June 20th. (22:22 UTC)

The root cause has been identified as node CPU saturation, causing high latency on ingesters.

1750434539 - 1750782929 Resolved

Cloudwatch/Athena Integrations - Partial Outage

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

1750355435 - 1750358853 Resolved

Slow user queries exceed threshold

There was a read outage impacting Loki tenants in the prod-us-east-0 cluster. This issue has been resolved.

1750269247 - 1750269247 Resolved
⮜ Previous Next ⮞