Incident History

Write path outage in us-central1 region

Due to this bug reported in https://github.com/kubernetes/kubernetes/issues/127370, we were affected by an issue causing K8S service endpoints not getting updated when pods are stopped/started if there are more than 1k pods matching the service. This caused a temporary outage in Mimir gossiping services, which further resulted in failures to ingest and query metrics for a short time. This issue has been resolved.

Issues with new stack creation

This incident has been resolved.

Adaptive Metrics Degraded Performance

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

Grafana Cloud Portal Accessibility Issues

This incident has been resolved.

Degraded dashboard performance due to the erroneous security policy

Rollback has been completed as of 17:20 UTC. At this time, we are considering this issue resolved. No further updates.

Tempo Ingestion Disruption

This incident has been resolved.

k6 browser tests aborted by system

This incident has been resolved.

Grafana Cloud Prometheus - Unhealthy Ingesters

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

Confluent Cloud Latency and High Error Rates

Confluent Cloud has resolved the issue with their Metrics API, which was causing gaps in our metric data. As a result, our service is now fully restored, and data flow is back to normal. Thank you for your patience.

New and recently unpaused/unarchived Grafana Cloud instances unable to start.

The incident has been resolved. We applied an update to all Grafana Cloud instances, which inadvertently restarted instances regardless of whether they were active or not. This caused heavy load on our control plane, causing slower startup times.

⮜ Previous Next ⮞