Incident History

Write path outage in us-central1 region

Due to this bug reported in https://github.com/kubernetes/kubernetes/issues/127370, we were affected by an issue causing K8S service endpoints not getting updated when pods are stopped/started if there are more than 1k pods matching the service. This caused a temporary outage in Mimir gossiping services, which further resulted in failures to ingest and query metrics for a short time. This issue has been resolved.

1732036353 - 1732036353 Resolved

Issues with new stack creation

This incident has been resolved.

1731973594 - 1731982110 Resolved

Adaptive Metrics Degraded Performance

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

1731945686 - 1731970691 Resolved

Grafana Cloud Portal Accessibility Issues

This incident has been resolved.

1731926863 - 1731928352 Resolved

Degraded dashboard performance due to the erroneous security policy

Rollback has been completed as of 17:20 UTC. At this time, we are considering this issue resolved. No further updates.

1731673295 - 1731691382 Resolved

Tempo Ingestion Disruption

This incident has been resolved.

1731593509 - 1731595453 Resolved

k6 browser tests aborted by system

This incident has been resolved.

1731517136 - 1731525172 Resolved

Grafana Cloud Prometheus - Unhealthy Ingesters

We continue to observe a continued period of recovery. At this time, we are considering this issue resolved. No further updates.

1731009251 - 1731009810 Resolved

Confluent Cloud Latency and High Error Rates

Confluent Cloud has resolved the issue with their Metrics API, which was causing gaps in our metric data. As a result, our service is now fully restored, and data flow is back to normal. Thank you for your patience.

1730744600 - 1730798959 Resolved

New and recently unpaused/unarchived Grafana Cloud instances unable to start.

The incident has been resolved. We applied an update to all Grafana Cloud instances, which inadvertently restarted instances regardless of whether they were active or not. This caused heavy load on our control plane, causing slower startup times.

1730286710 - 1730302135 Resolved
⮜ Previous Next ⮞