Due to this bug reported in https://github.com/kubernetes/kubernetes/issues/127370, we were affected by an issue causing K8S service endpoints not getting updated when pods are stopped/started if there are more than 1k pods matching the service.
This caused a temporary outage in Mimir gossiping services, which further resulted in failures to ingest and query metrics for a short time.
This issue has been resolved.
Confluent Cloud has resolved the issue with their Metrics API, which was causing gaps in our metric data. As a result, our service is now fully restored, and data flow is back to normal. Thank you for your patience.
The incident has been resolved. We applied an update to all Grafana Cloud instances, which inadvertently restarted instances regardless of whether they were active or not. This caused heavy load on our control plane, causing slower startup times.