At approximately 13:50 UTC a new deployment resulted in pointing to a non-existent plugin version for pinned instances. This resulted in new pods for those instances to fail upon start. Since older pods were still running, users should not have experienced a disruption.
Hosted Grafana experienced degraded performance on the Email notification service today, 15th of July from 11:45 UTC to 13:00 UTC. This impacted the sending/receiving of emails used in modules like alerting or reporting. Service was restored shortly after identification.
Engineering has released a fix and as of 22:20 UTC, customers should no longer experience any issues. At this time, we are considering this issue resolved. No further updates.
The incident was caused by multiple ingesters being unavailable at the same time due to moving ingester pods between nodes. It's a regular operation, but in this particular case the ingester took an unexpected long time to restart which coincided with another ingester eventually restarting at the same time, causing an issue.