The Tempo service on cluster EU west experienced a traffic increase over the weekend, which caused an elevated error rate in Tempo's write path (ingestion). Our engineering team identified the root cause of the issue, and implemented measurements for palliating and resolving the problem.
Traces ingestion problems could have been experienced from 15:30 UTC on 13th until 19:30 UTC on 15th.
At approximately 12:00 UTC a feature toggle was rolled out which negatively impacted instances on the slow release channel. Users on this release channel began to receive an "AlertStatesDataLayer" error. A workaround was quickly identified and applied to reporting users. The feature toggle in question was fully reverted by 18:00 UTC.
Due to scheduled maintenance (https://status.grafana.com/incidents/rz7nt6cs4prb) we hit an issue with some users being unable to log in into their Grafana Cloud stacks. The issue was affecting only users who:
had no session already opened in the Grafana Cloud stack;
or they were located close to Europe (geographically), but their stack is closer to US (or vice versa).
The issue was caused by an incorrect configuration introduced by the maintenance, which was fixed shortly after being discovered. Login is fully operational and stable now.