From approximately 16:30-8:15 UTC, a configuration change inadvertently removed a required headless service for hosted traces in one of our production regions. This caused elevated error rates and increased service-level objective (SLO) burn for the trace ingestion pathway. The underlying issue was a mismatch in internal configuration references following a prior migration. Re-enabling the headless service restored normal operation.
We consider the incident as resolved now. With regards to the cause - a slow physical partition of the backend database, which is used by the control plane of a critical component caused increased latency and occasional overloading with subsequent failing of the write path. Once writes switched to a different partition, the latency dropped and error rate went down.