Outage due to DNS problems on AWS


Incident resolved in 4h44m10s

Resolved

This incident has been resolved.

1739887612

Update

Our engineering teams have applied fixes to tackle the issues, and all services on the AWS clusters should be operational.

We will continue monitoring them.

1739884811

Update

Update in affected components: Prometheus Metrics on us-east-0 and services depending on this are back to operational.

1739884019

Update

Update in affected components: Synthetics Monitoring components back to operational, services depending on Prometheus Metrics on us-east-0 still under examination.

We are continuing to monitor all the services across AWS clusters. Beware that some services may still have degraded performance until fully infrastructure stabilization.

1739882454

Update

We are continuing to work on a fix for this issue.

1739881384

Update

Update in affected components: IRM components partially recovered, Oncall services are fully operational, Incident services recovering. Prometheus services are almost fully operational (monitoring recovery on us-east-0)

1739881313

Update

Update in affected components: Tempo services and asserts services have been restored, alerting services have been partially restored.

Currently monitoring all operative services.

1739880711

Update

We have identified the issue, and we are bringing back to operational state most of the services including: Loki services, Pyroscope services, and AI/ML Services. We are monitoring these services.

1739879591

Investigating

Update in affected components: OTLP Endpoint and Graphite proxy for querying and ingesting are fully operational.

1739877880

Investigating

We are continuing to investigate this issue and working on reestablishing the service.

1739876277

Investigating

Update in affected components: Hosted Grafana instances (stacks) are operational.

1739873676

Investigating

Update in affected components: Grafana Cloud k6 (and legacy app.k6.io) are fully operational.

1739873207

Investigating

We are continuing to investigate this issue and determining the full impact.

1739872905

Investigating

Update in components scope: potentially all our services running on AWS may be affected.

1739871853

Investigating

Update in components scope: potentially all our services running on AWS may be affected.

1739871780

Investigating

We are continuing to investigate this issue.

1739871199

Investigating

We are currently experiencing an outage on our instances locate on AWS cloud due to DNS problems. We are actively working to reestablish the service and quantify the whole impact of the issue. All our services running on this provider may be potentially affected.

1739870562