Incident with Webhooks and Actions


Incident resolved in 3h52m44s

Resolved

On July 5, 2024, between 16:31 UTC and 18:08 UTC, the Webhooks service was degraded, with customer impact of delays to all webhook delivery. On average, delivery delays were 24 minutes, with a maximum of 71 minutes. This was caused by a configuration change to the Webhooks service, which led to unauthenticated requests sent to the background job cluster. The configuration error was repaired and re-deploying the service solved the issue. However, this created a thundering herd effect which overloaded the background job queue cluster which put its API layer at max capacity, resulting in timeouts for other job clients, which presented as increased latency for API calls.Shortly after resolving the authentication misconfiguration, we had a separate issue in the background job processing service where health probes were failing, leading to reduced capacity in the background job API layer which magnified the effects of the thundering herd. From 18:21 UTC to 21:14 UTC, Actions runs on PRs experienced approximately 2 minutes delay and maximum of 12 minutes delay. A deployment of the background job processing service remediated the issue.To reduce our time to detection, we have streamlined our dashboards and added alerting for this specific runtime behavior. Additionally, we are working to reduce the blast radius of background job incidents through better workload isolation.

1720213053

Investigating

We are seeing recovery in Actions start times and are observing for any further impact.

1720212245

Investigating

We are still seeing about 5% of Actions runs taking longer than 5 minutes to start. We are scaling and shifting resources to encourage recovery of the problem.

1720211541

Investigating

We are still seeing about 5% of Actions runs taking longer than 5 minutes to start. We are evaluating mitigations to increase capacity to decrease latency.

1720209504

Investigating

We are seeing about 5% of Actions runs not starting within 5 minutes. We are continuing investigation.

1720207146

Investigating

We have seen recovery of Actions run delays. Keeping the incident open to monitor for full recovery.

1720204811

Investigating

Webhooks is operating normally.

1720203028

Investigating

We are seeing delays in Actions runs due to the recovery with webhook deliveries. We expect this to resolve with the recovery of webhooks.

1720202943

Investigating

Actions is experiencing degraded performance. We are continuing to investigate.

1720202826

Investigating

We are seeing recovery as webhooks are being delivered again. We are burning down our queue of events. No events have been lost. New webhook deliveries will be delayed while this process recovers.

1720202238

Investigating

Webhooks is experiencing degraded performance. We are continuing to investigate.

1720202114

Investigating

We are reverting a configuration change that is suspected to contribute to the problem with webhook deliveries.

1720201341

Investigating

Our telemetry shows that most webhooks are failing to be delivered. We are queueing all undelivered webhooks and are working to remediate the problem.

1720200031

Investigating

Webhooks is experiencing degraded availability. We are continuing to investigate.

1720199870

Investigating

We are investigating reports of degraded performance for Webhooks

1720199089