Incident with Actions


Incident resolved in 7h1m31s

Resolved

On January 23, 2025, between 9:49 and 17:00 UTC, the available capacity of large hosted runners was degraded. On average, 26% of jobs requiring large runners had a >5min delay getting a runner assigned. This was caused by the rollback of a configuration change and a latent bug in event processing, which was triggered by the mixed data shape that resulted from the rollback. The processing would reprocess the same events unnecessarily and cause the background job that manages large runner creation and deletion to run out of resources. It would automatically restart and continue processing, but the jobs were not able to keep up with production traffic. We mitigated the impact by using a feature flag to bypass the problematic event processing logic. While these changes had been rolling out in stages over the last few months and had been safely rolled back previously, an unrelated change prevented rollback from causing this problem in earlier stages.We are reviewing and updating the feature flags in this event processing workflow to ensure that we have high confidence in rollback in all rollout stages. We are also improving observability of the event processing to reduce the time to diagnose and mitigate similar issues going forward.

1737653222

Investigating

We are seeing recovery with the latest mitigation. Queue time for a very small percentage of larger runner jobs are still longer than expected so we are monitoring those for full recovery before going green.

1737651807

Investigating

We are actively applying mitigations to help improve larger runner start times. We are currently seeing delays starting about 25% of larger runner jobs.

1737649544

Investigating

We are still actively investigating a slowdown in larger runner assignment and are working to apply additional mitigations.

1737646417

Investigating

We're still applying mitigations to unblock queueing Actions in large runners. We are monitoring for full recovery.

1737643996

Investigating

We are applying further mitigations to fix the issues with delayed queuing for Actions jobs in large runners. We continue to monitor for full recovery.

1737641825

Investigating

We are investigating further mitigations for queueing Actions jobs in large runners. We continue to watch telemetry and are monitoring for full recovery.

1737639730

Investigating

We've applied a mitigation to fix the issues with queuing and running Actions jobs. We are seeing improvements in telemetry and are monitoring for full recovery.

1737637751

Investigating

The team continues to apply mitigations for issues with some Actions jobs delayed being enqueued for larger runners seen by a small number of customers. We will continue providing updates on the progress towards full mitigation.

1737635798

Investigating

The team continues to apply mitigations for issues with some Actions jobs delayed being enqueued for larger runners. We will continue providing updates on the progress towards full mitigation.

1737633822

Investigating

The team continues to investigate issues with some Actions jobs delayed being enqueued for larger runners. We will continue providing updates on the progress towards mitigation.

1737631912

Investigating

The team continues to investigate issues with some Actions jobs having delays in being queued for larger runners. We will continue providing updates on the progress towards mitigation.

1737629933

Investigating

We are investigating reports of degraded performance for Actions

1737627931