Incident with Pages and Actions


Incident resolved in 37m37s

Resolved

On September 16, 2024, between 21:11 UTC and 22:20 UTC, Actions and Pages services were degraded. Customers who deploy Pages from a source branch experienced delayed runs. Approximately 1,100 runs were delayed long enough to get marked as abandoned. The runs that weren't abandoned completed successfully after we recovered from the incident. Actions jobs experienced average delays of 23 minutes, with some jobs experiencing delays as high as 45 minutes. During the course of the incident, 17% of runs were delayed by more than 5 minutes. At peak, as many as 80% of runs experienced delays exceeding 5 minutes. The root cause was a misconfiguration in the service that manages runner connections, which caused CPU throttling and led to a performance degradation in that service.We mitigated the incident by diverting runner connections away from the misconfigured nodes. We are working to improve our internal monitoring and alerting to reduce our time to detection and mitigation of issues like this one in the future.

1726524519

Investigating

Actions is experiencing degraded performance. We are continuing to investigate.

1726523728

Investigating

The team is investigating issues with some Actions jobs being queued for a long time and a percentage of jobs failing. A mitigation has been applied and jobs are starting to recover.

1726523628

Investigating

Pages is operating normally.

1726523564

Investigating

Actions is experiencing degraded availability. We are continuing to investigate.

1726522646

Investigating

We are investigating reports of degraded performance for Actions and Pages

1726522262