Issues enabling actions and running jobs on GitHub


Incident resolved in 3h50m46s

Resolved

Beginning on July 18, 2024 at 22:38 UTC, network issues within an upstream provider led to degraded experiences across Actions, Copilot, and Pages services.Up to 50% of Actions workflow jobs were stuck in the queuing state, including Pages deployments. Users were also not able to enable Actions or register self-hosted runners. This was caused by an unreachable backend resource in the Central US region. That resource is configured for geo-replication, but the replication configuration prevented resiliency when one region was unavailable. Updating the replication configuration mitigated the impact by allowing successful requests while one region was unavailable. By July 19 00:12 UTC, users saw some improvement in Actions jobs and full recovery of Pages. Standard hosted runners and self-hosted Actions workflows were healthy by 2:10 UTC and large hosted runners fully recovered at 2:38.Copilot requests were also impacted with up to 2% of Copilot Chat requests and 0.5% of Copilot Completions requests resulting in errors. Chat requests were routed to other regions after 20 minutes while Completions requests took 45 minutes to reroute. We have identified improvements to detection to reduce the time to engage all impacted on-call teams and improvements to our replication configuration and failover workflows to be more resilient to unhealthy dependencies and reduce our time to failover and mitigate customer impact.

1721356690

Investigating

Actions is operating normally.

1721356688

Investigating

We have continued to apply mitigations to work around the outage. Customers may still experience run start delays for larger runners.

1721355956

Investigating

We've applied a mitigation to work around the outage. Customers may still experience run start delays.

1721353821

Investigating

We are making progress failing over to a different region to mitigate an outage.

1721351088

Investigating

We continue to mitigate an outage by failing over to a different region.

1721349051

Investigating

Pages is operating normally.

1721348660

Investigating

We are working to mitigate an outage by failing over to a different region.

1721347028

Investigating

Pages is experiencing degraded performance. We are continuing to investigate.

1721344989

Investigating

Some actions customers may experience delays or failures in their runs. We continuing to investigate.

1721344925

Investigating

We are investigating reports of degraded performance for Actions

1721342844