Issues enabling actions and running jobs on GitHub
Resolved
Beginning on July 18, 2024 at 22:38 UTC, network issues within an upstream provider led to degraded experiences across Actions, Copilot, and Pages services.Up to 50% of Actions workflow jobs were stuck in the queuing state, including Pages deployments. Users were also not able to enable Actions or register self-hosted runners. This was caused by an unreachable backend resource in the Central US region. That resource is configured for geo-replication, but the replication configuration prevented resiliency when one region was unavailable. Updating the replication configuration mitigated the impact by allowing successful requests while one region was unavailable. By July 19 00:12 UTC, users saw some improvement in Actions jobs and full recovery of Pages. Standard hosted runners and self-hosted Actions workflows were healthy by 2:10 UTC and large hosted runners fully recovered at 2:38.Copilot requests were also impacted with up to 2% of Copilot Chat requests and 0.5% of Copilot Completions requests resulting in errors. Chat requests were routed to other regions after 20 minutes while Completions requests took 45 minutes to reroute. We have identified improvements to detection to reduce the time to engage all impacted on-call teams and improvements to our replication configuration and failover workflows to be more resilient to unhealthy dependencies and reduce our time to failover and mitigate customer impact.
Investigating
Actions is operating normally.
Investigating
We have continued to apply mitigations to work around the outage. Customers may still experience run start delays for larger runners.
Investigating
We've applied a mitigation to work around the outage. Customers may still experience run start delays.
Investigating
We are making progress failing over to a different region to mitigate an outage.
Investigating
We continue to mitigate an outage by failing over to a different region.
Investigating
Pages is operating normally.
Investigating
We are working to mitigate an outage by failing over to a different region.
Investigating
Pages is experiencing degraded performance. We are continuing to investigate.
Investigating
Some actions customers may experience delays or failures in their runs. We continuing to investigate.
Investigating
We are investigating reports of degraded performance for Actions