Incident History

Incident with Actions

On June 5th, 2025, between 17:47 UTC and 19:20 UTC the Actions service was degraded, leading to run start delays and intermittent job failures. During this period, 47.2% of runs had delayed starts, and 21.0% of runs failed. The impact extended beyond Actions itself - 60% of Copilot Coding Agent sessions were cancelled, and all Pages sites using branch-based builds failed to deploy (though Pages serving remained unaffected). The issue was caused by a spike in load between internal Actions services exposing a misconfiguration that caused throttling of requests in the critical path of run starts. We mitigated the incident by correcting the service configuration to prevent throttling and have updated our deployment process to ensure the correct configuration is preserved moving forward.

Incident with Actions

On June 4, 2025, between 14:35 UTC and 15:50 UTC , the Actions service experienced degradation, leading to run start delays. During the incident, about 15.4% of all workflow runs were delayed by an average of 16 minutes. An unexpected load pattern revealed a scaling issue in our backend infrastructure. We mitigated the incident by blocking the requests that triggered this pattern. We are improving our rate limiting mechanisms to better handle unexpected load patterns while maintaining service availability. We are also strengthening our incident response procedures to reduce the time to mitigate for similar issues in the future.

Disruption with some GitHub services

On May 30, 2025, between 08:10 UTC and 16:00 UTC, the Microsoft Teams GitHub integration service experienced a complete service outage. During this period, the service was unable to deliver notifications or process user requests, resulting in a 100% error rate for all integration functionality except link previews.This outage was due to an authentication issue with our downstream provider. We mitigated the incident by working with our provider to restore service functionality and are working to migrate to more durable authentication methods to reduce the risk of similar issues in the future.

Disruption with some GitHub services

On May 28, 2025, from approximately 09:45 UTC to 14:45 UTC, GitHub Actions experienced delayed job starts for workflows in public repos using Ubuntu-24 standard hosted runners. This was caused by a misconfiguration in backend caching behavior after a failover, which led to duplicate job assignments and reduced available capacity. Approximately 19.7% of Ubuntu-24 hosted runner jobs on public repos were delayed. Other hosted runners, self-hosted runners, and private repo workflows were unaffected.By 12:45 UTC, we mitigated the issue by redeploying backend components to reset state and scaling up available resources to more quickly work through the backlog of queued jobs. We are working to improve our deployment and failover resiliency and validation to reduce the likelihood of similar issues in the future.

[Retroactive] Incident with Git Operations

Between 10:00 and 20:00 UTC on May 27, a change to our git proxy service resulted in some git client implementations not being able to consistently push to GitHub. Reverting the change resulted in an immediate resolution of the problem for all customers. The inflated time to detect this failure was due to the relatively few impacted clients. We are re-evaluating the proposed change to understand how we can prevent and detect such failures in the future.

Incident with Actions

On May 27, 2025, between 09:31 UTC and 13:31 UTC, some Actions jobs experienced failures uploading to and downloading from the Actions Cache service. During the incident, 6% of all workflow runs couldn’t upload or download cache entries from the service, resulting in a non-blocking warning message in the logs and performance degradation. The disruption was caused by an infrastructure update related to the retirement of a legacy service, which unintentionally impacted Cache service availability. We resolved the incident by reverting the change and have since implemented a permanent fix to prevent recurrence.We are improving our configuration change processes by introducing additional end-to-end tests to cover the identified gaps, and implementing deployment pipeline improvements to reduce mitigation time for similar issues in the future.

Disruption with some GitHub services

This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.

We're experiencing errors

On May 26, 2025, between 06:20 UTC and 09:45 UTC GitHub experienced broad failures across a variety of services (API, Issues, Git, etc). These were degraded at times, but peaked at 100% failure rates for some operations during this time.On May 23, a new feature was added to Copilot APIs and monitored during rollout but it was not tested at peak load. At 6:20 UTC on May 26, load increased on the code path in question and started to degrade a Copilot API because the caching for this endpoint and circuit breakers for high load were misconfigured.In addition, the traffic limiting meant to protect wider swaths of the GitHub API from queuing was not yet covering this endpoint, meaning it was able to overwhelm the capacity to serve traffic and cause request queuing.We were able to mitigate the incident by turning off the endpoint until the behavior could be reverted.We are already working on a quality of service strategy for API endpoints like this that will limit the impact of a broad incident and are rolling it out. We are also addressing the specific caching and circuit breaker misconfigurations for this endpoint, which would have reduced the time to mitigate this particular incident and the blast radius.

Disruption with some GitHub services

On May 23, 2025, between 17:40 UTC and 18:30 UTC public API and UI requests to read and write Git repository content were degraded and triggered user-facing 500 responses. On average, the error rate was 61% and peaked at 88% of requests to the service. This was due to the introduction of an uncaught fatal error in an internal service. A manual rollback was required which increased the time to remediate the incident.We are working to automatically detect and revert a change based on alerting to reduce our time to detection and mitigation. In addition, we are adding relevant test coverage to prevent errors of this type getting to production.

Delayed GitHub Actions Jobs

On May 22, 2025, between 07:06 UTC and 09:10 UTC, the Actions service experienced degradation, leading to run start delays. During the incident, about 11% of all workflow runs were delayed by an average of 44 minutes. A recently deployed change contained a defect that caused improper request routing between internal services, resulting in security rejections at the receiving endpoint. We resolved this by reverting the problematic change and are implementing enhanced testing procedures to catch similar issues before they reach production environments.

⮜ Previous Next ⮞