On January 27th, 2025, between 23:32:00 UTC and 23:41:00 UTC the Audit Log Streaming service experienced an approximate 9 minute delay of Audit Log Events. Our systems maintained data continuity and we experienced no data loss. There was no impact to the Audit Log API or the Audit Log user interface. Any configured Audit Log Streaming endpoints received all relevant Audit Log Events (but they were delayed) and normal service was restored after the incident's resolution.
On January 23, 2025, between 9:49 and 17:00 UTC, the available capacity of large hosted runners was degraded. On average, 26% of jobs requiring large runners had a >5min delay getting a runner assigned. This was caused by the rollback of a configuration change and a latent bug in event processing, which was triggered by the mixed data shape that resulted from the rollback. The processing would reprocess the same events unnecessarily and cause the background job that manages large runner creation and deletion to run out of resources. It would automatically restart and continue processing, but the jobs were not able to keep up with production traffic. We mitigated the impact by using a feature flag to bypass the problematic event processing logic. While these changes had been rolling out in stages over the last few months and had been safely rolled back previously, an unrelated change prevented rollback from causing this problem in earlier stages.We are reviewing and updating the feature flags in this event processing workflow to ensure that we have high confidence in rollback in all rollout stages. We are also improving observability of the event processing to reduce the time to diagnose and mitigate similar issues going forward.
On January 16, 2025, between 00:45 UTC and 09:40 UTC the Pull Requests service was degraded and failed to generate rebase merge commits. This was due to a configuration change that introduced disagreements between replicas. These disagreements caused a secondary job to run, triggering timeouts while computing rebase merge commits. We mitigated the incident by rolling back the configuration change.We are working on improving our monitoring and deployment practices to reduce our time to detection and mitigation of issues like this one in the future.
On January 14, 2025, between 19:13 UTC and 21:210 UTC the Codespaces service was degraded and led to connection failures with running codespaces, with a 7.6% failure rate for connections during the degradation. Users with bad connections could not use impacted codespaces until they were stopped and restarted.This was caused by bad connections left behind after a deployment in an upstream dependency that the Codespaces service still provided to clients. The incident self-mitigated as new connections replaced stale ones. We are coordinating to ensure connection stability with future deployments of this nature.
On January 13, 2025, between 23:35 UTC and 00:24 UTC all Git operations were unavailable due to a configuration change causing our internal load balancer to drop requests between services that Git relies upon.We mitigated the incident by rolling back the configuration change.We are improving our monitoring and deployment practices to reduce our time to detection and automated mitigation for issues like this in the future.
On January 9, 2025, larger hosted runners configured with Azure private networking in East US 2 were degraded, causing delayed job starts for ~2,300 jobs between 16:00 and 20:00 UTC. There was also an earlier period of impact from 2025-01-08 22:00 UTC to 2025-01-09 4:10 UTC with 488 jobs impacted. The cause of both these delays was an incident in East US 2 impacting provisioning and network connectivity of Azure resources. More details on that incident are visible at https://azure.status.microsoft/en-us/status/history (Tracking ID: PLP3-1W8). Because these runners are reliant on private networking with networks in the East US 2 region, there were no immediate mitigations available other than restoring network connectivity. Going forward, we will continue evaluating options to provide better resilience to 3rd party regional outages that affect private networking customers.
On January 9, 2025, between 06:26 and 07:49 UTC, Actions experienced degraded performance, leading to failures in about 1% of workflow runs across ~10k repositories. The failures occurred due to an outage in a dependent service, which disrupted Redis connectivity in the East US 2 region. We mitigated the incident by re-routing Redis traffic out of that region at 07:49 UTC. We continued to monitor service recovery before resolving the incident at 08:30 UTC. We are working to improve our monitoring to reduce our time to detection and mitigation of issues like this one in the future.
On January 9, 2025, between 01:26 UTC and 01:56 UTC GitHub experienced widespread disruption to many services, with users receiving 500 responses when trying to access various functionality. This was due to a deployment which introduced a query that saturated a primary database server. On average, the error rate was 6% and peaked at 6.85% of update requests.We mitigated the incident by identifying the source of the problematic query and rolling back the deployment.We are investigating methods to detect problematic queries prior to deployment to prevent, and to reduce our time to detection and mitigation of issues like this one in the future.
On January 7th, 2025 between 11:54:00 and 16:39 UTC, degraded performance was observed in Actions, Webhooks, and Issues, caused by an internal Certificate Authority configuration change that disrupted our event infrastructure. The configuration issue was promptly identified and resolved by rolling the change back on impacted hosts and re-issuing certificates.We have identified what services need updates to support the current PKI architecture and are working on implementing those changes to prevent a future recurrence.
On January 2, 2025 between 16:00:00 and 22:27:30 UTC, a bug in feature-flagged code that cleans up Pull Requests after they are closed or merged incorrectly cleared the merge commit SHA for ~139,000 pull requests. During the incident, Actions workflows triggered by the on: pull_request trigger for the closed type were not queued successfully because of these missing merge commit SHAs. Approximately 45,000 repositories experienced these missing workflow triggers in either of two possible scenarios: pull requests which were closed, but not merged; and pull requests which were merged. Impact was mitigated after rolling back the aforementioned feature flag. Merged pull requests that were affected have had their merge commit SHAs restored. Closed pull requests have not had their merge commit SHA restored; however, customers can re-open and close them again to recalculate this SHA. We are investigating methods to improve detection of these kinds of errors in the future.