Incident with Webhooks
We are experiencing latency on the API and UI endpoints. We are working to resolve the issue.
We are experiencing latency on the API and UI endpoints. We are working to resolve the issue.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
On Mar 5, 2026, between 16:24 UTC and 19:30 UTC, Actions was degraded. During this time, 95% of workflow runs failed to start within 5 minutes with an average delay of 30 minutes and 10% workflow runs failed with an infrastructure error. This was due to Redis infrastructure updates that were being rolled out to production to improve our resiliency. These changes introduced a set of incorrect configuration change into our Redis load balancer causing internal traffic to be routed to an incorrect host leading to two incidents. We mitigated this incident by correcting the misconfigured load balancer. Actions jobs were running successfully starting at 17:24 UTC. The remaining time until we closed the incident was burning through the queue of jobs. We immediately rolled back the updates that were a contributing factor and have frozen all changes in this area until we have completed follow-up work from this. We are working to improve our automation to ensure incorrect configuration changes are not able to propagate through our infrastructure. We are also working on improved alerting to catch misconfigured load balancers before it becomes an incident. Additionally, we are updating the Redis client configuration in Actions to improve resiliency to brief cache interruptions.
On March 5, 2026, between 12:53 UTC and 13:35 UTC, the Copilot mission control service was degraded. This resulted in empty responses returned for users' agent session lists across GitHub web surfaces. Impacted users were unable to see their lists of current and previous agent sessions in GitHub web surfaces. This was caused by an incorrect database query that falsely excluded records that have an absent field.We mitigated the incident by rolling back the database query change. There were no data alterations nor deletions during the incident.To prevent similar issues in the future, we're improving our monitoring depth to more easily detect degradation before changes are fully rolled out.
This incident has been resolved. Thank you for your patience and understanding as we addressed this issue. A detailed root cause analysis will be shared as soon as it is available.
On March 3, 2026, between 19:44 UTC and 21:05 UTC, some GitHub Copilot users reported that the Claude Opus 4.6 Fast model was no longer available in their IDE model selection. After investigation, we confirmed that this was caused by enterprise administrators adjusting their organization's model policies, which correctly removed the model for users in those organizations. No users outside the affected organizations lost access.We confirmed that the Copilot settings were functioning as designed, and all expected users retained access to the model. The incident was resolved once we verified that the change was intentional and no platform regression had occurred.
On March 3, 2026, between 18:46 UTC and 20:09 UTC, GitHub experienced a period of degraded availability impacting GitHub.com, the GitHub API, GitHub Actions, Git operations, GitHub Copilot, and other dependent services. At the peak of the incident, GitHub.com request failures reached approximately 40%. During the same period, approximately 43% of GitHub API requests failed. Git operations over HTTP had an error rate of approximately 6%, while SSH was not impacted. GitHub Copilot requests had an error rate of approximately 21%. GitHub Actions experienced less than 1% impact. This incident shared the same underlying cause as an incident in early February where we saw a large volume of writes to the user settings caching mechanism. While deploying a change to reduce the burden of these writes, a bug caused every user’s cache to expire, get recalculated, and get rewritten. The increased load caused replication delays that cascaded down to all affected services. We mitigated this issue by immediately rolling back the faulty deployment. We understand these incidents disrupted the workflows of developers. While we have made substantial, long-term investments in how GitHub is built and operated to improve resilience, we acknowledge we have more work to do. Getting there requires deep architectural work that is already underway, as well as urgent, targeted improvements. We are taking the following immediate steps: - We have added a killswitch and improved monitoring to the caching mechanism to ensure we are notified before there is user impact and can respond swiftly. - We are moving the cache mechanism to a dedicated host, ensuring that any future issues will solely affect services that rely on it.
Between March 2, 21:42 UTC and March 3, 05:54 UTC project board updates, including adding new issues, PRs, and draft items to boards, were delayed from 30 minutes to over 2 hours, as a large backlog of messages accumulated in the Projects data denormalization pipeline.The incident was caused by an anomalously large event that required longer processing time than expected. Processing this message exceeded the Kafka consumer heartbeat timeout, triggering repeated consumer group rebalances. As a result, the consumer group was unable to make forward progress, creating head-of-line blocking that delayed processing of subsequent project board updates.We mitigated the issue by deploying a targeted fix that safely bypassed the offending message and allowed normal message consumption to resume. Consumer group stability recovered at 04:10 UTC, after which the backlog began draining. All queued messages were fully processed by 05:53 UTC, returning project board updates to normal processing latency.We have identified several follow-up improvements to reduce the likelihood and impact of similar incidents in the future, including improved monitoring and alerting, as well as introducing limits for unusually large project events.