On March 5th, 2026, between approximately 00:26 and 00:44 UTC, the Copilot service experienced a degradation of the GPT 3.5 Codex model due to an issue with our upstream provider. Users encountered elevated error rates when using GPT 3.5 Codex, impacting approximately 30% of requests. No other models were impacted.The issue was resolved by a mitigation put in place by our provider.
On March 3, 2026, between 19:44 UTC and 21:05 UTC, some GitHub Copilot users reported that the Claude Opus 4.6 Fast model was no longer available in their IDE model selection. After investigation, we confirmed that this was caused by enterprise administrators adjusting their organization's model policies, which correctly removed the model for users in those organizations. No users outside the affected organizations lost access.We confirmed that the Copilot settings were functioning as designed, and all expected users retained access to the model. The incident was resolved once we verified that the change was intentional and no platform regression had occurred.
On March 3, 2026, between 18:46 UTC and 20:09 UTC, GitHub experienced a period of degraded availability impacting GitHub.com, the GitHub API, GitHub Actions, Git operations, GitHub Copilot, and other dependent services. At the peak of the incident, GitHub.com request failures reached approximately 40%. During the same period, approximately 43% of GitHub API requests failed. Git operations over HTTP had an error rate of approximately 6%, while SSH was not impacted. GitHub Copilot requests had an error rate of approximately 21%. GitHub Actions experienced less than 1% impact. This incident shared the same underlying cause as an incident in early February where we saw a large volume of writes to the user settings caching mechanism. While deploying a change to reduce the burden of these writes, a bug caused every user’s cache to expire, get recalculated, and get rewritten. The increased load caused replication delays that cascaded down to all affected services. We mitigated this issue by immediately rolling back the faulty deployment. We understand these incidents disrupted the workflows of developers. While we have made substantial, long-term investments in how GitHub is built and operated to improve resilience, we acknowledge we have more work to do. Getting there requires deep architectural work that is already underway, as well as urgent, targeted improvements. We are taking the following immediate steps: - We have added a killswitch and improved monitoring to the caching mechanism to ensure we are notified before there is user impact and can respond swiftly. - We are moving the cache mechanism to a dedicated host, ensuring that any future issues will solely affect services that rely on it.
Between March 2, 21:42 UTC and March 3, 05:54 UTC project board updates, including adding new issues, PRs, and draft items to boards, were delayed from 30 minutes to over 2 hours, as a large backlog of messages accumulated in the Projects data denormalization pipeline.The incident was caused by an anomalously large event that required longer processing time than expected. Processing this message exceeded the Kafka consumer heartbeat timeout, triggering repeated consumer group rebalances. As a result, the consumer group was unable to make forward progress, creating head-of-line blocking that delayed processing of subsequent project board updates.We mitigated the issue by deploying a targeted fix that safely bypassed the offending message and allowed normal message consumption to resume. Consumer group stability recovered at 04:10 UTC, after which the backlog began draining. All queued messages were fully processed by 05:53 UTC, returning project board updates to normal processing latency.We have identified several follow-up improvements to reduce the likelihood and impact of similar incidents in the future, including improved monitoring and alerting, as well as introducing limits for unusually large project events.
On March 2nd, 2026, between 7:10 UTC and 22:04 UTC the pull requests service was degraded. Users navigating between tabs on the pull requests dashboard were met with 404 errors or blank pages.This was due to a configuration change deployed on February 27th at 11:03 PM UTC. We mitigated the incident by reverting the change.We’re working to improve monitoring for the page to automatically detect and alert us to routing failures.
On February 27, 2026, between 22:53 UTC and 23:46 UTC, the Copilot coding agent service experienced elevated errors and degraded functionality for agent sessions. Approximately 87% of attempts to start or interact with agent sessions encountered errors during this period.This was due to an expired authentication credential for an internal service component, which prevented Copilot agent session operations from completing successfully.We mitigated the incident by rotating the expired credential and deploying the updated configuration to production. Services began recovering within minutes of the fix being deployed.We are working to improve automated credential rotation coverage across all Copilot service components, add proactive alerting for credentials approaching expiration, and validate configuration consistency to reduce our time to detection and mitigation of issues like this one in the future.
Starting February 26, 2026 at 22:10 UTC through February 27, 05:50 UTC, the repository browsing UI was degraded and users were unable to load pages for files and directories with non-ASCII characters (including Japanese, Chinese, and other non-Latin scripts). On average, the error rate was 0.014% and peaked at 0.06% of requests to the service. Affected users saw 404 errors when navigating to repository directories and files with non-ASCII names. This was due to a code change that altered how file and directory names were processed, which caused incorrectly formatted data to be stored in an application cache.We mitigated the incident by deploying a fix that invalidated the affected cache entries and progressively rolling it out across all production environments.We are working to improve our pre-production testing to cover non-ASCII character handling, establish better cache invalidation mechanisms, and enhance our monitoring to detect this type of failure mode earlier, to reduce our time to detection and mitigation of issues like this one in the future.
Between February 26, 2026 UTC and February 27, 2026 UTC, customers hitting the webhooks delivery API may have experienced higher latency or failed requests. During the impact window, 0.82% of requests took longer than 3s and 0.004% resulted in a 500 error response.Our monitors caught the impact on the individual backing data source, and we were able to attribute the degradation to a noisy neighbor effect due requests to a specific webhook generating excessive load on the API. The incident was mitigated once traffic from the specific hook decreased.We have since added a rate limiter for this webhooks API to prevent similar spikes in usage impacting others and will further refine the rate limits for other webhook API routes to help prevent similar occurrences in the future.
On February 26, 2026, between 09:27 UTC and 10:36 UTC, the GitHub Copilot service was degraded and users experienced errors when using Copilot features including Copilot Chat, Copilot Coding Agent and Copilot Code Review. During this time, 5-15% of affected requests to the service returned errors.The incident was resolved by infrastructure rebalancing.We are improving observability to detect capacity imbalances earlier and enhancing our infrastructure to better handle traffic spikes.
On February 25, 2026, between 15:05 UTC and 16:34 UTC, the Copilot coding agent service was degraded, resulting in errors for 5% of all requests and impacting users starting or interacting with agent sessions. This was due to an internal service dependency running out of allocated resources (memory and CPU). We mitigated the incident by adjusting the resource allocation for the affected service, which restored normal operations for the coding agent service.We are working to implement proactive monitoring for resource exhaustion across our services, review and update resource allocations, and improve our alerting capabilities to reduce our time to detection and mitigation of similar issues in the future.