Between March 2, 21:42 UTC and March 3, 05:54 UTC project board updates, including adding new issues, PRs, and draft items to boards, were delayed from 30 minutes to over 2 hours, as a large backlog of messages accumulated in the Projects data denormalization pipeline.The incident was caused by an anomalously large event that required longer processing time than expected. Processing this message exceeded the Kafka consumer heartbeat timeout, triggering repeated consumer group rebalances. As a result, the consumer group was unable to make forward progress, creating head-of-line blocking that delayed processing of subsequent project board updates.We mitigated the issue by deploying a targeted fix that safely bypassed the offending message and allowed normal message consumption to resume. Consumer group stability recovered at 04:10 UTC, after which the backlog began draining. All queued messages were fully processed by 05:53 UTC, returning project board updates to normal processing latency.We have identified several follow-up improvements to reduce the likelihood and impact of similar incidents in the future, including improved monitoring and alerting, as well as introducing limits for unusually large project events.
On March 2nd, 2026, between 7:10 UTC and 22:04 UTC the pull requests service was degraded. Users navigating between tabs on the pull requests dashboard were met with 404 errors or blank pages.This was due to a configuration change deployed on February 27th at 11:03 PM UTC. We mitigated the incident by reverting the change.We’re working to improve monitoring for the page to automatically detect and alert us to routing failures.
On February 27, 2026, between 22:53 UTC and 23:46 UTC, the Copilot coding agent service experienced elevated errors and degraded functionality for agent sessions. Approximately 87% of attempts to start or interact with agent sessions encountered errors during this period.This was due to an expired authentication credential for an internal service component, which prevented Copilot agent session operations from completing successfully.We mitigated the incident by rotating the expired credential and deploying the updated configuration to production. Services began recovering within minutes of the fix being deployed.We are working to improve automated credential rotation coverage across all Copilot service components, add proactive alerting for credentials approaching expiration, and validate configuration consistency to reduce our time to detection and mitigation of issues like this one in the future.
Starting February 26, 2026 at 22:10 UTC through February 27, 05:50 UTC, the repository browsing UI was degraded and users were unable to load pages for files and directories with non-ASCII characters (including Japanese, Chinese, and other non-Latin scripts). On average, the error rate was 0.014% and peaked at 0.06% of requests to the service. Affected users saw 404 errors when navigating to repository directories and files with non-ASCII names. This was due to a code change that altered how file and directory names were processed, which caused incorrectly formatted data to be stored in an application cache.We mitigated the incident by deploying a fix that invalidated the affected cache entries and progressively rolling it out across all production environments.We are working to improve our pre-production testing to cover non-ASCII character handling, establish better cache invalidation mechanisms, and enhance our monitoring to detect this type of failure mode earlier, to reduce our time to detection and mitigation of issues like this one in the future.
Between February 26, 2026 UTC and February 27, 2026 UTC, customers hitting the webhooks delivery API may have experienced higher latency or failed requests. During the impact window, 0.82% of requests took longer than 3s and 0.004% resulted in a 500 error response.Our monitors caught the impact on the individual backing data source, and we were able to attribute the degradation to a noisy neighbor effect due requests to a specific webhook generating excessive load on the API. The incident was mitigated once traffic from the specific hook decreased.We have since added a rate limiter for this webhooks API to prevent similar spikes in usage impacting others and will further refine the rate limits for other webhook API routes to help prevent similar occurrences in the future.
On February 26, 2026, between 09:27 UTC and 10:36 UTC, the GitHub Copilot service was degraded and users experienced errors when using Copilot features including Copilot Chat, Copilot Coding Agent and Copilot Code Review. During this time, 5-15% of affected requests to the service returned errors.The incident was resolved by infrastructure rebalancing.We are improving observability to detect capacity imbalances earlier and enhancing our infrastructure to better handle traffic spikes.
On February 25, 2026, between 15:05 UTC and 16:34 UTC, the Copilot coding agent service was degraded, resulting in errors for 5% of all requests and impacting users starting or interacting with agent sessions. This was due to an internal service dependency running out of allocated resources (memory and CPU). We mitigated the incident by adjusting the resource allocation for the affected service, which restored normal operations for the coding agent service.We are working to implement proactive monitoring for resource exhaustion across our services, review and update resource allocations, and improve our alerting capabilities to reduce our time to detection and mitigation of similar issues in the future.
On February 23, 2026, between 21:01 UTC and 21:30 UTC the Search service experienced degraded performance, resulting in an average of 3.5% of search requests for Issues and Pull Requests being rejected. During this period, updates to Issues and Pull Requests may not have been immediately reflected in search results. During a routine migration, we observed a spike in internal traffic due to a configuration change in our search index. We were alerted to the increase in traffic as well as the increase in error rates and rolled back to the previous stable index. We are working to enable more controlled traffic shifting when promoting a new index to allow us to detect potential limitations earlier and ensure these operations succeed in a more controlled manner.
Between 2026-02-23 19:10 and 2026-02-24 00:46 UTC, all lexical code search queries in GitHub.com and the code search API were significantly slowed, and during this incident, between 5 and 10% of search queries timed out. This was caused by a single customer who had created a network of hundreds of orchestrated accounts which searched with a uniquely expensive search query. This search query concentrated load on a single hot shard within the search index, slowing down all queries. After we identified the source of the load and stopped the traffic, latency returned to normal.To avoid this situation occurring again in the future, we are making a number of improvements to our systems, including: improved rate limiting that accounts for highly skewed load on hot shards, improved system resilience for when a small number of shards time out, improved tooling to recognize abusive actors, and capabilities that will allow us to shed load on a single shard in emergencies.
On February 23, 2026, between 15:00 UTC and 17:00 UTC, GitHub Actions experienced degraded performance. During the time, 1.8% of Actions workflow runs experienced delayed starts with an average delay of 15 minutes. The issue was caused by a connection rebalancing event in our internal load balancing layer, which temporarily created uneven traffic distribution across sites and led to request throttling. To prevent recurrence, we are tuning connection rebalancing behavior to spread client reconnections more gradually during load balancer reloads. We are also evaluating improvements to site-level traffic affinity to eliminate the uneven distribution at its source. We have overprovisioned critical paths to prevent any impact if a similar event occurs before those workstreams finish. Finally, we are enhancing our monitoring to detect capacity imbalances proactively.