On October 9th, 2025, between 14:35 UTC and 15:21 UTC, a network device in maintenance mode that was undergoing repairs was brought back into production before repairs were completed. Network traffic traversing this device experienced significant packet loss.Authenticated users of the github.com UI experienced increased latency during the first 5 minutes of the incident. API users experienced up to 7.3% error rates, after which it stabilized to about 0.05% until mitigated. Actions service experienced 24% of runs being delayed for an average of 13 minutes. Large File Storage (LFS) requests experienced minimally increased error rate, with 0.038% of requests erroring.To prevent similar issues, we are enhancing the validation process for device repairs of this category.
Between 13:39 UTC and 13:42 UTC on Oct 9, 2025, around 2.3% of REST API calls and 0.4% Web traffic were impacted due to the partial rollout of a new feature that had more impact on one of our primary databases than anticipated. When the feature was partially rolled out it performed an excessive number of writes per request which caused excessive latency for writes from other API and Web endpoints and resulted in 5xx errors to customers. The issue was identified by our automatic alerting and reverted by turning down the percentage of traffic to the new feature, which led to recovery of the data cluster and services. We are working to improve the way we roll out new features like this and move the specific writes from this incident to a storage solution more suited to this type of activity. We have also optimized this particular feature to avoid its rollout from having future impact on other areas of the site. We are also investigating how we can even more quickly identify issues like this.
On October 7, 2025, between 7:48 PM UTC and October 8, 12:05 AM UTC (approximately 4 hours and 17 minutes), the audit log service was degraded, creating a backlog and delaying availability of new audit log events. The issue originated in a third-party dependency.We mitigated the incident by working with the vendor to identify and resolve the issue. Write operations recovered first, followed by the processing of the accumulated backlog of audit log events.We are working to improve our monitoring and alerting for audit log ingestion delays and strengthen our incident response procedures to reduce our time to detection and mitigation of issues like this one in the future.
Between October 1st, 2025 at 1 AM UTC and October 2nd, 2025 at 10:33 PM UTC, the Copilot service experienced a degradation of the Gemini 2.5 Pro model due to an issue with our upstream provider. Before 15:53 UTC on October 1st, users experienced higher error rates with large context requests while using Gemini 2.5 Pro. After 15:53 UTC and until 10:33 PM UTC on October 2nd, requests were restricted to smaller context windows when using Gemini 2.5. Pro. No other models were impacted.The issue was resolved by a mitigation put in place by our provider. GitHub is collaborating with our provider to enhance communication and improve the ability to reproduce issues with the aim to reduce resolution time.
On October 1, 2025 between 07:00 UTC and 17:20 UTC, Mac hosted runner capacity for Actions was degraded, leading to timed out jobs and long queue times. On average, the error rate was 46% and peaked at 96% of requests to the service. XL and Intel runners recovered by 10:10 UTC, with the other types taking longer to recover.The degraded capacity was triggered by a scheduled event at 07:00 UTC that led to a permission failure on Mac runner hosts, blocking reimage operations. The permission issue was resolved by 9:41 UTC, but the recovery of available runners took longer than expected due to a combination of backoff logic slowing backend operations and some hosts needing state resets.We deployed changes immediately following the incident to address the scheduled event and ensure that similar failures will not block critical operations in the future. We are also working to reduce the end-to-end time for self-healing of offline hosts for quicker full recovery of future capacity or host events.
On September 29, 2025, between 17:53 and 18:42 UTC, the Copilot service experienced a degradation of the Gemini 2.5 model due to an issue with our upstream provider. Approximately 24% of requests failed, affecting 56% of users during this period. No other models were impacted.GitHub notified the upstream provider of the problem as soon as it was detected. The issue was resolved after the upstream provider rolled back a recent change that caused the disruption. GitHub will continue to enhance our monitoring and alerting systems to reduce the time it takes to detect and mitigate similar issues in the future.
On September 29, 2025 between 16:26 UTC and 17:33 UTC the Copilot API experienced a partial degradation causing intermittent erroneous 404 responses for an average of 0.2% of GitHub MCP server requests, peaking at times around 2% of requests. The issue stemmed from an upgrade of an internal dependency which exposed a misconfiguration in the service.We resolved the incident by rolling back the upgrade to address the misconfiguration. We fixed the configuration issue and will improve documentation and rollout process to prevent similar issues.
On September 26, 2025 between 16:22 UTC and 18:32 UTC raw file access was degraded for a small set of four repositories. On average, raw file access error rate was 0.01% and peaked at 0.16% of requests. This was due to a caching bug exposed by excessive traffic to a handful of repositories. We mitigated the incident by resetting the state of the cache for raw file access and are working to improve cache usage and testing to prevent issues like this in the future.
On September 23, 2025, between 15:29 UTC and 17:38 UTC and also on September 24, 2025 between 15:02 UTC and 15:12, email deliveries were delayed up to 50 minutes which resulted in significant delays for most types of email notifications. This occurred due to an unusually high volume of traffic which caused resource contention on some of our outbound email servers.We have updated the configuration we use to better allocate capacity when there is a high volume of traffic and are also updating our monitors so we can detect this type of issue before it becomes a customer impacting incident.