From October 20th at 14:10 UTC until 16:40 UTC, the Copilot service experienced degradation due to an infrastructure issue which impacted the Grok Code Fast 1 model, leading to a spike in errors affecting 30% of users. No other models were impacted. The incident was caused due to an outage with an upstream provider.
On October 20, 2025, between 08:05 UTC and 10:50 UTC the Codespaces service was degraded, with users experiencing failures creating new codespaces and resuming existing ones. On average, the error rate for codespace creation was 39.5% and peaked at 71% of requests to the service during the incident window. Resume operations averaged 23.4% error rate with a peak of 46%. This was due to a cascading failure triggered by an outage in a 3rd-party dependency required to build devcontainer images.The impact was mitigated when the 3rd-party dependency recovered.We are investigating opportunities to make this dependency not a critical path for our container build process and working to improve our monitoring and alerting systems to reduce our time to detection of issues like this one in the future.
On October 17th, 2025, between 12:51 UTC and 14:01 UTC, mobile push notifications failed to be delivered for a total duration of 70 minutes. This affected github.com and GitHub Enterprise Cloud in all regions. The disruption was related to an erroneous configuration change to cloud resources used for mobile push notification delivery.We are reviewing our procedures and management of these cloud resources to prevent such an incident in the future.
On October 14th, 2025, between 18:26 UTC and 18:57 UTC a subset of unauthenticated requests to the commit endpoint for certain repositories received 503 errors. During the event, the average error rate was 3%, peaking at 3.5% of total requests.This event was triggered by a recent configuration change and some traffic pattern shifts on the service. We were alerted of the issue immediately and made changes to the configuration in order to mitigate the problem. We are working on automatic mitigation solutions and better traffic handling in order to prevent issues like this in the future.
On Oct 14th, 2025, between 13:34 UTC and 16:00 UTC the Copilot service was degraded for GPT-5 mini model. On average, 18% of the requests to GPT-5 mini failed due to an issue with our upstream provider.We notified the upstream provider of the problem as soon as it was detected and mitigated the issue by failing over to other providers. The upstream provider has since resolved the issue.We are working to improve our failover logic to mitigate similar upstream failures more quickly in the future.
On October 9th, 2025, between 14:35 UTC and 15:21 UTC, a network device in maintenance mode that was undergoing repairs was brought back into production before repairs were completed. Network traffic traversing this device experienced significant packet loss.Authenticated users of the github.com UI experienced increased latency during the first 5 minutes of the incident. API users experienced up to 7.3% error rates, after which it stabilized to about 0.05% until mitigated. Actions service experienced 24% of runs being delayed for an average of 13 minutes. Large File Storage (LFS) requests experienced minimally increased error rate, with 0.038% of requests erroring.To prevent similar issues, we are enhancing the validation process for device repairs of this category.
Between 13:39 UTC and 13:42 UTC on Oct 9, 2025, around 2.3% of REST API calls and 0.4% Web traffic were impacted due to the partial rollout of a new feature that had more impact on one of our primary databases than anticipated. When the feature was partially rolled out it performed an excessive number of writes per request which caused excessive latency for writes from other API and Web endpoints and resulted in 5xx errors to customers. The issue was identified by our automatic alerting and reverted by turning down the percentage of traffic to the new feature, which led to recovery of the data cluster and services. We are working to improve the way we roll out new features like this and move the specific writes from this incident to a storage solution more suited to this type of activity. We have also optimized this particular feature to avoid its rollout from having future impact on other areas of the site. We are also investigating how we can even more quickly identify issues like this.
On October 7, 2025, between 7:48 PM UTC and October 8, 12:05 AM UTC (approximately 4 hours and 17 minutes), the audit log service was degraded, creating a backlog and delaying availability of new audit log events. The issue originated in a third-party dependency.We mitigated the incident by working with the vendor to identify and resolve the issue. Write operations recovered first, followed by the processing of the accumulated backlog of audit log events.We are working to improve our monitoring and alerting for audit log ingestion delays and strengthen our incident response procedures to reduce our time to detection and mitigation of issues like this one in the future.
Between October 1st, 2025 at 1 AM UTC and October 2nd, 2025 at 10:33 PM UTC, the Copilot service experienced a degradation of the Gemini 2.5 Pro model due to an issue with our upstream provider. Before 15:53 UTC on October 1st, users experienced higher error rates with large context requests while using Gemini 2.5 Pro. After 15:53 UTC and until 10:33 PM UTC on October 2nd, requests were restricted to smaller context windows when using Gemini 2.5. Pro. No other models were impacted.The issue was resolved by a mitigation put in place by our provider. GitHub is collaborating with our provider to enhance communication and improve the ability to reproduce issues with the aim to reduce resolution time.