Multiple services critical to GitHub's attestation infrastructure experienced an outage which prevented Fulcio from issuing signing certificates. During the outage, GitHub customers who use the "actions/attest-build-provenance" action from public repositories were not able to generate attestations.
On June 6, 2025, an update to mitigate a previous incident led to automated scaling of database infrastructure used by Copilot Coding Agent. The clients of the service were not implemented to automatically handle an extra partition. Hence it was unable to retrieve data across partitions, resulting in unexpected 404 errors.
As a result, approximately 17% of coding sessions displayed an incorrect final state - such as sessions appearing in-progress when they were actually completed. Additionally, some Copilot-authored pull requests were missing timeline events indicating task completion. Importantly, this did not affect Copilot Coding Agent’s ability to finish code tasks and submit pull requests.
To prevent similar issues in the future we are taking steps to improve our systems and monitoring.
Between 2025-06-10 12:25 UTC and 2025-06-11 01:51 UTC, GitHub Enterprise Cloud (GHEC) customers with approximately 10,000 or more users, saw performance degradation and 5xx errors when loading the Enterprise Settings’ People management page. Less than 2% of page requests resulted in an error. The issue was caused by a database change that replaced an index required for the page load. The issue was resolved by reverting the database change.To prevent similar incidents, we are improving the testing and validation process for replacing database indexes.
On June 10, 2025, between 12:15 UTC and 19:04 UTC, Codespaces billing data processing experienced delays due to capacity issues in our worker pool. Approximately 57% of codespaces were affected during this incident, during which some customers may have observed incomplete or delayed billing usage information in their dashboards and usage reports, and may not have received timely notifications about approaching usage or spending limits. The incident was caused by an increase in the number of jobs in our worker pool without a corresponding increase in capacity, resulting in a backlog of unprocessed Codespaces billing jobs. We mitigated the issue by scaling up worker capacity, allowing the backlog to clear and billing data to catch up. We started seeing recovery immediately at 17:40 UTC and were fully caught up by 19:04 UTC.To prevent recurrence, we are moving critical billing jobs into a dedicated worker pool monitored by the Codespaces team, and are reviewing alerting thresholds to ensure more rapid detection and mitigation of delays in the future.
On June 10, 2025, between 14:28 UTC and 14:45 UTC the pull request service experienced a period of degraded performance, resulting in merge error rates exceeding 1%. The root cause was an overloaded host in our Git infrastructure.We mitigated the incident by removing this host from the actual set of valid replicas until the host was healthy again.We are working to improve the various mechanisms that are in place in our existing infrastructure to protect us from such problems, and we will be revisiting why in this particular scenario they didn't protect us as expected.
On June 6, 2025, between 00:21 UTC to 12:40 UTC the Copilot service was degraded and a subset of Copilot Free users were unable to sign up for or use the Copilot Free service on github.com. This was due to a change in licensing code that resulted in some users losing access despite being eligible for Copilot Free.We mitigated this through a rollback of the offending change at 11:39 AM UTC. This resulted in users once again being able to utilize their Copilot Free access.As a result of this incident, we have improved monitoring of Copilot changes during rollout. We are also working to reduce our time to detect and mitigate issues like this one in the future.
On June 5th, 2025, between 17:47 UTC and 19:20 UTC the Actions service was degraded, leading to run start delays and intermittent job failures. During this period, 47.2% of runs had delayed starts, and 21.0% of runs failed. The impact extended beyond Actions itself - 60% of Copilot Coding Agent sessions were cancelled, and all Pages sites using branch-based builds failed to deploy (though Pages serving remained unaffected). The issue was caused by a spike in load between internal Actions services exposing a misconfiguration that caused throttling of requests in the critical path of run starts. We mitigated the incident by correcting the service configuration to prevent throttling and have updated our deployment process to ensure the correct configuration is preserved moving forward.
On June 4, 2025, between 14:35 UTC and 15:50 UTC , the Actions service experienced degradation, leading to run start delays. During the incident, about 15.4% of all workflow runs were delayed by an average of 16 minutes. An unexpected load pattern revealed a scaling issue in our backend infrastructure. We mitigated the incident by blocking the requests that triggered this pattern. We are improving our rate limiting mechanisms to better handle unexpected load patterns while maintaining service availability. We are also strengthening our incident response procedures to reduce the time to mitigate for similar issues in the future.
On May 30, 2025, between 08:10 UTC and 16:00 UTC, the Microsoft Teams GitHub integration service experienced a complete service outage. During this period, the service was unable to deliver notifications or process user requests, resulting in a 100% error rate for all integration functionality except link previews.This outage was due to an authentication issue with our downstream provider. We mitigated the incident by working with our provider to restore service functionality and are working to migrate to more durable authentication methods to reduce the risk of similar issues in the future.