On Oct 24 2024 at 06:55 UTC, a syntactically correct, but invalid discussion template YAML config file was committed in the community/community repository. This caused all users of that repository who tried to access a discussion template or attempted to create a discussion to receive a 500 error response.We mitigated the incident by manually reverting the invalid template changes.We are adding support to detect and prevent invalid discussion template YAML from causing user-facing errors in the future.
On October 11, 2024, starting at 05:59 UTC, DNS infrastructure in one of our sites started to fail to resolve lookups following a database migration. Attempts to recover the database led to cascading failures that impacted the DNS systems for that site. The team worked to restore the infrastructure and there was no customer impact until 17:31 UTC. During the incident, impact to the following services could be observed:- Copilot: Degradation in IDE code completions for 4% of active users during the incident from 17:31 UTC to 21:45 UTC.- Actions: Workflow runs delay (25% of runs delayed by over 5 minutes) and errors (1%) between 20:28 UTC and 21:30 UTC. Errors while creating Artifact Attestations.- Customer migrations: From 18:16 UTC to 23:12 UTC running migrations stopped and new ones were not able to start.- Support: support.github.com was unavailable from 19:28 UTC to 22:14 UTC. - Code search: 100% of queries failed between 2024-10-11 20:16 UTC and 2024-10-12 00:46 UTC.Starting at 18:05 UTC, engineering attempted to repoint the degraded site DNS to a different site to restore DNS functionality. At 18:26 UTC the test system had validated this approach and a progressive rollout to the affected hosts proceeded over the next hour. While this mitigation was effective at restoring connectivity within the site, it caused issues with connectivity from healthy sites back to the degraded site, and the team proceeded to plan out a different remediation effort.At 20:52 UTC, the team finalized a remediation plan and began the next phase of mitigation by deploying temporary DNS resolution capabilities to the degraded site. At 21:46 UTC, DNS resolution in the degraded site began to recover and was fully healthy at 22:16 UTC. Lingering issues with code search were resolved at 01:11 UTC on October 12.The team continued to restore the original functionality within the site after public service functionality was restored. GitHub is working to harden our resiliency and automation processes around this infrastructure to make diagnosing and resolving issues like this faster in the future.
Between September 27, 2024, 15:26 UTC and September 27, 2024, 15:34 UTC the Repositories Releases service was degraded. During this time 9% of requests to list releases via API or the webpage received a 500 Internal Server error.
This was due to a bug in our software roll out strategy. The rollout was reverted starting at 15:30 UTC, which began to restore functionality. The rollback was completed at 15:34 UTC.
We are continuing to improve our testing infrastructure to ensure that bugs such as this one can be detected before they make their way into production.
On September 30th, 2024 from 10:43 UTC to 11:26 UTC Codespaces customers in the Central India region were unable to create new Codespaces. Resumes were not impacted. Additionally, there was no impact to customers in other regions.The cause was traced to storage capacity constraints in the region and was mitigated by temporarily redirecting create requests to other regions. Afterwards, additional storage capacity was added to the region and traffic was routed back. A bug was also identified that caused some available capacity to not be utilized, artificially constraining capacity and halting creations in the region prematurely. We have since fixed this bug as well, so that available capacity scales as expected according to our capacity planning projections.
Between September 25, 2024, 22:20 UTC and September 26, 2024, 5:00 UTC the Copilot service was degraded. During this time Copilot chat requests failed at an average rate of 15%.This was due to a faulty deployment in a service provider that caused server errors from multiple regions. Traffic was routed away from those regions at 22:28 UTC and 23:39 UTC, which partially restored functionality, while the upstream service provider rolled back their change. The rollback was completed at 04:41 UTC.We are continuing to improve our ability to respond more quickly to similar issues through faster regional redirection and working with our upstream provider on improved monitoring.
On September 25th, 2024 from 18:32 UTC to 19:13 UTC, Actions service experienced a degradation during a production deployment, leading to actions failing to be downloaded at the start of a job. On average, 21% of Actions workflow runs failed to start during the course of the incident. The issue was traced back to a bug in an internal service responsible for generating the URLs used by the Actions runner to download actions.To mitigate the impact, we rolled back the affecting deployment. We are implementing new monitors to improve our detection and response time for this class of issues in the future.
On September 25, 2024 from 14:31 UTC to 15:06 UTC the Git Operations service experienced a degradation, leading to 1,381,993 failed git operations. The overall error rate during this period was 4.2%, with a peak error rate of 12.5%. The root cause was traced to a bug in a build script for a component that runs on the file servers that host git repository data. The build script incurred an error that did not cause the overall build process to fail, resulting in a faulty set of artifacts being deployed to production.To mitigate the impact, we rolled back the affecting deployment. To prevent further occurrences of this cause in the future, we will be addressing the underlying cause of the ignored build failure and improving metrics and alerting for the resulting production failure scenarios.
On September 24th, 2024 from 08:20 UTC to 09:04 UTC the Codespaces service experienced an interruption in network connectivity, leading to 175 codespaces being unable to be created or resumed. The overall error rate during this period was 25%. The cause was traced to an interruption in network connectivity caused by SNAT port exhaustion following a deployment, causing individual Codespaces to lose their connection to the service.To mitigate the impact, we increased port allocations to give enough buffer for increased outbound connections shortly after deployments, and will be scaling up our outbound connectivity in the near future, as well as adding improved monitoring of network capacity to prevent future regressions.
On September 16, 2024, between 21:11 UTC and 22:20 UTC, Actions and Pages services were degraded. Customers who deploy Pages from a source branch experienced delayed runs. Approximately 1,100 runs were delayed long enough to get marked as abandoned. The runs that weren't abandoned completed successfully after we recovered from the incident. Actions jobs experienced average delays of 23 minutes, with some jobs experiencing delays as high as 45 minutes. During the course of the incident, 17% of runs were delayed by more than 5 minutes. At peak, as many as 80% of runs experienced delays exceeding 5 minutes. The root cause was a misconfiguration in the service that manages runner connections, which caused CPU throttling and led to a performance degradation in that service.We mitigated the incident by diverting runner connections away from the misconfigured nodes. We are working to improve our internal monitoring and alerting to reduce our time to detection and mitigation of issues like this one in the future.