Disruption with some GitHub services


Incident resolved in 7h18m0s

Resolved

On October 11, 2024, starting at 05:59 UTC, DNS infrastructure in one of our sites started to fail to resolve lookups following a database migration. Attempts to recover the database led to cascading failures that impacted the DNS systems for that site. The team worked to restore the infrastructure and there was no customer impact until 17:31 UTC. During the incident, impact to the following services could be observed:- Copilot: Degradation in IDE code completions for 4% of active users during the incident from 17:31 UTC to 21:45 UTC.- Actions: Workflow runs delay (25% of runs delayed by over 5 minutes) and errors (1%) between 20:28 UTC and 21:30 UTC. Errors while creating Artifact Attestations.- Customer migrations: From 18:16 UTC to 23:12 UTC running migrations stopped and new ones were not able to start.- Support: support.github.com was unavailable from 19:28 UTC to 22:14 UTC. - Code search: 100% of queries failed between 2024-10-11 20:16 UTC and 2024-10-12 00:46 UTC.Starting at 18:05 UTC, engineering attempted to repoint the degraded site DNS to a different site to restore DNS functionality. At 18:26 UTC the test system had validated this approach and a progressive rollout to the affected hosts proceeded over the next hour. While this mitigation was effective at restoring connectivity within the site, it caused issues with connectivity from healthy sites back to the degraded site, and the team proceeded to plan out a different remediation effort.At 20:52 UTC, the team finalized a remediation plan and began the next phase of mitigation by deploying temporary DNS resolution capabilities to the degraded site. At 21:46 UTC, DNS resolution in the degraded site began to recover and was fully healthy at 22:16 UTC. Lingering issues with code search were resolved at 01:11 UTC on October 12.The team continued to restore the original functionality within the site after public service functionality was restored. GitHub is working to harden our resiliency and automation processes around this infrastructure to make diagnosing and resolving issues like this faster in the future.

1728695488

Investigating

We’re continuing to work towards recovery of code search service.

1728694016

Investigating

We’ve identified the issue with code search and are working towards recovery of service.

1728692094

Investigating

We’re continuing to investigate issues with code search.

1728689463

Investigating

We’re continuing to investigate issues with code search. Copilot and Actions services are recovered and operating normally.

1728687454

Investigating

Copilot is operating normally.

1728685010

Investigating

We are rolling out a fix to address the network connectivity issues. Copilot is seeing recovery. support.github.com is recovered.

1728684847

Investigating

Actions is operating normally.

1728683205

Investigating

We continue to work on mitigations. Actions is starting to see recovery.

1728682110

Investigating

The mitigation attempt did not resolve the issue and we are working on a different resolution path. In addition to the previously listed impacts, some Actions runs will see delays in starting.

1728679928

Investigating

Actions is experiencing degraded performance. We are continuing to investigate.

1728679695

Investigating

We continue to work on mitigations. In addition to previously listed impact, code search is also unavailable.

1728677756

Investigating

A mitigation for the network connectivity issues is being tested.

1728677141

Investigating

We continue to work on mitigations to restore network connectivity. In addition to the previously listed impact, access to support.github.com is also impacted.

1728674889

Investigating

We have identified the problem and are working on mitigations. In addition to previously listed impact, new Artifact Attestations cannot be created.

1728673536

Investigating

We have identified the problem is related to maintenance performed in our networking infrastructure. We are working to bring back the connectivity.Copilot users in organizations or enterprises that have opted into the Content Exclusions feature will experience disabled completions in their editors.Customer migrations remain paused as well.

1728672073

Investigating

We are investigating network connectivity issues. Some Copilot customers will see errors on API calls and experiences. We have also paused the remaining customer migration queue while we investigate due to an increase in errors.

1728671118

Investigating

We are investigating reports of issues with service(s): Copilot. We will continue to keep users updated on progress towards mitigation.

1728669535

Investigating

Copilot is experiencing degraded availability. We are continuing to investigate.

1728669414

Investigating

We are currently investigating this issue.

1728669208