Live updates on pages not loading reliably


Incident resolved in 1h8m42s

Resolved

On December 17th, 2024, between 14:33 UTC and 14:50 UTC, many users experienced intermittent errors and timeouts when accessing github.com. The error rate was 8.5% on average and peaked at 44.3% of requests. The increased error rate caused a broad impact across our services, such as the inability to log in, view a repository, open a pull request, and comment on issues. The errors were caused by our web servers being overloaded as a result of planned maintenance that unintentionally caused our live updates service to fail to start. As a result of the live updates service being down, clients reconnected aggressively and overloaded our servers.We only marked Issues as affected during this incident despite the broad impact. This oversight was due to a gap in our alerting while our web servers were overloaded. The engineering team's focus on restoring functionality led us to not identify the broad scope of the impact to customers until the incident had already been mitigated.We mitigated the incident by rolling back the changes from the planned maintenance to the live updates service and scaling up the service to handle the influx of traffic from WebSocket clients.We are working to reduce the impact of the live updates service's availability on github.com to prevent issues like this one in the future. We are also working to improve our alerting to better detect the scope of impact from incidents like this.

1734451210

Investigating

Issues is operating normally.

1734449561

Investigating

We have taken some mitigation steps and are continuing to investigate the issue. There was a period of wider impact on many GitHub services such as user logins and page loads which should now be mitigated.

1734449399

Investigating

Issues is experiencing degraded availability. We are continuing to investigate.

1734447958

Investigating

We are currently seeing live updates on some pages not working. This can impact features such as status checks and the merge button for PRs.Current mitigation is to refresh pages manually to see latest details.We are working to mitigate this and will continue to provide updates as the team makes progress.

1734447182

Investigating

We are investigating reports of degraded performance for Issues

1734447088