Resolved -
On December 17th, 2024, between 14:33 UTC and 14:50 UTC, many users experienced intermittent errors and timeouts when accessing github.com. The error rate was 8.5% on average and peaked at 44.3% of requests. The increased error rate caused a broad impact across our services, such as the inability to log in, view a repository, open a pull request, and comment on issues. The errors were caused by our web servers being overloaded as a result of planned maintenance that unintentionally caused our live updates service to fail to start. As a result of the live updates service being down, clients reconnected aggressively and overloaded our servers.
We only marked Issues as affected during this incident despite the broad impact. This oversight was due to a gap in our alerting while our web servers were overloaded. The engineering team's focus on restoring functionality led us to not identify the broad scope of the impact to customers until the incident had already been mitigated.
We mitigated the incident by rolling back the changes from the planned maintenance to the live updates service and scaling up the service to handle the influx of traffic from WebSocket clients.
We are working to reduce the impact of the live updates service's availability on github.com to prevent issues like this one in the future. We are also working to improve our alerting to better detect the scope of impact from incidents like this.
Dec 17, 16:00 UTC
Update -
Issues is operating normally.
Dec 17, 15:32 UTC
Update -
We have taken some mitigation steps and are continuing to investigate the issue. There was a period of wider impact on many GitHub services such as user logins and page loads which should now be mitigated.
Dec 17, 15:29 UTC
Update -
Issues is experiencing degraded availability. We are continuing to investigate.
Dec 17, 15:05 UTC
Update -
We are currently seeing live updates on some pages not working. This can impact features such as status checks and the merge button for PRs.
Current mitigation is to refresh pages manually to see latest details.
We are working to mitigate this and will continue to provide updates as the team makes progress.
Dec 17, 14:53 UTC
Investigating -
We are investigating reports of degraded performance for Issues
Dec 17, 14:51 UTC