Resolved -
Between Dec 1 12:20 UTC and Dec 2 1:05 UTC, availability of large hosted runners for Actions was degraded due to failures in background VM provisioning jobs. Users would see workflows queued waiting for a runner. On average, 8% of all workflows requiring large runners over the incident time were affected, peaking at 37.5% of requests. There were also lower levels of intermittent queuing on Dec 1 beginning around 3:00 UTC. Standard and Mac runners were not affected.
The job failures were caused by timeouts to a dependent service in the VM provisioning flow and gaps in the jobs’ resilience to those timeouts. The incident was mitigated by circumventing the dependency as it was not in the critical path of VM provisioning.
There are a few immediate improvements we are making in response to this. We are addressing the causes of the failed calls to improve the availability of calls to that backend service. Even with that impact, the critical flow of large VM provisioning should not have been impacted, so we are improving the client behavior to fail fast and circuit break non-critical calls. Finally the alerting for this service was not adequate in this particular scenario to ensure fast response by our team. We are improving our automated detection from this to reduce our time to detection and mitigation of issues like this one in the future.
Dec 2, 01:05 UTC
Update -
We've applied a mitigation to fix the issues with large runner jobs processing. We are seeing improvements in telemetry and are monitoring for full recovery.
Dec 2, 00:57 UTC
Update -
We continue to investigate large hosted runners not picking up jobs.
Dec 2, 00:14 UTC
Update -
We continue to investigate issues with large runners.
Dec 1, 23:43 UTC
Update -
We're seeing issues related to large runners not picking up jobs and are investigating.
Dec 1, 23:24 UTC
Investigating -
We are currently investigating this issue.
Dec 1, 23:18 UTC