Resolved -
On January 23, 2025, between 9:49 and 17:00 UTC, the available capacity of large hosted runners was degraded. On average, 26% of jobs requiring large runners had a >5min delay getting a runner assigned. This was caused by the rollback of a configuration change and a latent bug in event processing, which was triggered by the mixed data shape that resulted from the rollback. The processing would reprocess the same events unnecessarily and cause the background job that manages large runner creation and deletion to run out of resources. It would automatically restart and continue processing, but the jobs were not able to keep up with production traffic. We mitigated the impact by using a feature flag to bypass the problematic event processing logic. While these changes had been rolling out in stages over the last few months and had been safely rolled back previously, an unrelated change prevented rollback from causing this problem in earlier stages.
We are reviewing and updating the feature flags in this event processing workflow to ensure that we have high confidence in rollback in all rollout stages. We are also improving observability of the event processing to reduce the time to diagnose and mitigate similar issues going forward.
Jan 23, 17:27 UTC
Update -
We are seeing recovery with the latest mitigation. Queue time for a very small percentage of larger runner jobs are still longer than expected so we are monitoring those for full recovery before going green.
Jan 23, 17:03 UTC
Update -
We are actively applying mitigations to help improve larger runner start times. We are currently seeing delays starting about 25% of larger runner jobs.
Jan 23, 16:25 UTC
Update -
We are still actively investigating a slowdown in larger runner assignment and are working to apply additional mitigations.
Jan 23, 15:33 UTC
Update -
We're still applying mitigations to unblock queueing Actions in large runners. We are monitoring for full recovery.
Jan 23, 14:53 UTC
Update -
We are applying further mitigations to fix the issues with delayed queuing for Actions jobs in large runners. We continue to monitor for full recovery.
Jan 23, 14:17 UTC
Update -
We are investigating further mitigations for queueing Actions jobs in large runners. We continue to watch telemetry and are monitoring for full recovery.
Jan 23, 13:42 UTC
Update -
We've applied a mitigation to fix the issues with queuing and running Actions jobs. We are seeing improvements in telemetry and are monitoring for full recovery.
Jan 23, 13:09 UTC
Update -
The team continues to apply mitigations for issues with some Actions jobs delayed being enqueued for larger runners seen by a small number of customers. We will continue providing updates on the progress towards full mitigation.
Jan 23, 12:36 UTC
Update -
The team continues to apply mitigations for issues with some Actions jobs delayed being enqueued for larger runners. We will continue providing updates on the progress towards full mitigation.
Jan 23, 12:03 UTC
Update -
The team continues to investigate issues with some Actions jobs delayed being enqueued for larger runners. We will continue providing updates on the progress towards mitigation.
Jan 23, 11:31 UTC
Update -
The team continues to investigate issues with some Actions jobs having delays in being queued for larger runners. We will continue providing updates on the progress towards mitigation.
Jan 23, 10:58 UTC
Investigating -
We are investigating reports of degraded performance for Actions
Jan 23, 10:25 UTC