[Bug]: Self-hosted runners at the Enterprise level fail to detect queued jobs

Hi! I believe I’ve discovered a bug in how the actions runner is currently handling routing/queuing active workflows to available runners.

I opened an issue in the ‘actions/runner’ repository here (Runner.Listener failing to detect queued workflows/jobs for Enterprise runners · Issue #1059 · actions/runner · GitHub) and was encouraged to follow up in this forum instead.

There is a lot more additional context in the issue I originally posted here (enterprise autoscaling issues, indefinitely queued jobs within workflows · Issue #470 · summerwind/actions-runner-controller · GitHub) on the subject as well.

To boil it down, though, the disconnect seems to be stemming from a failure in the job message queue to retry allocating queued workflows to newly available, and otherwise compatible, runners. This becomes in issue when attempting to autoscale actions as a service.

Expected Behavior

  • 1 self-hosted enterprise worker is up
  • 1 workflow is triggered containing 4 jobs
  • The 1 available runner accepts 1 of the jobs, 3 jobs are now queued
  • A scale-up is triggered for the enterprise runners - now 4 runners are available (this autoscaling is currently working via the custom kubernetes controller we are using from summerwind/actions-runner-controller)
  • After the scale-up, the newly available runners detect there are 3 queued jobs and run them

Actual Behavior

  • 1 self-hosted enterprise worker is up
  • 1 workflow is triggered containing 4 jobs
  • The 1 available runner accepts 1 of the jobs, 3 jobs are now queued
  • Scale-up brings the available runners to 4 - Github API accurately reflects that the 3 additional runners are online and not busy
  • The 3 queued jobs never run and indefinitely hang in a sort of queue purgatory, despite the new runners indicating in the logs that they are listening for jobs. If a new workflow is triggered, these runners detect those jobs without issue, but the queued jobs are never picked up.

This is the case whether it’s jobs of the same workflow, or if a separate workflow entirely is triggered and not picked up, so it doesn’t appear to be a bug specific to jobs themselves, rather it’s any sort of task.

The self-hosted runner documentation states that the order of operations is to first search for available runners at the repository level, then the organization level, then for the first available runner enterprise-wide that satisfies the constraints set in the labels/groups/os, for that particular workflow. This does not appear to be the case for enterprise-level runners alone as they will only detect jobs that are immediately assigned, i.e. if in the example above I had 4 enterprise runners up it would have run without a problem.

So, I tried adding in an ‘organization’-scoped runner in addition to the enterprise runner. In this case there is a difference in behavior - the organization-scoped runners scale-up (as the enterprise runners do), but unlike the enterprise runners the organization-scoped runners do detect jobs that are actively waiting in the queue.

There is an important caveat to this, though, as it assumes at least one of those jobs was captured by an available org-level runner at the time the workflow was triggered. If, for instance, none are available at the time the workflow is triggered, similar to the enterprise failure to detect queued jobs, those jobs are never detected by the org-runners.

Again, for all of the above instances the Github API accurately reflects that these self-hosted runners are online and available. There are no errors in the runner logs - stdout or /runner/_diag/ - that speak to the issue - everything reflects the expected state there as well.

The only way to handle these queued jobs is to cancel the entire workflow manually.

Please let me know if I can provide any more information on my end!

1 Like

:wave: Software manager for the team that owns this area of the product…

  1. Thanks for such a detailed write up of the issue. Much appreciated.
  2. I am following the thread over here: enterprise autoscaling issues, indefinitely queued jobs within workflows · Issue #470 · summerwind/actions-runner-controller · GitHub on the controller as well.

Yes, this is a bug in the assignment logic for the service (or at least, undocumented behavior that doesn’t do what you’d expect). We are planning to ship a fix for this behavior in the next few weeks. Given that it is job assignment logic, though, it might take us some time to test and verify the behavior as we roll it out (breaking job assignment would be pretty bad :smile:.

Also I appreciate that you followed up in the forums even though you had to write things up twice. We are trying to clean up the runner repo and raise things in the appropriate places if we can.

If you have any further questions in this area please feel free to reach out directly to me via email if you like (handle at github).

We are investing in making queuing/assignment/runs better this quarter so I would love feedback.

Thanks!

1 Like

Great! Sent you an e-mail - I’ll close this out in the meantime as there are plans to resolve these issues in the near-future. :+1: