[Bug]: Self-hosted runners at the Enterprise level fail to detect queued jobs #27119
-
Hi! I believe I’ve discovered a bug in how the actions runner is currently handling routing/queuing active workflows to available runners. I opened an issue in the ‘actions/runner’ repository here (Runner.Listener failing to detect queued workflows/jobs for Enterprise runners · Issue #1059 · actions/runner · GitHub) and was encouraged to follow up in this forum instead. There is a lot more additional context in the issue I originally posted here (enterprise autoscaling issues, indefinitely queued jobs within workflows · Issue #470 · summerwind/actions-runner-controller · GitHub) on the subject as well. To boil it down, though, the disconnect seems to be stemming from a failure in the job message queue to retry allocating queued workflows to newly available, and otherwise compatible, runners. This becomes in issue when attempting to autoscale actions as a service. Expected Behavior
Actual Behavior
This is the case whether it’s jobs of the same workflow, or if a separate workflow entirely is triggered and not picked up, so it doesn’t appear to be a bug specific to jobs themselves, rather it’s any sort of task. The self-hosted runner documentation states that the order of operations is to first search for available runners at the repository level, then the organization level, then for the first available runner enterprise-wide that satisfies the constraints set in the labels/groups/os, for that particular workflow. This does not appear to be the case for enterprise-level runners alone as they will only detect jobs that are immediately assigned, i.e. if in the example above I had 4 enterprise runners up it would have run without a problem. So, I tried adding in an ‘organization’-scoped runner in addition to the enterprise runner. In this case there is a difference in behavior - the organization-scoped runners scale-up (as the enterprise runners do), but unlike the enterprise runners the organization-scoped runners do detect jobs that are actively waiting in the queue. There is an important caveat to this, though, as it assumes at least one of those jobs was captured by an available org-level runner at the time the workflow was triggered. If, for instance, none are available at the time the workflow is triggered, similar to the enterprise failure to detect queued jobs, those jobs are never detected by the org-runners. Again, for all of the above instances the Github API accurately reflects that these self-hosted runners are online and available. There are no errors in the runner logs - The only way to handle these queued jobs is to cancel the entire workflow manually. Please let me know if I can provide any more information on my end! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
👋 Software manager for the team that owns this area of the product…
Yes, this is a bug in the assignment logic for the service (or at least, undocumented behavior that doesn’t do what you’d expect). We are planning to ship a fix for this behavior in the next few weeks. Given that it is job assignment logic, though, it might take us some time to test and verify the behavior as we roll it out (breaking job assignment would be pretty bad 😄. Also I appreciate that you followed up in the forums even though you had to write things up twice. We are trying to clean up the runner repo and raise things in the appropriate places if we can. If you have any further questions in this area please feel free to reach out directly to me via email if you like (handle at github). We are investing in making queuing/assignment/runs better this quarter so I would love feedback. Thanks! |
Beta Was this translation helpful? Give feedback.
-
Great! Sent you an e-mail - I’ll close this out in the meantime as there are plans to resolve these issues in the near-future. 👍 |
Beta Was this translation helpful? Give feedback.
👋 Software manager for the team that owns this area of the product…
Yes, this is a bug in the assignment logic for the service (or at least, undocumented behavior that doesn’t do what you’d expect). We are planning to ship a fix for this behavior in the next few weeks. Given that it is job assignment logic, though, it might take us some time to test and verify the behavior as we roll it out (breaking job assignment would be p…