Problems with runner scheduler priority

@willabides kindly linked to the following question which was asked in a recent AMA with @chrispat:

Independently of seeing this question I asked the #github channel on Gophers Slack a related question.

We have a workflow that looks like this from a high level:

start -> test x n (matrix) -> delete -> report
  • The start job is very quick.
  • The test matrix is dependent on the start job.
  • The delete job is dependent on all of entries in the matrix of the test step completing. Each test job is relatively time consuming.

The problem is that if a number of instances are kicked off simultaneously, jobs in those other instances appear to jump the queue ahead of the delete job from an earlier instance.

Consider the following scenario:

  • 4 instances of this workflow are kicked off ~simultaneously. Let’s refer to these by number. Furthermore, let’s refer to the test jobs of each by n.m. e.g. 1.1 refers to the first test job of instance number 1.
  • Let’s assume that the start jobs for each start and complete in sequence (although this is not guaranteed), such that the test jobs are scheduled in sequence. Specifically all 1.1..1.n are scheduled before 2.1..2.n etc
  • Let’s also assume that the number of jobs in the test matrix, n, is greater than the max number of concurrent jobs that can run.

The issue occurs whe all the instance 1 test jobs are complete. Because at this point, the delete job from instance 1 gets scheduled after all the test jobs in the other instances. This means that instance 1 can only now complete once all the test jobs from the other instances complete.

IMHO what should happen here is that dependent jobs from earlier schedule instances should queue jump jobs from later workflow instances.

But I’d welcome insight and thoughts from others.

Thanks

3 Likes

GitHub added a feature in April that lets you control the concurrency of workflows and jobs:

This could potentially solve your problem. With a concurrency restriction at workflow level, you could ensure that only one instance of the workflow runs at a time, which should prevent any jobs from instance 2 to start before instance 1 is fully done. The downside would be that instance 2 won’t initiate the start job until instance 1's delete and report jobs are completed.

You could also try a job-level concurrency restriction so that instance 2's test jobs start as soon as all instance 1 test jobs are done, but I can’t think of a concurrency key that would allow for that from the top of my head. It would need to be unique for each test job I suppose (so that the matrix doesn’t just run one job at a time), but the same between instances. Not sure if that is possible at all.