Independently of seeing this question I asked the
#github channel on Gophers Slack a related question.
We have a workflow that looks like this from a high level:
start -> test x n (matrix) -> delete -> report
startjob is very quick.
testmatrix is dependent on the
deletejob is dependent on all of entries in the matrix of the
teststep completing. Each
testjob is relatively time consuming.
The problem is that if a number of instances are kicked off simultaneously, jobs in those other instances appear to jump the queue ahead of the
delete job from an earlier instance.
Consider the following scenario:
- 4 instances of this workflow are kicked off ~simultaneously. Let’s refer to these by number. Furthermore, let’s refer to the
testjobs of each by
1.1refers to the first
testjob of instance number
- Let’s assume that the
startjobs for each start and complete in sequence (although this is not guaranteed), such that the
testjobs are scheduled in sequence. Specifically all
1.1..1.nare scheduled before
- Let’s also assume that the number of jobs in the
n, is greater than the max number of concurrent jobs that can run.
The issue occurs whe all the instance
test jobs are complete. Because at this point, the
delete job from instance
1 gets scheduled after all the
test jobs in the other instances. This means that instance
1 can only now complete once all the
test jobs from the other instances complete.
IMHO what should happen here is that dependent jobs from earlier schedule instances should queue jump jobs from later workflow instances.
But I’d welcome insight and thoughts from others.