We plan to migrate our CI and we are evaluating Actions as a candidate. We are already a customer of Github so we hope Actions can be used since we then expect the best integration.
We require self-hosted runners since many tests depend on USB connected hardware. We are a Camera company, much testing requires cameras to be connected.
Because of this, we have lots of weak PC (think 4 cores) that will be added to the runner pool to be able to connect all HW, they are not great but together they give us some compute power.
However, for building we have workstations with up to 128 logical cores, fast memory, and IO. These machines build our C++ code and run other heavy tasks up to 20 times faster than the weak “camera PCs”.
Because of this highly heterogeneous setup it is critical for CI latency (and throughput) that jobs are sent to the PC that at any time will do the job the quickest. As an example; Probably the first 3 jobs go on a 128 core WS, then two go on a 64 core WS, then one more at the 128 core WS, maybe now a 16 core WS will be chosen and so on. When all performant WSs are taken, jobs should be sent to the 4 core PCs, but only then since they are so slow. However, we do not want to not use them for compilation at all since in peak periods they do provide lots of value in computing minutes and they can take some load of the 128 core WS and reduce queues.
Today we use a load balancing plugin in Jenkins to deal with this. It works so that you give each PC an initial score (manually maintained and set while setting up a runner). Then, a penalty is given per active job on a runner. When a new job arrives it is given to the runner with the highest score.
This system gets the job done, but it is a pain to manually update all the initial scores when runner pool changes or when the nature of jobs significantly changes. Also since it is manual heuristics the load balancing is far from perfect. But it works…
Ideally, the load balancer should learn the requirements (CPU, IO) per jobs and the capabilities of runners and dynamically update these and schedule jobs at best effort. When runners or jobs is added or jobs change their behaviour this should be picked up by the balancer.
Now, I hope I have given enough context, else just ask. The question goes as;
- How is Actions handling load balancing of self-hosted runners?
- Does it cover our use case?