GHA internal error: An image label with the label <name> does not exist

Hi,

GitHub Actions error report incoming :slight_smile:

I am setting up a build system with self-hosted GitHub Actions runners in Google Cloud. Development on the build system involves a lot of creation/destruction of these runners.

Since about a day back, some jobs have begun failing with internal errors. The failing jobs run specifically on the self-hosted runners. Example run that is failing The internal errors occur in both proof-of-concept repos that I have set up so far.

The problem is not that the appropriate runners are not registered/labelled. (They can be seen under the project’s settings page.) The problem is not that the appropriate runners are offline. (GitHub Actions will then wait until the runners come online.)

The error messages that are shown in the GitHub Actions UI are:

An image label with the label engine_build_agent does not exist.

and

The remote provider was unable to process the request.

These are not GitHub Actions error messages… these are Azure Pipelines error messages. My current guess is that something is wrong behind the scenes: GitHub Actions thinks that the runners are available, but Azure Pipelines does not.

Destroying and recreating the VMs containing the runners, and also removing/re-registering the GHA runners, has not helped. The problem is intermittent; I have seen the UE4-GHA-Game repo’s build process succeed, then fail, then succeed, then fail again, due to this.

Has anyone else experienced this too? Does anyone know of a workaround?

1 Like

@Kalmalyzer,

I can reproduce the same errors only when the self-hosted runner that has the specified label was not found in the repository and in the organization.
So, please check again if you have added the label “engine_build_agent” to the self-hosted runner you want to use.

If you have added this label to the specified self-hosted runner but the errors still exist, please provide the related screenshots to show all the self-hosted runners you have added in your repository and in your organization (Settings > Actions > Self-hosted runners). I will help you report the issue to the appropriate engineering team.

Thanks for following up on this. The runners do exist, they have the appropriate labels.

I have done some more investigation, and I think I know what has changed:

Up until recently, if GHA was about to schedule a job, and runners with appropriate labels existed but were offline, GHA would queue that job. Since ~2 days back, GHA will immediately fail/cancel the job (message: “This check was cancelled”) under those circumstances.

I was relying on this (undocumented) behaviour and starting/stopping runner VMs in response to build jobs being queued up against runners that were offline. What I’m doing is quite niche [setting up a build system with dedicated VMs that scale down to zero when not in use]. I will try with a different solution, where I spin up VMs explicitly before proceeding to the jobs that need those runners.

The change has some broader implications though; it means that, if I restart a runner machine (for example because I want to apply security updates to it) then, with the new GHA logic, any build jobs that get triggered at the time that the runner machine is being restarted will fail. With the old logic, those build jobs would wait in the queue while the runner machine was down for maintenance. Since there is no way to temporarily pause pushes/PRs/etc triggering builds (other than editing the YAML in the repo), this means that it is now difficult to make agent machine updates without seeing some spurious non-green builds in GHA.

We have had a very similar issue as of 2 days ago. We tried to put a ticket in to Github about it.

The only difference between your issue and the issue we are having is that our self hosted runner was is active (not offline and not idle).

Before it would queue all job that were waiting, since it can only run 1 job at a time, and do them 1 by one. Now it is queuing 1 job and auto canceling with "“The remote provider was unable to process the request.”

All but 1 of the jobs here use the same runner tag ec2xl-only.

This has been working for the past several month, and does not seem to be intended behavior.
Unless we miss read this part of the docs:
Job queue time - Each job for self-hosted runners can be queued for a maximum of 24 hours. If a self-hosted runner does not start executing the job within this limit, the job is terminated and fails to complete. This limit does not apply to GitHub-hosted runners.
From https://docs.github.com/en/actions/getting-started-with-github-actions/about-github-actions

1 Like

That’s interesting. I can reproduce the sort of problem that you are mentioning: https://github.com/Kalmalyzer/SelfHostedRunnerTest/actions/runs/162530982 shows the result when running a minimal test repo. It contains 4 jobs. All four jobs reference the same label. I have 1 runner for that label. Notice how one out of the four jobs have failed check.

It seems like the new logic does not support oversubscription of self-hosted runners at all. Not even across concurrent runs. In fact, just by pushing minimal changes to the same repo multiple times, I manage to get all the jobs in a single run to fail (because the runner is busy with a job from a previous run)!

Build 1:

Build 2:

Build 3:

I find it highly unlikely that this is intended behaviour of the GitHub Actions platform.

@Kalmalyzer, @SirSaunders,

I also can reproduce the same issue.
I have created an internal issue ticket to report this issue to the appropriate engineering team for further investigation and evaluation. If they have any progress, I will notify you in time, and sometimes the appropriate engineers may directly reply you here.

1 Like

I have this issue as well. Thought it was due to the recent runner update, but looks like it’s more of a server side issue. Here is the issue that I opened: https://github.com/actions/runner/issues/579

Ok, with the fix live, I am no longer seeing the errors; jobs are always getting queued up. Nice!

I am experiencing situations where the runner isn’t picking up the job (job is queued, waiting for action, runner is idle - for several minutes) but it’s not occurring 100% of the time - 2 out of 3 attempts so far. I haven’t been able to reproduce the “job not being picked up by runner” problem in a minimal test setting either. There is a chance that the problems I am seeing now are my own doing.

@Kalmalyzer,

Yes, according to the latest update from the appropriate engineering team, they have deployed the fix.
And I have tested on my side, the error has gone.

@SirSaunders, @timharris777,
Please also try on your side to see if the problem has gone.

(Confirmation: the problems I am seeing with runners not picking up jobs are unrelated to the topic of this thread. https://github.com/actions/runner/issues/590 covers that instead.)

@Kalmalyzer,

Thanks for your confirmation.