GitHub Actions Offline Runner removed?

Hello! I’m using GitHub - philips-labs/terraform-aws-github-runner: Terraform module for scalable GitHub action runners on AWS to create self-hosted Runners inside AWS. This code has the feature that it uses EC2 Spot instances, can bring up and then can bring down the set of active Runners to zero (so no cost when Runners are not active). To be able to bring the set of Runners down to zero, yet still have GitHub able to trigger the webhook to the AWS API Gateway, an ‘offline runner’ must be created.

What I’ve seen is that sometimes this ‘offline runner’ mysteriously disappears from the GitHub Organization I create it in. Is there a timeout that is triggered to prune ‘offline runners’ from a Repository or Organization? Is there a log of this activity somewhere? Is there any way to ensure that an ‘offline runner’ is never purged from an Organization’s Runner list?

Thanks in advance! This is a head-scratcher! :wink:

Cheers,
Jesse

According to the documentation in Removing self-hosted runners the inactivity expiry is 30 days:

A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 30 days.

The simplest workaround for this would be to create a Workflow that runs on a weekly[1] schedule, e.g:

on:
  schedule:
    - cron:  '0 0 0 * * 0'

jobs:
  wake-runner:
    runs-on: offline-runner
    steps:
      - run: echo "Awaken, My Love\!"

[1] Weekly because every 29 days is a little trickier to represent in cron format and so I think the simplicity is worth the additional cost of 3 more runs, however, if the additional cost is an issue then you could do it every 29 days!

1 Like

Thanks for this, @shrink ! Just one other question - should ‘runs-on:’ refer to ‘self-hosted’ or ‘offline-runner’? We couldn’t find a runner that matched ‘offline-runner’. Cheers!

Sorry for the confusion, offline-runner is just an arbitrary label I used as an example assuming you’d like to identify the runner that is expected to be offline so you can guarantee this doesn’t run on any other runners – in-case your organisation introduces runners that are online all the time. The “Using self-hosted runners in a workflow” documentation describes how you can use labels, and their prioritisation.

1 Like

We implemented what you suggested, but still the runner gets deleted, sadly.

Thinking about this more, I don’t think this would actually work. Since the purging of the offline runner happens when the runner hasn’t contacted GitHub’s servers within 30 days, any job that might execute on it never runs successfully. It’s the fact that the runner agent doesn’t contact GitHub seems to be the scenario that makes GitHub delete the runner, not that jobs are scheduled for that runner.

The unique case here is that because the set of runners actually goes to zero by design is the problem. (And because we are prototyping the scenarios that actually use the runners, they aren’t getting used as much as they would if they were in production.) This is kind of a catch-22, because we are trying to justify internally the use of GitHub Action Runners in AWS (vs. Jenkins and other CI pipeline system), but until we do, they will be underutilized. :frowning:

If folks have any other suggestions for a solution, I’m all ears! :slight_smile: And I hope that GitHub themselves figure out a path to allow for going to ‘zero’ running Runners. Happy to beta test approaches!

Cheers.

I’m sorry to hear that it isn’t working :frowning:

The principle behind the suggestion is that when a Runner is activated (i.e: it runs a job) then it communicates with GitHub, so in theory: as long as a Runner performs a job, once per 30 days, it will remain active indefinitely.

I’m going to guess that what you’ve discovered is that there are 2 ways a Runner can “communicate with GitHub”: one that GitHub uses to understand that the Runner is active, and one that reports the status of a job it has processed. I made an assumption in my previous comment, that those methods of communication were one and the same because it seems likely – but, I guess not!

If that isn’t true, we need to identify the method that GitHub uses to identify if a Runner has “communicated with GitHub”. The Runner source code is public so I think we might be able to track it down, and somehow convince your offline runner to communicate with GitHub.

My first guess is that perhaps “communicated with GitHub” is a mischaracterization in the documentation, maybe the actual behaviour is something like “has behaved like a normal Runner (i.e: listened for and consume messages) in the last 30 days”. I think this might make sense because otherwise, a normal, online Runner that has not handled a job in 30 days would disappear and that doesn’t appear to be the case.

I’m now considering, what is the difference between a normal runner and one of these offline runners? I think the key difference is that a normal runner is created on a machine, registered and then turned off and on with state persisting between runs, whereas these offline runners are destroyed, and recreated without any state – except the credentials, that are inserted into (and retrieved from) ssm.

Keeping all that in mind, and returning to the README of the terraform setup you’re using, this stands out:

Create this runner manually by following the GitHub instructions for adding a new runner on your local machine. If you stop the process after the step of running the config.sh script the runner will remain offline. This offline runner ensures that builds will not fail immediately and stay queued until there is an EC2 runner to pick it up.

If I look at install-config-runner.sh this line stands out:

CONFIG=$(aws ssm get-parameters --names ${environment}-$INSTANCE_ID --with-decryption --region $REGION | jq -r ".Parameters | .[0] | .Value")
# ...
./config.sh --unattended --name $INSTANCE_ID --work "_work" $CONFIG

So we’re launching the runner with a complete set of configuration values, rather than configuring it. According to the Runner:

// command is not required, if no command it just starts if configured

Then, further down, we can see:

if (command.Configure)
// ...
        await configManager.ConfigureAsync(command);
// ...
}

If we look in the ConfigManager, we can see that this is where pretty much everything happens, including…

publicKey = rsa.ExportParameters(false);
// ...
agent = UpdateExistingAgent(agent, publicKey, userLabels);
// ...
await _runnerServer.ConnectAsync(new Uri(runnerSettings.ServerUrl), credential);

bingpot! That’s it, I think. My best guess conclusion: launching the runner on a new machine without configuring it means that the Runner doesn’t “communicate with GitHub” in the right way. The Runner has the appropriate credentials to perform a job, and report that status back to GitHub – because they’re in config – but crucially, the Runner isn’t registered as an “agent” (identified by a Public Key stored on the machine?) and so the Runner doesn’t really exist.

I guess this is a bug in GitHub’s Runner? but I think there’s a few options even if this doesn’t change. The simplest would be to always configure the Runner using the stored credentials so that the full configuration lifecycle for the Runner is completed on each machine.

Anyway, that’s the best guess I have based on an hour of poking around.

So, the Runner docs do just flat out state:

“A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 30 days.”

So, if that’s true, then it’s merely that the runner didn’t connect at least once in the previous 30 days. It’s not about jobs assigned. It’s probably an assumption on the part of the GH Engineers that there’d be no scenario where you’d be required to have an offline runner exist for an indeterminate length of time. And the reason this scenario requires this offline runner is because of another design decision that there would never be zero connecting online runners (that’s not a condition GitHub’s internal runners would ever reach as long as GitHub offers the service! :smiley: ) and that a runner would be spun up if and only if there is a request.

I guess another approach would be to just always schedule an idle instance, and the Terraform module does have this as a feature, but then some of the cost-savings and lower maintenance of having a ‘lights out if no work is needed’ setup disappears.

Now how I handle this with GitHub is a question - is this a bug? Or by design? And if it is by design, is there a case to be made for a feature that is likely to be picked up by the GitHub Actions/Runner devs?