Observability of runners

We provide our team runners on AWS with GitHub - philips-labs/terraform-aws-github-runner: Terraform module for scalable GitHub action runners on AWS.

This makes them independent and they can just request one of them with

runs-on: [aws]

Cool no?

What is not cool though is (until now) we have zero transparency/observability. It is a question of when not if when the first teams will hog the runners.

Yes, they autoscale, but we are not ready to burn money just because someone thinks he needs to install modules 7 times or code an endless loop (or worst case mine some 🪙).
Also, they should break down their tests into smaller, fast-feedback bites.

Now, what we lack and I find it nowhere out there is some way to observe the runners. I mean requirements in the direction:

  • Which job is executed the most?
  • Which job fails the most?
  • Which job takes the longest?
  • Which step from which job takes the longest, fails the most often?

Etc.etc., let’s just say I want to observe the runners, on a runner basis. I know there are e.g. stepstimeout-minutes but it’s the wrong way around.

I want to observe which teams “violate” our guidelines and mentor them into the pattern. Of course, a “hard limit” for jobs is an option but then again this robs all freedom for special cases.

What “Runner Observability” exists there?

Thank you :hugs:

That is a great requirement and a common one.

We are thinking about throwing more visibility on these, we have some things planned in future.

actions/runner: The Runner for GitHub Actions might be a good place to kick start discussions and conversations, so you would get more buy in on what is needed the most.

Appreciate the feedback! Thank you!

I triggered a discussion here.