Jobs failing randomly, re-run helps but other jobs then fail

In our project, we have a multiple workflows, one of the has 28 jobs (Linux, testing different compilers). From those 28 jobs, some of them randomly fail. For example: Cleanup · taocpp/PEGTL@663a9f4 · GitHub

As you can see from the example, the compiler itself crashes. Re-running the jobs might help for one job, but chances are that now another one or two of those 28 jobs fails due to the same problem. I just re-ran the same jobs multiple times, each time one or two of the jobs will fail.

As it is the compiler crashing, it is not an instability in our code. As we are using different compilers in those jobs, it’s also not the compilers fault.

I would guess it is some kind of problem in the runners itself, maybe over-commitment and they are running out of resources? But I can’t debug that.

So, first of all: Can anyone from GitHub have a look as to why the runners are not stable? This is super-annoying and I don’t see anything I can do to fix this. Or where do I need to report this?

Secondly: It would be immenesly helpful if I could just re-run a single jobs. Why can I only re-run all of the jobs or none? This makes zero sense.

2 Likes

Forgot to mention: Those random fails also happen with our Windows and macOS builds. And again, not while running any of our code/tests, they happen during the build process.

1 Like

Hello,
I had a similar difficulty and in my case it was the memory.
But still unable to solve it. Only looking on resource-monitor helps to prevent total fail.

Regards