Actions randomly work fine or get stuck and canceled after 6 hours

I experienced that actions randomly either work out fine or get stuck under the same preconditions (same code). When I re-run the job it sometimes randomly completes in about 4 to 5 minutes and sometimes randomly gets stuck in preparations and eventually gets canceled after the 6 hours limit.
Since I can’t access the logs from old runs here are two commits with identical code, once it worked and once not.
I don’t have much experience with GitHub Actions, so I’m not sure whether the error might be on my side, but it seems like there have been to some extend similar issues recently. Have to remove https from the link since I’m only allowed to post two links.

Would be great if someone could explain that inconsistent behavior to me and show me how to fix that.

I’ve experienced the very same lockup issue and reported it here.

Let me break down the issues I’ve experienced into several parts:

  1. unreliability - the need for things to work reliably because our business relies on things working as expected.
  2. deterministic - are the errors originating by me or by the systems I’m using.
  3. financial - repeatedly running a windows or mac instance for 6 hours definitely breaks my startup budget.

You expressed frustration at the first point and would like to know about the second.
The solution expressed by others is to set a timeout-minutes to a known upper limit.
This addresses my primary complaint, point 3. As for points 1 and 2, consider them as known short comings of the systems integrate with and design for failure.

As the great Leslie Lamport once stated:

“A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.”

Sorry for the late response @andy-brainome and thanks for your reply. I’m rather new to all this GitHub Actions stuff. timeout-minutes might be a workaround if there would be a chance to automatically trigger it again.

I know the job will never take more than 5 minutes if it runs the way it should. So is it possible to configure the action like e.g. set 10 minutes timeout. If timeout is reached automatically re-run the job. Repeat this for a maximum of 10 times (in case something else breaks I don’t want to cause unnecessary load on the server). That would at least be a somewhat usable workaround.

Btw: in our self-hosted Github instance the jobs never got stuck, just at