We are trying to use our on-premise servers (with GPUs attached) to run a training workflow which takes around ~7 days. This has failed the last 3-4 times with:
Tegaki++ Scheduled Training : .github#L1 GitHub Actions has encountered an internal error when running your job.
after around 6 days, 11 hours, 50 minutes (it always fails around this time). We have the timeout specified as
timeout-minutes: 21600, but it seems to still fail around this time. Perhaps we are hitting an internal max time limit?
A more helpful error message would be great :-). Thanks.