Long Running Self-Hosted Runner (~7 days)

We are trying to use our on-premise servers (with GPUs attached) to run a training workflow which takes around ~7 days. This has failed the last 3-4 times with:

Tegaki++ Scheduled Training : .github#L1
GitHub Actions has encountered an internal error when running your job.

after around 6 days, 11 hours, 50 minutes (it always fails around this time). We have the timeout specified as timeout-minutes: 21600, but it seems to still fail around this time. Perhaps we are hitting an internal max time limit?

A more helpful error message would be great :-). Thanks.