Actions that normally work fine are getting stuck for 6 hours

I have a very simple Action defined that builds a React.js app and pushes it to an AWS S3 bucket. I have the same Action in several repositories. I haven’t changed this Action in 6+ months and use it several times a day. This action takes ~2 minutes to execute.

Every once in a while it hangs for 6 hours and then times out. Most of the time this happens I go over to githubstatus.com and there’s some sort of issue going on. I wait a while and retry the Action and it works fine. 

For the last 24 hours, however, I haven’t been able to get it to run successfully. The GitHub Status page says everything is operating normally, but I don’t think that’s actually true, and now I’m getting billed for the 6 hours the Action is hanging due to, what I believe, is an internal issue with the service.

There are no error messages that I can use to track down a problem, all I see is:

The job running on runner Hosted Agent has exceeded the maximum execution time of 360.

and

The operation was canceled.

So I guess I have two questions:

  1. What is going on? Why does the same action work fine most of the time and then occasionally just hang forever?

  2. Is there a way to set a 5 or 10 minute timeout so I don’t get billed for 6 hours for a 2 minute Action?

@ryanvanderpol ,

About the first question, I have helped you report this problem to the appropriate engineering team for further investigation and evaluation. If they have any progress, I will notify you in time, and sometimes the appropriate engineers may directly reply you here.

About the second question, you can use the syntax “jobs.<job_id>.timeout-minutes” to define the timeout minutes of a job, and the syntax “jobs.<job_id>.steps.timeout-minutes” to define the timeout minutes of a step in job.

Hi @ryanvanderpol 

Enginner working on Actions here! Could you let me know the name of your private repo along with the problematic workflow file? Something like konradpabjan/testing and ci.yml. If you’re not okay with publicly sharing that information, you can email me (my github username @github.com). We can take a look to see if there is a problem on our end.

On thing that you could do to get some more inforamtion is to turn on step debugging: https://github.com/actions/toolkit/blob/master/docs/action-debugging.md#how-to-access-step-debug-logs

Is there a specific pool that is problematic (Mac, Windows or Ubuntu in particular)? That could maybe help narrow down the issue.

When the timeout happens, what do you ususally see? Is there a particular step that it always hangs on? Any messages or is it just spinning and blank followed by the canceled message?

Per the suggestion of @brightran you can set timeouts to limit how long your workflows run. Sorry that you’re running into this issue :confused:

1 Like

Thanks, @konradpabjan. I will shoot you an email with the repo names and a bunch of other info.

The workflows are all for Ubuntu.  Here is the workflow file from one of the ones that fails, but they’re all the same:

name: Build and Deploy

on:
  push:
    branches:
    - master

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v1
    - uses: chrislennon/action-aws-cli@v1.1
    - uses: actions/setup-node@v1
      with:
        node-version: '10.x'
    - run: make deploy
      env:
        AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
        AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        AWS_REGION: ${{ secrets.AWS_REGION }}
        AWS_S3_BUCKET_NAME: ${{ secrets.AWS_S3_BUCKET_NAME }}
        REACT_APP_API_URI: ${{ secrets.REACT_APP_API_URI }}

and the make deploy command is also pretty simple. It’s just an npm install followed by an aws command to push to an S3 bucket.

Most of the time I notice it’s been 2-3 hours and my changes aren’t live yet, so I catch it and manually cancel it before it hits 6 hours, but when I walk away from my desk or forget about it, I see messages like this:

Run make deploy

...

> react-scripts build

Creating an optimized production build...
Failed to compile.

./src/components/main.jsx
Cannot find file './Hero' in './src/components'.


##[error]The operation was canceled.

or

Run make deploy
  make deploy
  shell: /bin/bash -e {0}
  env:
    AWS_ACCESS_KEY_ID: ***
    AWS_SECRET_ACCESS_KEY: ***
    AWS_REGION: ***
    AWS_S3_BUCKET_NAME: ***
    REACT_APP_API_URI: ***
sed: can't read .env: No such file or directory
npm install
##[error]The operation was canceled.

In this latter case, the error about the .env file is a red herring. My make file has it marked as an optional dependency, so when running in this workflow it’s not supposed to find the file.

So I looked at our telemetry and I couldn’t find anything that would indicate a problem on our end. 

Generally, even if there is some degradation reported on https://www.githubstatus.com/ your workflow should behave the same if it starts succesfully (during degradation, you would normally see something like delays or problems when starting workflows or issues streaming logs). Each time however there is a fresh VM with the same pre-installed software so once it starts it should be the same and any issues happening on github.com should not affect users that are doing things such as deployments to AWS.

I’m not familar with AWS at all, but our images do have a preinstalled version of the CLI. I would suggest removing the third party action to set it up and try out the one that comes preinstalled (sometimes extra installs can cause conflicts). If the preinstalled one needs to be updated, you can file in issue in actions/virtual-environments but i quickly looked around and I didn’t see any issues related to the AWS CLI for Ubuntu. Up to date info on what is pre-installed on our runners can be found here: https://help.github.com/en/actions/reference/software-installed-on-github-hosted-runners

Turning on step debugging can also display some extra information: https://github.com/actions/toolkit/blob/master/docs/action-debugging.md#how-to-access-step-debug-logs

A bit to hard debug the make deploy script past this point :thinking: With our first party actions we tend to split up all the steps so we do something like npm install first, then npm test, npm format etc. so that everything isn’t done in a single step (makes the logs nicer too). Maybe that would help with debugging.

Another possible suggestion, if it can’t find certain files, then I would double check that the correct paths are being used. You can try using the working-directory yaml parameter: https://help.github.com/en/actions/reference/workflow-syntax-for-github-actions#defaultsrun

Might be good as well to check the aws cli repo for any issues, always possible that others are having a similar issue. https://github.com/aws/aws-cli 

Seems a little weird that if you get an error such as

The operation was canceled

the script continues and it’s effectivly wasting time. It might be waiting for something to finish, but I don’t know. If you do something like “return 1” in a shell script is should immediatly fail and the job should terminate so I would investigate why that isn’t the case.

Overall though, I don’t think this is indicative of any problem on our side. If you frequntly see network issues between the VM and AWS or any generic configuration problems with the VM, then we can take a look but I don’t think there is much more we can do.