Jobs getting killed for no reason

Hello,
I’ve been experiencing a lot of this today:

2021-12-09T23:21:58.2790047Z ##[error]The operation was canceled.
2021-12-09T23:21:58.2833095Z Evaluate and set job outputs
2021-12-09T23:21:58.2846150Z Cleaning up orphan processes
2021-12-09T23:21:58.2968575Z Terminate orphan process: pid (20065) (run_in_docker.sh)
2021-12-09T23:21:58.3006327Z Terminate orphan process: pid (20067) (docker)

I didn’t change anything to my scripts that could result in a premature cancellation of the job, the only thing that I changed recently was to upgrade the runner on my instance.

This is what my workflow file looks like:

name: Release

on:
  # https://github.community/t/workflow-dispatch-event-not-working/128856/2
  workflow_dispatch:
    inputs:
      ai_prod_tag:
        description: 'The AI tag/name'
        required: true

concurrency:
  group: "build_image"
  cancel-in-progress: false

env:
  checkout_path: image_builder
  registry: 233424424.dkr.ecr.us-east-1.amazonaws.com

jobs:
  prepare_env:
    needs: [ cleanup, check_ai_prod_var_set ]
    name: Create the AI image tag
    runs-on: [ self-hosted, linux, x64, aws-batch ]
    outputs:
      image_name: ai-image
      image_tag: v${{ steps.tag.outputs.date }}
    steps:
      - name: Get current date and set it as the image tag
        id: tag
        run: echo "::set-output name=date::$(date +'%Y-%m-%d-%H%M')"

  build_image:
    needs: [ cleanup, prepare_env ]
    name: Build the PW AI image
    runs-on: [ self-hosted, linux, x64, aws-batch ]
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
        with:
          path: ${{ env.checkout_path }}

      - name: Set up Docker Buildx
        id: buildx
        uses: docker/setup-buildx-action@v1
        with:
          install: true

      # https://github.com/marketplace/actions/build-docker-images-using-cache
      - name: Build the AI image
        uses: whoan/docker-build-with-cache-action@v5
        with:
          registry: ${{ env.registry }}
          push_image_and_stages: false
          image_name: ${{ needs.prepare_env.outputs.image_name }}
          image_tag: ${{ needs.prepare_env.outputs.image_tag }}
          context: ${{ env.checkout_path }}/.
          build_extra_args: >
            --build-arg AWS_ACCESS_KEY_ID=${{ env.WEIGHTS_AWS_ACCESS_KEY_ID }}
            --build-arg AWS_SECRET_ACCESS_KEY=${{ env.WEIGHTS_AWS_SECRET_ACCESS_KEY }}
            --build-arg AI_IMAGE_TAG="${{ needs.prepare_env.outputs.image_tag }}"
            --build-arg PW_ENVIRONMENT=production
            --build-arg LAUNCHER=ai_launch_aws_batch.sh

  ecr-push:
    needs: [ build_image, prepare_env ]
    name: Push the AI image to ECR
    runs-on: [ self-hosted, linux, x64, aws-batch ]
    steps:
      # https://github.com/aws-actions/configure-aws-credentials
      - name: Configure AWS Credentials for ECR
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.ECR_SHARED_SERVICES_AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.ECR_SHARED_SERVICES_AWS_SECRET_ACCESS_KEY }}
          aws-region: us-east-1
      - name: Push the AI image to ECR
        run: ./.github/scripts/push_ecr_release.sh
        env:
          REGISTRY: ${{ env.registry }}
          DOCKER_IMAGE: ${{ needs.prepare_env.outputs.image_name }}
          DOCKER_TAG: ${{ needs.prepare_env.outputs.image_tag }}
          GITHUB_OAUTH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          GIT_BRANCH: ${{ github.ref_name }}
          GITHUB_REPOSITORY: ${{ github.repository }}
        working-directory: ${{ env.checkout_path }}

  regression_tester_200_set:
    name: (200 set) Generate the regression metrics for the current AI image
    timeout-minutes: 1440 # 24 hours
    needs: [ prepare_env, ecr-push ]
    runs-on: [ self-hosted, linux, x64, aws-batch ]
    outputs:
      regression_report_s3_path: ${{ steps.report.outputs.regression_report_s3_path }}
    steps:
      - name: Run the regression metrics generation on the 200 set
        run: ./.github/scripts/run_in_docker.sh /code/.github/scripts/regression_tester/generate_metrics.sh
        env:
          AWS_ACCESS_KEY_ID: ${{ env.AWS_BATCH_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ env.AWS_BATCH_SECRET_ACCESS_KEY }}
          DOCKER_IMAGE: "${{ env.registry }}/${{ needs.prepare_env.outputs.image_name }}"
          DOCKER_TAG: ${{ needs.prepare_env.outputs.image_tag }}
        working-directory: ${{ env.checkout_path }}
      - name: Run the regression comparison on the 200 set
        run: ./.github/scripts/run_in_docker.sh
        env:
          AWS_ACCESS_KEY_ID: ${{ env.AWS_BATCH_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ env.AWS_BATCH_SECRET_ACCESS_KEY }}
          DOCKER_IMAGE: "${{ env.registry }}/${{ needs.prepare_env.outputs.image_name }}"
          DOCKER_TAG: ${{ needs.prepare_env.outputs.image_tag }}
        working-directory: ${{ env.checkout_path }}
      - name: Upload the regression report
        id: report
        run: |
          echo "Report uploaded to $report_path"
        env:
          AWS_ACCESS_KEY_ID: ${{ env.AWS_BATCH_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ env.AWS_BATCH_SECRET_ACCESS_KEY }}
          SHARED_DRIVE_PATH: ${{ env.REGRESSIONS_DOCKER_SHARED_DRIVE_PATH_200_set }}
          S3_REPORTS_BUCKET: ${{ env.S3_REPORTS_BUCKET }}
          REGRESSIONS_PROD_AI_TAG: ${{ github.event.inputs.ai_prod_tag }}
          REGRESSIONS_NEW_AI_TAG: ${{ needs.prepare_env.outputs.image_tag }}

My regression_tester_200_set job can take up to 10h to execute, but for some reason, after 14min56s it gets canceled/killed (sometimes earlier).
How can I prevent this from happening? Thank you

8 Likes

I am also experiencing this with self-hosted runners in AWS (us-east-2).

Jobs that take over 15 minutes are dying with “##[error]The operation was canceled” after about 15 minutes every time.

2 Likes

We are seeing this as well starting today:

  • Self-hosted actions runner
  • All workflows are being killed after 15 minutes

The documentation states the default timeout should be 72 hours. I opened a ticket with github support.

1 Like

I also encountered the same problem in a self-hosted environment.

  1. Job is aborted in about 15 minutes
  2. Jobs prior to 09 Dec 2021 18:27:00 GMT run normally beyond 15 minutes.
  3. Job is aborted even if the timeout value is specified in jobs.<job_id>.timeout-minutes.
  4. Status of job is error instead of cancel.
  5. Job is aborted even if you keep sending data to stdout periodically.
2 Likes

Exactly. Reducing to 11 mins make it pass.

Related issues:

I have some logs which may be relevant:

[2021-12-10 08:13:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:23:47
[2021-12-10 08:14:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:24:47
[2021-12-10 08:15:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:25:47
[2021-12-10 08:16:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:26:47
[2021-12-10 08:17:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:27:47
[2021-12-10 08:18:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:28:47
[2021-12-10 08:19:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:29:47
[2021-12-10 08:20:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:30:47
[2021-12-10 08:21:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:31:47
[2021-12-10 08:22:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:32:47
[2021-12-10 08:23:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:33:47
[2021-12-10 08:24:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:34:47
[2021-12-10 08:25:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:35:47
[2021-12-10 08:26:47Z INFO JobDispatcher] Successfully renew job request 783714, job is valid till 12/10/2021 08:36:47
[2021-12-10 08:27:47Z ERR  GitHubActionsService] PATCH request to https://pipelines.actions.githubusercontent.com/[token]/_apis/distributedtask/pools/1/jobrequests/783714?lockToken=00000000-0000-0000-0000-000000000000 failed. HTTP Status: BadRequest, AFD Ref: Ref A: [ref-value] Ref B: [ref-value] Ref C: 2021-12-10T08:27:47Z
[2021-12-10 08:27:47Z INFO JobDispatcher] TaskAgentJobTokenExpiredException received renew job request 783714, job is no longer valid, stop renew job request.
[2021-12-10 08:27:47Z INFO JobDispatcher] Send job cancellation message to worker for job

So it seems after 15 minutes, the call to phone home gets a 400 response for some reason, which is enough to cancel the job.

1 Like

We’re also experiencing this issue with our self-hosted runners on GCP.

Seeing the same log lines in the listener log:

[2021-12-09 19:09:28Z ERR GitHubActionsService] PATCH request to https://pipelines.actions.githubusercontent.com/xxxxx/_apis/distributedtask/pools/1/jobrequests/5210?lockToken=00000000-0000-0000-0000-000000000000 failed. HTTP Status: BadRequest, AFD Ref: Ref A: xxx Ref B: CO1EDGE2014 Ref C: 2021-12-09T19:09:28Z
[2021-12-09 19:09:28Z INFO JobDispatcher] TaskAgentJobTokenExpiredException received renew job request 5210, job is no longer valid, stop renew job request.
[2021-12-09 19:09:28Z INFO JobDispatcher] Send job cancellation message to worker for job xxxxxxxx.
[2021-12-09 19:09:28Z INFO ProcessChannel] Sending message of length 0, with hash ‘xxxxxx’
[2021-12-09 19:09:43Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2021-12-09 19:09:43Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.

None of our affected jobs can realistically be brought down to run in less than 15m, so this is severely blocking our CI.

Same for us, some of our tasks last between 12 to 24h so 15min in comparison…

GitHub says resolved: GitHub Status - Incident with GitHub Actions

So you can retry your workflows now.