Hello,
I’ve been experiencing a lot of this today:
2021-12-09T23:21:58.2790047Z ##[error]The operation was canceled.
2021-12-09T23:21:58.2833095Z Evaluate and set job outputs
2021-12-09T23:21:58.2846150Z Cleaning up orphan processes
2021-12-09T23:21:58.2968575Z Terminate orphan process: pid (20065) (run_in_docker.sh)
2021-12-09T23:21:58.3006327Z Terminate orphan process: pid (20067) (docker)
I didn’t change anything to my scripts that could result in a premature cancellation of the job, the only thing that I changed recently was to upgrade the runner on my instance.
This is what my workflow file looks like:
name: Release
on:
# https://github.community/t/workflow-dispatch-event-not-working/128856/2
workflow_dispatch:
inputs:
ai_prod_tag:
description: 'The AI tag/name'
required: true
concurrency:
group: "build_image"
cancel-in-progress: false
env:
checkout_path: image_builder
registry: 233424424.dkr.ecr.us-east-1.amazonaws.com
jobs:
prepare_env:
needs: [ cleanup, check_ai_prod_var_set ]
name: Create the AI image tag
runs-on: [ self-hosted, linux, x64, aws-batch ]
outputs:
image_name: ai-image
image_tag: v${{ steps.tag.outputs.date }}
steps:
- name: Get current date and set it as the image tag
id: tag
run: echo "::set-output name=date::$(date +'%Y-%m-%d-%H%M')"
build_image:
needs: [ cleanup, prepare_env ]
name: Build the PW AI image
runs-on: [ self-hosted, linux, x64, aws-batch ]
steps:
- name: Checkout code
uses: actions/checkout@v2
with:
path: ${{ env.checkout_path }}
- name: Set up Docker Buildx
id: buildx
uses: docker/setup-buildx-action@v1
with:
install: true
# https://github.com/marketplace/actions/build-docker-images-using-cache
- name: Build the AI image
uses: whoan/docker-build-with-cache-action@v5
with:
registry: ${{ env.registry }}
push_image_and_stages: false
image_name: ${{ needs.prepare_env.outputs.image_name }}
image_tag: ${{ needs.prepare_env.outputs.image_tag }}
context: ${{ env.checkout_path }}/.
build_extra_args: >
--build-arg AWS_ACCESS_KEY_ID=${{ env.WEIGHTS_AWS_ACCESS_KEY_ID }}
--build-arg AWS_SECRET_ACCESS_KEY=${{ env.WEIGHTS_AWS_SECRET_ACCESS_KEY }}
--build-arg AI_IMAGE_TAG="${{ needs.prepare_env.outputs.image_tag }}"
--build-arg PW_ENVIRONMENT=production
--build-arg LAUNCHER=ai_launch_aws_batch.sh
ecr-push:
needs: [ build_image, prepare_env ]
name: Push the AI image to ECR
runs-on: [ self-hosted, linux, x64, aws-batch ]
steps:
# https://github.com/aws-actions/configure-aws-credentials
- name: Configure AWS Credentials for ECR
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.ECR_SHARED_SERVICES_AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.ECR_SHARED_SERVICES_AWS_SECRET_ACCESS_KEY }}
aws-region: us-east-1
- name: Push the AI image to ECR
run: ./.github/scripts/push_ecr_release.sh
env:
REGISTRY: ${{ env.registry }}
DOCKER_IMAGE: ${{ needs.prepare_env.outputs.image_name }}
DOCKER_TAG: ${{ needs.prepare_env.outputs.image_tag }}
GITHUB_OAUTH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
GIT_BRANCH: ${{ github.ref_name }}
GITHUB_REPOSITORY: ${{ github.repository }}
working-directory: ${{ env.checkout_path }}
regression_tester_200_set:
name: (200 set) Generate the regression metrics for the current AI image
timeout-minutes: 1440 # 24 hours
needs: [ prepare_env, ecr-push ]
runs-on: [ self-hosted, linux, x64, aws-batch ]
outputs:
regression_report_s3_path: ${{ steps.report.outputs.regression_report_s3_path }}
steps:
- name: Run the regression metrics generation on the 200 set
run: ./.github/scripts/run_in_docker.sh /code/.github/scripts/regression_tester/generate_metrics.sh
env:
AWS_ACCESS_KEY_ID: ${{ env.AWS_BATCH_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ env.AWS_BATCH_SECRET_ACCESS_KEY }}
DOCKER_IMAGE: "${{ env.registry }}/${{ needs.prepare_env.outputs.image_name }}"
DOCKER_TAG: ${{ needs.prepare_env.outputs.image_tag }}
working-directory: ${{ env.checkout_path }}
- name: Run the regression comparison on the 200 set
run: ./.github/scripts/run_in_docker.sh
env:
AWS_ACCESS_KEY_ID: ${{ env.AWS_BATCH_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ env.AWS_BATCH_SECRET_ACCESS_KEY }}
DOCKER_IMAGE: "${{ env.registry }}/${{ needs.prepare_env.outputs.image_name }}"
DOCKER_TAG: ${{ needs.prepare_env.outputs.image_tag }}
working-directory: ${{ env.checkout_path }}
- name: Upload the regression report
id: report
run: |
echo "Report uploaded to $report_path"
env:
AWS_ACCESS_KEY_ID: ${{ env.AWS_BATCH_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ env.AWS_BATCH_SECRET_ACCESS_KEY }}
SHARED_DRIVE_PATH: ${{ env.REGRESSIONS_DOCKER_SHARED_DRIVE_PATH_200_set }}
S3_REPORTS_BUCKET: ${{ env.S3_REPORTS_BUCKET }}
REGRESSIONS_PROD_AI_TAG: ${{ github.event.inputs.ai_prod_tag }}
REGRESSIONS_NEW_AI_TAG: ${{ needs.prepare_env.outputs.image_tag }}
My regression_tester_200_set
job can take up to 10h to execute, but for some reason, after 14min56s it gets canceled/killed (sometimes earlier).
How can I prevent this from happening? Thank you