Kafka integration tests in Gradle runs into GitHub Actions

We’ve been moving our applications from CircleCI to GitHub Actions in our company and we got stuck with a strange situation.

There has been no change to the project’s code, but our kafka integration tests started to fail in GH Actions machines. Everything works fine in CircleCI and locally (MacOS and Fedora linux machines).

Both CircleCI and GH Actions machines are running Ubuntu (tested versions were 18.04 and 20.04). MacOS was not tested in GH Actions as it doesn’t have Docker in it.

Here are the docker-compose and workflow files used by the build and integration tests:

  • docker-compose.yml
version: '2.1'

services:
  postgres:
    container_name: listings-postgres
    image: postgres:10-alpine
    mem_limit: 500m
    networks:
      - listings-stack
    ports:
      - "5432:5432"
    environment:
      POSTGRES_DB: listings
      POSTGRES_PASSWORD: listings
      POSTGRES_USER: listings
      PGUSER: listings
    healthcheck:
      test: ["CMD", "pg_isready"]
      interval: 1s
      timeout: 3s
      retries: 30

  listings-zookeeper:
    container_name: listings-zookeeper
    image: confluentinc/cp-zookeeper:6.2.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
      ZOOKEEPER_TICK_TIME: 2000
    networks:
      - listings-stack
    ports:
      - "2181:2181"
    healthcheck:
      test: nc -z localhost 2181 || exit -1
      interval: 10s
      timeout: 5s
      retries: 10

  listings-kafka:
    container_name: listings-kafka
    image: confluentinc/cp-kafka:6.2.0
    depends_on:
      listings-zookeeper:
        condition: service_healthy
    environment:
      KAFKA_BROKER_ID: 1
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT_HOST://localhost:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT_HOST
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_ZOOKEEPER_CONNECT: listings-zookeeper:2181
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    networks:
      - listings-stack
    ports:
      - "9092:9092"
    healthcheck:
      test: kafka-topics --bootstrap-server 127.0.0.1:9092 --list
      interval: 10s
      timeout: 10s
      retries: 50

networks: {listings-stack: {}}
  • build.yml
name: Build

on: [ pull_request ]

env:
  AWS_ACCESS_KEY_ID: ${{ secrets.TUNNEL_AWS_ACCESS_KEY_ID }}
  AWS_SECRET_ACCESS_KEY: ${{ secrets.TUNNEL_AWS_SECRET_ACCESS_KEY }}
  AWS_DEFAULT_REGION: 'us-east-1'
  CIRCLECI_KEY_TUNNEL: ${{ secrets.ID_RSA_CIRCLECI_TUNNEL }}

jobs:
  build:
    name: Listings-API Build
    runs-on: [ self-hosted, zap ]

    steps:
      - uses: actions/checkout@v2
        with:
          token: ${{ secrets.GH_OLXBR_PAT }}
          submodules: recursive
          path: ./repo
          fetch-depth: 0

      - name: Set up JDK 11
        uses: actions/setup-java@v2
        with:
          distribution: 'adopt'
          java-version: '11'
          architecture: x64
          cache: 'gradle'

      - name: Docker up
        working-directory: ./repo
        run: docker-compose up -d

      - name: Build with Gradle
        working-directory: ./repo
        run: ./gradlew build -Dhttps.protocols=TLSv1,TLSv1.1,TLSv1.2 -x integrationTest

      - name: Integration tests with Gradle
        working-directory: ./repo
        run: ./gradlew integrationTest -Dhttps.protocols=TLSv1,TLSv1.1,TLSv1.2

      - name: Sonarqube
        working-directory: ./repo
        env:
          GITHUB_TOKEN: ${{ secrets.GH_OLXBR_PAT }}
          SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
        run: ./gradlew sonarqube --info -Dhttps.protocols=TLSv1,TLSv1.1,TLSv1.2

      - name: Docker down
        if: always()
        working-directory: ./repo
        run: docker-compose down --remove-orphans

      - name: Cleanup Gradle Cache
        # Remove some files from the Gradle cache, so they aren't cached by GitHub Actions.
        # Restoring these files from a GitHub Actions cache might cause problems for future builds.
        run: |
          rm -f ${{ env.HOME }}/.gradle/caches/modules-2/modules-2.lock
          rm -f ${{ env.HOME }}/.gradle/caches/modules-2/gc.properties

The integration tests are written using the Spock framework and the part where the errors occur are these:

  boolean compareRecordSend(String topicName, int expected) {
    def condition = new PollingConditions()
    condition.within(kafkaProperties.listener.pollTimeout.getSeconds() * 5) {
      assert expected == getRecordSendTotal(topicName)
    }
    return true
  }

  int getRecordSendTotal(String topicName) {
    kafkaTemplate.flush()
    return kafkaTemplate.metrics().find {
      it.key.name() == "record-send-total" && it.key.tags().get("topic") == topicName
    }?.value?.metricValue() ?: 0
  }

The error we’re getting is:

Condition not satisfied after 50.00 seconds and 496 attempts
    at spock.util.concurrent.PollingConditions.within(PollingConditions.java:185)
    at com.company.listings.KafkaAwareBaseSpec.compareRecordSend(KafkaAwareBaseSpec.groovy:31)
    at com.company.listings.application.worker.listener.notifier.ListingNotifierITSpec.should notify listings(ListingNotifierITSpec.groovy:44)

    Caused by:
    Condition not satisfied:

    expected == getRecordSendTotal(topicName)
    |        |  |                  |
    10       |  0                  v4
                false

We’ve debugged the GH Actions machine (SSH into it) and run things manually. The error still happens, but if the integration tests are run a second time (as well as subsequent runs), everything works perfectly.

We’ve also tried to initialize all the necessary topics and send some messages to them preemptively, but the behavior was the same.

The questions we have are:

  • Is there any issue when running Kafka dockerized in an Ubuntu machine (the error also occurred in a co-worker Ubuntu machine)?
  • Any ideas on why this is happening?
1 Like