How we boosted our productivity by improving our CI

At Global App Testing, we improved our productivity by improving our continuous integration pipelines. Here’s how.

Why is CI so important?

In Global App Testing we heavily rely on our continuous integration (CI) pipelines to run all sorts of checks, build applications, and deploy them to live environments. We run hundreds of pipelines each day with thousands of jobs in them.

For that reason, developers need to trust that the pipelines are correct, so that if software fails a test, they can act on the failure. Additionally, the faster the pipeline returns any feedback, the faster developers can get back to their work, address any issues reported, and move on. So pipelines should be fast and reliable to ensure fast and reliable development in general.

At the beginning of the year, we heard complaints from the developers about our CI process. Some of the tests were failing the software randomly, unrelated to the changes made by the developers. Usually, retrying the job led to the test showing a pass result, and it required an investigation from the developer on the failure and manual action to retry, which was time-intensive and frustrating.

Developers also complained that due to the slow pipelines, the deployment queue was growing longer, making it harder for developers to deploy changes. These problems combined, leading to situations where the deployment queue became stuck; developers had to continuously retry tests, wait to see whether the new test passed, and then unstick others.

We decided to look at the data and started experimenting with some improvements that could reduce manual retries and speed up the whole process.

Note: We use GitLab as our CI tool, so some problems and solutions might be specific to this platform.

Reliability: Spot instances on AWS

We run our own Gitlab runners on k8s cluster. In addition to saving on costs, we try to use AWS spot instances as much as possible. Because of that, there could be lots of reasons that the job would fail – the pod could be terminated, the docker image may not be pulled, postgres service might not start before tests. Most of these reasons have nothing to do with what the developer is doing, and in most cases, a manual retry would fix the issue. In these cases, we could safely assume that any failure, other than caused by the CI job script, could be retried automatically. So we added a default config for each job to retry when a known issue would cause it to fail.

    default:
  retry:
    max: 2
    when:
      - api_failure
      - job_execution_timeout
      - runner_system_failure
      - scheduler_failure
      - stuck_or_timeout_failure
      - unknown_failure

The end result is that the developer is no longer bothered by the things outside of their control, getting us closer to returning only meaningful results and we can gather statistics on the failures we retry.

Reliability: Flaky tests

The next issue we wanted to address was flaky tests. According to this survey, corporations like Microsoft and Google estimate that 26-41% of failed builds were due to flaky tests. This problem is prevalent in our industry, so of course flaky tests are also present in our own code base.

Our aim in dealing with them was twofold. Firstly, we wanted to reduce the impact of the flaky tests, so the developers can focus on their changes and reliably deploy to the live environments. Secondly, we want to reduce the number of new flaky tests. We probably won’t eliminate them entirely, but the fewer we have, the better.

To reduce the impact of the flaky tests, we decided that we would let the pipeline pass when the test usually passes. To achieve that, we developed a command retry-failed-tests, that would parse the output of the test execution, in our case Jest and RSpec frameworks, scan for failing tests and rerun them multiple times. If the test usually passes, at a threshold of <30% failures, we allow the pipeline job to pass. Additionally, any test that is failing randomly is also reported to Sentry, where we gather the events on the tests failing and we can decide on investigating any particular test. It's also important that the retry-failed-tests command decides whether a job should fail or not, instead of the main test run. That's why we need to run it in a single command using || operator.

    rspec-tests:
  variables:
      TEST_RUNNER: "rspec"
      RETRIES: 10
      FAILURES_THRESHOLD: 3
  script:
    - bundle exec rspec --profile | tee rspec-output.txt ||
      OUTPUT_FILE_NAME=./rspec-output.txt retry-failed-tests

To reduce the number of flaky tests we decided to do more of the same. Any tests that are added or changed are again retried multiple times on a separate job for the merge request. If we detect that upon reruns, the test is failing randomly, we fail this pipeline, which stops it from being merged to the main branch and pushes the developer to investigate the change that they made. We've added a separate command retry-changed-tests that takes changed test files as input from command cmd-mr-changes.

    rspec-check-flaky-tests:
  variables:
    TEST_RUNNER: "rspec"
    RETRIES: 10
    FAILURES_THRESHOLD: 3
  script:
    - changed_tests=$(cmd-mr-changes ./spec | (grep _spec.rb || true) | tr '\r\n' ' ')
    - "test -z \"${changed_tests}\" && exit 0 || echo \"Found changed tests: ${changed_tests}\""
    - retry-changed-tests "${changed_tests}"

Speed: parallel execution

To speed up the CI pipeline, the obvious thing is to parallelize it! It’s quite easy to split the workload (ie. for tests) between multiple jobs and let them perform their part. Initially, we set it up to split the test files evenly between the jobs in a pipeline with a deterministic script which, with a given list of test files and ordinal number of the node, would return the files to be run at the node. The script looked like this:

    def visit_directory(path: str) -> List[str]:
    entries = os.listdir(path)
    spec_files = [os.path.join(path, x) for x in entries if x.endswith("_spec.rb")]

    for entry in entries:
        if os.path.isdir(os.path.join(path, entry)):
            spec_files.extend(visit_directory(os.path.join(path, entry)))

    return spec_files


if __name__ == "__main__":
    my_partition, all_partitions = int(sys.argv[1]), int(sys.argv[2])
    assert my_partition >= 1 and my_partition <= all_partitions

    all_specs = sorted(visit_directory("spec"), key=lambda val: hashlib.md5(val.encode("UTF-8")).hexdigest())
    total_specs = len(all_specs)
    tests_per_part = math.ceil(total_specs / all_partitions)

    for index in range((my_partition - 1) * tests_per_part, my_partition * tests_per_part):
        if index >= total_specs:
            break
        if "VERBOSE_PARTITIONS" in os.environ:
            print(f"{index + 1} / {total_specs} -> ", end="")
        print(f"{all_specs[index]}")

However this naive approach is not the most efficient one, as tests come in different shapes and sizes, and evenly splitting the workload by the number of test files will result in jobs of various lengths. In our case, the difference between fastest and slowest jobs in that stage was 5 minutes! And as the whole stage is as slow as its slowest part, we wanted to even it out. Fortunately, we found a solution to that. A gem named knapsack. It generates the report about the time spent on each test file on each parallel job and then uses this data to split the test files more evenly between the jobs.

As a setup we needed to fetch the report generated on the master branch with the test timings:

    retrieve-tests-metadata:
  script:
    - mkdir -p /knapsack
    - aws s3 cp s3://bucket/knapsack/report-master.json ./knapsack/report-master.json
  artifacts:
    paths:
      - knapsack/
    expire_in: 7d
    when: always

Then we needed to adjust the execution of the tests from calling RSpec directly to using Knapsack task

    # BEFORE
bundle exec rspec --profile  -- $(./ci/partition.py ${CI_NODE_INDEX} ${CI_NODE_TOTAL})
  
# AFTER
export KNAPSACK_GENERATE_REPORT=true
export KNAPSACK_REPORT_PATH="knapsack/$(echo "${CI_JOB_NAME}" | sed -E 's|[/ ]|_|g')_report.json"
cp knapsack/report-master.json "${KNAPSACK_REPORT_PATH}"
bundle exec rake "knapsack:rspec[--profile]"

And at the end we need to merge reports generated by each job and store it for future runs. Fortunately, GitLab has a script to merge the reports, so we utilized it in our CI.

    update-tests-metadata:
  dependencies:
    - rspec-tests
  script:
    - ./ci/merge-knapsack-reports.rb ./knapsack/report-master.json knapsack/rspec*.json
    - aws s3 cp ./knapsack/report-master.json s3://bucket/knapsack/report-master.json
    - rm -f knapsack/rspec*.json
  artifacts:
    paths:
      - knapsack/
    expire_in: 7d
    when: always
  only:
    refs:
      - master

This allowed us to make the difference between the slowest and fastest test jobs to go down to around 2 minutes.

Speed: Caching dependencies

At the first stage of the pipeline, we install and cache dependencies needed for the application. The cache key is calculated based on the lock file depending on the application type; for example, a Gemfile.lock is used for Ruby on Rails applications. These jobs usually take longer when the dependencies need to be updated, as when the lock file changes, the whole cache is invalidated and all dependencies need to be installed from the start. Additionally, in GitLab, the cache is always uploaded by default, even when nothing changes in the cached directory.

To improve the speed of the job we could address two obvious issues: installing all the dependencies, and always uploading cache. The first change we made was to add a fallback cache. That’s a cache updated on the main branch of the application which contains all the dependencies.

    variables:
  CACHE_VERSION: "v10"
  CACHE_FALLBACK_KEY: "${CACHE_VERSION}-global-bundle-cache-35"

Now, when the .lock file-based cache is invalidated due to file change, the fallback cache is downloaded and the dependencies are installed incrementally, based on the change to the lock file only. With one caveat: In GitLab, all cache keys have an automatic increment that changes when someone manually clears the cache. You can read about it in this issue. When that happens, we need to update our fallback cache config to change it as well.

The second thing to update that was taking unnecessary time was cache uploads when nothing had changed. There's an issue open in GitLab to change this behavior, but it has not yet been resolved. Fortunately, there are helpful workarounds, one of which we used. There is an option to configure when the upload should happen – i.e. when the pipeline fails or passes. This isn’t the greatest option, but that’s all that we have. We redirected a bundle output to the file, we tested the output to check if it includes any information about installing the dependencies, if it did, we failed the job with a specific exit code that the job is allowed to fail with. Then we configured the cache upload only on the failed job and saved us about 30 sec on each run pipeline.

    dependencies_cache:
  cache:
    - <<: *bundle-cache
      policy: pull-push
      when: on_failure
  script:
    - bundle install | tee -a bundle_log
    # Only upload cache when it changed, workaround based on: 
    # https://gitlab.com/gitlab-org/gitlab-runner/-/issues/3523#note_833284715
    - |
      if grep "Installing" bundle_log; then
        exit 123  # push cache
      else
        exit 0    # skip pushing cache
      fi
  allow_failure:
    exit_codes: 123

Speed: Optimizing for Docker layer caching

We use our CI pipeline to build docker images for our applications that are later deployed to our cloud infrastructure. Docker also uses a cache, based on its layers. One downside of building docker images in the CI pipeline is that jobs can be assigned to any out of many runners. This means that when an image is built on one runner its cached layers stay on that runner. When another runner gets the next job to build an image it can't use the cached layers generated by the previous job.

Well.. what if it could? We decided to set up a long-lasting docker building service that takes request from CI pipelines, to build the images there and point at this service via the DOCKER_HOST environment variable in the jobs where we build docker images. This way docker can utilize its cache most of the time and save time on building the images.

We fixed Docker layer caching but it still wasn't ideal. We were already caching app's dependencies in Dockerfile but it wasn't enough. Changing a single dependency would invalidate the cache on the whole "install dependencies" command. To optimize that, we decided to a similar approach, as with the dependencies installation in CI jobs, and add a fallback cache. Once a week, we cache the stable lock file from the master.

    cache_gemfile_lock:
  rules:
    - if: '$CI_PIPELINE_SOURCE == "schedule"'
      when: always
    - when: manual
      allow_failure: true
  cache: []
  script:
    - cp Gemfile.lock Gemfile.lock.stable
    - cp Gemfile Gemfile.stable
  artifacts:
    paths:
      - ${CI_PROJECT_DIR}/Gemfile.lock.stable
      - ${CI_PROJECT_DIR}/Gemfile.stable
    expire_in: 7 days
    when: always

Then we copy it to the image first, and install those stable dependencies.

    COPY ./Gemfile.lock.stable "$APP_PATH/Gemfile.lock"
COPY ./Gemfile.stable "$APP_PATH/Gemfile"

RUN bundle install

As the lock file doesn't change often, all the dependencies are cached and when the new updated lock file is copied, only new and updated dependencies need to be installed.

Conclusion

At the end of the first iteration of improvements we were able to reduce median pipeline duration from 15 minutes to 10. We've also decreased p75 and p90 times. With the number of pipelines and jobs that we are running, this was a significant improvement and the developer’s moods improved as well. The pipeline failure rate went down from 16% to 8% and 0.4% tests were reported as flaky giving us less noisy pipelines and higher developer confidence.