Developer Productivity Engineering Blog

Dogfooding Test Distribution for Maximum Effect at Gradle

Here at Gradle, Inc., we don’t just talk about Developer Productivity Engineering (DPE). We practice it every day. We know that companies that make DPE a priority have focused, productive developers who deliver better code faster while experiencing the joy of coding. That’s why Gradle has engineering resources dedicated to ensuring our developers get the same benefits as our customers. And we eat our own dog food. (Or, if you prefer, we drink our own champagne. Then again, not all of our code is written in France, so maybe we’re drinking our own sparkling wine.) That includes using Develocity Test Distribution. 

Test Distribution speeds up your builds by distributing your test cases across multiple agents. Using data from your build history, it determines how long each test is likely to take and distributes the tests intelligently. It works for local and CI builds, and when the build is finished, all the results from all the agents’ tests are conveniently available in a single Develocity Build Scan™.  

It’s easy to distribute most unit tests, but some tests have special requirements that make it more difficult. In this blog, we’ll look at some of those complications and discuss how we’re dealing with them here at Gradle. 

Challenges of distributing tests

In our case, we have three kinds of tests that are much more difficult to distribute:

  • Tests that require a database
  • Tests that use browser frameworks such as Selenium or Cypress
  • Tests that require virtualization platforms such as Docker or Vagrant

Tests that require a database

Develocity is backed by a Postgres database, so naturally, we have a lot of tests that interact with Postgres. Historically, developers needed to make sure a Postgres installation was up and running on their developer machines, then tests would connect to the database, set up the required schemas, and clean up afterward. On CI servers, a Postgres Docker container was spun up on demand and destroyed afterward. 

In order to make tests compatible with Test Distribution, we needed to do two things:

  1. We had to make a database available to the Test Distribution agent running on Kubernetes. We did this by installing Postgres into the same Docker Image and starting it as a second process alongside the agent process.
  2. We needed to make sure the setup and teardown happen before and after running the tests. We did this by following the recommendation from the Test Distribution documentation: We added a JUnit Platform listener implementation that executes setup and teardown SQL scripts. The SQL scripts are transferred to the agent by declaring them as additional inputs to the test task.

After using this approach for a while, we noticed that Kubernetes’ OOM killer kills the database every once in a while. In these cases, it’s hard to diagnose what’s going on because Kubernetes does not have knowledge of the database running as a second process inside the Test Distribution agent container. Also, there is currently no mechanism that prevents the agent from accepting new test requests. So sometimes a broken agent continues to execute tests, which fail immediately because the database cannot be connected. We were looking at s6-overlay to address this, but are very close to solving the issue with a different technique. We’ll update this post as soon as the solution is in production.

Tests that use browser frameworks like Selenium or Cypress

Historically, the team has relied heavily on Selenium for implementing UI tests. We’re slowly migrating them to Cypress, but some of our system tests still require Selenium. To run a Selenium-based test, you need two things: The webdriver library and a matching browser with the right version.

To do this, we implemented a JUnit 5 LauncherSessionListener to install or upgrade the requested versions of the browser and the driver. The browser and driver versions are the only required inputs for the task. Given those inputs, our code runs two small shell scripts that we added to the Test Distribution agent image. The scripts check for the browser and driver, downloading and installing the correct version if necessary: the same technique the Gradle wrapper uses. With that done, the scripts exit, and the browser tests are distributed and executed. 

Tests that require virtualization, such as Docker or Vagrant

We’re using the Testcontainers library in a few places to test integration with external services. We also have system tests that work with a full Develocity installation end to end. To do this, the Gradle build that’s running the tests first creates a Vagrant box. With that done, it installs Develocity using the appropriate installation mechanism (OpenShift, Helm, etc.). Finally, some tests are run against this installation to verify everything is working properly.

On CI, these tests are executed on the CI host directly. That means we install Docker and Vagrant on the CI nodes so these tools can be used by the Gradle build and the tests. Given our Test Distribution agent cluster is running on Kubernetes, we can’t make Docker and Vagrant available to these tests when they run on a Test Distribution agent. This is because you cannot easily use virtualization inside a container running on Kubernetes. There are solutions to this problem, such as having a pool of agents running on bare metal machines or using products like Testcontainers Cloud. If only a few tests in a project use Testcontainers, you can annotate them with the @LocalOnly annotation (supplied by gradle-enterprise-testing-annotations). That lets you distribute all tests except the ones that require Docker. 

Calculating ROI

With Test Distribution, we realized early that measuring its ROI is less straightforward compared to other Develocity solutions. Take, for example, the remote build cache. Once activated, you automatically benefit from cache hits that speed up local and remote builds. Furthermore, Develocity provides tools like Build Scan comparisons to help you debug and fix cache misses, improving build speed even more. This results in a very easy-to-calculate ROI; we can compare the performance of a build running with the build cache to a build that doesn’t. 

When evaluating the Test Distribution ROI, there are two questions to consider:

  1. Are there additional infrastructure costs for running Test Distribution agents?
  2. What are the time savings when running tests with Test Distribution?

First, a potential benefit for CI is a reduction in the number of CI nodes because builds run faster, and the computing power required to run tests is moved to the Test Distribution system. Investing in sufficient real or virtual infrastructure and operating a large pool of Test Distribution agents can be significantly cheaper than running the same number of CI agents. But actual savings depends on your particular environment.

Second, you need to look at the potential time savings. When a test cycle runs, its configuration asks Test Distribution for some agents, possibly specifying requirements for those agents (“only agents running Java 11 on Linux,” for example). If no other tests are running, Test Distribution can allocate significant resources to those tests. But if many builds take place simultaneously, the same set of tests may not get many resources. In addition, Test Distribution works with local builds as well as CI builds, so it may handle many workloads throughout the day. That means that the impact of Test Distribution, while usually substantial, can vary widely from one build to the next. The less constrained your resources are, the narrower the variation will be. 

As discussed, Test Distribution does require additional work to make your tests compatible. Tests that require external services or infrastructure to be available are more difficult to make compatible than ones that don’t. Tests with these requirements are often end-to-end or system tests that take a significant amount of the build time. That means distributing those tests will have a much larger impact on build times than distributing unit tests. 

In sum, there are a few additional factors to consider when evaluating test distribution ROI. Still, in our deployment and our many customer deployments, Test Distribution is delivering a compelling ROI. 

Conclusion

Develocity Test Distribution is a powerful feature that can make a substantial difference in build times for both CI and local builds. Here at Gradle, it’s a vital part of our software development toolchain. Some tests are easier to distribute than others, but our overall performance gains from using Test Distribution make the effort worthwhile. You can learn more about Test Distribution by watching this informative and entertaining explainer video.