This is part of a series of blog posts on Scala compilation in the cloud:
- Guide To Amazon Instances For Scala Apps (this post)
- How to Make Scala Compilation 5X Faster
We ran a number of Scala compilation benchmarks and look at what instance type is the fastest and what offers the best cost per build. We also checked the speed improvements in the latest 2.12.8 release, so read on for the full story or jump to Conclusions for the tl;dr;
A Guide to Amazon EC2 Instance Types
Since the introduction of Hydra, we have focused our efforts on making Scala compilation faster, be it on a laptop, desktop or in the cloud. As such, we listen to all our users and one question that we hear very often is: what Amazon EC2 instance type should I use for my Scala build?
What it really comes down to is “How much is the cost per build”? Amazon offers 18 general-purpose instance types, 18 memory-optimized, 18 compute-optimized… and that’s just looking at the latest generation. Which one to choose?
We ran a number of Scala compilation benchmarks and look at what instance type is the fastest and what offers the best cost per build. We also checked the speed improvements in the latest 2.12.8 release, so read on for the full story or jump to Conclusions for the tl;dr;
Benchmark
For our benchmark we wanted to choose an open-source project that is representative of a real-world project (i.e. not a toy-project) and that can build on several Scala versions. We’re particularly interested in seeing how the latest Scala version (2.12.8 at the time of writing) compares to the first one in the Scala 2.12.1 series.
We settled on scala-debugger, a tool and library for debugging Scala programs on the JVM:
- Code size: 88,382 lines of code, excluding blank lines and comments
- Compilation time: above 3 minutes
Methodology
Our main question is “What instance type is the fastest for Scala compilation”, so we’re going to measure only the actual compilation time. In particular, we are not going to consider the usual time-consuming tasks in a CI build, like dependency resolution and download, package or publish steps, etc.
Benchmarking on modern hardware is harder than it seems, and running inside the JVM makes it even trickier. Multiple layers of caching (CPU, OS), just in time (JIT) compilation, and garbage collection are just a few of the variables that influence our measurements.
In order to get reliable results, we need to bring the system up in a so-called steady-state, or warm state. As the JVM starts up and executes a program, it first needs to load classes from disk, starting with interpreting the bytecode, and, as it discovers what methods are executed most often, compile them to native code.
Setup
- To run the benchmark, we compile all projects a few times: this is the warm up run, and should take about 4-5 minutes.
- We then start measuring, and run the sbt
compile
task 8 times to measure the time it takes to do a full build. We pick the median value. - In between compilation runs, we delete the output directory (we do not run the clean task, as that would also remove the dependency resolution results and cause additional work for sbt).
- Report the minimum, maximum and median value for compilation time.
- We used OpenJDK 1.8u191, 7GB of heap (6GB for the smallest instance), and
-J-XX:MaxMetaspaceSize=512m -J-XX:ReservedCodeCacheSize=512M
. We monitored GC times and they didn’t show up as significant.
All of this is automated in our open-source sbt plugin sbt-scalabench (if you’re a Hydra user we already ship this functionality as part of the Hydra sbt plugin).
Generally, we’re going to look at the median value, but keep an eye on the spread. If the values are spread over too large an interval, the noise may be too high to draw meaningful conclusions.
Amazon instances
Before we go on let’s quickly recap the EC2 offering and the configurations we will be looking at. One dimension is the number of vCPUs on each instance, and we’ll be testing four different sizes: 2, 4, 8, and 16 vCPUs.
- General purpose machines (
m5.*
). These are Intel Xeon based, with a number of cores that doubles with each increment. We are going to test m5.large, m5.xlarge, m5.2xlarge and m5.4xlarge (with 2, 4, 8 and 16 virtual cores respectively). We’ll be using this as the baseline. - AMD general purpose machines (
m5a.*
). These are based on AMD EPYC 7000 processor, and mirror the m5 nomenclature at a slightly lower price than the Intel line (~10%). We are going to test the same instance types - Compute optimized (
c5.*
). These are Intel Xeon based and are, no surprises, optimized for compute-heavy tasks. They feature less memory than the general-purpose line but at a slightly lower price point (~10%), roughly the same as the AMD line - Memory optimized (
r5.*
). These instances feature twice the memory size of the general purpose line (starting at 16GB instead of 8GB), but at a higher price than the general purpose line (~25%).
There are other instance types, such as burstable instances but we decided against testing these due to noise. We used dedicated instances for benchmarking and the spread of benchmark values was only 1-2% around the median value, so we decided to not show the error bars as they would be too small.
Results
Summary:
The number of vCPUs in a given instance class usually does not make a difference in compilation time. The fastest instance was c5.4xlarge, but by less than 5% and the cost/build was more than 6x higher compare to the cheapest instance (for both 2.12.1 and 2.12.8).
The number of cores makes little difference to compilation time overall, which is not surprising: the Scala compiler is single-threaded. Both general purpose and memory optimized behave in the same way, and given the price difference, the overall recommendation is to go for general purpose (more about price in the next section).
One surprising find is that AMD general purpose instances are significantly slower than their Intel equivalent and do exhibit a slight performance increase with the number of cores. The same can be observed for compute optimized. In both cases we can link these to memory behavior: Compute optimized instances have half the memory of the general purpose instance. This is the reason we don’t have a data point for c5.large, the smallest instance with 2 cores and only 4GB of RAM (the project can’t build without at least 6GB). In the case of AMD, the performance difference could be explained by smaller per-core caches in the EPYC 7000 architecture.
Cost
Let’s have a look at the cost per build of each instance. Bear in mind that we only measure compilation time and a usual build involves many other tasks (our Part 2 will focus on how to minimize these tasks), including running tests. Since tests are not hard to parallelize, it may pay off to go for a larger instance, even if the cost/build (we abuse terminology here, we mean cost/compilation time) is not the cheapest.
As mentioned before, the Scala compiler being single-threaded does not lead to any surprises: the most cost-effective instances are the 2-core ones, with performance roughly equal between all classes. Clearly the general purpose instance is the winner. However, sometimes you need to go for more cores, for instance to speed up the test stage through parallelization. In that case compute-optimized seem to be slightly ahead, if you can cope with less memory. The bad news is that the cost per build increases in lockstep with Amazon instance price, which grows exponentially in the instance size.
How does 2.12.8 improve the situation?
The latest Scala releases in the 2.12 series improved compilation speed, so we wanted to see how the project behaves on the latest Scala release. We ran the same benchmark using Scala 2.12.8 (we needed to upgrade the ammonite dependency to be compatible with 2.12.8) and plotted the 2.12.8 next to the 2.12.1 compilation times.
Performance is consistently between 20% and 25% better on all instance types, and the general purpose instance continues to be the best choice if cost is taken into account. The AMD and compute-optimized instance types are less sensitive to the number of cores on 2.12.8, which we believe it’s because of improvements in memory consumption, leading to better cache behavior.
To finish, the cost/build table for 2.12.8:
Conclusions
The most cost-effective instance type of the ones we tested is m5.large, and AMD instances, despite being cheaper per hour, are not cheaper per build: the price drop is completely erased by relative lower performance in Scala compilation. Larger instances may bring benefits in parallel test execution, but compilation time won’t be affected. Scala 2.12.8 turned out to be 20-25% faster so the upgrade is totally worth it.
While we don’t have reasons to believe results would be different on a different project, you should run your own tests and make sure to let us know if your finds are different!
See how Hydra goes beyond 5x speedup in Part 2 of this blog series!