Developer Productivity Engineering Blog

How Zalando Achieved 3.2X Faster Scala Compilation Times

This article was written by Eric Torreborre, Senior Software Engineer at Zalando and author of the well known specs2 Scala testing library. The article was first published on the Zalando Technology blog and it’s now republished here.

Background

For the Glitch team who looks after Quality Control at Zalando, the year 2017 started with a new resolution: Make our compile times faster. Toward the end of 2016, the Glitch team experienced a steady increase in compilation time on their project. In just one month, the compile time doubled and it was hard to understand why. This was clearly hampering the team’s productivity, and a number of strategies were attempted to reduce the long compile time. For instance, some improvement was obtained by removing wildcard imports, better code modularization, and by making some implicit values explicits. However, it was still taking too long to compile, and the actual root cause was far from clear. In this article, we describe how by using with Triplequote, we were able to obtain a 3.2x compilation time speedup.

The problem

The Glitch team is currently working on the delivery of an application called “Quala” (for “QUALity Assessment”). This application enables the Content Team to check the quality of products that merchants want to sell on the Zalando platform: Are the descriptions correct? Are the images of good quality? Does the product come with the right washing machine instructions?

The backend of Quala is called “tinbox” and is written in Scala, using many type-intensive libraries such as Shapeless, Circe, Grafter, and http4s/rho. One important design goal behind these libraries is to reduce boilerplate by letting the Scala compiler generate as much ceremony code as possible. However, the downside is that compile time can increase substantially. Unexpected interactions between macros and implicit search can lead to an exponential growth of compilation time, and it is usually difficult to understand if the long compile time is symptomatic of a deeper problem. This pushed us to get in touch with Triplequote, a Swiss company that promises to relieve Scala teams of long compile times.

At the beginning of February, the Triplequote team joined the Glitch team in their Zalando office for three days. The mission included evaluating Hydra and identifying areas for compilation-speed improvements.

Let’s see how the problem was tackled.

Methodology

The first task was to collect metrics to objectively compare against. As the tinbox project uses Sbt as a build tool, it is meaningful to record both the “cold” and “warm” compile time. The terminology “cold” and “warm” refers to the state of the JVM that an application is running on.

When launching an application, the JVM starts by loading the required classes and interpreting the code, and as it runs it starts just-in-time compiling and optimizing the code paths that are taken more often. We call a JVM that isn’t optimized yet a “cold” JVM, while a JVM that is optimized is referred to as “warm”. Because Sbt is used to compile the project, you can warm-up the JVM by entering the Sbt interactive shell and executing a few full compiles. In fact, you will immediately notice that your sources take considerably more time to compile the first time, and that’s indeed because the JVM is initially “cold”. The reason why it’s interesting to collect both “cold” and “warm” compile time is that when sources are compiled on a Continuous Integration (CI) server, we usually observe a “cold” compile time. When a developer compiles on their own machine, he or she will observe a “warm” compile time. There are clear productivity benefits in reducing both “cold” and “warm” compile time. After all, the less one needs to wait, the more productive one can be.

All experiments were run on a Macbook Pro (Retina, late 2013), 16G, Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz, using Scala 2.12.1 and Java 1.8.0_112, and giving Sbt 4G of memory (using the JVM flag -Xmx4G).

Initial State

The chart below reports the time in seconds that it takes to compile all tinbox sources (both main and test). Take a look at how compilation time improves as the JVM warms up.

We now have the coarsed, grained numbers we will compare our work against. Let’s start our journey by discussing how much speedup we could obtain by just using the Hydra Scala parallel compiler, without making any change to the tinbox codebase.

Evaluating Triplequote Hydra

Using Hydra on a Scala project is simple, as it consists of just adding the sbt-hydra plugin to the project/plugins.sbt. After this small change, all of the project’s sources are compiled in parallel using Hydra, utilizing four workers. We chose to work with 4 cores because modern developer machines have 4 physical cores. Hydra can use more cores if available.

The next chart visually compares the tinbox project’s compile time performance when using the vanilla Scala 2.12.1 versus Hydra.

If we compare the best full compile time result with the vanilla Scala 2.12.1 (64 seconds) against the best result with Hydra (26 seconds), we see that using Hydra yields a 2.66x compile time speedup with a warm JVM.

Furthermore, the cold compile time performance is considerably improved. In fact, the cold compile time when using Hydra is shorter than the warm compile time when the vanilla Scala 2.12.1 compiler is used!

After evaluating Hydra we moved on to the second goal, which consisted in identifying areas for single-threaded compilation-speed improvements.

Improving single-threaded compilation time

To improve single-threaded compile performances, it was paramount to gain greater insights on what the Scala compiler does. We had to be able to answer questions such as:

  1. How much time does each compiler phase take?
  2. What are the sources that take the most to compile?
  3. What work is the compiler doing when compiling a single source? (this is especially relevant for sources that take more time than expected to compile).

But before going any further, let’s take a quick detour and briefly touch on the Scala compiler architecture.

The Scala Compiler Architecture

The Scala compiler is made up of many phases. Each phase takes as its input an Abstract Syntax Tree (AST) and returns a new, transformed AST. To see the Scala compiler phases just pass the flag -Xshow-phases when invoking scalac.

$ scalac -Xshow-phases
    phase name  id  description
    ----------  --  -----------
        parser   1  parse source into ASTs, perform simple desugaring
         namer   2  resolve names, attach symbols to named trees
packageobjects   3  load package objects
         typer   4  the meat and potatoes: type the trees
        patmat   5  translate match expressions
superaccessors   6  add super accessors in traits and nested classes
    extmethods   7  add extension methods for inline classes
       pickler   8  serialize symbol tables
     refchecks   9  reference/override checking, translate nested objects
       uncurry  10  uncurry, translate function values to anonymous classes
        fields  11  synthesize accessors and fields, add bitmaps for lazy vals
     tailcalls  12  replace tail calls by jumps
    specialize  13  @specialized-driven class and method specialization
 explicitouter  14  this refs to outer pointers
       erasure  15  erase types, add interfaces for traits
   posterasure  16  clean up erased inline classes
    lambdalift  17  move nested functions to top level
  constructors  18  move field definitions into constructors
       flatten  19  eliminate inner classes
         mixin  20  mixin composition
       cleanup  21  platform-specific cleanups, generate reflective calls
    delambdafy  22  remove lambdas
           jvm  23  generate JVM bytecode
      terminal  24  the last phase during a compilation run

 

As you can see, each Scala source has to go through 24 phases before binaries are produced. Of course, some phases take more time than others to execute. In particular, the typer phase is known to often take 30%+ of the whole compile time, as it takes care of typechecking, which is a fundamental operation in a statically typed language such as Scala.

Gaining Insights

We said we needed to gain visibility into what the compiler is doing, but how can we do so? The bad news is that there is little to no tool available today that can help with this task. The good news is that Triplequote is developing certain tooling to address this problem. All metrics reported in this section are obtained using Triplequote tooling.

The first question that needed to be answered was: How much time does each compiler phase take?

This question is interesting because the time per phase gives us a broad view on whether there might be opportunities to speed up compilation. The histogram below gives a high-level view of the time (in milliseconds) consumed by each phase for compiling main and test sources (with a cold JVM).

The one phase that you should pay attention to is typer, as it takes in both cases more than 34 seconds to execute. This means typechecking accounts for more than 60% of the whole compile time, which is definitely atypical. Because the tinbox project uses several type-intensive libraries, it is not entirely surprising to find out that typechecking sources takes time. However, it was remarkable that the typechecking time for test sources was so long, considering there were less than 5k LOC. Hence, the decision to take a closer look at the test sources.

Investigating Tests

To direct our efforts, we needed to know which test source took the most to typecheck. With the help of Triplequote tooling, we collected the following statistics:

Source file Typechecking time
ConfigsRouteSpec.scala 5617ms
RejectionsRouteSpec.scala 3814ms
DoobieArticleRepositorySpec.scala 3179ms
CursorPersistenceServiceSpec.scala 2982ms
MerchantsRouteSpec.scala 2465ms

ConfigsRouteSpec.scala is the test source file that took the most to compile. What’s stunning is that ConfigRouteSpec.scala contains only 56 lines of code, for a total of two unit tests. How could such a small source take so long to typecheck?

We needed more visibility into what the Scala compiler was doing. The next table reports two insightful metrics we collected on ConfigRouteSpec.scala:

Macro expansions Time % spent in macro expansion
ConfigsRouteSpec.scala 79.4%

The problem was evident: The many macro expansions were responsible for ConfigRouteSpec.scala’s long typechecking time. To understand whether this was normal or not, we had to have a look at the code generated by the triggered macros.

Macro generated code

To see the code generated by macros we can simply inspect the AST of the source ConfigRouteSpec.scala after the typer phase (after typer, all code generated by macros is in the AST). To print the AST after typer we use the Scala compiler option -Xprint:typer.

As expected, the amount of code generated by macros into the AST was substantial. In particular, we noticed that all macro code was injected into the following helper method:

def route(configDetails: Option[ConfigDetails]): 
ConfigsRoute = configure[ConfigsRoute](ApplicationConfig.test)

You don’t need to understand what the method does. What’s interesting is to look at definition of the configure method:

def configure[A](c: ApplicationConfig)(implicit r: ConfigReader[A]): A 
= r(c)

Note that configure takes an additional, implicit parameter that needs to be filled in by the compiler. What was intriguing is that the Scala compiler synthesized this value using macros instead of using an existing value in the implicit scope, and that was why the source file took so long to typecheck.

The interesting part in the implicit scope is the ConfigsRoute companion object:

object ConfigsRoute { 
  implicit def reader: ConfigReader[ConfigsRoute] = 
    createReader }

As you can see, there is an implicit definition that can be used to create ConfigReader[ConfigsRoute] implicit value instances. But why wasn’t this implicit picked up?

Before digging deeper into the problem, we tested that passing the argument explicitly would have an impact on compilation time:

def route(configDetails: Option[ConfigDetails]): 
  ConfigsRoute = configure[ConfigsRoute](ApplicationConfig.test)(ConfigsRoute.reader)

With this small change, compilation time of ConfigRouteSpec.scala was drastically reduced to 99ms, which is 56x faster than it was initially!

While great, the above is not an ideal solution, as no one likes to pass implicit values explicitly. Said otherwise, we treated the symptom but not the cause. In fact, we would like the Scala compiler to find and use the ConfigsRoute.reader implicit value, instead of synthesizing an implicit value using expensive macros. So why wasn’t the Scala compiler injecting the desired implicit value?

The answer turned out to be simple: The problem was that the expensive macros used to synthesize a ConfigReader[ConfigsRoute] instance were imported into the local scope via a package object. Hence, the Scala compiler couldn’t do anything else other than use these macros to create an implicit value instance of ConfigReader[ConfigsRoute].

Armed with this knowledge, the solution to the problem consisted of ensuring that the macro code that was previously triggered would no longer be accessible from the ConfigRouteSpec.scala source. The consequence of this refactoring is that the Scala compiler would now look for an implicit ConfigReader[ConfigsRoute] value in the implicit scope of ConfigsRoute, and it will manage to find ConfigsRoute.reader as expected. Therefore, the implementation of the route method could be reverted to its original state without losing the obtained compile time speedup.

It’s worth mentioning that while the solution to this inefficiency in compile time was relatively simple, it would have been impossible to know where to focus our efforts without adequate diagnostic tooling. It’s Triplequote’s intention to integrate diagnostic tooling into Hydra, and hence automatize the process of detecting compile time inefficiencies.

It was now time to re-run a full project compile and compare the compilation time for our current optimized state versus the initial state.

Optimized State

The chart below visually compares the tinbox project’s compile time performance prior to and after implementing the discussed code optimization.

Single-threaded compile time has improved by 17% on a cold JVM and 37% on a warm JVM.

Optimized State with Triplequote Hydra

Finally, we wanted to check that the initial speedup obtained with Hydra was still there after having optimized single-threaded compilation time. Hence, we ran once more the same experiments, but this time using Hydra.

The next chart visually compares the compile time performance when using the vanilla Scala 2.12.1 versus Hydra.

Notice how using Hydra yields a 2x compile time speed-up with a warm JVM. And cold compile time performance is 33% faster when using Hydra.

This all looks very promising and we now need to validate those good results by deploying Hydra in the team and on the Continuous Integration server. In particular we are checking:

  • If we can confirm the productivity gains across the day, which is a mix of cold/hot compilations, either full or incremental
  • If there is still a benefit of running Hydra on machine having only 2 cores with hyperthreading
  • If Hydra is robust and doesn’t break on some new code structure which we would be introducing
  • How we can collaborate to make better diagnostic tools to better understand the performance bottlenecks and how they evolve as the project grows

Editor’s note: The evaluation was very successful and we are very pleased to have Zalando as a customer.

Summary

Reducing compilation time of Scala programs can be challenging, but with the help of Triplequote we have obtained a drastic speedup. Using Hydra yielded a 2.66x compile time reduction for free on the initial tinbox codebase. This is impressive, as all we had to do was add a Sbt plugin to our build.

Moreover, thanks to their expertise and advanced tooling, we were able to pinpoint compile time inefficiencies that would otherwise have gone unnoticed. By detecting single-threaded inefficiencies and using Hydra to parallelize compilation, the tinbox project compiles now 3.2x faster!