How can you use data visualizations with your build data to improve the reliability and performance of your build?
In this recorded webcast, Sam Snyder, Senior Developer with Tableau, demonstrates using Tableau with Develocity for live build-monitoring, debugging tricky performance/reliability issues, and identifying opportunities to take engineering organizations to the next level of productivity. Moderated by Hans Dockter, CEO of Gradle.
Thank you for all of the excellent questions asked during the webcast, which are answered towards the end of the recording.
Hans: Hello everyone, my name is Hans Dockter, I’m the founder and CEO of Gradle. With me is Sam Snyder, an old friend of the Gradle project, senior software engineer from Tableau, and I’m very excited about this presentation.
What Tableau is doing with data analytics around developer productivity, I haven’t seen any organization in the world that is even close to what they are doing. I think you can learn a lot from what benefits you get if you take developer productivity and the data around that seriously, and use it for prioritization and actionable things. Welcome, Sam and thanks.
Sam: Good to be here, Hans. Thank you very much to you and the rest of the Gradle team for having me. I’ve worked at Tableau Software for about four years, and I’ve spent about the last two of those years focused on delivering the best developer experience I possibly can for my peer developers at Tableau. And so in this presentation, I’ll go over some of the things we’ve learned and how you can apply those same lessons to your development organization, combining the power of Tableau as tools for Visual Analytics, and Develocity’s detailed collection of build telemetry.
First, we’re going to establish a little more context, and I will walk you through our journey through time and our journey through data analysis as we came to build happiness. I’ll share some challenges that we’ve encountered along the way, and the lessons we’ve learned, and strategies to mitigate those challenges. Finally, there’ll be a very brief how-to guide at the end and a question and answer session. But you can, of course, ask questions all throughout the presentation, and Hans will find a suitable point to interrupt and let me know.
Tableau Software, for those of you who might not be familiar, make data analysis and visualization tools. It’s a bit of a cultural thing that we tend to shorten data visualization to “viz” a lot when we talk. It comes up a lot when you make data visualization software, so having a shortened term there is useful. Technically, as a development organization, Tableau Desktop is probably the majority of the code, and it is substantially C++ based with some web components in the UI. There’s a little bit of Gradle orchestration here, an increasing amount, but the primary usage of Gradle at Tableau is in building Tableau Server, which is substantially a Java product.
Tableau Server is where I came from. I was a product developer working on features and testing for the Tableau Server project, and me and my peers were frustrated with how bad the developer experience was. At this point in time, two years ago or so, our build wasn’t Gradle of five yet. It was a mixture of Ant and Ivy and homegrown scripts that orchestrated Ant and Ivy because writing all of that XML format was just too much of a pain. So in order to help my friends and in order to help myself have a more fun and productive day, set out with some other interested co-workers to migrate our build to Gradle. That immediately delivered some great gains in terms of performance, in terms of extensibility, but it was only the first step.
We had a build with some deep problems, and we accurately modeled a build with deep problems and a different tool. So it continued to accurately model substantial problems. And so those of us who came to this project came to it from the background of software engineering. We were product developers first and build guys accidentally, I would say.
In terms of how to advance the state of this build, how to achieve developer productivity, we decided that the strategy that resonated the most with us, we need to treat the build like software because, I mean, it is software. If you have an Ant and Ivy build or a Maven build or any other kind of build with a kind of document format that specifies your build logic, you might be able to pretend that it isn’t software, because you’re writing x amounts ostensibly in a document.
If you look at it from any other angle, it really comes away looking like software with all of the burdens that software has, all of the weight of complexity. And if you want to manage that complexity, the discipline of the software engineer is the way to go, to do the testing, to do the static analysis. And key for purposes of this presentation, to gather telemetry and metrics, analyze them and make database decisions about the areas that are most important. So that ended up being a key factor and why Gradle, as a build tool, was able to be useful to us, is that we could gather this telemetry, establish those metrics, and optimize them over time.
Our journey to build happiness is ongoing. That is the kind of pursuit that the value is in the chase, is in the work you do to seek the aspiration even if you never fully achieve it, or never permanently achieve a perfect state of build happiness. Because those pesky developers are going to keep on adding new code and changing things despite your best efforts. The very first most important thing in terms of developer happiness is the, at least there’s a place to build, is the error rate of the build. If the build fails, if it falls over on itself, if it spits out cryptic problems that they can’t reason about, just completely throws them off and sends them down to some unrelated rabbit hole.
So our very first priority in terms of using Develocity telemetry with Tableau was to figure out what issues are the most important. As when there are dozens of issues and many people crying out for help, it’s not always clear what the most important issue is. If Joe developer’s build is down and he can’t get his work done, that’s a terrible crisis for him personally. But if everyone has problems to varying degrees, maybe he’s the only one experiencing that particular problem. And you’re going to get a lot more value out of figuring out what issues affect the most developers, the most often with the greatest severity and least ability to mitigate.
One of the businesses that we have made that has been the most useful to us in terms of advancing the state of developer productivity, has been this data visualization that analyzes local failures. There’s a lot of colors going on here, and just to explain what we’re looking at right now, the left side is a temporal view of all of the problems that have come upon developer machines in Tableau over the past several months. They were ostensibly synced to a state of the build, the state of the enlistment that should be green, and some local condition, or some flakiness in the test, or flakiness in the product, or flakiness in our build caused them to fail. On the right, we have a spatial view of the same information.
We have used this to successfully identify what the most serious problem is. It’s going to be in this upper left corner, you know, the biggest problem that affects the most people, and then fix that, and that goes away. Then you recursively pop off and go to the next problem, which over time leads you to a much better state than when you started.
Hans: One quick question. How did you measure open files? How did you filter for that?
Sam: Excellent question, we used Develocity’s and the build scan’s custom values feature in order to record a version control system, as we used both Perforce and Git in various places. The changeless number or the origin commit hash in order to provide context. If someone’s build was broken, you can easily compare that to any known issues that might be going on at the time, and how many files that had open for edit, or how the state of their branch different from master. And so using this information, it really helped us to cut through a lot of the noise. I mean, if compile Java fails and they forgot a semicolon, it should have failed. So this is to try and identify only those task failures that should not be the developer’s fault.
Oh, and while we’re here, you can dig into the test information and see that. And if you go through a few more scans, this one particular test happens to be the one that accounts for every single one of these failures. So that wasn’t a hard fix. It was only a couple of lines worth of change in order to make this very substantial pain point just go away. But as you can see, this doesn’t happen to everyone all the time every day. There’s a few instances per day of this now. And so we probably wouldn’t even be able to tell without Develocity that this was the biggest problem.
Every time a test like particulate flaky test, if it fails once, the developer runs the build again and it passes, they probably grumble a little bit but don’t actually reach out to anyone to be like, this needs to be fixed now. So this can cause pain that normal channels of communication aren’t going to clue you in on, or I’m going to clue you one of them with enough specificity to really identify and fix that. And one nice little detail is that they should complement each other really well.
The spatial side shows things from all time, and so you might conclude that clean is a big issue that, for some reason, clean fails. But if we look, all of the issues with clean work like clustered on this one day and, in fact, it was one developer who had a local issue with a service holding open handles on files, so that the Gradle clean test could not delete them, which points to one of the perils or pitfalls of this kind of analysis in our pursuit of clean data to work with that was as free of noise as possible. We made ourselves a little bit susceptible to one person or a small number of people doing a lot of builds and amplifying the signal of a relatively minor failure.
In this case, there is no substitute for expertise, for domain knowledge of the build, for domain knowledge of Gradle. This has been one of the single biggest drivers for build happiness, for build reliability, is this project, what’s in that upper left corner of the distribution, the spatial distribution of problems. Fix that, then fix the next one, and then the next one, and then the next one. And it’s good to take a break as you’re doing this exercise, and congratulate yourself and pat yourself on the back for how far you’ve come. As quickly as that spatial view, it always takes up 100% until there’s zero failures, so you could be making a lot of progress. But I don’t feel that way, which is why I’ve found it to be very impactful to talk to my friends and peers about how they relate to the build, and how they’re doing, and how they feel about it since that’s what we’re ultimately here for, to improve the lives and productivity of our co-workers.
A challenge we ran into in this kind of analysis is that not all failures are co-located with the cause of that failure. Our Gradle build is 300 projects and 6,000 tasks, and any task could potentially change some local state in a way that would mess up any other task. Particularly, if you were hypothetically not an expert in writing Gradle build logic when you first started converting an entire build from Ant and Ivy to Gradle, you might end up with some malformed tasks. And you might end up with failures where, that on this data source, might show up as being in a test task when, actually, it was the setup task before it that didn’t do its job but returned a-OK.
This is particularly problematic or challenging within the context of an incremental build. As incremental builds, with some fraction of all total tasks running, are simply less idempotent than a build where you run everything all the time, which would be obviously much too expensive in terms of your time. But the knowledge of what task is actually causing the problem does not show up in that viz I just showed you all on its own. And we found that this functionality to try and identify incorrect incrementalism was not built integrate or not built into build scans yet. And we found that we really needed it in order to wrangle the bugs and wrangle the complexity of such a large and complicated build.
As a result, we built what we call the IO Detectorator, a detecting decorator, that serializes build execution while instrumenting the file system. With it we can then come back with an authoritative list of every file, any task reads or writes from. And then we just ask Gradle what are the declared inputs and outputs of this task. And if those lists don’t line up, that is almost all of the time, an incremental correctness issue and a bug waiting to happen, and particularly a nasty non-proximate cause bug waiting to happen.
So let me show you what that viz looks like. Here we go. So because doing these detectorator runs, slows down the build because you disable all incrementalism. You disable all parallelism. And then you slow things down further by instrumenting the file system, and a lot of file access goes on during a build like this. We kind of run it periodically and on a one-off basis for investigations where we are left to suspect by the data that this might be an issue. And so in the way we have this setup, you can see that tab native API, genConstantsJava, happens to have undeclared reads of Python, because this task happens to invoke a Python script. And so this could be a potential bug because you could say, update the version of Python. And this task, on some local developer machine, might not rerun. And if changing the version of Python would happen to break something or not break something, well, then it’s just an opportunity for a bug later on.
As a result, we know where to start. When something comes up where a task doesn’t appear to be idempotent with respect to its inputs, this is the first place to look. And once the issues around here are corrected, then it gives you a lot more confidence that failures are proximate to their causes. And we related the check overlapping file IO and so that we can’t see when two tasks have contention overwriting to a file for a very similar category of error analysis. I’d love to show you how this changed over time. But because of the one-off nature of how we run these scans, we don’t have a nice temporal view for the detectorator information.
The real takeaway I had from this, and the real takeaway I had from trying to optimize our build for performance to waste as little of people’s time as possible, is that performance is always paid for in complexity. Parallelism and incrementalism are completely essential for running our build in any reasonable amount of time, and not wasting days and days building stuff that’s already been built before. But incrementalism and parallelism both make debugging problems harder. And so if you want to enjoy those performance benefits, a level of rigor around measuring and dealing with their impact is essential for maintaining correctness in the face of performance. And so, like I said, the viz I just showed you doesn’t rely on built-in features of build scans. There’s a Gradle plug-in and a way of doing CI that we ran ourselves.
I’m curious, Hans. Do you think that a deep scan feature that opts you into paying a performance hit in exchange for deeper telemetry might be a built-in part of build scans in the future?
Hans: That’s a good question. With a clear answer, yes.
Sam: Cool.
Hans: But at the same time, as you know, with all the flakiness issues, we still want to make Develocity so scalable and so efficient that most of the data is collected all the time, because otherwise, you get into problems, oh, I need to reproduce, and then enable this deeper data collection. But there is data that would be so expensive that you not can collect it for any build but will be valuable. And then that would be where you would have a CI running every 24 hours to do some deeper sanity check, so absolutely.
Then, of course, as you pointed out with that other item, strict mode. I think it points out to, in this case, too much flexibility on a Gradle side. So in a Gradle build tool side, you should be able to say, you know what, the build should fail if there’s any overlapping output. I think we will tackle that from two angles. But the great thing is that you were able to create your own solution, right?
Sam: Right. Without the flexibility of Gradle’s API and plug-in model, I mean, other systems might not have allowed for that kind of introspection into the domain objects of the build at all, so that the Gradle’s flexibility made this possible.
One request as we do this in strict mode, as we saw on our viz of overlapping file I/O, we have a number of problems still to work out. So I hope when it becomes an option, we could enable it on a project level or on a subtree of the graph, so that we could turn, say, turn on strict mode for everywhere that was compliant, and then work on the problematic areas one by one.
Hans: Yes, so I would assume that it’s anyhow something that you can configure down to a task instance level.
Sam: That would be perfect. Thank you, Hans.
So once you’ve worked out most of the core correctness issues in your build, you might notice something new happening. You might notice people complaining to you about the performance of the build. You might notice them saying that the build was slow. This might feel frustrating or demoralizing to you and your effort to build happiness because you just put all that work into making the build as reliable as possible. But it’s actually a good thing in a certain sense because no one complains that the build is slow if it’s horribly broken. So complaints about slowness could mean that it’s better and more reliable than ever before.
Always appreciate how far you’ve come in reaching this point, and then get to work trying to fix it. Your build is going to get slower without ongoing countervailing effort. People are going to keep adding more tests, keep adding more projects.
Hans: New frameworks, new languages, new compilers.
Sam: New forms of code generation and a new form of static analysis.
Hans: Annotation processors.
Sam: And here again, Develocity data analyzed for Tableau could help you to see and understand what the most important bottleneck in your build is. Let’s just take a look at one of the performance visualizations that we used to keep an eye on how our incrementalism is doing. So one of the features that kind of comes for free with Develocity, well really not for free. It’s a premium product. We bought Develocity specifically for the telemetry and analysis capabilities it was going to grant us. And so since that was what we wanted, one of the first things I set out to do is kind of try to quantify the impact of having that remote build-cache, just available turnkey, on the lives of our developers.
On the left here, I put some common cacheable task types that a Java developer’s guarantee, a checkstyle Java compilation, unit test execution. And I’ve tracked their cache hit rate over time, and this combines the local and remote cache to static. It doesn’t differentiate and we make use of both forms of Gradle task output caching. Then I made an estimate of how much time this feature actually saves people by subtracting the average duration of a successful task from the duration of the cache hit. And then multiplying by the number of cash hits in order to estimate how much time the remote cache has saved the developer on a given week.
Here, you could see that the impact of checkstyle caching is relatively low. Java compile is pretty moderate. And the biggest impact in terms of time-saving is actually on unit test execution, which is, I think, comports with our intuition. The test execution is one of the more expensive operations going on in a build. Combining all those together and showing all cacheable tasks, we estimate that the remote cache saves between 11 and maybe like 12 and eight hours a month per developer. So every developer in the organization over the course of a month saves hours because this feature is just there and available to them.
I recall one period where it updated to a new Gradle version, or it changed our initialization scripts. I completely disabled publishing new artifacts, so the hit rate just gradually declined. So we noticed there was a problem, and went to go fix it. After that point, we started taking advantage of a feature of Tableau called data-driven alerting, where you can pick any given metric out of your workbook and say, send me an email if this drops below a particular value.
In this case, we set under 50% cache hit rate should get emails from us, so we’ve started. The changes we made that made the cache hit rate go down, we are gradually working to correct, and you see that gently slide upwards. The caching is just one small part of the performance story for a large Gradle build. All incrementality collectively is substantially more impactful.
Here, I’ve tried a variety of different lines, and the biggest one is up to date. This task has already been downloaded from the cache previously, or you built successfully and so you don’t need to do it again. And keeping this rate as high as possible is one of the biggest drivers for the build just like feeling snappy, and for feeling rapid, and for keeping people on that state of flow as they work on their business problem rather than on build problems. And so with a similar methodology, the previous slide I’ve estimated how much time incrementalism as a whole, saves people who are building. So per developer per month, we’re talking between 30 and 55 hours saved. Now that’s pretty cool. That’s a lot.
This also points to some of the difficulty or the complexity of meaningful performance analysis, because Gradle is so highly parallelizable, and we run it in such a highly paralyzed mode. This is CPU time savings. This is not an actual wall clock time savings. So we go to the timeline view. It shows this kind of thing really well. You can see to the extent to which tasks are run on every thread, and so things do in it very highly parallelized. But we do have a few problems in terms of the existence of a long pole, which can substantially thwart your paralyzation efforts. So the more parallelized your build is the less saving CPU time. You have to consider what the speed of ratio is for you in practice in order to make a conversion between CPU time savings and actual wall clock time savings for the people we’re trying to help here.
Hans: I’m curious. What we are seeing with our team that the machinery for CIs becoming are not irrelevant or cost item for our company.
Sam: Oh yeah, we noticed that too.
Hans: So is that also something you analyze?
Sam: I have not personally attempted to analyze the business impact because I’ve focused more on the engineering side. But I think that would be really interesting in trying in making a case to our organization to purchase Develocity, definitely highlighted that this would, and does save CI a lot of time. CI is one of the best cases in terms of getting a high hit rate because we tend not to have local changes and builds overlap with each other substantially. And it’s two things.
Hans: It’s a time saving on CI, but it’s also that other things are faster, or you need to pay less for your CI resources, right?
Sam: Right. One trap that it’s easy to fall into when attempting to analyze and attempting to maintain a highly performant build, if you don’t have data like this, if you don’t look at your Gradle build scans and see what the long haul task is because you’ve put a lot of effort and put a lot of time and sweat into optimizing the performance of task of build plug-ins, that just don’t matter. If there’s one serialized execution that takes 10 minutes, nothing you can do anywhere else is going to make your build any faster than 10 minutes. So knowing when you’re wasting your effort when it’s not worth doing something so that it is one of the key benefits of doing this kind of analysis.
Hans: What I see in organizations is, let’s say, organization, we’re struggling with build performance, configuration, time execution, time. We went in there, let’s say, and improved the situation significantly. And then we came back three or four months later, we say, well, significant regression. For me, it’s two things. On one hand, you have a high level of analytics to see there is a problem. But then it is not easy to figure out the root cause. People will just surrender to it.
Those are the two things you need. You need to really see some high-level analytics that say, in this area, we have a problem. But then you also need to be able to deep dive into it to see what is the root cause for that. If either one is missing, you will not be effective. That’s my experience.
Sam: Absolutely, I completely agree, and what you say about regressions is spot on. Any reliability issues you fix, I mean, if builds tend towards entropy and chaos over time, and you need to make a constant effort not only to realize gains but to keep the gains you’ve made in terms of reliability or performance. I mean, you just have to be watching this data, or periodically spending extensive effort on trying to heroically correct the issue once and for all only to have similar problems or categories and problems reoccur.
Hans: But then you lost a lot of money on the table for the full month, right?
Sam: Right. Exactly.
Hans: With the latency until it was escalated that you had to do something about it.
Sam: Exactly. So trying to preempt that kind of thing through a good system of learning and analysis is key.
We might have noticed this workgroup VQL web project coming up as our long haul, and as coming up as a big source of errors as well if you recall the IO detectorator bits. This guy is so long and so error-prone partially because Gradle doesn’t actually build it. It just calls another built system that builds this very large and idiosyncratic from a language and from a tooling perspective project.
And so we can’t use without converting its whole build to Gradle, which would be a massive undertaking. We can’t use the tools that Gradle gives us in terms of having a task that is incremental, and parallel, and casual in order to substantially speed this up. But we don’t want to just give up on what’s causing us the most time and a substantial fraction of the errors. So part of our solution to this has become taking large or problematic projects, moving them into their own build, and consuming them only as binary artifacts, which we can upgrade or download and resolve like any other dependency.
You can go even further with the most complicated areas of your build, and the most complicated areas of your product really, by trying to modularize it. A style of building and a style of organizing your software that has been exploding in popularity within Tableau, we still have effectively a large multi-project monolith for C++, and for Java, and desktop, and server. But since we added on the capability internally to do what is sometimes called maybe Netflix or Amazon-style development of many modules that come together and to independently shippable all services or subcomponents, we have found a great deal of hunger for and a great deal of adoption internally, and the style of development.
If I pop over here, here’s a little chart I have from my peer team that works on facilitating modular development. And this shows by what Gradle plug-in is applied to a module language adoption over time. And you see this has been incredibly popular right from the get-go when it comes to Java, and C++, and Python, and typescript projects have also grown substantially.
I will point out that this Java bar includes all JVM languages. It’s just the way we did this. This analysis didn’t notice that you applied the Java plug-in and the column plug-in, for example. So we are seeing increasing usage and internal popularity of Kotlin, which is my personal favorite language at the moment. So I’d love to have Gradle expanding the Kotlin DSLs.
Hans: Coming soon. Gradle 5.0
Sam: Oh I’m looking forward to it. I’m looking forward to it. And now you see it over the course of only a few months, we had thousands more. We had thousands of projects created that use this style. And I largely attribute this phenomenon that we were able to provide this capability for developers to organize their software in modules if they so choose, to why our server build hasn’t become a complete mess.
A year ago, when I gave a presentation at the Gradle conference, we had about 300 projects in the Java server monolith. And today, we still have about 300 projects in the Java server monolith, because a lot of the growth has gone here. And that has really helped to keep the performance and reliability characteristics, trending the right direction without having to arbitrarily scale the size of our build team.
Hans: Are you doing any analytics kind of activity per project? To show, OK, this is getting more legacy code. The change rate is reducing, so there is less need for that part of the codebase to modularize.
Sam: Yes, yes, absolutely. That correlates very strongly with how many failures there are in an area. These things end up being very closely related.
Hans: Cool.
Sam: You get something working and you just leave it alone until someone messes with it again. It generally kind of continues to work.
The developers who have been able to work in this style have found, and we have an analysis that shows this. And the workbook I just showed you merges information from Perforce, and Develocity, and Gitlab in our internal Git repository. And the developers who managed to work in this style for most of their work made commits to the code base about three times as often as those who work on the monolithic style. Because having that more self-contained project to build, means you have fewer dependencies. You have the same amount of binary dependencies, but fewer things that could mess you up in the moment.
Hans: And if you back cycle it is faster.
Sam: Yeah.
Hans: So it’s not just the monoliths you have to pay a lower price for getting the feedback from the build.
Sam: Yeah, absolutely. It shortens the iteration time, and it helps keep you in that state of flow when you’re developing that way. As the human mind can really only handle so much complexity no matter how intelligent you are, no matter how expert you are in your field, you can only keep so much in your head at once. And modular style of development is better at fitting code into a single human mind.
We have found this to be a key strategy for our ongoing journey to build happiness. The monolith is very much a push model for changes. You make a change in the code or it affects everywhere instantly, which gives you great consistency guarantees. And if something works or is broken, it works or is broken everywhere. Modules make a trade-off in terms of consistency in order to adopt a pull model. No, I’m not taking your change in my service, and so you have to pass my integration tests just to be stable enough. And so it gives you more control over the things or services you own in exchange for possible inconsistency.
Hans: Yes, and that’s how I see it. For me, continuous integration is a continuum. What you’re trading in, you get a little bit later integration. You have to pay attention that diversion ranges do not completely become unmanageable. At the same time, you get more stability for the team so that they can have a higher feature flow. So you trade continuous integration latency with higher feature flow. And I think people are able to model this now. I think that that’s very important because everything in one repo doesn’t scale above a certain organizational size, organization structure. That’s our experience.
Sam: Yeah, unless you’re willing and able to make Google levels of investment, and making a monolithic build.
Hans: But even then, you need a culture and the homogeneity of the engineering teams around that. So there’s still a lot of organizational qualities. You then also need to have, I would say, there’s hardly a company in the world that has that.
Sam: Yeah. I would tend to agree. We’re still trying to find internally the right balance for us. And I’m sure that the right balance will continue to change and evolve over time of what is the most productive and effective way of organizing our software.
Hans: One question here. So with the monolith, obviously, everyone is complaining about build time, I’m sure. It was emotionally or strong pain for the developers. Now in a module repo world, let’s say, the build is now 1 minute and 40 seconds.
Sam: Yup, at most.
Hans: They will not complain anymore. They’re not saying, oh, it’s way too long. But still, because you have not many more builds with many more repos, if you get it down to one minute, the business impact is tremendous but it’s no longer driven by code, data, and numbers, and not by developers knocking on your door, hey, we don’t want to be in that situation. So would you think that all of the performance optimizations are as important and valuable from a business impact in a multi repo environment compared to a mono repo environment?
Sam: I think I would say that because of how Gradle’s dependency resolution features work, and how you guys have put a lot of work into making resolution of binary dependencies fast, perform a parallel, extremely reliable, that durations were cut down in modular context. Now there’s still challenges when you need to bring it all together into a functioning piece of software. Regardless of how you’ve organized your modules or your monoliths when it comes time to run a heavy-hitting integration test, it’s time to run the heavy-hitting integration test. And so performance still matters.
I think it might be accurate to say that pain or frustration is conserved. You just get to choose to allocate it differently to different people or at different times. And hopefully, you allocate pain or frustration to the people who are domain experts who are knowledgeable and motivated and own a particular area who are then very equipped to deal with it. Whereas in a monolithic context, maybe my change breaks someone I’ve never met and on the other side of the company’s work, and causes them a lot of hassle, and there’s this long cycle time to bring it back to me so that I can undo the damage. So, in our ongoing journey to build happiness. We’ve encountered a lot of challenges, and we’ve learned a lot of lessons, and I think that it can help you jump start your own journey by sharing some of them.
One challenge that’s very relevant to analysis of this kind of data that doesn’t come up as much, and just interacting with the build as a user, is the heterogeneity of builds. You can call it a build when you build a single project of the 300 projects DAG. Or you can call it a build where you built every single one. It could be fully incremental or it could be fully non-incremental. It could have to download dependencies, or they might already be there on your local hard drive.
So we found it to be very difficult to compare builds to builds for a lot of our analysis because a build could be so very different from one another. As a result, I either had to throw out a lot of builds in order be only comparing apples to apples, or focus on tasks over projects.
If you talk about the number of task failure is a build failure, and if tasks are slowed then the build is slow, tasks tend to be an actionable level of detail. We found that focusing on tasks over projects and focusing on tasks overbuilds was the right level from a debugging and from an optimizing perspective. And then again, this is only possible if the tasks that are idempotent, if their problems are self-contained. So we are only able to achieve this because we had the IO detectorator functionality like it in order to get the tasks in that good state.
Hans: So I want to add one more thing. Yeah, comparing builds that include trends even with the same project, apples and oranges, so fast, and the roadmap for Develocity. That is one of the things we want to provide, the build comparison that compares apples with apples, and understands all those things. But because even when you do task level, you have an incremental compile task. One task compiles 30 changes in the source file, the other is two, so that is higher on our roadmap to kind of filter it out to give you a real comparison. It is really hard to compare two builds. But I think it’s absolutely the right approach, but even that, sometimes it’s too close ground as you know especially with incrementalism.
Sam: Absolutely. You can do some clever things with level of detail calculations or table calculations in Tableau in order to try and do things like establish an overall metric of incrementalism for a particular build, and then bucket ties into histograms, or similar like builds and then you can filter by that set. So it’s still possible to wrangle. It’s just a lot of effort compared to just focus on the task.
Another challenge or a related challenge is just the complexity of the domain. In 6,000 tasks, we’ve got at least half a dozen distinct forms of code generation, several programming languages, dependencies that cross and manage a native boundary, and it has been no replacement for expert knowledge, the different areas.
You can have the chart but if you don’t have the knowledge of the specific areas of the build or the knowledge of the people who do have the knowledge, that it can be very hard to dig in in a meaningful way, and you can be distracted by noise. So you need the expertise when working on a domain like this both at how Gradle’s model of the build works. I mean, you’re not going to get very far with Gradle stuff if you don’t understand the difference and the configuration and execution phase. And you’re not going to get very far and applying it to your build if you don’t know where the pieces of your build come from and how they relate to each other.
I wanted and tried for a while to find a simple metric for overall build health. Something where I could say to management or to my peers, the build got 10% healthier because of this work we did. And I tried to do that for a long time, but I found that this metric, in my mind at least, needed to have several properties that I found to be difficult to realize. If people come in and build a lot more like you might have seen on the chart of local failures. These spikes like this, does the build getting worse in a sine wave? No, those are just the workdays of the week. People don’t build as much on the weekends. There aren’t as many failures.
You’d want an overall metric of health to be insensitive to people just building more because they’re busy or building less. And developers know their own scenarios, they’d want it to be insensitive to minor details like what subset of the graph you’re trying to build. I thought that it would be very difficult to come up with though. It was not ultimately successful in coming up with then, and maybe one of you watching if you figure this out, then we know how to come up with a really simple and sane metric for overall health.
Instead of trying to come up with this, we focused on the failures. And more importantly, we focus on the people because that’s what we’re here for. That’s what I find personally gratifying about working on the build. If I fix a problem with the build, I fix a problem for my colleague, Jill, or my friend Matt. And it makes it a difference to the day and the lives of people who I know and care about personally and professionally.
Hans: I have a question. Let’s say, one of the problems that I see is the friction that comes between developers and build engineers, the developer productivity people, it’s like with all the data, there’s not much accountability. Say, for me, a funny example is the developer adds a new annotation processor, just purely the choice of the developers, and extremely slows down the build, and then they’re complaining where the build engine is, the build got slower. And it is not evil by intent. They just couldn’t connect the dots. So how do you deal with this kind of change management that developers make some choices that affect the build time but they’re not aware of it? So how does it work?
Sam: Yeah, yeah, absolutely. As we’re doing these kinds of performance analysis by the task execution, or configuration time, or whatever it may be, you can often trackback a particular performance regression to a particular change. And you don’t want to be in the business of telling your friends and colleagues know that tool that you, the expert, decided was just the right fit for your domain. No, I’m going to be the fun police and the productivity police until you know you can’t use that valuable thing. But it does give you a starting point for a conversation.
Hans: So that they can kind of trade. Oh, I understand. Now the impact, maybe it’s not worth it or maybe it’s worth it. It’s an organizational solution to the problem.
Sam: Yes, it’s the organizational solution to the problem and everyone wants their build to be fast and reliable. If you show them the sanitation process, or this is plug-in you brought in is problematic, then everyone’s going to want to work with you fixing that aspect of your shared experience. Once you realize what the issue is, then you can talk about what to do about it, and try and preserve the value they’re getting from it with as little impact as possible. Constrain it to only applying that plug-in during particular CI runs, or on particular projects, or putting something in a module, so these strategies can help us substantially, and the challenge is noise.
CI environments are set up one way. Developer environments can be set up any number of ways, and you want it to always just work, and that could be very, very challenging both from an analysis point of view and from a bug fixing and doing work point of view that builds are so different, and build environments are so different from one another. It’s convenient that Develocity captures information like what operating system they’re running on and some information about CPU resources. Because when you see the performance got worse in your build, was it just because they were running 20 servers and data clusters and virtual machines on their computer?
The noise that you get from variety and configuration and variety of contexts that developers operated, is a continual battle because this thing changes. It always changes over time. And so you could try and identify canonically supported scenarios and supported configurations. This helps you deal with the noise. But as you kind of saw on the reliability viz, one person who has some files locked can cause a noticeable amount of failures all on their own. As you try to pair down this ocean of information you’re swimming in, you can make yourself susceptible to the other kind, the opposite, the inverse kind of noise problem.
If you identify some supported configurations or canonical scenarios, and those aren’t what people actually use or what actually matters to people, you can put a lot of effort into working on things that just don’t match up with the reality of how development is conducted in your organization. I’d say there’s absolutely no replacement for having relationships with your peer developers, and no replacement for their knowledge or for your knowledge, and trying to continually track how development actually happens rather than how you might imagine it to happen.
Hans: Are you using the data also to communicate internally? Is there something like a monthly book newsletter? This is what went well, this is where we have a regression?
Sam: Yeah, our engineering and services platform organization definitely sends out that kind of information that frequently includes graphs and charts and visualizations of this information. And that’s relevant to strategic decision-making as well. This whole workbook is about the adoption of this modular way of doing things. And so you use this kind of information to make personnel allocations, and budgeting decisions, and strategic investment.
You could say, Look, developers are really into this, this just matters to people. It’s a big deal. Look at that growth rate. We need people on this or resources for this. And it started to come up in one-pagers and problem statements and work billing that like, Hey, this area is really problematic and we can quantify how much before we had this feeling that the work that workgroup project was a pain. We made sure exactly how much pain. We can say this accounts for 10% or 15% of all the errors experienced.
Hans: I’m just curious. Did you ever do something like the Windows file system is a little bit slower than Linux file system? Do you ever do experiments with that, for example, how this affects the build time?
Sam: Linux is noticeably faster than Windows built for running a built like this. Given that we have to support both, it didn’t seem relevant to me except to be like, cute. Because Tableau Desktop ships on Windows and Mac, and Tableau Server ships on Linux and Windows. And so it has to work everywhere. It has to be good everywhere. And if I want to be like, I enjoy using Linux. Look how much faster my build is than yours. Yeah, you can, but it doesn’t change the fact that at the end of the day, it has to work well for everyone.
Then one of the biggest challenges is that you’re never done. This has been a recurring theme of this presentation, that if you want your developer experienced to be world-class, you want it to be a competitive advantage relative to other shops in your industry, relative to your competitors, then you never get to stop doing this kind of thing. So long as the software is changing, so long as people are adding to it, they’re going to keep slowing it down, and they’re going to keep adding problems. And so if you want your developers to be as effective and as happy as they possibly can be, this is a continual investment. And you need to be receptive as someone working on the built to a changing as the needs will change.
That frustrating annotation processor or static analysis tool that slows things down might gradually become just absolutely core to how the product operates. And so you need to be able to be flexible and accommodate that and find strategies. And the key to this is that this is the condition of life for everyone, whether you’re deliberate about it or not. I mean, if you don’t collect the telemetry, if you don’t do the analysis, you’re still going to march forward into the future. You’re just going to do so blindly. You’re going to be on a random walk of the possibility space of how your build and developer productivity might evolve. And if you randomly walk to build happiness, well, that’s great but it might not have been that likely.
Let’s just briefly talk and very quickly about how you could set this up for your organization. So obviously in the Develocity, the data export APIs are only available on Develocity. You guys going to add that to the hosted offering anytime soon, you think, data export API?
Hans: We’re not sure yet when.
Sam: Fair enough.
Hans: It will come. We’re going to plan so sometimes, I guess next year.
Sam: Cool.
Hans: So yeah.
Sam: Yeah, so you need Develocity and then you need Tableau, obviously. And you can use a web data connector, which a colleague of mine worked on at a hackathon project, that goes and talks to data export APIs, extracts the data you asked it to, and puts it in the column there fully denormalized format the Tableau tends to work with. And if this sets your computer on fire, I’m sorry, but I’m not liable. So use it at your own risk.
I will be watching this for pull requests and such, and you’re absolutely free to fork it and use it commercially, or use it for however you like, and to adapt it to pick out the data from build scans. It feels most relevant and most salient to you. There’s just the full Tableau documentation on the web data connector, APIs and interface available to look at as well.
Alternatively, I recommend the web data connector option if you can get it to work, is ge-export, which is a little tool that Gradle helped to work on with me early on in our investigation of this kind of project, which first dumps the data from the export APIs to Postgres, which Tableau knows how to talk to natively. And so this comes at the cost of having this intermediary server that you have to prop up. If you already have Tableau Server perfectly willing and happy to store the data extract, you don’t need an intermediary server. If you’re only working on it on desktop, you don’t need that intermediary server. You just need to extract.
So this has added to the fact that we’ve used this as part of our ETL pipeline for so long, and has been useful but did add a lot of hassle to keep it working, to keep the CI runs that performed this in good condition and to have this prop up the machine and the resources to do it. If you can avoid using this intermediary service, this intermediary database to do your analysis, I do recommend that. And that’s all I have.
Hans: Awesome
Sam: That’s been my presentation.
Hans: Yeah, we have a couple of questions. There was an earlier question.
How did you get the data from Develocity into Tableau?
We just answered that.
Sam: Feel free to clarify that if my explanation of how to get it into Tableau was unclear. We can revisit that segment
Hans: We just mentioned export API is an event-based API, but for the purpose of getting into Tableau, the API is not optimized for that scenario. There’s more you can do with the export API in terms of learning and just watching certain events.
Sam: Yeah, it’s a scripted program, I guess, if you want to make a cool little wide service that’s charging in real-time for doing substantial amounts of analytics that interface is not, on its own, the most conducive.
Hans:
What kind of built culture, team culture do you feel is important to be successful with this kind of approach? How easy or hard is it to try adoption or appreciation of this approach to the development team?
Sam: Absolutely. So for me, it was extremely easy. I can’t promise that it will be for your organization. We work at a data visualization and analysis company. So if I’d said to someone, Hey, we should use data to make better decisions, they’d be like, “No, I don’t want to.” It might’ve been like, “You might not be the right fit for our organization.” So at least within Tableau, everyone immediately recognized how valuable this was and how much we wanted to do this. And so, yeah, that was an easy sell for us.
I suppose I would say if people are resistant to fact or evidence-based investigation of the state of your build, or just believe that they know it so well already that there’s nothing further analytics could add to their knowledge, well, the proof is in the pudding. Show them. Show them that if you do this analysis, you can download a free trial of Tableau, you can download a free trial of Develocity, and then do some of this analysis and ask, “What are the top 10 reliability issues in the build?” And if their answer matches up with all the data that actually tells you, that person is a valuable expert. Keep paying attention to them. But more than likely, it will not match up or not match up fully. I am very confident that something will surprise even experts.
Hans: Yeah, we see this all the time. Oh, why does the build take 20 minutes to spring boot microservice? Oh, someone works from home and is using a VPN to a shared network drive, which is the output directory. And we have that and people were not aware that people were doing those. Or, oh really, garbage collection is 80% of that build time? You could tell so many, I’m sure, surprising stories that no one here would expect and we see this all over the place. So good question here.
Sam, what is your vision for what you want to do next? How far do you think you can take this? Where do you see that you eventually want to take this capability?
Sam: Yeah, excellent question. Where does this go? So where I want to take it is, ultimately, the goal is to have the most productive, happiest development organization you possibly can. That’s what I want. To track down reliability problems, track down performance problems, to track down any impediment between my friends and peers, and doing their best possible work with the least possible hassle.
So I foresee the kind of recursive problem-solving approach to fix the worst thing, and then the next worst thing, and the next worst thing forever as being an ongoing and winning strategy. Although winning strategy, you’re never done like I said. You always keep focusing on it until such time as it’s not an issue and no one’s changing anything anymore.
Hans: And in the old days, there was the dream of the build engineers. It was, Hey, we just want to have a stable build to ship stuff to operations
Sam: And we’re going to finish it and we’ll ship software every three years.
Hans: And we don’t care about what you want to do. So that was the culture in the old days, which frustrated developers to a huge degree. And your service perspective, you’re serving the developers. That is absolutely what we need, but we also need data for accountability because the data is so important culturally to show, hey, we have an impact, we care. But also there are certain areas you are accountable for. Maybe you need to understand how to isolate your unit tests better than your hidden integration test which takes so long.
Sam: This is a struggle we had and did go through. One thing I will add to the recursive problem solving and recursive performance analysis is that our major goal is to be able to be flexible. When a new tool or new methodology comes out, and people want to use it and want to be able to facilitate that as much as possible, it would be such a shame for someone to come and say, hey, I love Kotlin. I want to use it in the project. Or, hey, we have this issue in this part of our native code. Let’s adopt Rust and its memory safety model. And you go, oh, mm, that’s really awkward. I have to go to work till I support that. Could you just keep using the old tools just forever? So it’s such a shame if it ever came to a point where I had to say that.
Hans: Yeah but there are still too many organizations where you have those discussions, and it’s about, ugh, it’s work. Yeah, well, it’s your job, right?
Sam: We’re going to be working on this anyway so get it done.
Hans: Right. So one thing, there’s a good question here and I’m also curious about that.
So has Develocity with the help of Tableau been able to use the data to establish some best practices for developers?
So did some of the data inform them that, let’s say, one example for that would be people don’t do a pre-commitment, but you could see that from the built development. They do some builds with the dirty working copy. But then they just say, see, I will figure it out. So things where you could say, hey, guys, you should do a build before you commit.
Sam: Yeah, we actually implemented for that specific problem, the gated check-in system, where any change that goes in must pass a suite of unit tests, integration tests, and select it faster even end-to-end test before it actually it’s committed to the source repository, and that was huge for improving our overall build reliability.
Hans: And how do you implement that gate?
Sam: We use TeamCity for continuous integration, and we made a little web service that stages commencing, kicks off, and kicks off the continuous integration runs. And if they look good, then it calls out to Perforce.
Hans: So it’s the delayed commit feature, right?
Sam: Exactly. So that was very impactful. Can you remind me of the question again?
Hans: So how does the data we have helped establish the best practices for developer teams? So did you learn anything about best practices that could be improved amongst the developers? Or was that never the focus? Was it more like, we need to get the machinery right and that’s our focus?
Sam: I’ve definitely seen it as a facilitation type of thing. It’s making it possible for people to do their best work with the flexibility and reliability they need. So I have rarely taken this data to try and directly influence product architecture as the needs of the products, and the needs of the business, or what should really dictate that. But the shape of the build is kind of a scaffolding for the software project you’re working on. Kind of naturally mirrors the shape of the thing you’re building with it.
So if you have something that’s pathological about the build, it might also be something that’s pathological about the build architecture. Like this big, complicated error-prone chunk to build is very often also a big, complicated, and error-prone chunk to run. And so this static can kind of help you see that. And it’s not a perfect correlation or anything if you compare where your bugs come from to where your build problems come from, but they definitely can be overlapped on that diagram.
Hans: Two more questions from my side. Are you collecting or would you like to also collect data directly from the IDE?
Sam: Yes, we would. I mean, developers spend a lot of their time in an IDE, and different people of different workflows and different relationship to the command line build. I mean, I’ve talked to some people who are kind of an extreme end of the spectrum like, I don’t care how fast the command line build is. I loaded up in IntelliJ and IntelliJ builds it plenty fast. And then other people are like, yeah, I compile everything in the command line as that’s the canonical representation. So I don’t care, the IDE still has to be fast. I still live there but I care a lot more about the command line performance.
Hans: So on the one hand, especially for the test execution, super-valuable data, and what we see with larger reports that the IDE is often the bottleneck itself. Long re-indexing, memory problem, so I’m asking that because that’s something we’re planning also for Develocity to collect data directly.
Sam: I’m very glad to hear that. I mean, our build stretches, eclipse, and IntelliJ to the seams. And it’s a frequent source of frustration amongst people, and IntelliJ into some search reindexing and like, well, I’m done for 30 minutes. I can’t do any more work. And I don’t have any insight into what it’s doing or why that slows it down so much. If I go to the command line, I can build the entire project in less time than this stuff was going to take, like, what is it doing? And then I have to allocate more memory to it so that it can fit the entire representation of the build. So one thing I’ve been hoping might help with that, although it would require a certain amount of re-architecture of the build, is if I could partition the build into a set of smaller composites, then it’d be possible to load up fewer at once and limit the scope.
Hans: And that is also a Gradle build too. It’s like something we’re working on with some of the new features. We don’t have an out-of-the-box solution yet for that. So it’s good to know that this is something that we see a lot of.
Sam: We’re definitely a recurrent pain point. It would be great if you could say to your IDE, I want to load up this project and all of its transitive dependencies, and that’s it. So you’re working on some subset of the whole graph. You only need your subset.
Hans: And with composite builds, we made a step in that direction, but it doesn’t have this out-of-the-box thing.
Sam: Or you have to re-architect the build to break it into composites, and then you still might not have the granularity you want.
Hans: Yeah.
So your typescript developers, what build system are they using for typescript?
Sam: So there’s a mixture. There’s a couple. They use gulp and Yarn is used for dependency management. And I think, in some cases, they’re directly calling node commands to vote webpack and to vote Yarn to do dependency resolution. And it kind of varies by project. Everyone is dumping gulp for a while, and then it seems that they kind of put that aside, and went back to more raw NPM.
Hans: Right. And is that something you have custom solutions to collect data? Or do you see now, hmm, we don’t have the same level of insight and we’re paying a price for that because it’s less stable than, let’s say, Java and the C++ builds?
Sam: Yeah, it definitely is. I mean, we have a Gradle plug-in that wraps this project, but it just invokes the tools native to that land. And it has been a recurring source of frustration that Yarn, in particular, is a little bit unreliable with how it performed, particularly in a context that I never expected it to be run in, of being in a large multiple, highly parallelized build.
We had to, at the Gradle level, serialize all Yarn indications across the entire project. So we just never call it twice at the same time in order to keep it from corrupting its own caches. And, hopefully, it’s gotten better since we had to take that measure. But the wrapping non-Gradle native build tools have had varying degrees of success. We’ve always gotten it working but it does reduce its susceptibility to analysis.
Hans: The root cause-effect is no longer there.
Sam: If we’re just shelling out to KMS build or whatever else the case, whatever other CSC make, whatever else the tool might be, then we’re kind of right back where when we started for the server build, where we wanted to see not just that build failed but what part.
Hans: Exactly. And it is on our roadmap to support, especially in the JavaScript area, to support the native toolchain, because that’s just for JavaScript people want to do. You cannot force them, you should. It’s the same thing. You shouldn’t say, oh, you should use the Gradle JavaScript plug-in, Gradle build tool. No, we want to use those tools. So it’s also then our approach with Develocity. Well, then let’s embrace that and let’s embrace the toolchain, and provide the same level of insight, and we provide for using Gradle build tool natively.
Sam: Yeah, and we tend to wrap those other language projects with Gradle so that developers who are working on the project can mostly build with their own native tools. And the build orchestration just happens to also call those tools VA Gradle.
Hans: Awesome. Thank you so much. It was excellent.
Sam: Thanks, Hans. Pleasure for being here. Hope everyone enjoyed my presentation and found it to be valuable. Thank you for your time.
Hans: Thank you. Bye.