Gerrit User Summit: Jenkins forever

Posted on October 24, 2017 by Git and Gerrit Code Review for the Enterprise

This week we are going to publish a talk from the Gerrit User Summit 2017 about Gerrit and Jenkins used together. It is a real-life story on how to set up a CI/CD pipeline for a massive traffic OpenSource project such as Gerrit Code Review and the learnings of how to manage the storage and consumption of the Jenkins build logs and the associated meta-data.

Even if you are not a Gerrit Code Review user, the learnings of this talk are going to be exciting and useful for any high load CI/CD pipeline project with Jenkins.

GerritForge: Gerrit Code Review and Jenkins expertise

I am part of GerritForge, a London-based limited company not specialized in Gerrit, as the name would tell, but also on Jenkins, Continuous Integration and Delivery. Why don’t we use our skills to serve the Gerrit Code Review project? A couple of years ago the project did not have an official CI yet, so we said: “why not help the project and set up an official pipeline to verify all the incoming Gerrit changes to the Gerrit Code Review project itself?”

We then created https://gerrit-ci.gerritforge.com and, as you can see, it is nowadays a jam-packed CI system. We have been running a Hackathon over the weekend, and now, even while people in this room are following this talk, new changes are produced, and reviews are getting pushed to Gerrit, and that keeps our CI busy all the times.

Screen Shot 2017-10-24 at 22.51.04.png

We have a lot of slaves, some of them are provided for free by Google and others are paid by GerritForge. We have been running this service for the last couple of years, and even non-contributors to the Gerrit project like most of you guys are possibly using it for downloading some useful artifacts such as the Gerrit plugins. Additionally, if you want to download and demo the latest and greatest version of Gerrit master, as we just did with some of you before lunch, you can use the Gerrit artifacts on Gerrit-CI instead of building it yourself on your local box.

Gerrit-CI pipeline walkthrough

Let’s have a look at how Gerrit-CI works. You can log in with your GitHub credentials, and then trigger builds for your Gerrit Code Review contribution using a job called “Gerrit verifier change”. That is the most important job of the pipeline and it verifies every single change we make on the Gerrit Code Review project.

How can you manually trigger the build and verification of a change in Gerrit? You navigate to https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-change/ and click on “Build with parameters” link. You enter your change number and then click “Build”: it is straightforward.

What this job does is triggering a workflow developed in Groovy language which it will provide at the end a series of feedback messages to Gerrit. When you go to https://gerrit-review.googlesource.com and list of open changes, you will notice that some of them by one guy that is called “GerritForge CI”. That means that our CI works, yeah!

At a certain point in time, someone in the Gerrit mailing list said: “Houston, we have a problem, we are too productive! We have produced so many changes and patch sets that the time you finish to build a change, we have already produced other 300 patch sets on that job and the build logs get lost”.

The Gerrit change verifier workflow

Let’s go back for a moment to review how the workflow that we came up with works. It does not rely on the Gerrit Trigger plugin, the de-facto out-of-the-box Gerrit/Jenkins integration that most of the people use, but rather on a complete “new thing” that we have built ad-hoc for our purpose.
We couldn’t use the Gerrit Trigger plugin because of two reasons:

Google data-centers do not allow incoming SSH connections
SSH stream event channel would not have been good enough for us, because of the parallelism needed.

The way that our workflow works if very simple.
The verifier flow requests the list of changes that need verification by leveraging Gerrit query language which allows you to search through most of the fields of changes using a Lucene syntax. For each change that needs checking, a corresponding number of parallel jobs are triggered. This parallelism is potentially unlimited; the only limit is the number of machines that Google can assign to the Gerrit-CI, if he can allocate one hundred, we will be able to perform hundreds of parallel changes verifications.

Screen Shot 2017-10-24 at 22.53.30.png

That means that we can produce a lot of verification jobs at the same time. Bear in mind that for every change we do not trigger just one build: we have NoteDb vs. ReviewDb verification, PolyGerrit UX tests, Code-Style check there was a moment in time where a single change needed up to 6 parallel builds! That resulted into a lot of builds, which, as long as you’ve got enough horsepower in the slaves, it was working fine us.

We do not send feedback to Gerrit for every single build, but we rather have a “Gerrit Verifier Change” job coordinating the workflow and makes a decision accordingly. The criteria are the number of failed builds, the build retries for flaky builds. At the end of the process, all build results are collected and a unique coordinated feedback to send back to Gerrit as a unique verification message.

Too many logs for Jenkins lead to a 404 page.

This is all good, but as we said earlier: “Houston, we have a problem, we are too productive!”.
Here are some numbers of our productivity:

300 jobs
170,000 builds
4.8 millions of jar artifacts produced
1.7 billions of lines of logs

And of course, we want to send a link to the build logs we want to give context to the change failure or success. Unfortunately, happened to trace in Gerrit changes some nice links pointing to a quite unpleasant 404 page in Jenkins.

Why did it happen? We have a lot of contributors that generated lots of commit traffic and thus many build runs. There is a policy in Jenkins to remove “old” builds and thus happened that we lost build logs of active changes under review.

Screen Shot 2017-10-24 at 22.59.39.png

Q. (Han-Wen Nienhuys – Google) At Google internal build system we also this kind of numbers but of course with more zeros at the end, but actually we throw away our logs, and if you build binaries, they are very large.

In the beginning, we tried to keep more stuff online in Jenkins but people started saying “Luca, we have a much bigger problem now: gerrit-ci.gerritforge.com doesn’t respond anymore. When I open the Jenkins home page, it takes a very long time and eventually times out.”

That is caused by Jenkins design which is problematic when the number of logs increases considerably: everything is stored as a file and there is no efficient indexing for discovering the data on the filesystem. Additionally, if your company does not have a large infrastructure, your disk space is limited anyway. At GerritForge the Jenkins master has only 8TBytes of disk space, and we don’t have available a system with PetaBytes or more.

Keeping Jenkins logs forever

I made the Gerrit Contributors’ Community aware of the problem and I asked: do we like that? If you think about it, logs are not rubbish. Logs are of immense value, logs are like your money, and analyzing them, crunching and understanding them is our daily job. The timestamps in the logs are like precious diamonds because they tell you that you may have made a mistake in your code and some parts of your pipeline execution start taking a lot more time than before.

When you remove the “old” logs, you make much more difficult to investigate on a failed verification build: the link attached to the change verification message points to a page that returns a 404. That’s not a bug in Jenkins; it’s a feature of removing old logs and keeping the master instance fast and healthy. But actually, it is a real functionality gap because Jenkins doesn’t know yet how to manage logs archiving.

Then I asked the Community: “For how long you want your logs to be retained?” because I needed to raise a PO for a much bigger machine. “One day, one week, one month?” and the answer I got was “Forever!”

If you think about it carefully, the answer is correct. You may not need all those logs at the moment, but in a month’s time, you may need to crunch some data to extract features or metrics. Additionally, getting rid of all logs means generating broken links in my past reviews, which could be an audit requirement stored with Gerrit changes.

Sending Jenkins data to a Logstash appender

It was about time for me to think about a solution and here is a description of what I have done.

First of all, I needed to get more disk space from Google, but then how can I tell Jenkins to use an alternative disk storage mechanism for his logs?
I then started adding to the jobs a plugin called “Logstash” (https://wiki.jenkins.io/display/JENKINS/Logstash+Plugin) which is responsible for capturing and sending Jenkins data to a configured stream appender.
All the Gerrit CI jobs are managed through YAML files which are submitted through code-reviews, using the Jenkins Job Builder tool. However, showing the Logstash configuration on the Jenkins UI is much easier to show where the Logstash is playing a role in the Gerrit-verifier-change job configuration.

Screen Shot 2017-10-24 at 23.05.33.png

I have enabled a new feature to all the jobs to send all the log stream to the Logstash plugin. This works differently to what most of the people would do. Instead of just posting the log file into a stream of lines to ElasticSearch, this plugin gets the information directly from the JVM memory together with its metadata, the timestamp, the build parameters, the environment variables and send them to an endpoint, which could be anything. In this case, I have chosen to use RabbitMQ as stream appender. On RabbitMQ you can notice that I have created a queue for incoming Jenkins messages.

Screen Shot 2017-10-24 at 23.07.02.png

You may notice a lot of activity because every time that the Jenkins jobs produce something, a message is sent to RabbitMQ with the log and the attached meta-data. RabbitMQ is not used though as a storage system but acts only as a vehicle to transfer the information to a long-term storage system, which could be Google Cloud Storage.

The organization of files is straightforward: one file per hour. By looking at the file content, it is a very compressed JSON file that contains all the information I need: the build id, the result, the logs, the parameters.

Spark to the rescue

Problem solved? Can I tell all the Gerrit contributors that they have to look for a build result into a JSON file? Maybe this is not a very nice user experience.
I little more digging is needed to make the solution more transparent to the end user.

Screen Shot 2017-10-24 at 23.14.56.png

GerritForge as a company works and contributes to many BigData projects, including Apache Spark. Why don’t we build an elementary Spark transformation that consumes the input JSON files and materializes back the log into a readable format?
So we built a Spark job that is crunching this data and produces something very very similar to what Jenkins would render. However, we need to make sure to perform all those operations outside the Jenkins domain; otherwise, it would become very soon overloaded and thus unusable.

I have then created another directory that is not actually managed by Jenkins but gets populated by a Spark job. This parallel file structure has exactly the same organization of the build files generated by Jenkins builds.
Let’s have a look for instance at the oldest build that has been recorded by Jenkins: build #31639. For sure if I go to the build #31444, which is older than the #31638, Jenkins would give me a 404 because that job execution has been removed.
However, if I try now to navigate to the build log #31444, wow, I can see the full results as the build log was still accessible.

Screen Shot 2017-10-24 at 23.09.09.png

Additionally, as this log has been produced from the previous JSON file that contains all the meta-data, I can even render more information such as the time-stamps, which are not typically available in Jenkins unless you enable a specific plugin.
Moving forward, by leveraging the same input JSON file, we could do a lot more data crunching as well. It would be interesting for instance to draw a graph of the correlation between the Gerrit changes the build execution times at the different stages.

Uncovering the hidden value of your Jenkins logs

There is a lot more we can do with the JSON I’ve shown you before. It contains not just the log messages, but everything related to the build meta-data of the build and its execution metrics. That means if we go to this change #129553, the link that points to Jenkins logs is not broken anymore, even if it is not served by Jenkins but is backed by the Spark job results.
Starting to applying the same mechanism to all the Gerrit changes and redirecting them to the Google storage where all the files are going to be archived, any change in the Gerrit history will not contain broken links anymore and will be perfectly auditable.

That means that from now, whenever you are going to receive a Verified notification from Gerrit and you navigate to your change links, you are not landing anymore to a 404 page anymore.

Questions.

Q: What if I have a Jenkins instance and I want to do some of this but I don’t have infinite disk-space as Google. Is it is possible to implement?

A: With regards to disk space, you don’t have to go to Google or AWS. You can set up an HDFS filesystem yourself. All the cloud storage implementations available on the Cloud are mainly based on something very similar to HDFS which is an open standard and is available as OpenSource. That means you can store the information there and you do not necessarily need to keep it forever. In practical terms what you need to keep is the lifetime of a release of the software, or a few software iterations, maybe six months, 12 months. As the JSON files are organized on a time-series, it is going to be very easy to remove or archive all the data you do not need anymore. I have shown you how to store those files in JSON, but you can use even more optimised and compressed format such as Avro or Parquet, which may contain 10x times the information in a fraction of disk space. Additionally, when you process them, they can be even faster because they include data encoded in binary format. In a nutshell, the term “keep the logs forever” could be read as “keep for as much as you need: one week, one month, six months, …”. The problem with Jenkins is that for very busy servers like the Gerrit CI, you cannot keep even a single day of logs and when the people are coming the next day to check what’s wrong with a failed verification, would risk having a 404 error page.

Q: So if you do compression and decompression, that needs to happen server-side, so that is transparent to the browser?

A: Yes, that needs to happen on the Server, and there are a lot of ways for doing it, it could be even done on-the-fly, streaming and is pretty fast. There will be a talk tomorrow talking about the methodology to crunch large amounts of data and about the lambda architecture.

Q: Does it generate a RabbitMQ message for each log statement or a unique one at the end of the build?

A: Yes, and the reason is straightforward: If the build crashes or gets aborted for any reason, you do not want to lose your build logs. There was an implementation of Logstash for the Jenkins pipeline that was precisely collecting the logs all at the end of the build, but the design is wrong because if the builds get aborted you do not get feedback at all. Yes, it generates a message for every single line, and possibly RabbitMQ is not the correct implementation of it. But as soon as the Logstash plugin supports the Kafka transport, the performance issues related to the use of RabbitMQ for log streaming will be resolved.

Q: The Logstash plugin that you mentioned, has nothing to do with the “ElasticLogstash” implementation?

A: Yes, it is just unfortunate naming. Actually, the Jenkins Logstash plugin was possibly born before Elastic called his implementation ‘logstash’.

Q: You mentioned that you do Spark processing at some point, but it wasn’t part of your presentation.

A: Yes, it is not part of this presentation for reasons of time, but it is trivial.

Q: Question about the GerritForge CI: I have frequent problems of the test failing not because of my code, and I want to retrigger the tests without having to add a commit to retrigger the CI. Is there a way to retrigger the CI build?

A: Yes, it can be done by going to the Gerrit-verifier-change URL, you click on “Build with Parameters” and enter your change number. You can in this way retrigger any build without having to commit anything.

Q: And if that pass that would assign the Verifier approval to the change?

A: Yes. I would like to add a Button to Gerrit-Review to avoid people to navigate to a different URL.

Q: We are relatively heavy users of Gerrit topics because we have changes that are across multiple repositories. We have a very similar job to this one but we can either put a single or multiple change IDs or a topic name, and it will work out whether it is a consistent declaration. Another thing that you can comment on, you mentioned that the verifier job which runs some independent verifications and then feeds the result as one result to Gerrit, that sounds like something we could use. What is that build using?

A: Tomorrow there will be a presentation of a brand-new integration between Gerrit and Jenkins. The rationale for writing a new integration lies on the thinking that “maybe the Gerrit project is not the only one that needs a bit more from Jenkins.” So why not creating a Jenkins plugin that takes the most of the experience we’ve made in integrating Gerrit with Jenkins for the Gerrit Code Review project and makes it available to the rest of the world? There will a plugin to implement that workflow.

Gerrit User Summit: PolyGerrit UX

Posted on October 19, 2017 by Git and Gerrit Code Review for the Enterprise

The second presentation this week from the Gerrit User Summit 2017 is about PolyGerrit, the new Polymer-based UX for Gerrit Code Review, officially feature complete for Code Review in Gerrit Ver. 2.15.

Logan (PolyGerrit and Gerrit Maintainer), Arnab (PolyGerrit UX Designer), Dustin (PolyGerrit User Researcher) and Jason (Gerrit Code Review and Go Language Product Manager) show and discuss the future of UX of Gerrit.

PolyGerrit Overview (Logan Hanks, Google)

We are going to present three topics today:

I am going to talk about the development of PolyGerrit, why we are making PolyGerrit and what we’ve done over the past two years.
Arnab is going to present a new visual design we are going to start rolling out to googlesource.com, and to the next release of Gerrit.
Dustin is going to give some findings from the user research on the survey I believe that many of the people in this room may have received the questionnaire on Gerrit and have responded.

Established November 2015

commit ba698359647f565421880b0487d20df086e7f82a
Author: Andrew Bonventre <andybons@google.com>
Date: Wed Nov 4 11:14:54 2015 -0500

Add the skeleton of a new UI based on Polymer, PolyGerrit
 
This is the beginnings of an experimental new non-GWT web UI developed
using a modern JS web framework, http://www.polymer-project.org/. It
will coexist alongside the GWT UI until it is feature-complete.

That is the very first commit we did on the PolyGerrit project, it was founded by Andrew Bonventre from the Chromium project and the commit message summaries up pretty well.
The idea is to replace the existing UI based on Google Web Toolkit (or GWT) and replace with a new UI built on modern frameworks with straight HTML and JavaScript. Part of the innovation was just not to renew the UI and make it better but also for the Chromium project move to Gerrit.

Based on the Polymer Project

Polymer is pretty essential to PolyGerrit, and that’s why is part of the name. The Polymer project is not a framework but rather a collection of libraries and tools and components that have been developed by Chrome at Google. The idea is the build beautiful web applications on top of the standard web platform. Some critical aspects of the web platform that we use are web components which are built on top of templates that you can import into your HTML file and shadow DOM which is a way of encapsulating your stylesheets into your templates so that you can build reusable components.

Another significant aspect of the Polymer project is the collection of custom elements. They are organized into groups and given chemistry names. We use iron elements pretty heavily; they are the basic building blocks for HTML.
We are starting to use elements from the paper collection which are higher level UI elements that introduce material design. They look nice, they are high level and reusable elements. They let us focus on building code-review without having to rebuild everything from scratch.

Screen Shot 2017-10-19 at 15.50.05.png

Q: Can you explain a bit more on what material design means?

A. (Arnab) Material design is a new design language that Google came out of it in 2015. It’s a library of all the modern components with great look and feel.

Material design elements come with a lot of features like animations and smooth transitions which makes your UI more dynamic, it’s hard to show in slides all that dynamism.

Current status of PolyGerrit

We have been working on PolyGerrit for two years now, and in that time we got a lot of contributions from outside the project, we have a few dozens of contributors, checked in thousands of commits, and we closed almost a thousand issues from our issue tracker.

As far as I know, there is no other significant deployment of PolyGerrit outside of Google and GerritHub.io, but we have been having used in Google for a while, and we see 10k code-reviews a day involving PolyGerrit, at least on a workday. We have got a few major OpenSource projects using it, Chromium has recently transitioned to PolyGerrit, and brings up almost 2000 developers.

I’ve put some other recognizable OpenSource projects like GoLang and also the Gerrit project itself, where we configured gerrit-review.googlesource.com to use PolyGerrit by default. That helps us get early feedback on the functionality and the design of PolyGerrit.

Chromium migration to PolyGerrit

What about Chromium? It is a significant milestone for our Team, and I also think for the Gerrit project. The Chromium project has a very complex system for checking their code and for builds. They have built the system themselves and they used it to run their code-review system them as well based on Rietveld. Unfortunately, that is a sort of unmaintained instance of Rietveld running on AppEngine they did not want to use anymore and also it wasn’t scaling very well for them. There were even commits that they couldn’t review on that system because they were too large.

So a lot we’ve been working at that time, it was allowing Gerrit to come to a state so that Chromium could use it. Part of that was building a nice UI, and part of that was taking features they have been enjoying in Rietveld and bringing them over to Gerrit. Some of them are small things like user account statuses, so that you can tell people that you are not in the office and you can’t do code reviews this week, or having descriptions to your patch-sets so that you can explain what this revision means.

And also they’ve built a lot of new customizations, and they got us flush out a plugins API, and I’ll show one example here. So Chromium has a pretty heavyweight CI system because they have to build Chromium for a lot of different platforms and different architectures and they have a lot of incoming commits. So they need to find a way to verify that a commit that is coming into the code, actually works and in which order to check in those changes, so that is not breaking the build.

Screen Shot 2017-10-19 at 16.06.00.png

One thing that they’ve built for PolyGerrit is a plugin to add these elements to the screen for visibility to what’s going on in their system. The builds that are running for their commits, their status, often the flaky, so that by clicking on these you can see the output of the build. You can also use this UI to start a particular build and chose which architecture you want to run on or re-run things that failed. I think we should be building better integration to Gerrit for these sort of things because those are what big and small projects need, but it is cool that we can do these things with plugins as they are now.

What’s new in Gerrit 2.15 for PolyGerrit

Gerrit 2.14 was the first release that included PolyGerrit but it is just a sort of experimental preview, but that did work well out of the box. Since then, we have fixed a lot of bugs and added more features, we have converted the entire project to ES6 which for JavaScript developers makes their life so much easier and also makes the project more attractive to other contributors.

We made a lot of performance improvements in the application. For example, sometimes we have commits that have thousands of files, and we have to have a way to display them intelligently, we cannot just throw them on the screen. We needed a sort of new navigation for these kinds of things, and we have adapted PolyGerrit to work with any variety of commits of different shapes and sizes.

Some specific things we added that are kind of cute. PolyGerrit offers a reply dialog to manage your reviews, reply to comments, vote on labels, and submit that all as a single batch operation. We added some nice features to this reply dialog since Ver. 2.14, one is being able to preview how your reply will be formatted. Not just to show you how your answer is going to look like but now it renders in the same it would end up in the e-mail because we have HTML e-mails.

Screen Shot 2017-10-19 at 16.06.49.png

We have also reorganized how voting buttons are arranged, there is this beautiful vertical alignment, and you can see the labels that give you the meaning what +1 means for that label. We have a slightly updated patchset selector. That is important because one of the things people struggle with it is navigating a code-review. If there are more patches, understanding where the comments are can be challenging, because of comments land on different patchsets.

Screen Shot 2017-10-19 at 16.08.31.png

We also introduced the feature of resolvable comments, so that you can mark a comment as fixed or done and we can track that for you. You can see that on some patchsets there are some comments you haven’t reply to yet or has a review, and you can verify that everything has been addressed.

The PolyGerrit Admin UI

Another major area that we have been working at is one of the significant features that PolyGerrit has compared to the old version: the admin UI.

A lot of our users had to switch back and forth to do the admin work. We’ve been building up this part of the application. We have rearranged that a little bit, we changed the title by clicking on the browse link on the top menu. Our idea is that you can get to this like the project home page for a repository, where you have to check out instructions and is also the entry point for administering the project.

Screen Shot 2017-10-19 at 16.25.51.png

We have almost all the primary forms for managing groups and projects completed. We have also released a read-only version of the sophisticated project access rules editor. It is read-only mode because we don’t want anybody to risk losing their project.config because of a bug in this: project.config tends to be a very security sensitive thing.

What’s next for PolyGerrit.

We are still working on replacing the GWT UI; we’re getting close. One of the significant features remaining is in-line edit. We’ll be working on that over the next few months, as long as the long tail of the specific missing pieces of the UI. And then we are going to roll out a new visual design which Arnad will be talking about shortly.

We also want to do in the longer term more significant changes to the UI of Gerrit, to show how accessible is to the users, because when people learn how to use Gerrit they love it, but the initial learning curve is a bit difficult at the moment.

PolyGerrit UX New Look (Arnab Banerjee, Google)

We are working on a new look; I’ll just be giving you a quick glimpse of what it looks like. We know that Gerrit and PolyGerrit work very well for the users, but we want to improve its usability as well as the status of the tool.

For improving usability we think there is a long process. We don’t want to break what’s already working for you, and that’s why it will be more an evolutionary process based on user-research and feedback.

However, we started by looking at the aesthetics of PolyGerrit and giving it a new look. A few critical goal that we had is that design should represent what PolyGerrit is and stands for, having a more contained group of information in a way that you know exactly where to find a particular group of information. Also more consistent UI behavior across the board: as soon as you look at the component you see what you can expect what the element is supposed to do. Also, we want to introduce a more up-to-date style of components and fonts, which means: material design.

What should drive the experience we want to create is what we really want to achieve with PolyGerrit.

Screen Shot 2017-10-19 at 16.26.40.png

Confidence: we want you to be confident with the code you are submitting and is going out there.

Scalability: we want PolyGerrit to manage the small as the significant and more complicated change.

Efficiency: we want to increase the efficiency of your team as making the code review process more efficient.

Intelligence: PolyGerrit should be smart enough to do all the repetitive task that you otherwise would have had to do manually. Wisdom: PolyGerrit should be mean to share the knowledge on the team.

PolyGerrit is Blue

One color that captures all of that very very well is blue. So we picked blue and made the primary color of PolyGerrit.
That is going to be the default primary color but you can change it for your team.

Screen Shot 2017-10-19 at 16.28.34.png

That is the new and refreshed look of PolyGerrit.

There are three distinct layouts for PolyGerrit:

The change screen
Dashboards
Admin screen.

PolyGerrit new Change Screen

The change screen is the most used screen in PolyGerrit, so we thought why not start by giving this a new look.
For this presentation, I will just talk about how the new PolyGerrit style works on it. Let’s talk about the visual elements on this page and look at it with a little bit of detail.

Screen Shot 2017-10-19 at 16.27.51.png

One of the things we have added in PolyGerrit is the breadcrumb where you can quickly jump back to wherever you came from. The next is the status, and now we have a clear status indication of the change so that just looking at this state you can understand if the change is it is closed or opened, and the color line adds to this the indication of the change.

Screen Shot 2017-10-19 at 16.30.54.png

Let’s talk about the meta-data of this page, we have now a much more visual structure of spaces, fonts and colors and actions that are now material flat buttons, and they are different from links, which will be all caps and different fonts altogether. So visually you see the difference in what will take you to another space and what will take action. When you click on add assignee, you get this in-line box which let you add assignee, and this is just a more elegant way of putting inline data, it just makes the action so much more clear.
When you click save, you can see the assignee name populated and assigned to the assignee label.

Once you start typing will just suggest the names while you are typing and topics also work in the same way.

Screen Shot 2017-10-19 at 16.31.29.png

Now coming arguably to the most used section of this page which is the files table, and you may notice these two header rows, the file heading and we just combined the information of these two rows into one row.

Let’s look at the elements of this particular section. The first part is the patchset selector. If have both the selected patchset as well as the original patchset, you can see them together side-by-side, so that you don’t have to look all the way to the right to select the referenced patchset.

Once you click on the patchset dropdown, you have a different layout for this different drop-down menu. On the left side you can see the patchset as well as the patchset description, and on the right side, you can look at the comments and the unresolved comments. That makes scanning through this list so much more comfortable, especially if this list is long.

Screen Shot 2017-10-19 at 16.32.17.png

Then adding patchset description also works in the very same way. It just shows the inline edit box, and you just enter the patchset description and say “save.”

Now we have a separate interaction for marking the file as reviewed when you hover on this rows. The mark reviewed shows up on the right-hand side, when you click it you can see a column on the right-hand side of the mark as reviewed and when you click review this text shows up a toggle button.

Screen Shot 2017-10-19 at 16.33.05.png

On the top-right, we have now the table actions, and we just have two actions by default. The first action is “expand all” and the other is “download.” If you click on expand all you see this diff view expanded and you see some icons that appear on the top. So the “expand all” button changes into “collapse all” and you can change from side-by-side to unified diff-view or even change your diff-view preferences.

That is just a tiny step towards what we can do and is just the tip of the iceberg, and I think we’ll reach out to you for more feedback and do more user research for really making PolyGerrit a tool that you’d love as code-review tool.

PolyGerrit User Research (Dustin Smith, Google)

I am a user researcher at Google, and my job is to communicate with our users and take what you tell me and distill what we can do and what changes we need to do to our product.
Being an advocate for all of you in the meetings because I am not a developer I am not a designer and my role is mainly to communicate what you need to our developers and feedback them to our designers. I collect feedback and distill your needs. Arnab will come up with new designs, and we will come back to you and say “how did we do based on your feedback?” Then we will repeat that process to infinity, and that’s how research gets integrated into the design.

So if you are not a member of the Gerrit Google Group (repo-discuss) , I highly recommend you to join. You can yell at us there; you can say “I really don’t like what’s going on in the product, ” or you can praise us, whatever you like. At some point, you can send us surveys through there and hopefully not a very long survey that helps us aggregate all your needs.

User Survey results on Gerrit Code Review

An example of that we sent out a survey in June and 600 of you responded. We received a lot of nice praise, and that’s great, but we’re also looking at ways to improve Gerrit, and we asked you to rank of what would be your priority.

Screen Shot 2017-10-19 at 16.34.13.png

What you see on the top here is that all the people that answered the survey is interested in improving the navigation of Gerrit, whether it is a patchset navigation, commenting, or other UI navigation. There is also another related to better integration with other tools. We are curing that feedback. Something to keep in mind is that even though there are things at the bottom that are ranked at zero, that doesn’t mean that you don’t want those things but they are relatively listed.

Q (Han-Wen Nienhuys, Google). I am wondering, who is in this group of users you asked because one of the things entirely at the bottom says “improve the reliability of the backend” and then very much at the bottom is “better performance” which is kind of what our Team is continuously worried about. Even yesterday we had a Hackathon and the friends from the Android Team on Sunday morning, and we were all in panic, but I see that “nobody cares about it,” how comes?

A. That’s a fair question, who did we sample? These users were external users of Gerrit, and only 13 of them were PolyGerrit users, so what you’re looking at here, 95% of them are external users. Having said that, why ranking better performance and reliability the lowest, I am not sure how to answer that question.

A. (Gerrit admin from the audience). For us obviously reliability of the backend is important, but the question was “does reliability need improving?” and for us, Gerrit works quite well and we are very happy with it.

A. (Luca Milanesio, GerritForge) I believe that my is very similar to that. First of all, the people you are aking to. You’re not asking Gerrit administrator, but just people that are using as a UI. The people that are using the GUI are saying “reliability is currently 99.99%, do you need to make it 99.99999%? Not really”. What needs improving is what you guys are focused on: usability, discoverability, and leveraging all the power that is out there.

A. That’s a great point and keep in mind that all the items on the bottom of this list don’t mean that people don’t like them just that other aspects that need improvements are ranked above it.
People did not say “I dislike the idea of better performance, ” but more often than not, people selected the options above it before the better performance.

Other highlights from people, maybe people that is one of you, navigation is important “Getting an overview of the change I’m reviewing is hard. When a change is composed of several files, the file navigation feels like a pain”.

We are taking these feedback and then moving forward, and also some complaints about the search itself. The search could be better, for example, “I want to search a commit that includes a particular file.”

And then, integration. I would like more Jenkins integration for passing code coverage reports into Gerrit. If these quotes that I pulled out are jelly with you wonderful, if you like there are things that are missing and you are not being heard, I am the person to talk to. I need to make sure that you are heard. There are also some other options I am going to show you. How many of you are aware of userresearch.google.com?

PolyGerrit Research: call to action

If you’d like to, you can sign up here. What we will do is substantially having a look at the new designs, you can spend an hour with me, maybe 30′ depending on how busy you are, and I’ll show you some new designs, and you’ll get a gift for your time. I am not the only user searcher at Google, and this is not the only product that is supported by this, but I am interested in hearing your feedback about Gerrit and runs studies on here, and you’re welcome to join me.

Screen Shot 2017-10-19 at 16.34.56.png

Today and tomorrow we have a booth that you may have seen already with this banner, and we have a new dashboard design, a new patch navigation system and a new navigation for related changes that we’d loved to have you have a look at. We are not showing them on these slides on purpose because we want you to see the designs and come with a bunch of feedback on your first impression to the UI.

Please come to the booth and talk to any of us, there is Jason Buberel with us (our Product Manager), you’re welcome to come and speak with him as well, and almost all the Googlers here are willing to show some of the new designs.

Questions.

Q. How exactly you come out with a new design, the layout, how to make it better. How do you know that the user experience is better compared to the previous old one?

A. I take what your pinpoints, what you love, and I tell Arnab, and then he comes with a new concept. How do you know if that is better? Typically you generate tasks of what you would usually do with our UI, and we then do a tasks assessment where you see the goal that you go through those tasks and see if they failed. You talk aloud thinking while you go through those tasks and you may say “I really don’t like this new UI” and from that, you do mainly a mini A/B test. You can go through this process for just a few people and you don’t have to build up the entire product that everyone sees.

Q. I remember when a couple of years ago there was the announcement that we are going to have the new PolyGerrit project. We all asked “when the new GUI is going to be available?” and the answer was “six, maybe twelve months.” Now after two years we have something that works that is PolyGerrit, we started using it, and we like it. There are still gaps that are not really on the change screen, which is way better than the old one. I can use it on my Tube in the morning on the mobile phone while the old GUI did not even render or it was so tiny that you could not even see it. The gaps are more on the other parts of the UI. 50% of the users are using Gerrit as a very reliable Git server, have you put on your radar all the other use-cases not directly related to code-review that currently PolyGerrit is not covering?. The on-line editing, the code-browsing, the user-journeys from the code to the review and the way back, integrated search across code and the associated reviews?

A. Something is clearly in the scope of Gerrit, and something is not, but that does not mean we are not interested in it. Source browsing is one example of it, and I don’t think it will be in PolyGerrit but that does not mean that we are not interested in a way to browse source code. We are working on the in-line edit, maybe is not something we looked at the usability of it yet, it is a bit lower on the list of priorities, but are starting with the implementation and Arnad has been helping us with the basic design elements of that. It will be a basic feature at first, but it will look nice and be usable to some extent, and then we iterate.

Q. What about the completely new redesign? Are your efforts directed in completing what is there or are on reinventing everything from the ground up and we have to wait for another two years?

A. We are focused on the immediate terms of completing PolyGerrit because for many technical reasons we want to replace the GWT UI and we want to have more users using PolyGerrit so that we can get feedback to keep going. As we get more and more feedback and more and more iterations on PolyGerrit in the wild, we will come up with a lot more work for us to do. In the near term, we are going to implement what Arnad has presented and while we are doing that Arnad will help us by collecting some more feedback around what you will see at the booth, around the new dashboards and we are going to keep working on that. PolyGerrit is not going to stay static as it is today, it is going to be more iterative process based on Gerrit @Google and the Gerrit master branch. At every release cycle of the Gerrit OpenSource project, we will include some significant improvements on the user experience.

Q. I am just wondering how the new users get represented in your user research. Maybe people coming from GitHub or GitLab but even other people coming from different source control systems such as Perforce or SVN. How do you capture their experience? A lot of us have used Gerrit for a while and understand to a degree, Gerrit. However for other people, at times is a kind of a ping-pong, even though you are a new user for a very short time until you learn Gerrit.

A. On-boarding is definitely on our radar, we take users that have not used Gerrit before, and we ask to perform certain tasks. That is the kind of the recipe: take people that never used the product and working with Arnad on their feedback to make that process smoother.

Q. How many research you have made on the other code-review tools, advantages and disadvantages over Gerrit?

A (Dustin). It is an approach I am willing to take, doing a kind of competitive analysis. I have not started yet, but it is definitely on the roadmap for research.

A. (Jason). If there are other code-review tools that you’ve used and something that worked well, something that Gerrit or PolyGerrit could benefit from their knowledge or usage, please let us know. Nothing is off the table here concerning what we will improve or consider improving. If you used other code review tools, e.g. “hey reviewable.io got this very pretty good thing, you should do something really similar to that”, by all means, please let us know because we are opened to improvements in all areas.

Q. Following-up on this question, there will be talk tomorrow on this, our friends from CollabNet that are contributing to Gerrit Code Review have clients that are using not only Gerrit but even other tools like GitHub and many of them are hearing concerns of people getting lost without their pull requests. Because they are coming from other tools like Atlassian BitBucket, they have the concept that “without a pull request there is no review.” They have as well the misunderstanding that Gerrit forces you to amend the latest commit all the times, which is wrong. Gerrit allows you doing whatever you want. If you want to amend, you can do it; if you want to use feature branches you can do it as well. I remember Martin last year saying that they are using feature branches at Qualcomm and there is nothing wrong with it. CollabNet has developed a very basic UI that allows people that are coming from the pull request workflow to keep on using that paradigm. This UI is a sort of lightweight code review that helps to digest Gerrit gradually. That could be even the key to solve the dilemma raised on a recent discussion in the GoLang project where people asked to stop using Gerrit and instead using GitHub to review contributions.
Following the discussion thread, it emerged that Gerrit is “too good” and does exactly what they need while GitHub doesn’t. However, there are concerns about other potential contributors coming from a GitHub experience having to invest 15′, maybe 30′ or even one day of learning of the Gerrit Code Review workflow and a new UI paradigm. They may decide to give up and avoid contributing at all. If the project would move to GitHub pull requests, this problem would just disappear. Are you guys taking into consideration how to make this transition from GitHub smoother and Gerrit easier to understand for those guys?

A. That’s a great question, and as the formal product manager for the GoLang project, I heard from some members of the community of the project that they were relatively unwilling to contribute. We saw this type of disparity from the casual contributors. Those are the people that find a typo in the docs and just want to fix it quickly. With the GitHub UI is nice because they see the problem, they click Fork, edit the code on a single line or merely a single file, and they create the pull-request with a straightforward review process and gets merged and done. You never had to clone the repository, did not have to figure out the authentication bits, and that is simple. You know, whether we want or not to take Gerrit down a path where we want to make these casual contributions super-simple it is something we are open to if that is what the Gerrit community thinks is a significant capability. At the same time, we do not want to displace GitHub with Gerrit, there is a vast community there for a reason, but there are specific workflows like the casual contributors’ workflow that are certainly easier on GitHub and much more difficult for users that are new to Gerrit. It is an area where if we think there is a need for, we should think about it and consider how we may support such a thing. We are open to hearing more feedback on things like that.

A. I just came from the OpenStack developers’ summit in Colorado a couple of weeks ago, and we were having the same conversation regarding onboarding contributors to OpenStack. Most of the people coming from GitHub are having the same hurdle. One of the things that were discussed out there was having some documentation, some quick-start, cheat-sheet, crash course, for people that are familiar with GitHub. It is not just about enhancing Gerrit code to make it more friendly and port the workflow they are used to use but provide content that helps people, coming over.

A. Dave is one of the tech writers at Google who works explicitly with Code Review. If you have a specific part of the documentation set that you think it needs improving or a brand new part of the documentation set that could be useful to new users, come to him and talk today or during the next days. He will be happy to help.

A. I have started already building a spreadsheet, a kind of rosetta-stone of mapping concepts and commands for mapping one to the other and I’ll show you that. That is just one approach, yes, but there are some pros as well.

Gerrit User Summit: Gerrit at Google

Posted on October 10, 2017 by Git and Gerrit Code Review for the Enterprise

Starting from this week, we are going to share one video per week of the amazing talks that were presented at the Gerrit User Summit 2017 in London.

In addition to the YouTube recording, we are during the extraction of the text and publishing it together with the relevant pictures taken from the presenter’s slides, so that people can start digesting the content at small bites.

This week talk is Patrick Hiesel’s presentation on how Gerrit multi-tenant and multi-master setup has been implemented in Google.

Gerrit@Google – Patrick Hiesel, Google

My name is Patrick, and I am going to talk about the setup of Gerrit we are running at Google. I wanted to take you on a journey starting with Gerrit that you all know and making it the system we run at Google, step-by-step; and at the end will have a multi-master and multi-tenant system.

Multi-what?

Multi-tenant is the ability to serve multiple hosts from the same single Java process. Imagine like the same JVM task serving gerrit-review.googlesource.com and gerrit-chromium.googlesource.com.

Multi-master is the ability to have multiple Gerrit servers all over the world. You can contact any one of them for reads and writes.
Most systems have read replicas, which is straightforward, but write replicas is where the juicy meat is.

Multi-tenant

We have gerrit-review.googlesource.com, based on OpenSource Gerrit that you can download right now and have it running hopefully under ten minutes. That is core-Gerrit, and it depends on three things:

JGit: for all the Git stuff
Multiple indexes for the accounts, changes, and other stuff
Caches

All these three components are based on the filesystem in one way or the other.

Now you have a friend that is accessing go-review.googlesource.com, what are you going to do?
The most natural solution is to start another Gerrit instance for it. You can have all of them on one machine, you can give them different ports, very easy, and in the end, they’ll be all based on the filesystem.
All those Gerrit instances do not need to talk to each other; they can just be separate instances operating on separate ports. This is not a multi-tenant system, but only different Gerrit instances on the same host.

Screen Shot 2017-10-10 at 15.35.08.png

You can add another layer on top of it: a servlet engine which receives all the traffic, check which host the traffic is for, and just delegate to the individual host.
To take one step further, have that selection filter doing that for you. Gerrit has a daemon that runs all the functionalities. You can integrate that daemon into the incoming servlet filter. When you can get a request for go-review.googlesource.com and I do not have where to allocate it, you can just launch it, instantiate all the objects and then run the traffic from there. Also, unload instances would work in the same way.
The Gerrit server engine and the selection filter can run in a single JVM.

How Gerrit can conquer the world

So you have a master here in Europe, and you have got one friend on the west coast in the US. He says: “Oh your Gerrit is so slow I have no idea why and I wish I could move to GitHub.” You say: “Hold on, I can do better than that!” and so you put a new master for the person on West coast.

So the key to that is the replication and comes in two sets:

Objects that you have to replicate related to all the Git data. JGit is putting objects into the disk, and these are the data you need to replicate correctly and fast.
Other stuff that should replicate and fast to provide a pleasant user experience but it really can be best-effort. That is the indexes and the caches.

If it is okay for your master having a 200/600 msec additional latency, then do not replicate the caches. You can have a cold cache in Singapore or the US, and you can reread them without problems.

Screen Shot 2017-10-10 at 15.37.31.png

For the index, replication can be best-effort, but you should make an effort to replicate them. It is still nonetheless a mandatory requirement. One way to achieve that is to use ElasticSearch, but other index implementations that give indexes replication can be used as well.

Multi-tenant and Multi-master together

We talked about a multi-tenant system and then replicate them globally, so we have now a multi-tenant and multi-master system, actually pretty close to what we run at Google.

That is the stack that we run in total. We have a selection filter and two other filters to decide what the traffic is directed. We are also based on JGit, no magic there, we have index and caches we replicate, all our systems are based on filesystem and BigTable.

Some “magic” happens at the Git layer at Google because that is where all the majority consensus across all the cells. When you are pushing anything that is Git and, with NoteDb, anything that is a review is in the repository as well, the system tries to reach the majority of the cells and write the objects to them. When that is acknowledged, you get a green light on the push.
Majority consensus also means that you have it only in so many cells, but don’t have it on all the cells all the times. Some of the replication is happening in the long tail, by replication events eventually get acknowledged by the cells, and then they get written to all the masters.

Our indexes and caches are also replicated, but some of them are just in-memory and a component that gets replicated on top of BigTable.

Redundancy everywhere.

We run five data-centers across three continents (Americas, Europe, Asia-Pacific), with precisely the stack we just saw which gives us a good latency for most of our users worldwide.

Let’s talk about load balancing. We have a system that is multi-master and multi-tenant, and any of the tasks can serve any requests, but just because it can, doesn’t mean that it should.
Maybe it has in cold memory caches, or it is in Singapore, and you are in the US; so the question is what if the biggest machine is not big enough and we want to optimize it?

Screen Shot 2017-10-10 at 15.39.02.png

The idea is that we want to reduce latency that comes out of cold caches and minimize the time the site takes to load.
So you have a request for gerrit-review.googlesource.com, and your instance has a cold cache, and you need to read from disk to memory to serve the request.
You have a fleet of 300 tasks available but you want to serve gerrit-review.googlesource.com from only just five of them. If you serve 300 requests from 300 tasks in a round-robin manner, you pay the latency to load data from disk to memory for every single request. And the second motivation is that you want to distribute the load.

We want a system that can dynamically scale with changing load patterns. We want a system that can optimise the caches, to send a request for a site/repo to the few number of servers and tasks based on two conditions:

serve from one machine as long as it fits on it
server from more machines if you really have to

Level-1 load balancer

In the stack at Google, you saw two load balancing levels. What you see down in the picture is the Gerrit tasks, that contains all the software layers we talked about. We have a user that triggers JSON calls from the browser, with PolyGerrit. The first thing that the JSON request is hitting is the L1 load balancer. The primary routing of your request is by geographic proximity. We have five datacentres at Google; the L1 load balancer picks the one with the lowest latency. When the request goes into the data-center, we have another load balancer which is the one I’ll talk about more, because this is the one where the Gerrit specifics happen.

One thing that L1 is doing as well is managing the spillover of traffic. When a datacentre says “I can handle up to 100 QPS” the L1 load balancer starts redirecting traffic to other datacentres should that threshold be reached.

Level 2 load balancer

Let’s dive into L2 load balancer, we want to know how much traffic we are getting into each Gerrit task, and we want to know in the load balancer where the single request should go, and we want to know that fast!

We added three new components to the architecture:

An element to redirect tasks and provide functionality and can report the load we are handling right now. When I mean load it can be anything: QPS (Queries Per Second), metrics, we just want to know from the tasks: what is your current load? We have a system called slicer, which I am going to talk about in a second and it’s added there in the picture.
A second component we are adding to the load balancer, with a query interface that responds to the following question: “we have a request for gerrit-googlesource.com, where should it go?”. All of that should be done in memory and should be regularly updated with the new elements in the background so that we don’t add another component of latency by having another RPC.
A third part is coordinating everything and is called the assigner, and it takes all the load metrics that we reported generates new assignments and gives them to the query interface.

Introducing the slicer

We have a system that is called slicer. There is a very nice paper that I can recommend, published last fall, that talks about that. It is a load balancer that works on custom keys and can do automatic re-sharding based on new traffic patterns. When your nodes receive more traffic, the slicer will automatically distribute the load or re-shard the whole system. That is a suitable method for local sharding that happens within the data center; we do not use it for inter-datacentre because that is all done via geographic proximity.

Screen Shot 2017-10-10 at 15.40.13.png

The system works with 64-bit keys and gives you a lot of combinations. You can slice the keyspace, for instance, in 400 slices. That gives you 400 ranges, and you can take any of them and assign to one or more tasks. The hostname is my key for instance, and then you hash it, and you end up in the first slice that gets assigned to a single task with an index zero.

What can we do if the load changes? Let’s say that you have key zero that gets assigned to the first range and then the traffic changes. We have two options.

The first option is to assign more tasks, let’s say task 6, and then you round-robin between task 0 and task 6.
The second option is splitting into 600 or 800 slides to get a better grip on each of the keys.

Screen Shot 2017-10-10 at 15.41.20

We can also do that, and then we factor our the load for gerrit-review.googlesource.com and go-review.googlesource.com, and we put them into different hosts.

We do that for Gerrit, and one of the things we want for Gerrit is when we have to split per-host traffic with the affinity on the repository. Caches are based on the project, and because gerrit-android.googlesource.com is a massive host served from a lot of tasks, we don’t want all these tasks just to serve all general traffic for android. We want tasks serving android/project1 from here and android/project2 from there so that we optimise the second layer of caches.

What we do is to mangle these keys together based on both host and project. Before, all these chromium keyspace was served from a single host; when the load increases we just split the keys into Chromium source and the rest of metadata. This is the graph that we obtained after we implemented the load balancer. The load we have on each of the tasks in a single data center is represented by a line with a different color. What you can see is that are all nicely aligned, so that each task is serving precisely the same amount of traffic which is what you want.

Screen Shot 2017-10-10 at 15.42.21.png

What if one project is 100 times the size of the others and we are optimizing on queries per second? The system will just burn resources fast. We had that situation in May, we saw the graphs, and we said “all good, looks nice”; however, people were sending e-mails and raising bugs wondering if the system were serving any traffic at all or if it were down completely.
It turned out that Android had a lot of large repositories, regarding the number of references, and the objects. We were just optimizing the queries per second, but some of the tasks were doing just CPU intensive work, where others were happy with it. Some of them were burning CPU in flames, and others were fine.

So we moved out of the per-request affinity, and we modified the per-repository sharding to optimise all of this.

Warm vs. Cold Cache

There is an extra in the system that is pre-warming caches. What the load balancer can do for you is to tell you that traffic is changing and I need to reconsider how to split the load on the system. For each of the tasks is going to tell “I’m going to give you traffic for gerrit-review.googlesource.com” with a notice of 30 seconds. That time you can use for pre-warm caches.

That is especially nice if you restart your tasks because all the associated in-memory cache gets flushed. The load balancer tells you “oh, this is the list of the tasks I need” and then you can get them all and pre-warm their caches. This graph shows the impact of the cache warmer on our system, on the 99.9% requests latency, really on the long tail of requests latency. That looks nice because we brought the latency down by a third.

Screen Shot 2017-10-10 at 15.43.15.png

What is a task start dying during peak traffic? Imagine that the load balancer is saying “You’re going to handle this” and two seconds later says “I have to reconsider, you’re going to handle that instead”. Again you’re going to watch your system burning on fire, because you’re serving peak traffic and then you’re running close to 100% CPU. That situation causes the load balancer loading and unloading tasks all the time, which is inconvenient. The way we work around this is to make this cache warming a best-effort activity. You can do it if you’re below 50% CPU when you have time to do fancy things, but when you receive peak traffic, you just handle peak traffic without any optimisation made.

Multi-master and multi-tenant outside of Google

The question is: how do we do that in a non-Google setup?
There are plenty of options.

With the new Gerrit release in 2.15, we introduced to a new URL scheme, which includes the project name in the URL. Previously you had gerrit-review.googlesource.com/c/NNN and there was no way to directly know which project this is for and no way to do that load balancing that we just saw.

What we did in 2.15 is just add the project before that, so that extraction for both host and project can be made in a simpler way. You could do the same even before v2.15 but you needed a secondary index lookup, which most opensource load-balancers such as HAProxy or NGINX did not support. And of course, there are lots of products like Google Cloud load balancer, and others that you can use to achieve the same thing.

Wrap-up

We went through a journey where we took OpenSource Gerrit, we added sites selection and got a multi-tenant Gerrit.
Then we took this multi-tenant Gerrit, added replication and obtained multi-master Gerrit.
And then we took that with load balancing and lots of failures and lots of fixes, and we got pretty much the Gerrit that we run at Google, which brings me to the end of this talk 🙂

Q&A

Q: How strategic is Gerrit@Google? Do you have any other code-review systems? If yes, how is used Gerrit vs. the others?

We have another code-review system for internal use only, and Gerrit is used whenever we are doing OpenSource stuff, so for GoLang, Chromium, Android, Gerrit, and whenever the Google Team wants to collaborate with other OpenSource users, or in general with users that are not sitting at Google.

Historically the source at Google was developed in Perforce, and we ported from that to a home-based system called Piper. Around that, we have a tooling ecosystem which is internal. In parallel to that, Google started to do a lot of projects that have nothing to do with the internal search engine and available outside. What we see is that a lot of projects started at Google from scratch were thinking about “what system should we use?”. Many people said: “well, we’re going just to use Git because that’s what we know and we like, ” and when they needed code-review for Git they ended up with us. Gerrit and Git are very popular inside Google.

Q. You have two levels of load balancer. The first one is the location, and the second one is to decide what to do inside the data-center. What about if a location is off? Maybe is not fully off-line but has big problems, or has a very low-percentage of consensus, and some of the locations have not the “latest and greatest” of the repo. Possibly a location that should be “inconvenient for me” actually has the data I want.

You’re talking about replication layer where you have the objects in one location but not in the other. Our replication latency is in the order of seconds, but it may happen that one location is just really slow in getting the objects. That happens from time to time, and we have metrics that says what the replication lag is accounted for. When it exceeds a threshold we just shut the data-center off, which means cut-off the traffic, the data-center will not receive user-traffic anymore but it will still be able to get the replication done, and when the decrease the objects we need to replicate we can send the traffic again.

Cutting off the traffic is happing at the L1 load balancer where we said “don’t send anything there”.

Q. Do all the tasks have the same setup? Or do have a sort of micro-service architecture inside where some of the tasks are more dedicated to this type of operations and other for another type. Serving data from memory in one thing, but calculating diff change is a different type of task.

Not in general. All of our tasks are the same, except for checking access control permissions. We do not go through the whole Gerrit stack but we have only this little task that knows how the project config works and is going to tell us yes or no.