It has been a hectic and productive year for ourselves at GerritForge and the Gerrit Code Review Community.
We want to take this opportunity to recap some of the milestones of the 2019 and the exciting perspectives for 2020 and beyond.
Gerrit Code Review, 2019 in numbers
Gerrit had over 120+ contributors from all around the world coming from 33 different companies and organisations, which is excellent. There is a robust 6% increase in the number of commits (+231 commits) but a reduction in the number of contributors (-7 authors).
With regards to the overall trend of commits during the year, the success of the Gerrit User Summit 2019 in Sunnyvale is visible, with an increase of the rate of commits around October/November.
Top-three projects of the 2019
Gerrit (1,626 commits) is, of course, the most active project. However, it is visibly down in terms of number of commits from 2018 (-19%). That is a consequence of the shift of focus to the other two key components listed below, which are available as plugins and then not accounted for the overall gerrit core repository statistics.
Checks (315 commits) is the brand-new 1st class CI integration API for external build systems, such as Jenkins and Zuul. It is incredible how in just 12 months it has become robust and fully mature. It is currently used for the validation of all changes on the Gerrit project.
Multi-site (234 commits) is the long-awaited support for Gerrit that everyone has been waiting for years. It is finally available for all active and supported versions (from 2.16+ onwards).
Top-three companies contributing to Gerrit
Google is, with no surprise, still the top contributor of the Gerrit project overall. It is basically stable from 2018 (around 43%) as a confirmation of the continued commitment to the project.
GerritForge is growing significantly in the contribution to the project, with exactly half of the contributions of Google. This is a significant result from 2018 with a 7% growth of involvement.
CollabNet is sliding to the 3rd position (it was 2nd in 2018) with a 3% decrease of contributions. As noticeable mention, however, David Pursehouse from CollabNet is still the number #1 maintainer in terms of number of commits.
Even if it is outside the top#3 contributors companies, SAP deserves a special mention for its continuous involvement in the JGit project, which is at the basis of Gerrit engine, and its fantastic engagement in improving the Gerrit CI system and integrating it with the checks plugin.
Top-three achievements from GerritForge
The outstanding results of contributions of GerritForge in 2019 have been focused on three major topics.
Gerrit multi-site, released and production ready
We released the Gerrit Multi-Site plugin, allowing seamless balancing in a distributed environment, a technologically highly advanced development, crucial for very distributed companies. See https://gerrit.googlesource.com/plugins/multi-site for more information.
Gerrit User Summits in Europe, USA and streaming
We successfully organised and executed the Gerrit User Group in Europe and the US. The event was very well received by the community with an overall attendance of some 87 on-site and 38 in streaming. Have a look at https://gitenterprise.me/2019/12/23/gerrit-user-summit-survey/ for interesting feedback on those from the attendees.
We opened our own local office in Sunnyvale, in the heart of Silicon Valley. A crucial move to better serve our ever-expanding US customer base.
Gerrit Analytics for the Android Open-Source Project
We kickstarted the Gerrit Analytics for the Android open-source project initiative: after the successful adoption of the automatic collection of code metrics on the Gerrit project (see https://analytics.gerrithub.io) the Android team asked GerritForge to start working on extracting the same metrics from their code.
What’s coming in 2020
Gerrit v3.2 is currently under development and it is planned to be released around April/May 2020. It represents a major milestone for the Gerrit project with the support for Java 11 and large JVM heaps, up to hundreds of GBytes. Gerrit v3.2 is definitely the release that everyone that has a big repository (mono-repos) should target as next upgrade. See the Gerrit .roadmap at https://www.gerritcodereview.com/roadmap.html for more details about the planned features.
More work and improvements on the checks plugin, with the aim of fully integrating it into everyone’s user-journey and their CI/CD pipeline. Our first blog-post of 2020 will be how to use Jenkins and Checks plugin together with GerritHub.io.
Multi-site and HA will become more integrated with Gerrit, with the aim of moving parts of their technologies (e.g. global ref-db) into JGit and thus used in Gerrit core.
The Gerrit User Summit 2020 will continue the experiment of cross-pollination with other communities, after the success of the interactions with the JGit and OpenStack communities in 2019. Bazel is the next target, as it is used as the de-facto standard build system for Gerrit and its plugins.
Again, Best wishes from your friends at GerritForge and looking forward to a continuing successful partnership in the coming years.
Luca Milanesio Gerrit Maintainer, Release Manager and member of the ESC.
Luca Milanesio, Gerrit Code Review Maintainer and Release Manager, presented the current status of the support for multi-master and multi-site setups with the standard OpenSource Components, developed by GerritForge and the Gerrit Code Review Community.
Introduction
The focus of this talk is sharing with you one experience that we did with the Gerrit server that we maintained, GerritHub.
First of all, I’m just going to tell you how we went through the journey from a single master-slave installation back in 2013 to a fully multi-site setup across two continents.
The evolution of GerritHub to multi-site
GerritHub was born in November 2013. The idea was straightforward. It was just an idea on how to take a single Gerrit server and put the replication plug-in to push to GitHub.
To implement a good and scalable and reliable architecture, you don’t need to design everything up front. At the beginning of your journey, you don’t know who your users are, how many repos are going to create, what the traffic looks like, what the latency looks like: you know nothing.
You need to start small, and we did back in 2013, with a single Gerrit master located in Germany, because we had no idea of where the users would have come from.
Would the people in Europe like it, or rather the people in the U.S. like it, or again the people in China like it? We did not know. So we started with one in Germany.
Because we wanted to make a self-service system what we did was very simple: a simple plugin called, “The GitHub plug-in”. That was just a wizard to add an entry in the replication config.
You have Gerrit incoming traffic, then you configure replication, plugin and eventually push to GitHub. The only complicated part here is that if you do it as a Gerrit administrator you have to define these remotes in the replication.config but you can express it in an optimized way. On a self-service system, you’ve got 1000s of people then will create 1000s remotes automatically. Luckily, the replication plugin works very well and was able to cope with it very well.
Moving to Canada
Then we evolved. The reason why we changed is that people started saying ‘Listen, GerritHub is cool, I can use it, works well in the morning. Why in the afternoon is so slow?“. Uh oh.
We needed to do some data mining to see precisely who was using it, where they were coming from, and what operations they were doing. Then we realized that we had chosen the wrong location because we decided that we wanted to put the Gerrit master in Germany, but the majority of people are coming from the USA.
Depending on how the backbone between the Atlantic Ocean was performing, GerritHub could be faster or slower. One of the complaints that they were saying is that in the morning, GitHub was slower than GerritHub, but in the afternoon it was exactly the opposite.
We were doing some performance tuning and analyzing the traffic, and even when people were saying that it’s very slow, actually GerritHub was a lot faster than GitHub in terms of throughput. The problem was the number of hops between the end user and GerritHub.
We decided that we needed to move from Germany to the other side of the Atlantic Ocean. We could have done to move the service to the USA but we decided to go with Canada because the latency was precisely the same as hosting in the USA but less expensive.
What we could have done is just to move the Master from Germany to the other side of the Atlantic Ocean, but because, from the beginning, we wanted to give a service that is always available, we decided to keep both zones.
We didn’t want to have any downtime, even in this migration. We wanted, definitely, to do one step at a time. No changes in releases, no changes in configurations, only moving stuff around. Whenever you change something, even if it’s a small release change, you change the function, and that has to be properly communicated.
If we were changing data center and version, when something goes wrong, you would have the doubt of what it is. Would it be the new version that is slower or the new data center that is slower? You don’t know. If you change one thing at a time, it must be that thing that wasn’t working.
We did the migration in two steps.
Step-1: The Gerrit master in Germany, still the replication to GitHub, and the new master in Canada was just one extra replication end.
The traffic was still coming on this side of the master, but it was replicated to both Canada and the other GitHub. Then, when that was stable, so we were doing all the testing, the other master was used as it was at Gerrit slave, but was not a slave, all the nodes were master, with just a different role.
Step-2: Flip the switch the Gerrit master is in Canada. When replication was online and everything was aligned, we have put a small read-only plug on the German side, which was making the whole node read-only for a few minutes, to give time to the last replication queue to drain.When the replication queue was drained, we flipped the switch, when it was going to the new master it was already read/write.
The people didn’t change their domain, didn’t notice any difference, apart from the much-improved performance. The feedback was like “Oh my, what have you done to GerritHub since yesterday? It’s so much faster. It has been never so fast before.‘
Because it was the same version, and we were testing in parallel for quite some time, nobody had a single disruption.
Zero-downtime migration leveraging multi-site
But that was not enough, because we wanted to keep Gerrit always up and running and always up to date with the latest and greatest version. Gerrit is typically released twice a year; however, the code-base is stable in every single commit.
However, we were still forced to do the ping pong between the two data centers when we were doing our roll out. It means that every time that an upgrade was done, users had a few minutes of read-only state. That was not good enough, because we wanted to release more frequently.
When you upgrade Gerrit within the same release, let’s say between 2.15.4 and 2.15.5, the process is really straight forward, because you just replace the .war, restart Gerrit, done.
However, If you don’t have at least two nodes on either side, you need to ping pong between the two different data centers, then apply the read-only window, which isn’t great.
We started with a second server on the central node so each node can deal with the entire traffic of GerritHub. We were not concerned about the German side, because we were just using it as disaster recovery.
Going multi-site: issues
We started doubling the Canadian side with one extra server. Of course, if you do that with version v2.14 which problem do you have?
Sessions. So, how do you share the sessions? If you login into one Gerrit server, you create one session, then you go to the other and you don’t have have a session anymore.
Caches. That is easier to resolve, you just put the TTL of the cache to a very low value, put some stickiness, you may sort this out. However, cache consistency is another problem and needs to be sorted.
Indexes are the very painful one, because, at that time there was no support for ElasticSearch. Now things are different, but back in 2017, it wasn’t there.
What happens is that every single node has its own index. If an index entry is stale, it’s not right until someone is going to re-index.
The guys from Ericsson were developing a high availability plugin. We said instead of reinventing the wheel, why don’t we use a high availability plugin? We started rolling it out to GerritHub and actually, the configuration is more complex and looks like this one.
So, imagine you’ve got in Canada you’ve got two different masters, still only one in Germany. They align the consistency of the Lucene index and the cache through the HA plugin and they still use the replication plugin.
How do you share the repository between the two? You need to have a shared file system. We use what exactly the same used by Ericsson, NFS.
Then, for exposing the service we needed HAProxy, not just one but at least two. If you put one HAProxy, you’re not HA anymore, because if that HA proxy dies, your service goes down. So, you have two HAProxy, they must have a cross configuration’s, it means that both of them, they can redirect traffic to one master or the second master, it’s not one primary and the second backup: they have exactly the same role. They do exactly the same thing, they contain exactly the same code, they’ve got exactly the same cache, exactly the same index. They’re both running at the same time. They’re both accepting traffic.
This is something similar to what Martin Fick (Qualcomm) did, I believe, last year, with the only difference that they did not use HAProxy but only DNS round-robin.
Adoption of the high-availability plugin
Based on the experience of running this configuration on GerritHub, we started contributing a lot of fixes to the high-availability plugin.
A lot of people are asking “Oh, GerritHub is amazing. Can you sell me GerritHub?”. I reply with “What do you mean exactly?”
GerritHub is just a domain name that I own, with Gerrit 2.15, plus a bunch of plugins: replication plugin, GitHub plugin, the high-availability plugin (we use a fork), the web session flat file and a bunch of scripts to implement the health check.
If you guys want to do your own, the same configuration, you don’t need to buy any commercial product. We don’t sell commercial products. We just put the ideas into the OpenSource community to make it happen. Then if you need Enterprise Support, we can help you implement it.
The need for a Gerrit disaster-recovery site
Then we needed to do something more because we had one problem. OVH had historically been very reliable, but, of course, shit happens sometimes.
It happened one day that OVH network backbone was down for a few hours.
That means that any server that was on that Data-Center was absolutely unreachable. You couldn’t even connect to them, you couldn’t even check their status, zero. We turned the traffic to the disaster recovery side, but then we faced a different challenge because we had only one master.
It means that if something happens to that master, I don’t know, a peak, or whatever, is becoming a little bit unhealthy, then we are going to have an outage. We didn’t want to risk to have an outage in that situation.
Your disaster recovery side is never really safe until you are going to need and use it. Then use it all the times, on a regular basis. This is what we ended up to implement.
We have now two different data centers, managed by two different cloud providers. One is still OVH with Canada, and the other is Hetzner in Germany. We still have the same configuration, HA plugin over a shared NFS, so this one is completely replicated into the disaster recovery site, and we are using the disaster recovery continuously to make sure that is always healthy and aligned.
Leverage Gerrit-DR site for Analytics
Because we didn’t want to serve actual user traffic on the disaster recovery site, because of the synchronization lag between the two sides, we ended up using for all the data mining activities. There is a lot of things that we do on data, trying to understand how our system performs. There is a universe of data that typically either never looked at or you don’t really extract and process in the right way.
Have you ever noticed how much data Gerrit generates under the logs directory? A tremendous amount of data and that data tells you exactly the stories that you want to know. It tells you exactly how Gerrit works today, what I need to do to makes sure that Gerrit will work tomorrow, how functions are performing for the end users, if something has blown up, it’s all there.
We started really long ago to work on that DevOps Analytics space, and we started providing that metrics and insights data for the Gerrit Code Review project itself and reporting it back to the Gerrit Code Review project through the service https://analytics.gerrithub.io.
Therefore we started using the disaster recovery site for analytics traffic because if I do an extraction and processing of my data on my logging, on my activities, is there really a need for an analysis of the visit of 10 seconds ago or not? A small time lag on data doesn’t make any difference from an analytics’s perspective.
We were running Gerrit v2.15 here, so the HA plugin needed to be radically different from the one that it is today. We are still massively on the HA plugin fork, but all the changes have been pushed for review on the high availability.
However, the solution was still not good enough because there were still some problems. The problem is that the HA plugin within the same data center relies on the shared file system. We knew within the same data center that was not a problem. But, what about creating an NFS across data-centers in different continents? It would never work effectively because of the latency limitations.
We then started with a low tech solution: rely on the replication plugin for the Git data in the repository. Then every 30 minutes, there was a cronjob that was checking the consistency between the two and then does a delta re-index between the different sites.
Back in Gerrit v2.14, I needed to do as well a database export and import, because contained the reviews.
But, there was also the timing problem: in case of a disaster occurred, people will have to wait for half an hour to get the data after re-index.
They would have not lost anything because the index can be recreated at any time, but the user experience was not ideal. And also, you’ve got the DNS related issues for going from one zone to the other.
Sharding Gerrit across sites
First of all, we want to leverage the sharding based on the repository, available from Gerrit 2.15, which include the project name in each page or REST-API URLs. That allows achieving basic sharding across different nodes even with a simple OpenSource HAProxy or other Workload Balancer, without having the magic of the Google intelligent Gerrit/Git proxy. Bear in mind this is not pure sharding, because all the nodes keep on having all the repositories available on every node. HAProxy is going to be clever and based on the project name and action, will use the most appropriate backend for the reads and writes operation. In that way, we are making sure that you never push on the same repository on the same branch from two different nodes concurrently.
How magic works? The replication plugin takes care of synching the nodes. Historically with master-slave, the synchronization was unidirectional. With multi-site instead, all masters are replicating to each other. It means that the first master will replicate to the second and the other way around. Thanks to the new URL scheme, Google made our lives so much easier.
Introducing the multi-site plugin
We have already started writing a brand-new plugin called multi-site. Yes, it has a different name, because the direction that it’s taking is completely different in terms of decisions compared to the high-availability plugin, which required having a shared file system. Secondly, the HTTP synchronous communication with the other site was not feasible anymore. What happens if, for instance, a remote Site in India is not reachable for 50 seconds? I still want things events and data to be put into a persistent queue to be eventually sent through the remote site.
For the multi-site broker implementation, were have decided to start with Kafka, because there is already an implementation of the stream events it in Gerrit as a plugin. The multi-site plugin, however, won’t be limited to Kafka but could be extended to support other popular brokers such as NATS.
We will go to the location-aware DNS because we want the users to access the HAProxy that is closer to him. If I am living in Germany maybe it makes sense for me to use only the European servers, but definitely not the servers in California.
You will go by default to a “location-aware” site that is closer to you. However, depending on what you do, you could still be laster on redirected to another server across zones.
For example, if I want to fetch data, I can still fetch everything on my local site. But, if I want to push data, the operation can either be executed locally or forwarded to the remote site, if the target repository has been sharded remotely.
The advantage is the location-awareness and data-locality. The majority of data transfer will happen with my local site because, at the end of the day, the major complaints of people using Gerrit with remote masters is a sluggish GUI.
If you have Gerrit master/slave, for instance, a Gerrit master in San Francisco and your Gerrit slaves in India, you have the problem that everyone from India will still have to access a remote GUI from the server in San Francisco, and thus would experience a very slow GUI.
Even if only 10 percent of your traffic is a write operation, you would suffer from all the performance penalties of using a remote server in San Francisco. However, if I have an intelligent HAProxy, with also an intelligent logic in Gerrit that understands where the traffic needs to go to, I can always talk to my local server is in my zone. Then, depending on what I do, I can use my local server or a remote server. This happens behind the scenes and is completely transparent to the user.
Q&A
Q I just wanted to ask if you’re ignoring SSH?
SSH is a problem in this architecture, right, because HAProxy supports SSH, but has a big problem.
Q: You don’t know what projects, you don’t have any idea of the operation, whether it’s read or write.
Exactly. HAProxy works at transport levels, so it means that it knows that flow of encrypted data going back and forth to the Gerrit server, but cannot see anything in clear and thus cannot make an educated decision based on it. Your data may end up being forwarded to the zone where your traffic is not optimized, and you will end up waiting a little bit more. But, bear in mind, that the majority of complaints about Gerrit master/slave is not really on the speed of the push/pull but rather on the sluggish Gerrit GUI.
Of course, we’ll find as a Community a solution to the SSH problem, I would be more than happy to address that in multi-site as well.
Q: Solution is to get a better LAN because we don’t have any complaints about the GUI across the continents.
Okay, yeah, get a super fast network across the globe, so everyone and everywhere in the world will be super fast accessing a single unique central point. However, not everyone can have a super-fast network. For instance in GerritHub, we don’t own the infrastructure so we cannot control the speed. Similarly, a lot of people are adopting a cloud infrastructure, where you don’t own and control the infrastructure anymore.
Q: Would you also have some Gerrit slaves in this setup. Would there be several masters and some of the slaves? Or everything will become just a master?
In multi-site architecture, everything can be a master. The distinction between master and slave does not exist anymore. Functionally speaking, some of them will be master for certain repositories and slave for other repositories. You can say that every single node is capable of managing any type of operations. But, based on this smart routing, any node can potentially manage everything, but effectively forwards the calls based on where is best to execute them.
If with this simple multi-site solution you can serve the 99% of the traffic locally, and for that one percent, we accept the compromise of going a bit slower, than it could still sound very good.
We implement this architecture last year with a client that is distributed worldwide and it just worked, because it is very simple and very effective. There isn’t any “magic product” involved but simply standard OpenSource components.
We are on a journey, because as Martin said last year in his talk, “Don’t try to do multi-master all at once.” Every single situation is different, every client is different, every installation is different, and all your projects are different. You will always need to compromise, find the trade-off, and tailor the solution to your needs.
Q: I think in most of the cases that I saw up there, you were deploying a Gerrit master in multiple locations? If you want to deploy multi-master in the same sites, like say I want two Gerrit servers in Palo Alto, does that change the architecture at all? Basically, I am referring to shared NFS.
This is actually the point made by Martin: if you have the problem on multi-site, it’s not a problem of Gerrit, it’s a problem of your network because your network is too slow. You should solve the network, you shouldn’t solve this problem in Gerrit.
Actually, the architecture is this one. Imagine you have a super fast network all across the globe, everyone is reaching the same Gerrit in the same way. You have a super fast direct connection with your shared NFS, so you can go and grow your master and scale horizontally.
The answer is, for instance, yes, if your company can do that. Absolutely, I wouldn’t recommend doing multi-site at all. But if you cannot do anything about speed, and you got people that work remotely and unfortunately, you cannot go and put a cable between the U.S. and India, under the sea just for your company, then maybe you want to address the problem in a multi-site fashion way.
One more thing that I wanted to point out, is that the multi-site plugin makes sense even in the same site. Why this picture is better than the previous one? I’m not talking about multi-site, I’m talking about multi-master on the same data center. Why this one is better?
You’ll read and write into both. Read, write, traffic to both. The difference between this one and previous is that there is no shared file system on this one. The previous, even within the same data center, if you use a shared file system, you still have a single point of failure. Because, even if you are willing to buy the most expensive shared file system with the most expensive network and system that exists in the world, they will still fail. Reliability cannot be simply resolved by throwing more money on it.
It’s not the money that makes my system reliable. Machines and disks will fail one day or another. If you don’t plan for failures, you’re going to fail anyway.
Q: Let’s say when you’re doing an upgrade on all of your Gerrit masters right? Do you have an automatic mechanism to upgrade? I’m coming from a perspective where we have our Gerrit on Kubernetes. So, sure, if I’m doing a rolling upgrade, it’s kind of like a piece of cake, HAProxy is taking care of by confd and all that stuff, right? In your situation, how does an upgrade would look like, especially if you’re trying to upgrade all the masters across the globe? Do you pick and choose which site gets upgraded first and then let the replication plugins catch up after that?
First of all, the rolling upgrade with Kubernetes will work if you do minor upgrades. If you’re doing major Gerrit upgrades, it doesn’t. This is because that changes the data, right? With the minor upgrades, you do exactly the same here. Even if the others in the other zones have a different version, they are still interoperable, right, because they talk exactly to the same schema, exactly to the same API, exactly in the same way.
Q: I guess taking Kubernetes out of the picture, even, so with the current multi-master setup you have, if you’re not doing just a trivial upgrade, how do you approach that usually?
If you are doing a major version upgrade, that is going to orchestrate exactly like the GerritHub upgrade from 2.14 to 2.15. You need to use what I called a ping-pong technique. You basically need to do across data centers.
It’s not fully automated yet. I’m trying to automate it as much as possible and contribute back to the community.
Q: In the multi-master, when you’re doing the major upgrade, and even in the ping pong, so if the schemas changed, and you’re adding the events to replication plugin, are you going to temporarily suspend replication during the period of time? Because the other items on the earlier version don’t understand that schema yet. Can you explain that a little?
Okay, when you do the ping pong that was this morning presentation, what happens is the upgrade on the first node, interruption the traffic there, you do all the testing you want, you are behind with the original master but it catches up with replication and does all the testing.
With regards to the replication event, they are not understood by the client, such as the Jenkins Gerrit trigger plugin. That point was raised this morning as well. If you go to YouTube.com/GerritForgeTV, there is the recording of my talk of last year about a new plugin that is not subject to this fluctuation of Garret version.
Luca Milanesio – Gerrit Code Review Maintainer and Release Manager
My name is Martin Fick and I work at Qualcomm as Gerrit Administrator and I have one of the key historical Gerrit Maintainers since its very beginning.
I am going to talk about my experience in introducing a truly Multi-Master Gerrit setup at Qualcomm and learn about the findings and the experiences we made to make that happen.
Let’s start with a question so that we can make this session very interactive from the start.
Q: Who would like to have multi-master in Gerrit Code Review? Why would you need it on your server? Failover? High-availability? Both?
A: All of that but also latency. We have developers all around the world, and we would have at least two datacentres, one in the States and another one in Europe, we were looking for a multi-site multi-master to descreen latency between the sites.
A: (Luca Milanesio – GerritForge). Scalability and elasticity that allows to grow and shrink the instances based on the incoming traffic. Additionally, we would like to have zero downtime and rolling upgrades without stopping the services when moving between minor versions of Gerrit.
Q: (Han-Wen Nienhuys – Google) Let me reverse the question: who is running multi-master and would like to have a single master?
A: (Dave Borowitz – Google) I would love if we had a single server that was powerful enough to serve all our traffic and no latency and we had not had to deal with slow database backends and replication lags between sites.
Q: Does anybody need multi-master? How are you surviving today if you needed it?
A: Right now we maintain a second master server for DR purposes, is a passive standby backup server. If our master went down, we would have to failover to it.
A: We would need multi-master soon. Currently, we are experimenting the HA-plugin as a fallback solution, but we are working towards achieving multi-master so that we can scale up, which is mainly our primary concern here.
Gerrit Multi-Master requirements
There are needs that I hear from the audience, high-availability, scalability. Not necessarily the same requirements and I believe that multi-master would address some of those. What’s important is that everyone is focussing on their needs and think about how they can approach those first instead of going to the “big everything” solution.
We have high-availability as an issue somewhere, so we want better availability. We have some scalability issues as well. We do have some site issue at times but we most generally deal with those through slaves all over the world, between 50 and 100 and we primarily have one data-center in San Diego USA. If that goes down, nobody is getting anything done even if Gerrit is up, that makes failing over to other sites not particularly useful right now.
Our current focus is on better availability and better scalability at one site.
Scalability problems at Qualcomm
We have a pretty hefty load, we have around 6k projects, but our load is enormous on a few large projects. They are not massive projects but they are significant from the standpoint of having many changes on them, and there are a lot of users using those projects. Mainly a few kernel projects are used, and even if there are many other little projects, most of the users are pushing to those single projects.
We are a single tenant server, so we are primarily dealing with that. Other people need other Gerrit servers, and they manage their own, but at the moment we don’t have to deal with those. However, if we had a more scalable system, then I believe multi-tenancy would become the next problem on the horizon. If I had a great scalable system why am I creating this instances here and there I would instead put all of them on the central system so that I can manage one system. But then I would have the problem that Gerrit doesn’t currently handle multi-tenancy well. Maybe is because we don’t have a multi-master system and then we are not ready to put everything on one server anyway. If we are starting having a multi-master solution, then multi-tenancy is then something that would become more interesting.
On one server we have around 6k projects, the main project has about 1/2M refs, and we have in total 2.3M changes on the entire server.
I like to break down the scaling topic into two different ideas: horizontal scaling and vertical scaling. In our area, we need to scale vertically more, while horizontally is where you have lots of projects and users and vertical they are more concentrated on a few projects.
Stairway to Gerrit Multi-Master
Here is what we decided to do and we decided to build it incrementally. These are the four phases in which we have approached it.
Step #1: Active/Passive
We have an active/passive standby, and we started doing regular failover. Let’s face it, if you have a failover machine that you probably have never failed over to, it would probably not work.
We have so much software around the Gerrit ecosystem, we have a lot of scripts, cronjobs and things that run on our server, and they use a lot of software, there are a lot of Python hooks and stuff like that.
All that software gets updated. If the failover machine is not updated through the same process and tested, there could be missing dependencies, wrong package versions, etc. Maybe some hacks or links on the filesystem that is pointing over here and not over there or maybe filesystem positions are different and things like that.
Before even dreaming about multi-master you need to have a system to have your server systems consistent. If you don’t think that your organization can do that and make two systems precisely the same, you are not going to be anywhere near ready to start doing multi-master. If you are a large organization that has a lot of history, it’s a more significant challenge. If you are starting from scratch it is a lot easier, think about that from the start: how can I focus on repeatability and deploying the same stack of software to multiple servers, and then concentrate on failing over.
Sometimes this year around February/March, we started failing over, and now we are trying to fail over weekly. We have been doing that for getting the Team used to it, to ensure that our processes work.
Step #2: Active/Hot Standby
Then we moved to hot-standby. The difference here is that the passive node had all the software in place but was not necessarily running. So now we just started the other server, but only nobody was pointing to it. The active standby gives you the opportunity to test the software you have around your Gerrit master. If you have hooks or cronjobs, try to run them on the failover server, even if it is not the active master. Do you have the coordination needed? Do you have locks in place? If you are doing repository repacking, can you do it on both servers at the same time? Do they conflict with each other?
The assumption is that you have a shared repository on the backend such as an NFS and thus activities like repacking need that coordination. And you need that stuff to work before you can go to multi-master.
Step #3: Active/Active with Round-Robin DNS (low tech)
The third step which is where we are at is active-active. We chose this as a simple low-tech and easy to backup solution: round-robin DNS. It sucks from a load-balancing standpoint and for the failover recovery, but it gets us having traffic going to two servers. We have been running that for quite some time now, 15h!
Step #4: Active/Active with Load Balancer(s)
And the next step is of course to have a load balancer. We have a load-balancer setup that we use for slaves. It is much harder to back out because you have to set IPs to have DNS in place pointing to them, but we plan to transition to that.
What Gerrit data to share in a multi-master setup?
In a single-site Gerrit multi-master setup, you necessarily need to have shared Git repositories, you don’t have to share your H2 caches because you need them on each of them and then the ReviewDb needs to be shared until it goes away with Gerrit Ver. 2.15.
Sharing sessions
There is other stuff you need to share, like web sessions.
When you log in on with your web-browser, and you hit one machine, your browser keeps track that you are logged in with a cookie on your browser while the server keeps track of it on a shared persisted session. If your next hit goes to the other server, you don’t want that user to be kicked out, so you need to share the server session data across nodes.
In 2014 we developed the websession-flatfile plugin, and that helps taking care of sharing web sessions, and I know that a lot of people used that already for HA solutions. It is straightforward, just stores the sessions on the filesystem and if you are using Git repositories on NFS, just put them there also and share them.
Sharing events
The other thing that you require to share is the events data. Our events need persistence on the server side, and we share them on the filesystem too. That allows to share them across masters also. If you connect to one of our masters through SSH, you get events from both of the masters. It’s a poor man’s sharing, and it is a simple set of files that is there and is shared. One thing that we realized using multi-master is that possibly your events are going to be disrupted much more. Even if you think you have high-availability, the client can only connect to one server. So it doesn’t matter if the cluster has even ten servers, the client is still attached to just one of them.
Generally speaking, most of us want to do rolling upgrades, and that would bring one server down, and that is one of the reasons we want to go to multi-master. If you are doing that frequently, you are disrupting your SSH clients and for those who are permanently connected such as the stream events even more.
I have contributed a new stream events plugin, and I merged it upstream during the last hackathon so you can download it if you want to look at it that helps you to store stuff. You get the added benefit to add some flags to get IDs on your events, and then replay IDs. Since they are stored on the filesystem, when you connect you can tell what the last ID was you connected to, and you can start receiving all the events after that. It does it through the same interface of the old stream events: you drop it in, and it works. It respects Gerrit permissions, just like the core plugin does so that the user that is connecting will see only the events of the projects and branches he has access to.
Coordinating other processes on the Gerrit nodes
As processes on our Servers, we have the Gerrit JVM, but also we have a lot of hooks, some of them run for a very long time. You need to get those to run on both servers. We have a few cronjobs that do repack, we have others that are checking if users are active or inactive on LDAP and update their status on Gerrit and check their e-mails to make sure that they are really what they are saying. And last we have some various maintenance cronjobs.
We focussed on the background stuff first because you need to get that done anyway and you can use it as a peace mill, and we tried to coordinate those first.
We went for the low-tech solution, sharing as much as possible via the filesystem.
Years ago I created a lock implementation based on the filesystem. It is similar to an echo PID to a file but is a lot more robust and can recover deadlocks as long as he can contact the other server using SSH and does all of that automatically.
We use that as a primary method to coordinate things, so on top of that, I have built a scripting queue executor for hooks. Years ago before multi-master, we had issues with hooks that were running for too long. If you remember how Gerrit manages hooks, runs from the Java process with a queue that runs only one at a time. If your hooks take for instance three minutes, you may build up a backlog on your server. If your node goes down, you lose that backlog. What we decided years ago is to have the java process just to write the hook onto a file, and then we have a schedule that runs them in the background. The queue in Gerrit is then always at zero, it takes less than a second to write to the file and then it is done.
We made that multi-master friendly, but because that queue uses the locks, it just worked when we run them on more than one master. One server can write the hooks to a file, and the other could be the one that runs it, and thus it distributes the load in that way.
Gerrit cluster events
If you think about it, there are a lot of things that are hooked into the startup, so I just put a little hook framework that checks if the servers are running and check if the other nodes are running by ssh-ing into them. That allows identifying that if the two servers are down and one goes up, then your node is starting, but also your cluster is starting. And if the other just comes up you are not starting the cluster, but you are starting the node. Then if one node goes down, you haven’t stopped the cluster but you’ve stopped only the node, and if the other goes down you have now stopped the cluster.
I have created a little hooks framework so that you can plug into those events, and then you can do things based on that. For example, at the startup of any node, we are going to start the hooks just to make sure that they are running. But then when you stop a node we disable them only when all the cluster goes down. If one node goes down, the hooks need to be still running because they can contact the other server because they go to the domain name and not to the local server.
We plan to use the hooks framework for replication too. Currently, if you are familiar with the replication plugin, at startup try to replicate all the project to all the slaves to see if they are up to date. The theory behind it is that data could have been modified behind Gerrit back and possibly because when the server was shut down, it was in the middle of doing something and never finished it. In a cluster situation, if at any time something goes down, it could have been in the middle of replicating, and the other node won’t know. When any server goes down, you don’t want to replication, but there is no need to do replication at startup when any node starts but only when the cluster starts.
If there is at least one node up, when the other comes up you don’t want to do global replication. You could replicate it, but it would be a just extra load.
Caching across the cluster nodes
You can do some coordination better than the simple filesystem, for instance, the high-availability plugin does one to one connection, you need to configure the IP of the one master to connect to the other and they connect between each other. That works for two, but it is not a super-scalable system, it is an N-squared problem. Eventually, you want to move to a pub/sub system for things like that.
Something I haven’t mentioned yet it is what we had to do coordination wise it is caching. The high-availability plugin evicts caches, but we don’t do that yet, we consider it an optimization. The primary remedy is just to reduce the time on the caches, mainly decreasing to the values that we had on our slaves. Mostly you probably have lower cache time on your slaves, to make sure that you are checking your projects’ ACLs and your group memberships a little more often than your master. The Gerrit master knows when things get changed without the need to evict any cache, but the slaves do not know when they changed. We made the masters a little bit dumber like the slaves because they know when they changed something but they don’t know when the other masters did. As long as you are fine with the delay, let’s say 5 minutes, then it means that when you change an ACL, it may take up to 5 minutes to be seen by the other masters.
Demo: make your laptop a multi-master Gerrit server
As anybody tried to run two Gerrit instances on the same laptop?
The first problem that you will see is that if you point them to the same H2 caches it will just not start. But if you run init on two different directories and you don’t share any data on the two masters, then you need to go and modify your gerrit.config file to say to start sharing something.
Out of the box sharing abilities
You need to reconfigure the Database to be the same, and you should use a PostgreSQL or MySQL because the default H2 won’t allow any sharing. I am running 2.14 here, and I did not share the indexes, so they are going to be out of date for this demo. The user’s sessions are not going to be shared, so you need to configure them manually to make them shared.
If you use the websession-flatfile plugin, it will put the sessions on the filesystem and then you can share them on the filesystem.
Lastly, the stream events won’t be shared, but most of the rest will just work.
I am about to run stream events on both masters, on the left screen I am connecting to port 29418 and on the right screen on port 29419.
Now if I go to the WebUI, on port 8080 I have the same master of the one on the left, and on port 8081 I have the one on the right.
I am going to create a project using the master on the left: I hit create and you can see that the events appear on the stream events on 29418 but not on 29419.
Now let’s try to go to port 8081, and I have to log in again because the websessions are not shared yet. I am showing to you what it does out of the box. The events now came on the right screen but not on the left one.
The two masters are just pointing to the same Git data: I am trying to make them multi-master, but so far it is not working very well.
Adding plugins to share events and sessions
So, let’s step in and start deploying some new plugins. What we did with the plugins is to create a new one called “events” so that when you want to get the distributed events, you just call “events stream” where “events” is the name of the plugin and stream is the command.
What we did in the core is to remap the actual “gerrit stream-events” to “gerrit stream-events-core” so that you can still get to it if you want and this helps for debugging. And then we pointed the “gerrit stream-events” to “events stream” so that users will automatically get the new plugin events so that to them in invisible.
I am now on 8081 on the browser over here, create a new project called ‘Luca’ and here we go, events are generated on both 29418 and 29419 ports. You may have noticed that showed up on the right first and then to the left.
The events are on the filesystem, so the server just have to pull and here I am pulling every one second, but if something happened on this server, it will automatically catch-up on all the events generated on the other server as well.
The pulling is a backup: the events should come up even without a pulling mechanism. If we eventually had a pub/sub mechanism, I would still suggest keeping the pulling. Events are inherently unreliable and will get lost: you need a pulling mechanism in place if you want true reliability. Pulling is not great, but it is reliable. Events are effectively an optimisation for speed.
Let’s go back now to the other server at port 8080, and I did not have to login because the sessions are now shared and kept across masters. I create a ‘hugo’ project, and it came up pretty quick on both stream events on both masters.
So we have now shared web sessions and shared events. That’s in 2.14 but to make it a true multi-master you would need to share your indexes as well but as we are still on Gerrit v2.7 at Qualcomm, we don’t have an index and we don’t have to share it.
More exciting features in the events plugin
We’ve been using the events plugin in a single-master scenario since early this year, and it has been pretty reliable. I can show some other exciting features are in there. There are some new options here, IDs and resume after. If I give –ids and –resume-after 0, then it returns all the events since I started the instance, and I initialized it. Even if the server started and stopped a bunch of times, I still have all those events recorded. The IDs of the event have two parts of it: the left part is a UUID, and a right portion is a number. The idea being that this is your file store and this is your event store, a unique id associated with it. If someone deleted it on disk, it would create a new one. The idea is that if you had event 1M and someone has removed the file store, and you ask after event 1M, the new event store will restart at zero, and you get nothing. By having different Ids if you ask “give me all the events after ID 1M” then the UUID changed, and it realizes that you need everything. That allows giving some little extra safety. The plugin works on Gerrit v2.14, and I made some changes to make it work on v2.15 as well.
Q: If you use Hugo’s high-availability plugin for the index, you basically can probably do multi-master right now. Why then are you guys not running multi-master? Are you guys concerned about NFS and ref-updates?
A: (Hugo Ares – Gerrit Maintainer) We use it in failover mode for the only reason that if you just have a little bit of load, you are safe and consistency will be kept. But if you push it a bit more then sharing Git repositories from different machines in write mode doesn’t work. If you have two computers writing precisely to the same Git repository to the same branch to the same file at the same time, you are going to lose history, and we know because it happened for us. That’s why we don’t do multi-master right now.
We have been writing for several years to the same repositories from different machines using NFS and we haven’t found any issues because we do repack all over. Perhaps we have different NFS settings or just a different implementation.
A: It did not happen that often, but at least a couple of times and that’s why we don’t do it. We mainly use the high-availability plugin for evicting the caches so that we don’t need to lower the cache settings because they are removed automatically and takes care of the events in a slightly different way and trigger the reindex of the other copies. We can keep nodes down during the day, and the people would not notice any difference at all, no deal. We are just a little bit reluctant to write from both nodes, but that’s the plan.
So here you are already on a “multi-master ready” setup. We hear about your reports of the JGit problems, and we will be keeping an eye on it. I am confident we can fix any issues that could show up over here, understanding how Git works on NFS. There are some tricks that we can pull from our code base.
The (in)famous NFS stale file handle bug
I know we fixed in the past a bug on JGit to handle NFS: when you delete a file, and someone else tries to access it from a different node gets a stale file handle error. That’s the main problem you are running into and we encountered and fixed some of those issues on JGit, and there are possibly a few more left here and there probably. The workaround to the stale file handle problem is just to make a copy or a hard link to it and keep it for a bit more time and remove it later, and then you are fine. Generally speaking is just the brief moment after the operation that would be a problem. It is what it would happen on a local file system anyway: if you delete a file that was still opened by other processes, the inode will stay there for a while until it gets unreferenced. That is the main issue we have been running into, and most of the other problems have been already fixed on JGit.
Distributed GC on a multi-master setup
We have a distributed Git GC and repacking relying on the SSH filesystem lock mechanism that I talked before. We basically have a list of repos to be GCed on the filesystem. All the nodes are saying: “I am going to this,” and another node says “Oh, good, then I am going to do this other one” and in this way, the GC load gets fairly well distributed and you get only a few hours to get through the whole thing. The lock allows preventing to step on each other toes when doing repacking. As far as the backup is concerned, we do a rsync and take database dumps, so that we don’t get inconsistent snapshots. You need to get your Database backup first and then the repositories next. If you are running an older version of the DataBase backup, you are generally safe because you don’t risk to have references to inexistent objects.
Questions
Q: I may have missed one point: the shared filesystem approach is assuming you are not doing multi-site, right? If you were doing multi-site, it would add extra latency that would introduce more problems and add more latency which was the problem you were trying to solve.
A: Yes, that’s correct. We are not interested in multi-site because there are a lot of downsides to it: it is a lot more complicated and makes your writes a lot slower while the reads would be faster. When you are pushing to gerrit-review.googlesource.com you realize that is not the best experience. You need to go to a quorum to the majority of the people around the world to get the agreement of the majority of them to finalize a git push operation. Single-site is the first initial step for us, multi-site requires a very different approach. You need to get the Git objects to be replicated and then having somewhere to store the refs that have to be shared without a system like Google’s database that assures that the refs are replicated around the world. It is not impossible; we made some experiments on it. A few years ago Dave Borowitz uploaded a ZooKeeper implementation that does the refs sharing, but that only does the refs. Then you need to do something for the Git objects replication. You could use the regular Git replication to do that, but you need to come up with a bit of magical ref scheme that does the job behind the scene.
A: (Luca Milanesio – GerritForge) Last year at the Gerrit Hackathon in Mountain View, we presented a new implementation of the “missing bit” you were talking about. It an OpenSource project based on JGit and leverages Apache Cassandra for the Git objects, while uses the ZooKeeper implementation for the refs you mentioned. We are making a lot of progress on the project and it will soon be possibly THE solution for Gerrit Multi-Site.
Q: (Luca Milanesio – GerritForge) We found out and fixed recently in JGit a problem with the cache consistency: it was reported on NFS but had nothing to do with it. When JGit was not able to open a file for whatever reason, NFS latency, maximum opened files, whatever, then the packfile was removed from the list of packs because it was considered missing. Then JGit started to failing fetching objects from the repository, and the people were panicking thinking that their Git repository got corrupted. But then, why the other nodes accessing the same repository did not have the same problem? Then “magically” the problem disappeared when you restarted Gerrit. The problem has been fixed in recent versions of JGit but is still there for the version you are currently using as the basis of Gerrit v2.7. Did you find the same problem? How do you manage to keep up with the recent releases of JGit but staying on Gerrity v2.7?
A: We have an old version of JGit, but we have put our patches on it. I believe we discovered and fixed a bug like that years ago. We ported a series of patches upstream and we a set of performance improvements on our branch. However, it is hard sometimes to get stuff merged on JGit; there are not enough maintainers to even look at the thing.
Starting from this week, we are going to share one video per week of the amazing talks that were presented at the Gerrit User Summit 2017 in London.
In addition to the YouTube recording, we are during the extraction of the text and publishing it together with the relevant pictures taken from the presenter’s slides, so that people can start digesting the content at small bites.
This week talk is Patrick Hiesel’s presentation on how Gerrit multi-tenant and multi-master setup has been implemented in Google.
Gerrit@Google – Patrick Hiesel, Google
My name is Patrick, and I am going to talk about the setup of Gerrit we are running at Google. I wanted to take you on a journey starting with Gerrit that you all know and making it the system we run at Google, step-by-step; and at the end will have a multi-master and multi-tenant system.
Multi-what?
Multi-tenant is the ability to serve multiple hosts from the same single Java process. Imagine like the same JVM task serving gerrit-review.googlesource.com and gerrit-chromium.googlesource.com.
Multi-master is the ability to have multiple Gerrit servers all over the world. You can contact any one of them for reads and writes.
Most systems have read replicas, which is straightforward, but write replicas is where the juicy meat is.
Multi-tenant
We have gerrit-review.googlesource.com, based on OpenSource Gerrit that you can download right now and have it running hopefully under ten minutes. That is core-Gerrit, and it depends on three things:
JGit: for all the Git stuff
Multiple indexes for the accounts, changes, and other stuff
Caches
All these three components are based on the filesystem in one way or the other.
Now you have a friend that is accessing go-review.googlesource.com, what are you going to do?
The most natural solution is to start another Gerrit instance for it. You can have all of them on one machine, you can give them different ports, very easy, and in the end, they’ll be all based on the filesystem.
All those Gerrit instances do not need to talk to each other; they can just be separate instances operating on separate ports. This is not a multi-tenant system, but only different Gerrit instances on the same host.
You can add another layer on top of it: a servlet engine which receives all the traffic, check which host the traffic is for, and just delegate to the individual host.
To take one step further, have that selection filter doing that for you. Gerrit has a daemon that runs all the functionalities. You can integrate that daemon into the incoming servlet filter. When you can get a request for go-review.googlesource.com and I do not have where to allocate it, you can just launch it, instantiate all the objects and then run the traffic from there. Also, unload instances would work in the same way.
The Gerrit server engine and the selection filter can run in a single JVM.
How Gerrit can conquer the world
So you have a master here in Europe, and you have got one friend on the west coast in the US. He says: “Oh your Gerrit is so slow I have no idea why and I wish I could move to GitHub.” You say: “Hold on, I can do better than that!” and so you put a new master for the person on West coast.
So the key to that is the replication and comes in two sets:
Objects that you have to replicate related to all the Git data. JGit is putting objects into the disk, and these are the data you need to replicate correctly and fast.
Other stuff that should replicate and fast to provide a pleasant user experience but it really can be best-effort. That is the indexes and the caches.
If it is okay for your master having a 200/600 msec additional latency, then do not replicate the caches. You can have a cold cache in Singapore or the US, and you can reread them without problems.
For the index, replication can be best-effort, but you should make an effort to replicate them. It is still nonetheless a mandatory requirement. One way to achieve that is to use ElasticSearch, but other index implementations that give indexes replication can be used as well.
Multi-tenant and Multi-master together
We talked about a multi-tenant system and then replicate them globally, so we have now a multi-tenant and multi-master system, actually pretty close to what we run at Google.
That is the stack that we run in total. We have a selection filter and two other filters to decide what the traffic is directed. We are also based on JGit, no magic there, we have index and caches we replicate, all our systems are based on filesystem and BigTable.
Some “magic” happens at the Git layer at Google because that is where all the majority consensus across all the cells. When you are pushing anything that is Git and, with NoteDb, anything that is a review is in the repository as well, the system tries to reach the majority of the cells and write the objects to them. When that is acknowledged, you get a green light on the push.
Majority consensus also means that you have it only in so many cells, but don’t have it on all the cells all the times. Some of the replication is happening in the long tail, by replication events eventually get acknowledged by the cells, and then they get written to all the masters.
Our indexes and caches are also replicated, but some of them are just in-memory and a component that gets replicated on top of BigTable.
Redundancy everywhere.
We run five data-centers across three continents (Americas, Europe, Asia-Pacific), with precisely the stack we just saw which gives us a good latency for most of our users worldwide.
Let’s talk about load balancing. We have a system that is multi-master and multi-tenant, and any of the tasks can serve any requests, but just because it can, doesn’t mean that it should.
Maybe it has in cold memory caches, or it is in Singapore, and you are in the US; so the question is what if the biggest machine is not big enough and we want to optimize it?
The idea is that we want to reduce latency that comes out of cold caches and minimize the time the site takes to load.
So you have a request for gerrit-review.googlesource.com, and your instance has a cold cache, and you need to read from disk to memory to serve the request.
You have a fleet of 300 tasks available but you want to serve gerrit-review.googlesource.com from only just five of them. If you serve 300 requests from 300 tasks in a round-robin manner, you pay the latency to load data from disk to memory for every single request. And the second motivation is that you want to distribute the load.
We want a system that can dynamically scale with changing load patterns. We want a system that can optimise the caches, to send a request for a site/repo to the few number of servers and tasks based on two conditions:
serve from one machine as long as it fits on it
server from more machines if you really have to
Level-1 load balancer
In the stack at Google, you saw two load balancing levels. What you see down in the picture is the Gerrit tasks, that contains all the software layers we talked about. We have a user that triggers JSON calls from the browser, with PolyGerrit. The first thing that the JSON request is hitting is the L1 load balancer. The primary routing of your request is by geographic proximity. We have five datacentres at Google; the L1 load balancer picks the one with the lowest latency. When the request goes into the data-center, we have another load balancer which is the one I’ll talk about more, because this is the one where the Gerrit specifics happen.
One thing that L1 is doing as well is managing the spillover of traffic. When a datacentre says “I can handle up to 100 QPS” the L1 load balancer starts redirecting traffic to other datacentres should that threshold be reached.
Level 2 load balancer
Let’s dive into L2 load balancer, we want to know how much traffic we are getting into each Gerrit task, and we want to know in the load balancer where the single request should go, and we want to know that fast!
We added three new components to the architecture:
An element to redirect tasks and provide functionality and can report the load we are handling right now. When I mean load it can be anything: QPS (Queries Per Second), metrics, we just want to know from the tasks: what is your current load? We have a system called slicer, which I am going to talk about in a second and it’s added there in the picture.
A second component we are adding to the load balancer, with a query interface that responds to the following question: “we have a request for gerrit-googlesource.com, where should it go?”. All of that should be done in memory and should be regularly updated with the new elements in the background so that we don’t add another component of latency by having another RPC.
A third part is coordinating everything and is called the assigner, and it takes all the load metrics that we reported generates new assignments and gives them to the query interface.
Introducing the slicer
We have a system that is called slicer. There is a very nice paper that I can recommend, published last fall, that talks about that. It is a load balancer that works on custom keys and can do automatic re-sharding based on new traffic patterns. When your nodes receive more traffic, the slicer will automatically distribute the load or re-shard the whole system. That is a suitable method for local sharding that happens within the data center; we do not use it for inter-datacentre because that is all done via geographic proximity.
The system works with 64-bit keys and gives you a lot of combinations. You can slice the keyspace, for instance, in 400 slices. That gives you 400 ranges, and you can take any of them and assign to one or more tasks. The hostname is my key for instance, and then you hash it, and you end up in the first slice that gets assigned to a single task with an index zero.
What can we do if the load changes? Let’s say that you have key zero that gets assigned to the first range and then the traffic changes. We have two options.
The first option is to assign more tasks, let’s say task 6, and then you round-robin between task 0 and task 6.
The second option is splitting into 600 or 800 slides to get a better grip on each of the keys.
We can also do that, and then we factor our the load for gerrit-review.googlesource.com and go-review.googlesource.com, and we put them into different hosts.
We do that for Gerrit, and one of the things we want for Gerrit is when we have to split per-host traffic with the affinity on the repository. Caches are based on the project, and because gerrit-android.googlesource.com is a massive host served from a lot of tasks, we don’t want all these tasks just to serve all general traffic for android. We want tasks serving android/project1 from here and android/project2 from there so that we optimise the second layer of caches.
What we do is to mangle these keys together based on both host and project. Before, all these chromium keyspace was served from a single host; when the load increases we just split the keys into Chromium source and the rest of metadata. This is the graph that we obtained after we implemented the load balancer. The load we have on each of the tasks in a single data center is represented by a line with a different color. What you can see is that are all nicely aligned, so that each task is serving precisely the same amount of traffic which is what you want.
What if one project is 100 times the size of the others and we are optimizing on queries per second? The system will just burn resources fast. We had that situation in May, we saw the graphs, and we said “all good, looks nice”; however, people were sending e-mails and raising bugs wondering if the system were serving any traffic at all or if it were down completely.
It turned out that Android had a lot of large repositories, regarding the number of references, and the objects. We were just optimizing the queries per second, but some of the tasks were doing just CPU intensive work, where others were happy with it. Some of them were burning CPU in flames, and others were fine.
So we moved out of the per-request affinity, and we modified the per-repository sharding to optimise all of this.
Warm vs. Cold Cache
There is an extra in the system that is pre-warming caches. What the load balancer can do for you is to tell you that traffic is changing and I need to reconsider how to split the load on the system. For each of the tasks is going to tell “I’m going to give you traffic for gerrit-review.googlesource.com” with a notice of 30 seconds. That time you can use for pre-warm caches.
That is especially nice if you restart your tasks because all the associated in-memory cache gets flushed. The load balancer tells you “oh, this is the list of the tasks I need” and then you can get them all and pre-warm their caches. This graph shows the impact of the cache warmer on our system, on the 99.9% requests latency, really on the long tail of requests latency. That looks nice because we brought the latency down by a third.
What is a task start dying during peak traffic? Imagine that the load balancer is saying “You’re going to handle this” and two seconds later says “I have to reconsider, you’re going to handle that instead”. Again you’re going to watch your system burning on fire, because you’re serving peak traffic and then you’re running close to 100% CPU. That situation causes the load balancer loading and unloading tasks all the time, which is inconvenient. The way we work around this is to make this cache warming a best-effort activity. You can do it if you’re below 50% CPU when you have time to do fancy things, but when you receive peak traffic, you just handle peak traffic without any optimisation made.
Multi-master and multi-tenant outside of Google
The question is: how do we do that in a non-Google setup?
There are plenty of options.
With the new Gerrit release in 2.15, we introduced to a new URL scheme, which includes the project name in the URL. Previously you had gerrit-review.googlesource.com/c/NNN and there was no way to directly know which project this is for and no way to do that load balancing that we just saw.
What we did in 2.15 is just add the project before that, so that extraction for both host and project can be made in a simpler way. You could do the same even before v2.15 but you needed a secondary index lookup, which most opensource load-balancers such as HAProxy or NGINX did not support. And of course, there are lots of products like Google Cloud load balancer, and others that you can use to achieve the same thing.
Wrap-up
We went through a journey where we took OpenSource Gerrit, we added sites selection and got a multi-tenant Gerrit.
Then we took this multi-tenant Gerrit, added replication and obtained multi-master Gerrit.
And then we took that with load balancing and lots of failures and lots of fixes, and we got pretty much the Gerrit that we run at Google, which brings me to the end of this talk 🙂
Q&A
Q: How strategic is Gerrit@Google? Do you have any other code-review systems? If yes, how is used Gerrit vs. the others?
We have another code-review system for internal use only, and Gerrit is used whenever we are doing OpenSource stuff, so for GoLang, Chromium, Android, Gerrit, and whenever the Google Team wants to collaborate with other OpenSource users, or in general with users that are not sitting at Google.
Historically the source at Google was developed in Perforce, and we ported from that to a home-based system called Piper. Around that, we have a tooling ecosystem which is internal. In parallel to that, Google started to do a lot of projects that have nothing to do with the internal search engine and available outside. What we see is that a lot of projects started at Google from scratch were thinking about “what system should we use?”. Many people said: “well, we’re going just to use Git because that’s what we know and we like, ” and when they needed code-review for Git they ended up with us. Gerrit and Git are very popular inside Google.
Q. You have two levels of load balancer. The first one is the location, and the second one is to decide what to do inside the data-center. What about if a location is off? Maybe is not fully off-line but has big problems, or has a very low-percentage of consensus, and some of the locations have not the “latest and greatest” of the repo. Possibly a location that should be “inconvenient for me” actually has the data I want.
You’re talking about replication layer where you have the objects in one location but not in the other. Our replication latency is in the order of seconds, but it may happen that one location is just really slow in getting the objects. That happens from time to time, and we have metrics that says what the replication lag is accounted for. When it exceeds a threshold we just shut the data-center off, which means cut-off the traffic, the data-center will not receive user-traffic anymore but it will still be able to get the replication done, and when the decrease the objects we need to replicate we can send the traffic again.
Cutting off the traffic is happing at the L1 load balancer where we said “don’t send anything there”.
Q. Do all the tasks have the same setup? Or do have a sort of micro-service architecture inside where some of the tasks are more dedicated to this type of operations and other for another type. Serving data from memory in one thing, but calculating diff change is a different type of task.
Not in general. All of our tasks are the same, except for checking access control permissions. We do not go through the whole Gerrit stack but we have only this little task that knows how the project config works and is going to tell us yes or no.