Gerrit: 2021 in review

Yet another year has passed for the Gerrit Code Review project with many challenges posed by the COVID-19 pandemic, new exciting releases, and the most popular Gerrit User Summit with the largest audience ever in its 12 years of history.

2021 in numbers

  • 93 registered attendees to the Gerrit Virtual User Summit 2021, connecting from 56 companies over 17 countries, 14 talks showcased by 15 presenters over 2 days
  • 1 Gerrit Contributors’ Summit
  • 35 releases of which 2 major versions (v3.4.0 and v3.5.0.1) and 33 patches
  • 107 contributors from 32 organizations, merging 4763 changes to 84 projects

The Gerrit Code Review community has shown resiliency during these difficult times, with outstanding participation in the events organized during the year, all remote and lacking the much-needed face-to-face interaction.

  • Commits: -26%
  • Projects: -16%
  • Contributors: -30%
  • Companies: -41%
  • Average changes/contributor: +10%

The engagement has paid its toll after two years of pandemics with fewer organizations willing to invest time in contributing to Gerrit, possibly also impacted by the uncertainty of the future. 2021 has also been the first whole year of the project without David Pursehouse, one of the Gerrit project’s top #3 contributors. He was used to contributing 1.5k changes per year, which would alone easily justify the drop observed.

On the bright side, the contributors that continued over the year 2021 have shown an increased commitment as the number of active projects and commits has dropped less than the contributors, increasing the change/contributor rate compared to 2020.

Major organisations contributing to Gerrit in 2021

Google is confirmed to be the leading force of the Gerrit Code Review project, with over 62% of the changes merged, while GerritForge continues to be the #1 top contributor from the rest of the community. There are a couple of pleasant special surprises from the contributors.

  • Wikimedia Foundation confirmed to be the #2 top contributor from the community, all provided by Paladox who has been awarded Gerrit Maintainer in November.
  • SAP continues to be a strong contributor, just below Wikimedia Foundation, with Thomas being awarded Gerrit Maintainer in November.
  • Qualcomm is back on the shortlist of the top maintainers, with many new names in the list of contributors, well done!

Top-ten projects with major activity in 2021

  1. gerrit (2,903 changes)
  2. plugins/code-owners (447 changes)
  3. jgit (287 changes)
  4. plugins/task (83 changes)
  5. plugins/multi-site (57 changes)
  6. aws-gerrit (44 changes)
  7. modules/cache-chroniclemap (40 changes)
  8. plugins/checks (39 changes)
  9. plugins/high-availability (39 changes)
  10. plugins/replication (38 changes)

The first surprise is that the code-owners, the emerging star of the Gerrit plugins, received a massive investment of effort from Edwin (Google), who contributed 89% of the changes to it. The code-owners plugin has also been presented at the Gerrit Virtual User Summit 2021 and attracted the community’s attention.

The second surprise is the decline in contributions in the jgit project during the past two years: from 820 changes/year is now down to 374 changes in 2021.

Task is now the #2 plugin project in terms of merged changes in 2021. Qualcomm keeps the project’s full ownership with 98.9% of changes in 2021.

GerritForge confirm their commitment to improving Gerrit Multi-Site, as its plugin is the #3 in terms of changes merged in 2021.

Aws-gerrit is a relatively new project, presented less than two years ago and contributed by GerritForge, who contributed over 99% of the changes. It confirms to be a very active project that has helped the Gerrit Code Review open-source project deploy and test well-known “recipes” of infrastructure setups and see how Gerrit performs and works on those. Many bugs have been detected before the release and identified by the aws-gerrit project and CI integration.

The cache-chroniclemap module confirms to be very active in 2021, with 40 changes all provided by GerritForge. This relatively new module allows existing Gerrit setups to increase the overall performance of all persistent caches, which are vital in reducing the REST-API latency across all Gerrit features.

The checks plugin was deprecated back in 2020. However, it still shows significant changes and investment from Google in supporting the new Gerrit checks-API and UI. However, the rate of contributions is in stiff decline, down from the 324 changes in 2019 when it was still an actively developed project.

The last two plugins projects in the top tens are the replication and high-availability plugins, which has received major contributions from Qualcomm, GerritForge, Google and Ericsson.

Top events in 2021

The Gerrit Code Review community abandoned the idea of a face-to-face event in 2021 because of the continued global pandemic of COVID-19.
Instead, there were two separate virtual events for sharing the news of what is happening on the platform and the expectations from the community.

Virtual Gerrit Contributors’ Summit – 9th of June

The summit was organized by the Gerrit Community Managers and had an amazing audience amongst the contributors. The presentations showed what different teams are working on and reported into the summit notes:

Gerrit Virtual User Summit 2021 – 2-3 of December

It was the first experiment of an entirely Virtual User Summit of the Gerrit Code Review project history. The challenges were multiple, including the limitations of allowing up to 100s of attendees, shortening the overall time to 3h x 2 days, and still allowing some interactions between the audience and the presenters. After two years of silence, we have finally received some user stories of using Gerrit in the wild.

The Summit has received vast overall positive feedback and rated 7.9/10, making it a fantastic achievement. The quality and interest of the talks were scored even higher, reaching 8.2/10.

The talks have been fully recorded and published on the GerritForge TV channel:

It was definitely a lot of information and sharing, which showed that the Gerrit Code Review open-source project is alive and active more than ever.

Gerrit features highlights in 2021

Gerrit Code Review has major innovations developed and decisions made over 2021. See below a short recap of the ones that represent a turning point in the evolution of the Gerrit open-source project. Some of them are considered breaking changes and, therefore, need careful analysis and a planned upgrade path.

Speed up of Gerrit upgrade from v2.7 to the latest version

2021 has seen a significant increase in the cooperation and contributions of Qualcomm to the rest of the Gerrit Code Review community, focussed on the speed-up of the Gerrit upgrade process from v2.7 to v3.5.
The contributions and cooperation have brought many improvements to JGit and Gerrit and will allow many more companies to migrate faster and smoother than ever before.

Goodbye to Java 8

From Gerrit v3.5 onwards, the source code and binaries of Gerrit Code Review won’t be compatible with Java 8 anymore.

JSch SSH library is completed removed from Gerrit Code Review

The quirks and obsolescence of the JSch library has cursed Gerrit’s destiny for years. Thanks to Thomas Wolf (Paranor) JGit moved away from it and rebuilt all its Git/SSH stack on top of Apache Mina. That has allowed to remove the JSch library from the Gerrit dependencies and used the Apache Mina SSHD client stack instead.

ElasticSearch is removed from Gerrit Code Review

On the 2nd of February 2021, Elasticsearch B.V. changed its license model and abandoned the Apache 2.0 open-source license for the new versions of ElasticSearch v8 and over.

Gerrit cannot include or require any commercial product not released under one of the open-source licenses allowed by the project. The ElasticSearch backend has not been widely used in the community anyway, based on a recent survey sent to the community therefore the ESC decided on the 3rd of November that the ElasticSearch backend will be removed from Gerrit core and moved into a libModule.

Submit Requirements waving goodbye to Prolog

The Gerrit Code Review project does not use anymore Prolog rules for the submit rules of the project from the 16th of December. The support for Prolog-less submit rules is now mature and it will be part of the forthcoming v3.6 release in 2022.

What’s coming in 2022?

The future of Gerrit Code Review is bright and full of innovative ideas and improvements on the overall development and CI/CD lifecycle. With the forced remote working of millions of developers worldwide, more and more companies are looking on how to make remote interactions more useful and fruitful, reducing frictions and making the workflow smoother and faster more than ever.

Stay tuned and keep on using and contributing to Gerrit Code Review, one of the most innovative and productive platforms for code review and collaboration.

Happy New Year, Gerrit Code Review

It has been a hectic and productive year for ourselves at GerritForge and the Gerrit Code Review Community.
We want to take this opportunity to recap some of the milestones of the 2019 and the exciting perspectives for 2020 and beyond.

Gerrit Code Review, 2019 in numbers

gerrit-2019-commits.png

Gerrit had over 120+ contributors from all around the world coming from 33 different companies and organisations, which is excellent. There is a robust 6% increase in the number of commits (+231 commits) but a reduction in the number of contributors (-7 authors).

With regards to the overall trend of commits during the year, the success of the Gerrit User Summit 2019 in Sunnyvale is visible, with an increase of the rate of commits around October/November.

Top-three projects of the 2019

  1. Gerrit (1,626 commits) is, of course, the most active project. However, it is visibly down in terms of number of commits from 2018 (-19%). That is a consequence of the shift of focus to the other two key components listed below, which are available as plugins and then not accounted for the overall gerrit core repository statistics.
  2. Checks (315 commits) is the brand-new 1st class CI integration API for external build systems, such as Jenkins and Zuul. It is incredible how in just 12 months it has become robust and fully mature. It is currently used for the validation of all changes on the Gerrit project.
  3. Multi-site (234 commits) is the long-awaited support for Gerrit that everyone has been waiting for years. It is finally available for all active and supported versions (from 2.16+ onwards).

Top-three companies contributing to Gerrit

gerrit-contributors-2019.png

  1. Google is, with no surprise, still the top contributor of the Gerrit project overall. It is basically stable from 2018 (around 43%) as a confirmation of the continued commitment to the project.
  2. GerritForge is growing significantly in the contribution to the project, with exactly half of the contributions of Google. This is a significant result from 2018 with a 7% growth of involvement.
  3. CollabNet is sliding to the 3rd position (it was 2nd in 2018) with a 3% decrease of contributions. As noticeable mention, however, David Pursehouse from CollabNet is still the number #1 maintainer in terms of number of commits.

Even if it is outside the top#3 contributors companies, SAP deserves a special mention for its continuous involvement in the JGit project, which is at the basis of Gerrit engine, and its fantastic engagement in improving the Gerrit CI system and integrating it with the checks plugin.

Top-three achievements from GerritForge

The outstanding results of contributions of GerritForge in 2019 have been focused on three major topics.

Gerrit multi-site, released and production ready

We released the Gerrit Multi-Site plugin, allowing seamless balancing in a distributed environment, a technologically highly advanced development, crucial for very distributed companies. See https://gerrit.googlesource.com/plugins/multi-site for more information.

Gerrit User Summits in Europe, USA and streaming

We successfully organised and executed the Gerrit User Group in Europe and the US. The event was very well received by the community with an overall attendance of some 87 on-site and 38 in streaming. Have a look at https://gitenterprise.me/2019/12/23/gerrit-user-summit-survey/ for interesting feedback on those from the attendees.
We opened our own local office in Sunnyvale, in the heart of Silicon Valley. A crucial move to better serve our ever-expanding US customer base.

Gerrit Analytics for the Android Open-Source Project

We kickstarted the Gerrit Analytics for the Android open-source project initiative: after the successful adoption of the automatic collection of code metrics on the Gerrit project (see https://analytics.gerrithub.io) the Android team asked GerritForge to start working on extracting the same metrics from their code.

What’s coming in 2020

Gerrit v3.2 is currently under development and it is planned to be released around April/May 2020. It represents a major milestone for the Gerrit project with the support for Java 11 and large JVM heaps, up to hundreds of GBytes. Gerrit v3.2 is definitely the release that everyone that has a big repository (mono-repos) should target as next upgrade. See the Gerrit .roadmap at https://www.gerritcodereview.com/roadmap.html for more details about the planned features.

More work and improvements on the checks plugin, with the aim of fully integrating it into everyone’s user-journey and their CI/CD pipeline. Our first blog-post of 2020 will be how to use Jenkins and Checks plugin together with GerritHub.io.

Multi-site and HA will become more integrated with Gerrit, with the aim of moving parts of their technologies (e.g. global ref-db) into JGit and thus used in Gerrit core.

The Gerrit User Summit 2020 will continue the experiment of cross-pollination with other communities, after the success of the interactions with the JGit and OpenStack communities in 2019. Bazel is the next target, as it is used as the de-facto standard build system for Gerrit and its plugins.


 

Again, Best wishes from your friends at GerritForge and looking forward to a continuing successful partnership in the coming years.

Luca Milanesio
Gerrit Maintainer, Release Manager and member of the ESC.

Gerrit: OpenSource and Multi-Site

One more recording from the Gerrit User Summit 2018 at Cloudera in Palo Alto.

Luca Milanesio, Gerrit Code Review Maintainer and Release Manager, presented the current status of the support for multi-master and multi-site setups with the standard OpenSource Components, developed by GerritForge and the Gerrit Code Review Community.

Introduction

The focus of this talk is sharing with you one experience that we did with the Gerrit server that we maintained, GerritHub.

First of all, I’m just going to tell you how we went through the journey from a single master-slave installation back in 2013 to a fully multi-site setup across two continents.

The evolution of GerritHub to multi-site

GerritHub was born in November 2013. The idea was straightforward. It was just an idea on how to take a single Gerrit server and put the replication plug-in to push to GitHub.

To implement a good and scalable and reliable architecture, you don’t need to design everything up front. At the beginning of your journey, you don’t know who your users are, how many repos are going to create, what the traffic looks like, what the latency looks like: you know nothing.

You need to start small, and we did back in 2013, with a single Gerrit master located in Germany, because we had no idea of where the users would have come from.

Would the people in Europe like it, or rather the people in the U.S. like it, or again the people in China like it? We did not know. So we started with one in Germany.

Screenshot 2019-03-03 at 00.07.01

Because we wanted to make a self-service system what we did was very simple: a simple plugin called, “The GitHub plug-in”. That was just a wizard to add an entry in the replication config.

You have Gerrit incoming traffic, then you configure replication, plugin and eventually push to GitHub. The only complicated part here is that if you do it as a Gerrit administrator you have to define these remotes in the replication.config but you can express it in an optimized way. On a self-service system, you’ve got 1000s of people then will create 1000s remotes automatically. Luckily, the replication plugin works very well and was able to cope with it very well.

Moving to Canada

Then we evolved. The reason why we changed is that people started saying ‘Listen, GerritHub is cool, I can use it, works well in the morning. Why in the afternoon is so slow?“. Uh oh.

We needed to do some data mining to see precisely who was using it, where they were coming from, and what operations they were doing. Then we realized that we had chosen the wrong location because we decided that we wanted to put the Gerrit master in Germany, but the majority of people are coming from the USA.

Depending on how the backbone between the Atlantic Ocean was performing, GerritHub could be faster or slower. One of the complaints that they were saying is that in the morning, GitHub was slower than GerritHub, but in the afternoon it was exactly the opposite.

We were doing some performance tuning and analyzing the traffic, and even when people were saying that it’s very slow, actually GerritHub was a lot faster than GitHub in terms of throughput. The problem was the number of hops between the end user and GerritHub.

We decided that we needed to move from Germany to the other side of the Atlantic Ocean. We could have done to move the service to the USA but we decided to go with Canada because the latency was precisely the same as hosting in the USA but less expensive.

Screenshot 2019-03-02 at 01.19.34

What we could have done is just to move the Master from Germany to the other side of the Atlantic Ocean, but because, from the beginning, we wanted to give a service that is always available, we decided to keep both zones.

We didn’t want to have any downtime, even in this migration. We wanted, definitely, to do one step at a time. No changes in releases, no changes in configurations, only moving stuff around. Whenever you change something, even if it’s a small release change, you change the function, and that has to be properly communicated.

If we were changing data center and version, when something goes wrong, you would have the doubt of what it is. Would it be the new version that is slower or the new data center that is slower? You don’t know. If you change one thing at a time, it must be that thing that wasn’t working.

We did the migration in two steps.

  • Step-1: The Gerrit master in Germany, still the replication to GitHub, and the new master in Canada was just one extra replication end.

    The traffic was still coming on this side of the master, but it was replicated to both Canada and the other GitHub. Then, when that was stable, so we were doing all the testing, the other master was used as it was at Gerrit slave, but was not a slave, all the nodes were master, with just a different role.

  • Step-2: Flip the switch the Gerrit master is in Canada. When replication was online and everything was aligned, we have put a small read-only plug on the German side, which was making the whole node read-only for a few minutes, to give time to the last replication queue to drain.When the replication queue was drained, we flipped the switch, when it was going to the new master it was already read/write.

The people didn’t change their domain, didn’t notice any difference, apart from the much-improved performance. The feedback was like “Oh my, what have you done to GerritHub since yesterday? It’s so much faster. It has been never so fast before.
Because it was the same version, and we were testing in parallel for quite some time, nobody had a single disruption.

Zero-downtime migration leveraging multi-site

But that was not enough, because we wanted to keep Gerrit always up and running and always up to date with the latest and greatest version. Gerrit is typically released twice a year; however, the code-base is stable in every single commit.
However, we were still forced to do the ping pong between the two data centers when we were doing our roll out. It means that every time that an upgrade was done, users had a few minutes of read-only state. That was not good enough, because we wanted to release more frequently.

When you upgrade Gerrit within the same release, let’s say between 2.15.4 and 2.15.5, the process is really straight forward, because you just replace the .war, restart Gerrit, done.

However, If you don’t have at least two nodes on either side, you need to ping pong between the two different data centers, then apply the read-only window, which isn’t great.
We started with a second server on the central node so each node can deal with the entire traffic of GerritHub. We were not concerned about the German side, because we were just using it as disaster recovery.

Going multi-site: issues

We started doubling the Canadian side with one extra server. Of course, if you do that with version v2.14 which problem do you have?

  • Sessions. So, how do you share the sessions? If you login into one Gerrit server, you create one session, then you go to the other and you don’t have have a session anymore.
  • Caches. That is easier to resolve, you just put the TTL of the cache to a very low value, put some stickiness, you may sort this out. However, cache consistency is another problem and needs to be sorted.
  • Indexes are the very painful one, because, at that time there was no support for ElasticSearch. Now things are different, but back in 2017, it wasn’t there.
    What happens is that every single node has its own index. If an index entry is stale, it’s not right until someone is going to re-index.

The guys from Ericsson were developing a high availability plugin. We said instead of reinventing the wheel, why don’t we use a high availability plugin? We started rolling it out to GerritHub and actually, the configuration is more complex and looks like this one.

So, imagine you’ve got in Canada you’ve got two different masters, still only one in Germany. They align the consistency of the Lucene index and the cache through the HA plugin and they still use the replication plugin.

How do you share the repository between the two? You need to have a shared file system. We use what exactly the same used by Ericsson, NFS.

Screenshot 2019-03-02 at 01.21.36

Then, for exposing the service we needed HAProxy, not just one but at least two. If you put one HAProxy, you’re not HA anymore, because if that HA proxy dies, your service goes down. So, you have two HAProxy, they must have a cross configuration’s, it means that both of them, they can redirect traffic to one master or the second master, it’s not one primary and the second backup: they have exactly the same role. They do exactly the same thing, they contain exactly the same code, they’ve got exactly the same cache, exactly the same index. They’re both running at the same time. They’re both accepting traffic.

This is something similar to what Martin Fick (Qualcomm) did, I believe, last year, with the only difference that they did not use HAProxy but only DNS round-robin.

Adoption of the high-availability plugin

Based on the experience of running this configuration on GerritHub, we started contributing a lot of fixes to the high-availability plugin.

A lot of people are asking “Oh, GerritHub is amazing. Can you sell me GerritHub?”. I reply with “What do you mean exactly?”

GerritHub is just a domain name that I own, with Gerrit 2.15, plus a bunch of plugins: replication plugin, GitHub plugin, the high-availability plugin (we use a fork), the web session flat file and a bunch of scripts to implement the health check.

If you guys want to do your own, the same configuration, you don’t need to buy any commercial product. We don’t sell commercial products. We just put the ideas into the OpenSource community to make it happen. Then if you need Enterprise Support, we can help you implement it.

The need for a Gerrit disaster-recovery site

Then we needed to do something more because we had one problem. OVH had historically been very reliable, but, of course, shit happens sometimes.
It happened one day that OVH network backbone was down for a few hours.

That means that any server that was on that Data-Center was absolutely unreachable. You couldn’t even connect to them, you couldn’t even check their status, zero. We turned the traffic to the disaster recovery side, but then we faced a different challenge because we had only one master.
It means that if something happens to that master, I don’t know, a peak, or whatever, is becoming a little bit unhealthy, then we are going to have an outage. We didn’t want to risk to have an outage in that situation.

So we moved to Germany with two servers, OVH with two servers, and afterward, we migrated to Gerrit v2.15 and NoteDb.

Your disaster recovery side is never really safe until you are going to need and use it. Then use it all the times, on a regular basis. This is what we ended up to implement.

We have now two different data centers, managed by two different cloud providers. One is still OVH with Canada, and the other is Hetzner in Germany. We still have the same configuration, HA plugin over a shared NFS, so this one is completely replicated into the disaster recovery site, and we are using the disaster recovery continuously to make sure that is always healthy and aligned.

Leverage Gerrit-DR site for Analytics

Because we didn’t want to serve actual user traffic on the disaster recovery site, because of the synchronization lag between the two sides, we ended up using for all the data mining activities. There is a lot of things that we do on data, trying to understand how our system performs. There is a universe of data that typically either never looked at or you don’t really extract and process in the right way.

Have you ever noticed how much data Gerrit generates under the logs directory? A tremendous amount of data and that data tells you exactly the stories that you want to know. It tells you exactly how Gerrit works today, what I need to do to makes sure that Gerrit will work tomorrow, how functions are performing for the end users, if something has blown up, it’s all there.

We started really long ago to work on that DevOps Analytics space, and we started providing that metrics and insights data for the Gerrit Code Review project itself and reporting it back to the Gerrit Code Review project through the service https://analytics.gerrithub.io.

Therefore we started using the disaster recovery site for analytics traffic because if I do an extraction and processing of my data on my logging, on my activities, is there really a need for an analysis of the visit of 10 seconds ago or not? A small time lag on data doesn’t make any difference from an analytics’s perspective.

Screenshot 2019-03-03 at 00.19.54.png

We were running Gerrit v2.15 here, so the HA plugin needed to be radically different from the one that it is today. We are still massively on the HA plugin fork, but all the changes have been pushed for review on the high availability.

However, the solution was still not good enough because there were still some problems. The problem is that the HA plugin within the same data center relies on the shared file system. We knew within the same data center that was not a problem. But, what about creating an NFS across data-centers in different continents? It would never work effectively because of the latency limitations.

We then started with a low tech solution: rely on the replication plugin for the Git data in the repository. Then every 30 minutes, there was a cronjob that was checking the consistency between the two and then does a delta re-index between the different sites.

Back in Gerrit v2.14, I needed to do as well a database export and import, because contained the reviews.

But, there was also the timing problem: in case of a disaster occurred, people will have to wait for half an hour to get the data after re-index.
They would have not lost anything because the index can be recreated at any time, but the user experience was not ideal. And also, you’ve got the DNS related issues for going from one zone to the other.

Sharding Gerrit across sites

First of all, we want to leverage the sharding based on the repository, available from Gerrit 2.15, which include the project name in each page or REST-API URLs. That allows achieving basic sharding across different nodes even with a simple OpenSource HAProxy or other Workload Balancer, without having the magic of the Google intelligent Gerrit/Git proxy. Bear in mind this is not pure sharding, because all the nodes keep on having all the repositories available on every node. HAProxy is going to be clever and based on the project name and action, will use the most appropriate backend for the reads and writes operation. In that way, we are making sure that you never push on the same repository on the same branch from two different nodes concurrently.

Screenshot 2019-03-02 at 01.23.16

How magic works? The replication plugin takes care of synching the nodes. Historically with master-slave, the synchronization was unidirectional. With multi-site instead, all masters are replicating to each other. It means that the first master will replicate to the second and the other way around. Thanks to the new URL scheme, Google made our lives so much easier.

Introducing the multi-site plugin

We have already started writing a brand-new plugin called multi-site. Yes, it has a different name, because the direction that it’s taking is completely different in terms of decisions compared to the high-availability plugin, which required having a shared file system. Secondly, the HTTP synchronous communication with the other site was not feasible anymore. What happens if, for instance, a remote Site in India is not reachable for 50 seconds? I still want things events and data to be put into a persistent queue to be eventually sent through the remote site.

Screenshot 2019-03-02 at 01.24.06
For the multi-site broker implementation, were have decided to start with Kafka, because there is already an implementation of the stream events it in Gerrit as a plugin. The multi-site plugin, however, won’t be limited to Kafka but could be extended to support other popular brokers such as NATS.

We will go to the location-aware DNS because we want the users to access the HAProxy that is closer to him. If I am living in Germany maybe it makes sense for me to use only the European servers, but definitely not the servers in California.

You will go by default to a “location-aware” site that is closer to you. However, depending on what you do, you could still be laster on redirected to another server across zones.

For example, if I want to fetch data, I can still fetch everything on my local site. But, if I want to push data, the operation can either be executed locally or forwarded to the remote site, if the target repository has been sharded remotely.

Screenshot 2019-03-02 at 01.24.43

The advantage is the location-awareness and data-locality. The majority of data transfer will happen with my local site because, at the end of the day, the major complaints of people using Gerrit with remote masters is a sluggish GUI.

If you have Gerrit master/slave, for instance, a Gerrit master in San Francisco and your Gerrit slaves in India, you have the problem that everyone from India will still have to access a remote GUI from the server in San Francisco, and thus would experience a very slow GUI.

Even if only 10 percent of your traffic is a write operation, you would suffer from all the performance penalties of using a remote server in San Francisco. However, if I have an intelligent HAProxy, with also an intelligent logic in Gerrit that understands where the traffic needs to go to, I can always talk to my local server is in my zone. Then, depending on what I do, I can use my local server or a remote server. This happens behind the scenes and is completely transparent to the user.

Q&A

Screenshot 2019-03-02 at 01.25.33

Q I just wanted to ask if you’re ignoring SSH?

SSH is a problem in this architecture, right, because HAProxy supports SSH, but has a big problem.

Q: You don’t know what projects, you don’t have any idea of the operation, whether it’s read or write.

Exactly. HAProxy works at transport levels, so it means that it knows that flow of encrypted data going back and forth to the Gerrit server, but cannot see anything in clear and thus cannot make an educated decision based on it. Your data may end up being forwarded to the zone where your traffic is not optimized, and you will end up waiting a little bit more. But, bear in mind, that the majority of complaints about Gerrit master/slave is not really on the speed of the push/pull but rather on the sluggish Gerrit GUI.

Of course, we’ll find as a Community a solution to the SSH problem, I would be more than happy to address that in multi-site as well.

Q: Solution is to get a better LAN because we don’t have any complaints about the GUI across the continents.

Okay, yeah, get a super fast network across the globe, so everyone and everywhere in the world will be super fast accessing a single unique central point. However, not everyone can have a super-fast network. For instance in GerritHub, we don’t own the infrastructure so we cannot control the speed. Similarly, a lot of people are adopting a cloud infrastructure, where you don’t own and control the infrastructure anymore.

Q: Would you also have some Gerrit slaves in this setup. Would there be several masters and some of the slaves? Or everything will become just a master?

In multi-site architecture, everything can be a master. The distinction between master and slave does not exist anymore. Functionally speaking, some of them will be master for certain repositories and slave for other repositories. You can say that every single node is capable of managing any type of operations. But, based on this smart routing, any node can potentially manage everything, but effectively forwards the calls based on where is best to execute them.
If with this simple multi-site solution you can serve the 99% of the traffic locally, and for that one percent, we accept the compromise of going a bit slower, than it could still sound very good.

We implement this architecture last year with a client that is distributed worldwide and it just worked, because it is very simple and very effective. There isn’t any “magic product” involved but simply standard OpenSource components.

We are on a journey, because as Martin said last year in his talk, “Don’t try to do multi-master all at once.” Every single situation is different, every client is different, every installation is different, and all your projects are different. You will always need to compromise, find the trade-off, and tailor the solution to your needs.

Q: I think in most of the cases that I saw up there, you were deploying a Gerrit master in multiple locations? If you want to deploy multi-master in the same sites, like say I want two Gerrit servers in Palo Alto, does that change the architecture at all? Basically, I am referring to shared NFS.

This is actually the point made by Martin: if you have the problem on multi-site, it’s not a problem of Gerrit, it’s a problem of your network because your network is too slow. You should solve the network, you shouldn’t solve this problem in Gerrit.
Actually, the architecture is this one. Imagine you have a super fast network all across the globe, everyone is reaching the same Gerrit in the same way. You have a super fast direct connection with your shared NFS, so you can go and grow your master and scale horizontally.

The answer is, for instance, yes, if your company can do that. Absolutely, I wouldn’t recommend doing multi-site at all. But if you cannot do anything about speed, and you got people that work remotely and unfortunately, you cannot go and put a cable between the U.S. and India, under the sea just for your company, then maybe you want to address the problem in a multi-site fashion way.

One more thing that I wanted to point out, is that the multi-site plugin makes sense even in the same site. Why this picture is better than the previous one? I’m not talking about multi-site, I’m talking about multi-master on the same data center. Why this one is better?

You’ll read and write into both. Read, write, traffic to both. The difference between this one and previous is that there is no shared file system on this one. The previous, even within the same data center, if you use a shared file system, you still have a single point of failure. Because, even if you are willing to buy the most expensive shared file system with the most expensive network and system that exists in the world, they will still fail. Reliability cannot be simply resolved by throwing more money on it.
It’s not the money that makes my system reliable. Machines and disks will fail one day or another. If you don’t plan for failures, you’re going to fail anyway.

Q: Let’s say when you’re doing an upgrade on all of your Gerrit masters right? Do you have an automatic mechanism to upgrade? I’m coming from a perspective where we have our Gerrit on Kubernetes. So, sure, if I’m doing a rolling upgrade, it’s kind of like a piece of cake, HAProxy is taking care of by confd and all that stuff, right?
In your situation, how does an upgrade would look like, especially if you’re trying to upgrade all the masters across the globe? Do you pick and choose which site gets upgraded first and then let the replication plugins catch up after that?

First of all, the rolling upgrade with Kubernetes will work if you do minor upgrades. If you’re doing major Gerrit upgrades, it doesn’t. This is because that changes the data, right? With the minor upgrades, you do exactly the same here. Even if the others in the other zones have a different version, they are still interoperable, right, because they talk exactly to the same schema, exactly to the same API, exactly in the same way.

Q: I guess taking Kubernetes out of the picture, even, so with the current multi-master setup you have, if you’re not doing just a trivial upgrade, how do you approach that usually?

If you are doing a major version upgrade, that is going to orchestrate exactly like the GerritHub upgrade from 2.14 to 2.15. You need to use what I called a ping-pong technique. You basically need to do across data centers.

It’s not fully automated yet. I’m trying to automate it as much as possible and contribute back to the community.

Q: In the multi-master, when you’re doing the major upgrade, and even in the ping pong, so if the schemas changed, and you’re adding the events to replication plugin, are you going to temporarily suspend replication during the period of time? Because the other items on the earlier version don’t understand that schema yet. Can you explain that a little?

Okay, when you do the ping pong that was this morning presentation, what happens is the upgrade on the first node, interruption the traffic there, you do all the testing you want, you are behind with the original master but it catches up with replication and does all the testing.

With regards to the replication event, they are not understood by the client, such as the Jenkins Gerrit trigger plugin. That point was raised this morning as well. If you go to YouTube.com/GerritForgeTV, there is the recording of my talk of last year about a new plugin that is not subject to this fluctuation of Garret version.

Luca Milanesio – Gerrit Code Review Maintainer and Release Manager