Gerrit User Summit: Jenkins forever

Posted on October 24, 2017 by Git and Gerrit Code Review for the Enterprise

This week we are going to publish a talk from the Gerrit User Summit 2017 about Gerrit and Jenkins used together. It is a real-life story on how to set up a CI/CD pipeline for a massive traffic OpenSource project such as Gerrit Code Review and the learnings of how to manage the storage and consumption of the Jenkins build logs and the associated meta-data.

Even if you are not a Gerrit Code Review user, the learnings of this talk are going to be exciting and useful for any high load CI/CD pipeline project with Jenkins.

GerritForge: Gerrit Code Review and Jenkins expertise

I am part of GerritForge, a London-based limited company not specialized in Gerrit, as the name would tell, but also on Jenkins, Continuous Integration and Delivery. Why don’t we use our skills to serve the Gerrit Code Review project? A couple of years ago the project did not have an official CI yet, so we said: “why not help the project and set up an official pipeline to verify all the incoming Gerrit changes to the Gerrit Code Review project itself?”

We then created https://gerrit-ci.gerritforge.com and, as you can see, it is nowadays a jam-packed CI system. We have been running a Hackathon over the weekend, and now, even while people in this room are following this talk, new changes are produced, and reviews are getting pushed to Gerrit, and that keeps our CI busy all the times.

Screen Shot 2017-10-24 at 22.51.04.png

We have a lot of slaves, some of them are provided for free by Google and others are paid by GerritForge. We have been running this service for the last couple of years, and even non-contributors to the Gerrit project like most of you guys are possibly using it for downloading some useful artifacts such as the Gerrit plugins. Additionally, if you want to download and demo the latest and greatest version of Gerrit master, as we just did with some of you before lunch, you can use the Gerrit artifacts on Gerrit-CI instead of building it yourself on your local box.

Gerrit-CI pipeline walkthrough

Let’s have a look at how Gerrit-CI works. You can log in with your GitHub credentials, and then trigger builds for your Gerrit Code Review contribution using a job called “Gerrit verifier change”. That is the most important job of the pipeline and it verifies every single change we make on the Gerrit Code Review project.

How can you manually trigger the build and verification of a change in Gerrit? You navigate to https://gerrit-ci.gerritforge.com/job/Gerrit-verifier-change/ and click on “Build with parameters” link. You enter your change number and then click “Build”: it is straightforward.

What this job does is triggering a workflow developed in Groovy language which it will provide at the end a series of feedback messages to Gerrit. When you go to https://gerrit-review.googlesource.com and list of open changes, you will notice that some of them by one guy that is called “GerritForge CI”. That means that our CI works, yeah!

At a certain point in time, someone in the Gerrit mailing list said: “Houston, we have a problem, we are too productive! We have produced so many changes and patch sets that the time you finish to build a change, we have already produced other 300 patch sets on that job and the build logs get lost”.

The Gerrit change verifier workflow

Let’s go back for a moment to review how the workflow that we came up with works. It does not rely on the Gerrit Trigger plugin, the de-facto out-of-the-box Gerrit/Jenkins integration that most of the people use, but rather on a complete “new thing” that we have built ad-hoc for our purpose.
We couldn’t use the Gerrit Trigger plugin because of two reasons:

Google data-centers do not allow incoming SSH connections
SSH stream event channel would not have been good enough for us, because of the parallelism needed.

The way that our workflow works if very simple.
The verifier flow requests the list of changes that need verification by leveraging Gerrit query language which allows you to search through most of the fields of changes using a Lucene syntax. For each change that needs checking, a corresponding number of parallel jobs are triggered. This parallelism is potentially unlimited; the only limit is the number of machines that Google can assign to the Gerrit-CI, if he can allocate one hundred, we will be able to perform hundreds of parallel changes verifications.

Screen Shot 2017-10-24 at 22.53.30.png

That means that we can produce a lot of verification jobs at the same time. Bear in mind that for every change we do not trigger just one build: we have NoteDb vs. ReviewDb verification, PolyGerrit UX tests, Code-Style check there was a moment in time where a single change needed up to 6 parallel builds! That resulted into a lot of builds, which, as long as you’ve got enough horsepower in the slaves, it was working fine us.

We do not send feedback to Gerrit for every single build, but we rather have a “Gerrit Verifier Change” job coordinating the workflow and makes a decision accordingly. The criteria are the number of failed builds, the build retries for flaky builds. At the end of the process, all build results are collected and a unique coordinated feedback to send back to Gerrit as a unique verification message.

Too many logs for Jenkins lead to a 404 page.

This is all good, but as we said earlier: “Houston, we have a problem, we are too productive!”.
Here are some numbers of our productivity:

300 jobs
170,000 builds
4.8 millions of jar artifacts produced
1.7 billions of lines of logs

And of course, we want to send a link to the build logs we want to give context to the change failure or success. Unfortunately, happened to trace in Gerrit changes some nice links pointing to a quite unpleasant 404 page in Jenkins.

Why did it happen? We have a lot of contributors that generated lots of commit traffic and thus many build runs. There is a policy in Jenkins to remove “old” builds and thus happened that we lost build logs of active changes under review.

Screen Shot 2017-10-24 at 22.59.39.png

Q. (Han-Wen Nienhuys – Google) At Google internal build system we also this kind of numbers but of course with more zeros at the end, but actually we throw away our logs, and if you build binaries, they are very large.

In the beginning, we tried to keep more stuff online in Jenkins but people started saying “Luca, we have a much bigger problem now: gerrit-ci.gerritforge.com doesn’t respond anymore. When I open the Jenkins home page, it takes a very long time and eventually times out.”

That is caused by Jenkins design which is problematic when the number of logs increases considerably: everything is stored as a file and there is no efficient indexing for discovering the data on the filesystem. Additionally, if your company does not have a large infrastructure, your disk space is limited anyway. At GerritForge the Jenkins master has only 8TBytes of disk space, and we don’t have available a system with PetaBytes or more.

Keeping Jenkins logs forever

I made the Gerrit Contributors’ Community aware of the problem and I asked: do we like that? If you think about it, logs are not rubbish. Logs are of immense value, logs are like your money, and analyzing them, crunching and understanding them is our daily job. The timestamps in the logs are like precious diamonds because they tell you that you may have made a mistake in your code and some parts of your pipeline execution start taking a lot more time than before.

When you remove the “old” logs, you make much more difficult to investigate on a failed verification build: the link attached to the change verification message points to a page that returns a 404. That’s not a bug in Jenkins; it’s a feature of removing old logs and keeping the master instance fast and healthy. But actually, it is a real functionality gap because Jenkins doesn’t know yet how to manage logs archiving.

Then I asked the Community: “For how long you want your logs to be retained?” because I needed to raise a PO for a much bigger machine. “One day, one week, one month?” and the answer I got was “Forever!”

If you think about it carefully, the answer is correct. You may not need all those logs at the moment, but in a month’s time, you may need to crunch some data to extract features or metrics. Additionally, getting rid of all logs means generating broken links in my past reviews, which could be an audit requirement stored with Gerrit changes.

Sending Jenkins data to a Logstash appender

It was about time for me to think about a solution and here is a description of what I have done.

First of all, I needed to get more disk space from Google, but then how can I tell Jenkins to use an alternative disk storage mechanism for his logs?
I then started adding to the jobs a plugin called “Logstash” (https://wiki.jenkins.io/display/JENKINS/Logstash+Plugin) which is responsible for capturing and sending Jenkins data to a configured stream appender.
All the Gerrit CI jobs are managed through YAML files which are submitted through code-reviews, using the Jenkins Job Builder tool. However, showing the Logstash configuration on the Jenkins UI is much easier to show where the Logstash is playing a role in the Gerrit-verifier-change job configuration.

Screen Shot 2017-10-24 at 23.05.33.png

I have enabled a new feature to all the jobs to send all the log stream to the Logstash plugin. This works differently to what most of the people would do. Instead of just posting the log file into a stream of lines to ElasticSearch, this plugin gets the information directly from the JVM memory together with its metadata, the timestamp, the build parameters, the environment variables and send them to an endpoint, which could be anything. In this case, I have chosen to use RabbitMQ as stream appender. On RabbitMQ you can notice that I have created a queue for incoming Jenkins messages.

Screen Shot 2017-10-24 at 23.07.02.png

You may notice a lot of activity because every time that the Jenkins jobs produce something, a message is sent to RabbitMQ with the log and the attached meta-data. RabbitMQ is not used though as a storage system but acts only as a vehicle to transfer the information to a long-term storage system, which could be Google Cloud Storage.

The organization of files is straightforward: one file per hour. By looking at the file content, it is a very compressed JSON file that contains all the information I need: the build id, the result, the logs, the parameters.

Spark to the rescue

Problem solved? Can I tell all the Gerrit contributors that they have to look for a build result into a JSON file? Maybe this is not a very nice user experience.
I little more digging is needed to make the solution more transparent to the end user.

Screen Shot 2017-10-24 at 23.14.56.png

GerritForge as a company works and contributes to many BigData projects, including Apache Spark. Why don’t we build an elementary Spark transformation that consumes the input JSON files and materializes back the log into a readable format?
So we built a Spark job that is crunching this data and produces something very very similar to what Jenkins would render. However, we need to make sure to perform all those operations outside the Jenkins domain; otherwise, it would become very soon overloaded and thus unusable.

I have then created another directory that is not actually managed by Jenkins but gets populated by a Spark job. This parallel file structure has exactly the same organization of the build files generated by Jenkins builds.
Let’s have a look for instance at the oldest build that has been recorded by Jenkins: build #31639. For sure if I go to the build #31444, which is older than the #31638, Jenkins would give me a 404 because that job execution has been removed.
However, if I try now to navigate to the build log #31444, wow, I can see the full results as the build log was still accessible.

Screen Shot 2017-10-24 at 23.09.09.png

Additionally, as this log has been produced from the previous JSON file that contains all the meta-data, I can even render more information such as the time-stamps, which are not typically available in Jenkins unless you enable a specific plugin.
Moving forward, by leveraging the same input JSON file, we could do a lot more data crunching as well. It would be interesting for instance to draw a graph of the correlation between the Gerrit changes the build execution times at the different stages.

Uncovering the hidden value of your Jenkins logs

There is a lot more we can do with the JSON I’ve shown you before. It contains not just the log messages, but everything related to the build meta-data of the build and its execution metrics. That means if we go to this change #129553, the link that points to Jenkins logs is not broken anymore, even if it is not served by Jenkins but is backed by the Spark job results.
Starting to applying the same mechanism to all the Gerrit changes and redirecting them to the Google storage where all the files are going to be archived, any change in the Gerrit history will not contain broken links anymore and will be perfectly auditable.

That means that from now, whenever you are going to receive a Verified notification from Gerrit and you navigate to your change links, you are not landing anymore to a 404 page anymore.

Questions.

Q: What if I have a Jenkins instance and I want to do some of this but I don’t have infinite disk-space as Google. Is it is possible to implement?

A: With regards to disk space, you don’t have to go to Google or AWS. You can set up an HDFS filesystem yourself. All the cloud storage implementations available on the Cloud are mainly based on something very similar to HDFS which is an open standard and is available as OpenSource. That means you can store the information there and you do not necessarily need to keep it forever. In practical terms what you need to keep is the lifetime of a release of the software, or a few software iterations, maybe six months, 12 months. As the JSON files are organized on a time-series, it is going to be very easy to remove or archive all the data you do not need anymore. I have shown you how to store those files in JSON, but you can use even more optimised and compressed format such as Avro or Parquet, which may contain 10x times the information in a fraction of disk space. Additionally, when you process them, they can be even faster because they include data encoded in binary format. In a nutshell, the term “keep the logs forever” could be read as “keep for as much as you need: one week, one month, six months, …”. The problem with Jenkins is that for very busy servers like the Gerrit CI, you cannot keep even a single day of logs and when the people are coming the next day to check what’s wrong with a failed verification, would risk having a 404 error page.

Q: So if you do compression and decompression, that needs to happen server-side, so that is transparent to the browser?

A: Yes, that needs to happen on the Server, and there are a lot of ways for doing it, it could be even done on-the-fly, streaming and is pretty fast. There will be a talk tomorrow talking about the methodology to crunch large amounts of data and about the lambda architecture.

Q: Does it generate a RabbitMQ message for each log statement or a unique one at the end of the build?

A: Yes, and the reason is straightforward: If the build crashes or gets aborted for any reason, you do not want to lose your build logs. There was an implementation of Logstash for the Jenkins pipeline that was precisely collecting the logs all at the end of the build, but the design is wrong because if the builds get aborted you do not get feedback at all. Yes, it generates a message for every single line, and possibly RabbitMQ is not the correct implementation of it. But as soon as the Logstash plugin supports the Kafka transport, the performance issues related to the use of RabbitMQ for log streaming will be resolved.

Q: The Logstash plugin that you mentioned, has nothing to do with the “ElasticLogstash” implementation?

A: Yes, it is just unfortunate naming. Actually, the Jenkins Logstash plugin was possibly born before Elastic called his implementation ‘logstash’.

Q: You mentioned that you do Spark processing at some point, but it wasn’t part of your presentation.

A: Yes, it is not part of this presentation for reasons of time, but it is trivial.

Q: Question about the GerritForge CI: I have frequent problems of the test failing not because of my code, and I want to retrigger the tests without having to add a commit to retrigger the CI. Is there a way to retrigger the CI build?

A: Yes, it can be done by going to the Gerrit-verifier-change URL, you click on “Build with Parameters” and enter your change number. You can in this way retrigger any build without having to commit anything.

Q: And if that pass that would assign the Verifier approval to the change?

A: Yes. I would like to add a Button to Gerrit-Review to avoid people to navigate to a different URL.

Q: We are relatively heavy users of Gerrit topics because we have changes that are across multiple repositories. We have a very similar job to this one but we can either put a single or multiple change IDs or a topic name, and it will work out whether it is a consistent declaration. Another thing that you can comment on, you mentioned that the verifier job which runs some independent verifications and then feeds the result as one result to Gerrit, that sounds like something we could use. What is that build using?

A: Tomorrow there will be a presentation of a brand-new integration between Gerrit and Jenkins. The rationale for writing a new integration lies on the thinking that “maybe the Gerrit project is not the only one that needs a bit more from Jenkins.” So why not creating a Jenkins plugin that takes the most of the experience we’ve made in integrating Gerrit with Jenkins for the Gerrit Code Review project and makes it available to the rest of the world? There will a plugin to implement that workflow.

London to host Gerrit User Summit 2017

Posted on March 9, 2017 by Git and Gerrit Code Review for the Enterprise

… and the winner is … Europe/London!

Despite the future Brexit plans, London will still be this year the beating heart of Code Review innovation by hosting the Gerrit User Summit 2017.

Here are the numbers in detail:

157 people visited the poll page (goo.gl/M7X6rp)
75 people from 14 countries expressed their vote in the past two weeks
Summit in Europe (only Europe + USA/Europe) received 54 votes
Summit in USA (only USA + USA/Europe) received 37 votes

Countries

The audience is very diverse, with the most votes coming from West and East Coast of the USA, the British Isles, and Germany. There was some interest in the Summit as well from Israel, India, and Japan.

It will be excellent to see new faces at this year Gerrit User Summit, to exchange ideas and capture new and essential requirements for the next versions to come.

The London Venue and Dates

SkillsMatter offered the CodeNode venue in London (10 South Place, London, EC2M 7EB, GB) which is entirely dedicated to hosting events, meetups and user groups from the global OpenSource Community.

The new venue can record every session, open policy about taking pictures and share content and the possibility of streaming the event to allow remote attendees to watch and interact.

The proposed dates for the event are:

Saturday 30/09 to Sunday 01/10 – Pre-summit and Gerrit+Plugins Hackathon
Monday 02/10 to Tuesday 03/10 – Gerrit User Summit

New location, same community, and format

Even though this year the country and location will be different from the past User Summits hosted by Google in Mountain View – CA, we want to keep the organization and format of the Summit exactly as it was before:

User-driven: focussed on sharing experiences and networking between the users of Gerrit Code Review
Self-organized by the Community: no calls for papers, sponsored talks or products presentations. All has to come from the users and voted by the users.
No commercials: even though the supporters of the event are business entities (SkillsMatter, eSynergy, GerritForge and many others) we will be very careful in keeping the spotlight on the users and not interfere with them.

Costs and sponsorships

Historically Google has paid for all the costs involved in the Summits (venue, catering, marketing, etc.). For this year, a lot of companies who are using Gerrit Code Review and contributing to it have already provided their interest in contributing to cover the costs and make this European event successful.

We want to keep the list of sponsors small and tightly coupled with the Gerrit Community. If you are a company and you want to sponsor the event, please contact Luca Milanesio (luca@gerritforge.com) or post your offer to the Gerrit Code Review mailing list https://groups.google.com/forum/#!forum/repo-discuss.

Next steps

We will work together with the Gerrit Community to organize this event and make it fruitful and profitable for the future of this amazing OpenSource project.

Thanks again for your vote and your interest in the 2017 Summit.

Luca Milanesio – GerritForge

Gerrit User Summit 2016: Ten days to go

Posted on November 3, 2016 by Git and Gerrit Code Review for the Enterprise

gerritforge-keepcalmandcodereview-copy

In ten days time, the 2016 Gerrit User Summit will open its doors at the Googleplex Campus in Mountain View – CA.

It is going to be an amazing event, with a lot of exciting updates from the Gerrit Community, thanks to the number of innovations that are coming into the platform, from large files support (Git LFS), HTML5 Polymer-based new UX, NoteDB, multi-master updates and a lot more.

GerritForge will be present with five attendees and six amazing talks, covering many of the aspects of the world of Code Review and Continuous Delivery Pipeline:
– Code Review Analytics
– High Availability and Zero-Downtime upgrades
– Infinite scalability with Gerrit on Apache Cassandra
– Jenkins 2.0 Continuous Delivery pipeline for Gerrit
– Speeding up builds with Bazel on Gerrit
– Robot feedback management for Gerrit

Once again, as we always did in the past, we continue to fuel new ideas and innovation to the world of Code Review, with the feedback from the Gerrit User’s Community and developing the ideas as OpenSource, always.

See you soon at the Gerrit User Summit 2016.

Code Review Analytics at Gerrit User Summit

Posted on October 20, 2016 by Git and Gerrit Code Review for the Enterprise

I love hiking mountains; I always did since I was a child. There is a mix of challenges, enthusiasm, and learnings in walking for long hours through little and tortuous steep trails: it challenges your mind and body and makes you stronger.

One thing that always fascinated me is how the shape of peaks and valleys changes as you go higher and raise your perspective. When finally, exhausted, you reach the mountain peak you have a mix of pleasure and relief. You have achieved your goal, and you can, at last, have the full view on the horizon, understand how mountain chains are linked together, where rivers start and end up in lakes, and you can see far away as never before.

Continuous Delivery is a landscape of rivers, lakes, and mountains

I believe today’s landscape of Software Engineering is very diverse: there are so many powerful tools that generate streams of data continuously, and we need to get on top of them every day to be successful in an ever-evolving market space.
Continuous Delivery is the key methodology that nowadays allows the entire “software production chain” working smoothly. However, it poses challenges which are not always technical but often related to the flow of information across the systems and people.

See the big picture

To succeed, we need to raise our point of view to see the “big picture” and understand how things are connected and where are the improvement points, considering all the data we have about:

Tools
collect software and system metrics, logs, test results, build trends
Projects
repositories, commits, branches, pull requests, patch sets
People and Teams
active and passive collaborators, contributors, reviewers, comments and replies

Collecting data is not enough, we need to raise our point of view and hike the Continuous Delivery mountain of problems and reach a point where all those elements make sense because they are:

all visible from a single perspective
correlated
aggregated

Yet another BigData problem

The problem is too many data sources to manage.
The typical solution to the problem is taking all logs from everywhere and publishing them to a single repository using an ELK (ElasticSearch + Logstash + Kibana) stack and building fancy dashboards. I have used this approach for small-scale projects, and it works quite well, but … when I tried to scale that to a much bigger Continuous Delivery pipeline the complexity, diversity, and granularity of data just killed my ability to see the “bigger picture” and I felt almost helpless in front of my Kibana dashboards.

Back to the source of data

Trying to understand what was missing, I ended up realizing that some of the dimensions were not taken into account and correlated: Code Reviews.

All the tooling were about test results, build, system and application logs but none of them has taken the code into account, which is the source of all the pipeline. You can understand where to go if you realize where you are coming from: the code is the source of all build chains.
When I started collecting data from the code repository and reviews, all made sense again, and I felt the I have reached the peak of my Continuous Delivery hiking effort: all made sense again and I could see the overall perspective.

Continuous Delivery Analytics

Exactly in the same way we Application Analytics are used to collecting data about our production system, split into dimensions and analyzed, we need then to start doing the same with our Continuous Delivery pipeline.

There are already some of the tools out there in the market which integrates some parts of it, but I haven’t seen a single one who can give you the bird’s eye perspective you need to understand the big picture.

That’s why I started writing one, using the only way make sense for me: writing in the open and with the help and cooperation of the OpenSource community of people and companies who share the same problem and have the same perspective.

Gerrit Analytics coming at the User Summit 2016

I presented my ideas at a couple of conferences (Devoxx, JenkinsWorld) and I received a lot of appreciation and feedback: the next one is the Gerrit User Summit in Mountain View – CA.

Large companies like SAP, Qualcomm, Ericsson, Google and Intel are exchanging every year their problems and ideas on how to make their Continuous Delivery Pipelines smoother, better and faster.
The perspective will be, of course, more Gerrit Code Review centric with more data and views that make sense from a review perspective.

Call to action

Come to the Gerrit User Summit 2016 in Mountain View – Google HQ on the 12th and 13th of November, and see the Gerrit and Continuous Delivery Pipeline in Action.

The event is FREE, register now at https://goo.gl/forms/oeEnQweHl2noNSnn1

Gerrit Upgraded with No Downtime

Posted on March 21, 2016 by Git and Gerrit Code Review for the Enterprise

Zero DownTime success story.

From today at 08:06 GMT GerritHub users are served by our brand new infrastructure geo-located in Canada, Quebec, Beauharnois. It is the first time we applied a zero-downtime roll-out scheme, the PingDom uptime for the past 24h reported 100% uptime and 688 msec average response time for the page of the list of opened changes. The two response times spike on the above graphs are actually due to the old German infrastructure and happened before the start of the roll-out.

We can see the switch of the traffic to the new infrastructure from the increase of the overall response time (IP packets were routed from Germany to Canada causing extra hops); as the DNS propagation was spreading across the world, the overall number of hops gradually came back to normal.

Timeline of the events.

08:00:00 GMT – Phase 1 – Set Gerrit READ-ONLY. All changes and Git repositories started to refuse push and updates.
08:00:01 GMT – Phase 2 – Wait for pending replication to complete. Replication queue was empty; there was no need to wait.
08:00:02 GMT – Phase 3 – Mirror DB and Git for the last time, delta-reindex, DB upgrade and Gerrit restart. It has been the longest part of the roll-out and lasted 5′ 32”, aligned with our estimates.
08:05:34 GMT – Phase 4 – Cache warm-up. 20K projects, 8K accounts and 4.6K groups were pre-loaded in Gerrit. This step was optional but allowed us to redirect all the traffic without risks of causing thread spikes on the new infrastructure.
08:06:23 GMT – Phase 5 – Redirect traffic to the new infrastructure.

Did anybody notice the rollout?

During the rollout the Git projects and Gerrit changes were read-only for 6′ and 23”. According to the logs, 493 Git/HTTP and 172 Git/SSH invocations were made and completed successfully: none of them failed.

What is the situation right now?

The new infrastructure public IP (192.99.233.76) has almost completed his DNS propagation around the world, the only countries not entirely covered are Australia and China. The rest of the world is coming directly to Canada avoid the German hops. Metrics are good, low CPU utilization and threads consumption compared to the old German infrastructure, symptom of the reduction of the execution and serving times and latency.

What’s next?

From now on we will continue to use this Blue/Green roll-out strategy, possibly improving in the ReadOnly window by introducing live distributed reindex and cache warm-up.

We fully commit to Zero-Downtime and Stability, the most valuable assets for our clients.

GerritHub and Zero-Downtime Upgrade

Posted on March 16, 2016 by Git and Gerrit Code Review for the Enterprise

GerritHub gets bigger on Mon, 21 March 08:00 GMT

GerritHub has experienced unprecedented growth over the past two years. The November 2015 numbers presented at the Google User Summit in Mountain View – CA have been surpassed again, and we do need to make sure that our infrastructure is still capable of dealing with current and future users’ needs.

What is changing in GerritHub.io?

We are changing everything, from the version of Gerrit to the hardware, network and storage infrastructure. Data, DBMS, Indexes and cache, need to be upgraded and refreshed to make sure that the new systems are reflecting exacting the current production data and sessions.
We are changing as well the geo-location of our servers, from the current server farm in Germany (Bayern, Nuremberg – 100 MBps) to a new server farm in Canada (Quebec, Beauharnois – 1 GBps).

Why have so many changes?

We started to measure some significant delay in the Git and review operations on the old infrastructure, mainly due to three factors:

More users, more repositories, more concurrency. Individuals, OpenSource projects and Businesses started using GerritHub.io for their mission-critical repositories, considering Gerrit the “source of truth” of their review workflow. We needed more horsepower, memory, storage and ability to scale even further.
Bandwidth from USA and Far-east. The majority of people using GerritHub.io are from the other side of the Atlantic Ocean: this is typically not a problem from 7 AM to 3 PM … but after 4 PM the connectivity between Europe and the Americas becomes slow. Additionally, people using GerritHub.io from India, Japan, Australia and New Zealand experienced terrible slowdowns because of the excessive number of hops to reach Germany.
Gerrit master is much faster. Based on the current data and metrics measured on GerritHub.io, we have contributed a lot of patches to reduce the overhead caused by Gerrit DB and lessen the number of SQL queries per minute. All those new improvements are on Gerrit master, and we need to catch-up with the “latest and greatest” version.

Will I experience any GerritHub.io outage?

Last time that GitHub needed to make a major upgrade, asked his 5M users to stop working for 23 minutes,. This translates to a loss of two millions of hours of continuous delivery lifecycle, equivalent to over 130 man/years, worth no less than eight millions dollars.
We are going to adopt a new Zero-Downtime Gerrit roll-out strategy to make sure that all those changes are not going to impact your day-by-day activity. If you were not reading this post you would possibly even not notice the “switch” from the old to the new infrastructure, apart from the increase in speed and bandwidth.

Zero-downtime GerritHub.io migration, step by step with the associated expected timings.

Phase 0 – Replication to the new Gerrit infrastructure. (- 1 month ago)
We started migrating everything one month ago, and the old and new infrastructure are working side-by-side, thanks to Gerrit master-slave replication. The new Gerrit servers are active as slaves and are read-only.

Phase 1 – Migration kick-off. (08:00 GMT)
We install a Gerrit plugin that rejects all the push to GerritHub.io repositories providing a courtesy message: “Gerrit is under maintenance, all projects are READ ONLY”. All the HTTP POST, PUT and DELETE are disable on the Gerrit REST-API.

Phase 2 – Wait for replication events to complete and migrate DB. (08:02 GMT)
Git repositories are continuously replicated, but we do need to make sure that the event queue is empty. Once that happens we schedule the last final DB migration to the new infrastructure.

Phase 3 – Gerrit DB upgrade and reindex (08:04 GMT)
New Gerrit server executes the final upgrade and off-line reindex of the latest received changes.

Phase 4 – Gerrit start-up and cache warm-up (08:05 GMT)
New Gerrit is restarted and the most critical Gerrit caches (projects, accounts and groups) are pre-loaded in memory. This allows the incoming traffic spike to avoid the collapse of used threads and makes the transition as smooth as possible without slowdowns.

Phase 5 – Traffic switch and DNS updates (08:06 GMT)
GerritHub.io redirects all incoming HTTPS and SSH traffic to the new infrastructure. Git pushes and HTTP PUT, POST and DELETE operations of the REST API are operational again and served by the new Gerrit infrastructure. GerritHub.io DNS is updated to the new Canadian IPs.

Phase 6 – New IPs gets propagate to all worldwide DNSs (+ 1 day)
Once all the DNSs in the world would have been updated, everyone will start going directly to the new infrastructure without further hops or redirection from Germany. Customers from USA, Canada, South America, Asia, Japan, New Zealand and Australia should see a significant reduction of the network latency and increase of GerritHub.io responsiveness.

Firewall and SSH considerations

Even if Gerrit server’s SSH key is not changing, some of you may see a warning similar to this when they push or pull over SSH:

Warning: the RSA host key for ‘review.gerrithub.io’ differs from the key for the IP address ‘148.251.77.70’

The warning message will also tell you which lines in your ~/.ssh/known_hosts need to change. Open that file in your favorite editor, remove or comment out those lines, then retry your push or pull.

Should your network have some strict firewall rules to access external sites, you may want to whitelist the IP of the new infrastructure WLB to: 192.99.233.76.

Follow GerritHub.io migration progress.

We will advertise the migration progress on Twitter at @GitEnterprise. Should you have any issue you can tweet us or contact GerritForge Customer Support at support@gerritforge.com.