How to migrate Gerrit from v2.15 to v2.16

Time has come to migrate gerrithub.io to the latest Gerrit v2.16, from the outdated v2.15 we had so far. The big change between the two is the full adoption of NoteDB: the internal Gerrit groups were still kept in ReviewDb on v2.15, which forced us to keep a PostgreSQL instance active in production. This means we can finally say goodbye to the ReviewDb đź‘‹ and eliminated yet another SPoF (Single-Point-of-Failure) from the GerritHub high-availability infrastructure.

Migrating to Gerrit v2.16 implies:

  1. Gerrit WAR upgrade
  2. GIT repos upgrade because of a change in the NoteDb format
  3. Change in the database used, from PostgreSQL to H2 (for the schema_version)
  4. Introduction of the new Projects index

The above is a quite complex process and, here at GerritForge, we executed the migration on a running GerritHub.io with 15k of active users avoiding any downtime during the migration.

Architecture

This is the initial architecture we are starting the GerritHub.io v2.15 migration from:

Initial status - 15_01.png

In this setup, we have 2 sites, one in Canada (active) and one in Germany (active for analytics and disaster recovery). The latter is aligned with the active master via replication plugin.

The HA Plugin used between the 2 Canadian nodes is a GerritForge fork enhanced with the ability to align the Lucene Indexes, Caches and Events when sharing repositories via NFS with caching enabled.

NOTE: The original High-Availability plugin is certified and tested on Gerrit v2.14 / ReviewDb only and requires the use of NFS without caching, which requires a direct fiber-channel connection between the Gerrit nodes the disks.

The traffic is routed with HAProxy to the active node. This allows us easy code migrations with no downtimes, using what we call the “ping-pong” technique between the Canadian and the German site, which is inspired by the classical Blue/Green deployment with some adjustments for the peculiarities of the Gerrit multi-site setup.

The migration pattern, in a nutshell, is composed of the following phases:

  1. Upgrade code in Germany
    The Gerrit site in Germany is used for Analytics and thus can be upgraded first with low risk associated.
    German site -> passive, Canadian site -> active
     
  2. Redirect traffic in Germany
    Once the site in Germany is ready and warmed up, the GerritHub users are redirected to it. GerritHub is technically serving the v2.16 user-experience to all users.
    German site -> active, Canadian site -> passive
     
  3. Upgrade code in Canada
    The site in Canada is put offline and upgraded as well.
    German site -> active, Canadian site -> passive
     
  4. Redirect traffic back to Canada
    Once the site in Canada is fully ready and warmed up, the entire user-base is redirected back.
    German site -> passive, Canadian site -> active

Each HAProxy has the same configuration with a primary and 2 backups as follow:

HAProxy CA Primary.png

Timeline of events – 2nd of Jan 2019

2/1/2019 – 8:00 GMT: Starting point of the GerritHub configuration

  • Review-1 – Gerrit 2.15 – active node
  • Review-2 – Gerrit 2.15 –  failover node
  • Review-DE – Gerrit 2.15 – analytics node, used for disaster recovery

2/1/2019 – 10:10 GMT: Upgrade disaster recovery server

  • Stopped all services using Gerrit on review-de (we use the disaster recovery to crunch and serve the analytics dashboard)
  • Disabled replication plugin
  • Stopped Gerrit 2.15 and upgraded to Gerrit 2.16
  • Restarted Gerrit

2/1/2019 – 10:44 GMT: Re-enabled disaster recovery server 

  • Re-Enabled replication from review 1…boom!
    • First issue: mirror option of the replication plugin was set to true, hence all the branches containing the groups on the All-Users repo been dropped from the recovery server. All the Groups were suddenly gone from the disaster recovery server
  • Remove mirror option in replication plugin
  • Re-Enabled replication from review-1…this time everything was ok!
  • Migration re-executed and everything was fine

2/1/2019 – 11:00 GMT: Removed ReviewDB

  • Once we were happy with the replication of the Groups we could remove PostgreSQL

The only information left outside NoteDB is the schema_version table, which contains only one row and it is static. We moved it into H2 by copying the DB from a vanilla 2.16 installation and changing Gerrit Config to use it.

DE 2.16 - 15_01.png

Before the next step, we had to wait for the online reindexing on review-de to finish (~2 hours).

Note: we didn’t consider offline reindexing since it is basically sequential, and it would have been way slower compared to the concurrent online one. Additionally, it does not compute all the Prolog rules in full.

2/1/2019 – 15:15 GMT: Reduce delta between masters

  • Reducing the delta of data between the 2 sites (Canada and Germany) will allow having a shorter read-only window when upgrading the currently active master
  • Manually replicate and reindex misaligned repositories on review-de (see below the effect on the system load)

Screenshot 2019-01-14 at 20.33.10.png

Screenshot 2019-01-14 at 20.33.23.png

  • Pro tip: if you want to check queue status to see, for example, if the replication is still ongoing this command can be used:

    ssh -p 29419 <gerrit_admin_user>@localhost \
                 gerrit show-queue --by-queue --wide

2/1/2019 – 15:50 GMT: Final master catchup

  • Switched on read-only plugin on the active master
  • Service degraded for few minutes (i.e.: Gerrit was read-only), but most of the operations were available, i.e.: Gerrit index/query/plugin/version, git-upload-pack, replication
  • Waited for review-de to catch up with the latest changes that come in review-1 (we monitored it using the above “gerrit show-queue” command)

CA Readonly - 15_01.png

2/1/2019 – 15:54 GMT: Made disaster recovery active

  • Changed HAProxy configuration, and reloaded, to re-direct all the traffic to review-de, which become the active node in the cluster

HAProxy-DE-primary-transition.png

  • See the transition of the traffic to review-de

Screenshot 2019-01-14 at 20.39.22.png

  • Left review-de the whole night as the primary site. This way we also tested the disaster recovery site stability

DE Active - 15_01.png

2/1/2019 – 19:47 GMT: Upgrade review-1 and review-2 to Gerrit 2.16

  • Stopped Gerrit 2.15 and upgraded to Gerrit 2.16
  • Wait for offline reindexing of Projects, Accounts and Groups
  • Started with Gerrit 2.16 with online reindexing of the changesCA 2.16 - 15_01.png

It was possible to see an expected increase in the system load due to the reindexing, lasted for about 2 hours:

System load.png

Furthermore, despite review-1 not being the active node, the HTTP workload grew disproportionately:

HTTP requests.png

This was due to a well-known issue of the high-availability plugin, where the reindexing are forwarded to the passive nodes, creating an excessive workload on them.

3/1/2019 – 10:14 GMT: Made review 1 active

  • We used the same pattern used when upgrading review-de to align the data between masters
  • Changed HAProxy configuration, and reloaded, to re-direct back all the traffic to review-1

 

Final - 15_01.png

Conclusions

Migration was completed and production is back to stable again with the latest and greatest Gerrit v2.16.2 and the full PolyGerrit UI. With the migration of the Groups in NoteDB, ReviewDB leaves the stage completely to NoteDB. PostgreSQL is no more needed, simplifying the overall architecture.

The migration itself was quite smooth, the only issue was due to a plugin misconfiguration, nothing to have with Gerrit core. With the good monitoring we have in place, we managed to spot the issues straight away. Still, we will further automate our release process to avoid these issues from happening again.

Fabio Ponciroli (aka Ponch) – Gerrit Code Review Contributor – GerritForge