GerritHub is on NoteDb … with a bump

516px-Road-sign-Speed_bump.svg

The 26th of April at 9:10 AM EDT, the 400K changes on GerritHub.io have been successfully migrated to NoteDb.
See below the historical log entry in error_log.2018-04-26

[2018-04-26 09:10:55,429] [OnlineNoteDbMigrator] INFO com.google.gerrit.server.notedb.rebuild.OnlineNoteDbMigrator : Online NoteDb migration completed in 8630s

What is NoteDb?

NoteDb is the next generation of Gerrit storage backend, which replaces the traditional SQL backend for change and account metadata with storing data in the same repository as code changes. In a nutshell, you can access all the reviews from your local Git repository as well by using the “git log -p” command line and even when you are offline, which is really neat.

Whilst all the major competitors of Gerrit Code Review still rely on a traditional DataBase for reviews, NoteDb is innovative and provides many major benefits:

  • Simplicity
    All data is stored in one location in the site directory, rather than being split between the site directory and a possibly external database server.
  • Consistency
    Replication and backups can use a snapshot of the Git repository refs, which will include both the branch and patch set refs, and the change metadata that points to them.
  • Auditability
    Rather than storing mutable rows in a database, modifications to changes are stored as a sequence of Git commits, automatically preserving history of the metadata.
  • Extensibility
    Plugin developers can add new fields to metadata without the core database schema having to know about them.
  • New features
    Enables simple federation between Gerrit servers, as well as offline code review and interoperation with other tools.

Large-scale, world’s first.

GerritHub.io is the first large-scale Gerrit Code Review installation, apart from Google’s of course, that has hit essential records targets:

  1. The world’s most advanced and up-to-date Gerrit release in production: v2.15.1-143
  2. The world’s first NoteDb on-line migration in production

Being the “first” has a lot of advantages because allow people and companies to work faster and more efficiently than the competitors, which is paramount of the modern global economy. However, there are disadvantages as well: being the “first” means that at times you are going into unexplored space, and the road could be bumpy.

See below a summary of what happened yesterday on GerritHub.io during the NoteDb migration.

Timeline of events

06:47 AM – Starting online NoteDb migration

The online migration process starts. All incoming changes and reviews are still happening on ReviewDb, however, Gerrit start creating the /meta refs on the existing changes to translate all the existing DBMS records into Review Notes.

This migration state is called: WRITE (changes are written to both NoteDb and ReviewDb)

07:58 AM – Setting primary storage to NoteDb

The primary storage for new changes is moved to NoteDb. New changes will be stored to NoteDb while existing changes that have been modified between 6:47 AM and 7:58 AM will be delta-migrated and then flagged as “NoteDb only” one by one.

When a new change is created, it will be assigned from a sequence number coming from NoteDb and not anymore from ReviewDb.

08:01 AM – Errors when trying to push new changes to GerritHub.io

One developer of the Python zVM SDK OpenSource project tries to create a new change to GerritHub.io but receives the following error:

$ git push origin HEAD:refs/for/master
Counting objects: 10, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (10/10), 657 bytes | 0 bytes/s, done.
Total 10 (delta 4), reused 0 (delta 0)
remote: Resolving deltas: 100% (4/4)
remote: Processing changes: new: 1, refs: 1, done
remote:
remote: New Changes:
remote: https://review.gerrithub.io/#/c/mfcloud/python-zvm-sdk/+/407671
remote:
To ssh://balaskoa@review.gerrithub.io:29418/mfcloud/python-zvm-sdk
! [remote rejected] HEAD -> refs/for/master (internal server error: Error inserting change/patchset)

Other errors are appearing with the identical symptoms on other projects. It isn’t, however, a general failure because other new changes are getting through and existing changes are reviewed correctly as expected.

08:36 AM – Problem notified to the Gerrit mailing list

The troubleshooting starts, and it seems that some of the new changes created on NoteDb have sequences in conflict with existing changes on ReviewDb but on other projects.

Not all changes are impacted though, so migration continues.

09:02 AM – Migrated primary storage

All the changes have been migrated and flagged as “NoteDb only”, there will be no more read access to ReviewDb for those.

09:06 AM – Cause identified

A bug has been identified in the code that manages the generation of sequencing numbers for the new changes on NoteDb: the switch to the primary storage to NoteDb has not updated the sequencing number on the All-Projects/refs/sequences/changes and thus new changes created may be conflicting with existing ones on ReviewDb.

09:10 AM – Migration completed

09:49 AM – Acknowledge by Google

Dave Borowitz, the leader of the Gerrit Code Review project, analyzes the discussion topic on the mailing list and agrees on the diagnosis of the issue.

Dave Borowitz words were: “Nice catch, thank you Luca.”

10:02 AM – GerritHub.io production patched, problem resolved.

9:05 PM – A software fix to the Gerrit v2.15 stable branch uploaded

A definite fix for the software glitch is uploaded to Gerrit-Review and is reviewed by the Gerrit Code Review contributors.

The “bump” on the road

Migration is always a pain, and you need to plan it, test and fix all the issues you can potentially verify in a “like-for-like” pre-production environment. However, this time at least, testing had produced a situation that was unprecedented.

When Gerrit was migrated from v2.14 to v2.15.1, traffic has been moved between Data-Centers (DCs), from Canada to Germany and then back to Canada, using a “ping-pong” technique with zero-downtime.
That means that the testing of the on-line migration to NoteDb has been tried *already* on the Canada-DC a few days ago and it actually succeeded and Gerrit stored the “last known sequence number” in ReviewDb into NoteDb.

The second NoteDb migration yesterday followed exactly the same trace of the previous test made on Canada-DC but, this time, the “last known sequence number” was not updated.
That is an “edge-case” that was not foreseen when writing the code and has produced the failures experienced by new changes.

Gerrit NoteDb code is very resilient and immediately detected the situation and avoided to insert and index the changes with conflicting IDs.

Statistics of migration

  • Total migration time: 2h 23m
  • Reaction time to investigate failures: 36m
  • Resolution time: 2h
  • Software fix: 13h
  • Number of changes impacted: 33 over 400k – 0.008%
  • Number of projects impacted: 14 over 14k – 0.1%
  • Data loss: 0%
  • Incidents created and closed: 3

Current situation

No more errors or problems reported, production is stable

GerritHub adopts 100% PolyGerrit

p-logo

GerritHub.io has been successfully migrated Gerrit Code Review v2.15. Thanks to the 5-days Gerrit Hackathon hosted by Axis in Lund, all the remaining issues we had on v2.15 have been resolved and all the 15k active users of GerritHub from today can use the 100% feature complete PolyGerrit UX.

What is PolyGerrit?

PolyGerrit is the code-name of the new UX wholly redesigned using web components, a set of web platform APIs that allow you to create new custom, reusable, encapsulated HTML tags to use in web pages and web apps. Custom components and widgets build on the Web Component standards, will work across modern browsers, and can be used with any JavaScript library or framework that works with HTML.

To access the PolyGerrit, just go to the GerritHub.io footer and click on the link “Switch to New UI” or add “?polygerrit=1” as an extra query string (e.g. https://review.gerrithub.io/?polygerrit=1)

To have a more comprehensive description of PolyGerrit, you can read the previous blog post about the Google talk at the past Gerrit User Summit in London.

Zero-downtime “ping-pong” migration to Gerrit v2.15

As usual, we migrated with near-zero-downtime, with only a “ten minutes *read-only* window” where we were waiting to drain the final replications were moved between Data-Centers.

There are two Data-Centers (DC) active for GerritHub.io; the main one is hosted in Canada and the second in Germany. Both DC have a high-availability configuration and run the same version of Gerrit with the same data. However, during major upgrades, we use a “ping-pong” technique to give a seamless experience to the users.

Ping phase

DC-Canada is with Gerrit v2.14 and DC-Germany gets upgraded to v2.15. Traffic gets forwarded smoothly from DC-Canada to DC-Germany thanks to the HAProxy that is serving traffic for https://review.gerrithub.io.

Pong phase

DC-Canada gets upgraded to v2.15 and DC-Germany sync back to Canada in near-real time. Once the upgrade is complete, the HAProxy forwards back the traffic to DC-Canada.

Benefits of the ping-pong upgrade

There two significant benefits of using the combined zero-downtime rollout with the ping-pong technique:

  1. No general service disruption, minimal read-only time.
    Nobody would notice any significant service disruption on GerritHub.io: the read-only window is a minor service degradation that lasts for only a few minutes. Given that 90% of the traffic is represented by “git fetch/clone” and web browsing, the degradation is hardly noticed by anyone.
  2. Validation of the disaster recovery procedure.
    Because the DC-Germany is used as disaster recovery site, it is essential to make sure that is always working fine and you can actually failover to it at any given time when needed.
    You do not want to find out that the disaster recovery isn’t working when is too late.

What’s new in Gerrit v2.15?

There are many changes in Gerrit v2.15, for the details you can have a look at the Google presentation at the Gerrit User Summit 2017 in London.

In a nutshell, here are the headlines of the most visible changes you will notice:

  1. Support for draft changes and draft patch sets has been completely removed.
    You have now two possible states for a change: WIP (work-in-progress) and Private. All the changes that were in “draft” status at the migration have been moved to WIP state.
  2. New URL Scheme
    Gerrit URLs generated and used by the UI include not just the change number but the project name as well.
    For instance, the Change 123 on project ‘myproject’ would now be accessible on the URL: https://review.gerrithub.io/#/c/myproject/+/123.
    Existing URLs containing only the change number (e.g. https://review.gerrithub.io/123) are redirected to the new scheme.
  3. New workflows on the PolyGerrit UX
    The PolyGerrit UX is now 100% feature complete. It is not only an engineering rewrite but also a whole redesign of the user-flows and experience.
    See at https://www.gerritcodereview.com/releases/2.15.md#new-workflows the details of all the new flows.

What’s next?

Gerrit v2.15 includes a brand-new storage for reviews, code-named “NoteDb”, that is actually your Git repository itself. That means that all the meta-data, comments, scores, history, audit, will all be stored in the same GitHub repository with your code.

Our next step is to perform an online migration from ReviewDb to NoteDb, which will be again with zero-downtime. Thanks to the fantastic work made by Dave Borowitz (Google), there will be no need for a read-only window: you will not even notice.

What do know all the details of Gerrit v2.15?

For an overview of what’s new in GerritHub with v2.15, you can look at the Gerrit User Summit 2017 presentation.