GerritHub is on NoteDb … with a bump

516px-Road-sign-Speed_bump.svg

The 26th of April at 9:10 AM EDT, the 400K changes on GerritHub.io have been successfully migrated to NoteDb.
See below the historical log entry in error_log.2018-04-26

[2018-04-26 09:10:55,429] [OnlineNoteDbMigrator] INFO com.google.gerrit.server.notedb.rebuild.OnlineNoteDbMigrator : Online NoteDb migration completed in 8630s

What is NoteDb?

NoteDb is the next generation of Gerrit storage backend, which replaces the traditional SQL backend for change and account metadata with storing data in the same repository as code changes. In a nutshell, you can access all the reviews from your local Git repository as well by using the “git log -p” command line and even when you are offline, which is really neat.

Whilst all the major competitors of Gerrit Code Review still rely on a traditional DataBase for reviews, NoteDb is innovative and provides many major benefits:

  • Simplicity
    All data is stored in one location in the site directory, rather than being split between the site directory and a possibly external database server.
  • Consistency
    Replication and backups can use a snapshot of the Git repository refs, which will include both the branch and patch set refs, and the change metadata that points to them.
  • Auditability
    Rather than storing mutable rows in a database, modifications to changes are stored as a sequence of Git commits, automatically preserving history of the metadata.
  • Extensibility
    Plugin developers can add new fields to metadata without the core database schema having to know about them.
  • New features
    Enables simple federation between Gerrit servers, as well as offline code review and interoperation with other tools.

Large-scale, world’s first.

GerritHub.io is the first large-scale Gerrit Code Review installation, apart from Google’s of course, that has hit essential records targets:

  1. The world’s most advanced and up-to-date Gerrit release in production: v2.15.1-143
  2. The world’s first NoteDb on-line migration in production

Being the “first” has a lot of advantages because allow people and companies to work faster and more efficiently than the competitors, which is paramount of the modern global economy. However, there are disadvantages as well: being the “first” means that at times you are going into unexplored space, and the road could be bumpy.

See below a summary of what happened yesterday on GerritHub.io during the NoteDb migration.

Timeline of events

06:47 AM – Starting online NoteDb migration

The online migration process starts. All incoming changes and reviews are still happening on ReviewDb, however, Gerrit start creating the /meta refs on the existing changes to translate all the existing DBMS records into Review Notes.

This migration state is called: WRITE (changes are written to both NoteDb and ReviewDb)

07:58 AM – Setting primary storage to NoteDb

The primary storage for new changes is moved to NoteDb. New changes will be stored to NoteDb while existing changes that have been modified between 6:47 AM and 7:58 AM will be delta-migrated and then flagged as “NoteDb only” one by one.

When a new change is created, it will be assigned from a sequence number coming from NoteDb and not anymore from ReviewDb.

08:01 AM – Errors when trying to push new changes to GerritHub.io

One developer of the Python zVM SDK OpenSource project tries to create a new change to GerritHub.io but receives the following error:

$ git push origin HEAD:refs/for/master
Counting objects: 10, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (5/5), done.
Writing objects: 100% (10/10), 657 bytes | 0 bytes/s, done.
Total 10 (delta 4), reused 0 (delta 0)
remote: Resolving deltas: 100% (4/4)
remote: Processing changes: new: 1, refs: 1, done
remote:
remote: New Changes:
remote: https://review.gerrithub.io/#/c/mfcloud/python-zvm-sdk/+/407671
remote:
To ssh://balaskoa@review.gerrithub.io:29418/mfcloud/python-zvm-sdk
! [remote rejected] HEAD -> refs/for/master (internal server error: Error inserting change/patchset)

Other errors are appearing with the identical symptoms on other projects. It isn’t, however, a general failure because other new changes are getting through and existing changes are reviewed correctly as expected.

08:36 AM – Problem notified to the Gerrit mailing list

The troubleshooting starts, and it seems that some of the new changes created on NoteDb have sequences in conflict with existing changes on ReviewDb but on other projects.

Not all changes are impacted though, so migration continues.

09:02 AM – Migrated primary storage

All the changes have been migrated and flagged as “NoteDb only”, there will be no more read access to ReviewDb for those.

09:06 AM – Cause identified

A bug has been identified in the code that manages the generation of sequencing numbers for the new changes on NoteDb: the switch to the primary storage to NoteDb has not updated the sequencing number on the All-Projects/refs/sequences/changes and thus new changes created may be conflicting with existing ones on ReviewDb.

09:10 AM – Migration completed

09:49 AM – Acknowledge by Google

Dave Borowitz, the leader of the Gerrit Code Review project, analyzes the discussion topic on the mailing list and agrees on the diagnosis of the issue.

Dave Borowitz words were: “Nice catch, thank you Luca.”

10:02 AM – GerritHub.io production patched, problem resolved.

9:05 PM – A software fix to the Gerrit v2.15 stable branch uploaded

A definite fix for the software glitch is uploaded to Gerrit-Review and is reviewed by the Gerrit Code Review contributors.

The “bump” on the road

Migration is always a pain, and you need to plan it, test and fix all the issues you can potentially verify in a “like-for-like” pre-production environment. However, this time at least, testing had produced a situation that was unprecedented.

When Gerrit was migrated from v2.14 to v2.15.1, traffic has been moved between Data-Centers (DCs), from Canada to Germany and then back to Canada, using a “ping-pong” technique with zero-downtime.
That means that the testing of the on-line migration to NoteDb has been tried *already* on the Canada-DC a few days ago and it actually succeeded and Gerrit stored the “last known sequence number” in ReviewDb into NoteDb.

The second NoteDb migration yesterday followed exactly the same trace of the previous test made on Canada-DC but, this time, the “last known sequence number” was not updated.
That is an “edge-case” that was not foreseen when writing the code and has produced the failures experienced by new changes.

Gerrit NoteDb code is very resilient and immediately detected the situation and avoided to insert and index the changes with conflicting IDs.

Statistics of migration

  • Total migration time: 2h 23m
  • Reaction time to investigate failures: 36m
  • Resolution time: 2h
  • Software fix: 13h
  • Number of changes impacted: 33 over 400k – 0.008%
  • Number of projects impacted: 14 over 14k – 0.1%
  • Data loss: 0%
  • Incidents created and closed: 3

Current situation

No more errors or problems reported, production is stable

Leave a comment