GerritHub gets bigger on Mon, 21 March 08:00 GMT
GerritHub has experienced unprecedented growth over the past two years. The November 2015 numbers presented at the Google User Summit in Mountain View – CA have been surpassed again, and we do need to make sure that our infrastructure is still capable of dealing with current and future users’ needs.
What is changing in GerritHub.io?
We are changing everything, from the version of Gerrit to the hardware, network and storage infrastructure. Data, DBMS, Indexes and cache, need to be upgraded and refreshed to make sure that the new systems are reflecting exacting the current production data and sessions.
We are changing as well the geo-location of our servers, from the current server farm in Germany (Bayern, Nuremberg – 100 MBps) to a new server farm in Canada (Quebec, Beauharnois – 1 GBps).
Why have so many changes?
We started to measure some significant delay in the Git and review operations on the old infrastructure, mainly due to three factors:
- More users, more repositories, more concurrency. Individuals, OpenSource projects and Businesses started using GerritHub.io for their mission-critical repositories, considering Gerrit the “source of truth” of their review workflow. We needed more horsepower, memory, storage and ability to scale even further.
- Bandwidth from USA and Far-east. The majority of people using GerritHub.io are from the other side of the Atlantic Ocean: this is typically not a problem from 7 AM to 3 PM … but after 4 PM the connectivity between Europe and the Americas becomes slow. Additionally, people using GerritHub.io from India, Japan, Australia and New Zealand experienced terrible slowdowns because of the excessive number of hops to reach Germany.
- Gerrit master is much faster. Based on the current data and metrics measured on GerritHub.io, we have contributed a lot of patches to reduce the overhead caused by Gerrit DB and lessen the number of SQL queries per minute. All those new improvements are on Gerrit master, and we need to catch-up with the “latest and greatest” version.
Will I experience any GerritHub.io outage?
Last time that GitHub needed to make a major upgrade, asked his 5M users to stop working for 23 minutes,. This translates to a loss of two millions of hours of continuous delivery lifecycle, equivalent to over 130 man/years, worth no less than eight millions dollars.
We are going to adopt a new Zero-Downtime Gerrit roll-out strategy to make sure that all those changes are not going to impact your day-by-day activity. If you were not reading this post you would possibly even not notice the “switch” from the old to the new infrastructure, apart from the increase in speed and bandwidth.
Zero-downtime GerritHub.io migration, step by step with the associated expected timings.
Phase 0 – Replication to the new Gerrit infrastructure. (- 1 month ago)
We started migrating everything one month ago, and the old and new infrastructure are working side-by-side, thanks to Gerrit master-slave replication. The new Gerrit servers are active as slaves and are read-only.
Phase 1 – Migration kick-off. (08:00 GMT)
We install a Gerrit plugin that rejects all the push to GerritHub.io repositories providing a courtesy message: “Gerrit is under maintenance, all projects are READ ONLY”. All the HTTP POST, PUT and DELETE are disable on the Gerrit REST-API.
Phase 2 – Wait for replication events to complete and migrate DB. (08:02 GMT)
Git repositories are continuously replicated, but we do need to make sure that the event queue is empty. Once that happens we schedule the last final DB migration to the new infrastructure.
Phase 3 – Gerrit DB upgrade and reindex (08:04 GMT)
New Gerrit server executes the final upgrade and off-line reindex of the latest received changes.
Phase 4 – Gerrit start-up and cache warm-up (08:05 GMT)
New Gerrit is restarted and the most critical Gerrit caches (projects, accounts and groups) are pre-loaded in memory. This allows the incoming traffic spike to avoid the collapse of used threads and makes the transition as smooth as possible without slowdowns.
Phase 5 – Traffic switch and DNS updates (08:06 GMT)
GerritHub.io redirects all incoming HTTPS and SSH traffic to the new infrastructure. Git pushes and HTTP PUT, POST and DELETE operations of the REST API are operational again and served by the new Gerrit infrastructure. GerritHub.io DNS is updated to the new Canadian IPs.
Phase 6 – New IPs gets propagate to all worldwide DNSs (+ 1 day)
Once all the DNSs in the world would have been updated, everyone will start going directly to the new infrastructure without further hops or redirection from Germany. Customers from USA, Canada, South America, Asia, Japan, New Zealand and Australia should see a significant reduction of the network latency and increase of GerritHub.io responsiveness.
Firewall and SSH considerations
Even if Gerrit server’s SSH key is not changing, some of you may see a warning similar to this when they push or pull over SSH:
Warning: the RSA host key for ‘review.gerrithub.io’ differs from the key for the IP address ‘220.127.116.11’
The warning message will also tell you which lines in your ~/.ssh/known_hosts need to change. Open that file in your favorite editor, remove or comment out those lines, then retry your push or pull.
Should your network have some strict firewall rules to access external sites, you may want to whitelist the IP of the new infrastructure WLB to: 18.104.22.168.