Gerrit User Summit 2017, 2-3 Oct, London

Posted on September 7, 2017 by Git and Gerrit Code Review for the Enterprise

New and exciting features are coming for this year Gerrit User Summit, with the launch of Ver. 2.15, NoteDb, high-availability, multi-master and much more.

The Summit will take place for the very first time in Europe, London, the location chosen by the community after a public consultation, the 2nd and 3rd of October at CodeNode (Skills Matter).

There are still a few places available but hurry up and register now at https://gerritusersummit.eventbrite.com.

See below an overview of the topics that will be presented and discussed during the User Summit.

What’s new in Gerrit 2.14.x.

Gerrit v2.14 was released during the last Hackathon in April and has gone through three patch releases. David Pursehouse from CollabNet will give an overview of the new features introduced which would be highly beneficial for all of those who haven’t migrated yet.

Gerrit at Google: Multi-master, Mutli-tenant.

Google is the founder, main contributor and possibly the most advanced user of the Gerrit Code Review: learning from their experience is a unique opportunity to learn and being able to leverage and use the tool at its best.

Patrick Hiesel from Google will go through the insights of their Gerrit Code Review architecture and will provide some of their metrics of scale. In addition to that, he will present some findings from the recent switch of their load-balancing infrastructure and the associated pitfalls encountered.

Google is possibly the only one in the world using Gerrit in a multi-tenant setup, having a unique multi-master installation that serves a constellation of domains and projects, including huge and familiar ones like Android and Chromium.

Standing “on the shoulders of giants” like Google helps a lot in preventing scalability issues as the audience and adoption of Gerrit Code Review grows in large companies: being part of the audience in the talk is a unique opportunity to learn and ask questions directly to the maintainers of their infrastructure.

PolyGerrit: a new UX experience for Gerrit Code Review

Google has invested a lot in reinventing and reengineering the user interface of Gerrit Code Review, which remained mostly unchanged for almost a decade. A new team has been put together in their San Francisco offices with experienced UX developers that leveraged the new Polymer framework of web components.

The result is PolyGerrit, a modern web UX which provides an unprecedented browsing speed and flexible rendering across different devices, including mobile and tablets.

The PolyGerrit Team will be presenting the findings of their user-experience research and show some of the features and insights of the new UX.

Gerrit CI and keeping logs forever.

Gerrit Code Review itself is a large project, involving over 300 developers across the globe and using the most advanced DevOps practices. The CI/CD pipeline has been provided and managed by GerritForge on the https://gerrit-ci.gerritforge.com and Luca Milanesio from GerritForge will present the latest improvements in the pipeline plus an interesting way of collecting and reusing the logs.

Leveraging the logs for identifying the bottlenecks of the CI/CD pipeline is the way to drive improvement. GerritForge leveraged the expertise of his engineers to harvest and organize data and will give it back to the community as powerful dashboards.

Beyond Gerrit.

Gerrit is great. However, it is also quite an important part of a bigger ALM process. Jacek Centkowski from CollabNet will describe how multiple tools can be unified under a single TeamForge umbrella and what are the immediate benefits of it.

What’s coming in Gerrit 2.15

After only four months, we are already close to the v2.15 of Gerrit Code Review, which would be possibly the last one before the step to the v3.0.

Dave Borowitz from Google, principal maintainer of the Gerrit Code Review project, will go through the new features of v2.15 and possibly give a glimpse in what to expect from v3.0.

Mining Gerrit Data to Study Contentious Reviews and Community Evolution

Gerrit Code Review is much more than a tool, it is a way for people working together in companies that are large and mostly distributed across the globe.

Shane McIntosh from McGill University has been running a research lab on this topic. The Software REBELs—a research lab at McGill University—mine code review data to study topics like the impact that code review practices have on software release and design quality. Our more recent work mines code review data to study the reviewing process itself. In this talk, I will describe the results of two empirical studies of data that we collected from the Gerrit instances of the OpenStack project. The first study aims to understand the reviews where reviewers disagree about a patch. The second study follows how the concerns that reviewers raise evolve as the OpenStack community ages and individual reviews accrue experience.

Gerrit Analytics: dashboards, networks, KPI

Gerrit has always been lacking major code analytics features compared to other Git Server tools like GitBlit or GitLab. GerritForge Ltd is filling the gap and adds one important asset to the Gerrit Code Review platform: code review analytics.

We need to harvest and unify the logs and events coming from the different components of the CI/CD pipeline by putting at the center of it the people and teams that are building and discussing the code on Gerrit. The resulting data-lake of information can be later analyzed and correlated to calculate the cycle time of the entire pipeline.

Luca Milanesio from GerritForge will show the new analytics dashboards that are going to be published and provided back to the Team that is developing the Gerrit Code Review project as a precious contribution to the community.

How to extend Gerrit using Scripting Plugins

Gerrit Code Review has a robust set of API that can be used to extend its functionalities and provide a more integrated development workflow for the Teams.

Luca Milanesio from GerritForge will present how to use different scripting tools to extend the capabilities of Gerrit without the need of developing and building a plugin, using Jython, Groovy and Scala.

A new simpler but powerful Gerrit Jenkins plugin

Gerrit Code Review is an essential part of a larger CI/CD pipeline. Most of the times it is used in conjunction with Jenkins, the most popular OpenSource Continuous Integration and Delivery tool.

The integration between Gerrit and Jenkins (Gerrit Trigger Plugin) was developed back in 2010 at Sony and since then has been extended and adopted in thousands of Jenkins installations. However, Jenkins has evolved too and has now a brand new concept and definition of multi-branch pipeline which struggles to be seamlessly integrated with the current Gerrit Trigger Plugin.

Luca Milanesio from GerritForge will present a brand new plugin based on the new Jenkins branch discovery API which works seamlessly with Jenkins multi-branch pipelines and provides a simpler interface with Gerrit by leveraging the new WebHooks.

Diffy with enterprise grade

Since 2012 CollabNet has been working on improving Gerrit integration with TeamForge. Many features have been created to satisfy the needs of enterprise customers. Eryk Szymanski from CollabNet will present features like RBAC, history protection, Git style notifications, quality gates, pull request and code browser which have been implemented on top of vanilla Gerrit.

Q&A with the maintainers

Have you ever wondered why something is working in a certain way? Have you ever wanted to explain any complaint about some parts of Gerrit? Would you give your congratulation to the people that made this project? Would you like to make a feature request or propose new ideas?

This is the moment where you can speak directly face-to-face to the people that are building this project every single day, the Gerrit maintainers.

The event is free for everyone, thanks to the contribution of our sponsors, CollabNet Inc, GerritForge Ltd and Skills Matter Ltd.

Gerrit Upgraded with No Downtime

Posted on March 21, 2016 by Git and Gerrit Code Review for the Enterprise

Zero DownTime success story.

From today at 08:06 GMT GerritHub users are served by our brand new infrastructure geo-located in Canada, Quebec, Beauharnois. It is the first time we applied a zero-downtime roll-out scheme, the PingDom uptime for the past 24h reported 100% uptime and 688 msec average response time for the page of the list of opened changes. The two response times spike on the above graphs are actually due to the old German infrastructure and happened before the start of the roll-out.

We can see the switch of the traffic to the new infrastructure from the increase of the overall response time (IP packets were routed from Germany to Canada causing extra hops); as the DNS propagation was spreading across the world, the overall number of hops gradually came back to normal.

Timeline of the events.

08:00:00 GMT – Phase 1 – Set Gerrit READ-ONLY. All changes and Git repositories started to refuse push and updates.
08:00:01 GMT – Phase 2 – Wait for pending replication to complete. Replication queue was empty; there was no need to wait.
08:00:02 GMT – Phase 3 – Mirror DB and Git for the last time, delta-reindex, DB upgrade and Gerrit restart. It has been the longest part of the roll-out and lasted 5′ 32”, aligned with our estimates.
08:05:34 GMT – Phase 4 – Cache warm-up. 20K projects, 8K accounts and 4.6K groups were pre-loaded in Gerrit. This step was optional but allowed us to redirect all the traffic without risks of causing thread spikes on the new infrastructure.
08:06:23 GMT – Phase 5 – Redirect traffic to the new infrastructure.

Did anybody notice the rollout?

During the rollout the Git projects and Gerrit changes were read-only for 6′ and 23”. According to the logs, 493 Git/HTTP and 172 Git/SSH invocations were made and completed successfully: none of them failed.

What is the situation right now?

The new infrastructure public IP (192.99.233.76) has almost completed his DNS propagation around the world, the only countries not entirely covered are Australia and China. The rest of the world is coming directly to Canada avoid the German hops. Metrics are good, low CPU utilization and threads consumption compared to the old German infrastructure, symptom of the reduction of the execution and serving times and latency.

What’s next?

From now on we will continue to use this Blue/Green roll-out strategy, possibly improving in the ReadOnly window by introducing live distributed reindex and cache warm-up.

We fully commit to Zero-Downtime and Stability, the most valuable assets for our clients.

GerritHub and Zero-Downtime Upgrade

Posted on March 16, 2016 by Git and Gerrit Code Review for the Enterprise

GerritHub gets bigger on Mon, 21 March 08:00 GMT

GerritHub has experienced unprecedented growth over the past two years. The November 2015 numbers presented at the Google User Summit in Mountain View – CA have been surpassed again, and we do need to make sure that our infrastructure is still capable of dealing with current and future users’ needs.

What is changing in GerritHub.io?

We are changing everything, from the version of Gerrit to the hardware, network and storage infrastructure. Data, DBMS, Indexes and cache, need to be upgraded and refreshed to make sure that the new systems are reflecting exacting the current production data and sessions.
We are changing as well the geo-location of our servers, from the current server farm in Germany (Bayern, Nuremberg – 100 MBps) to a new server farm in Canada (Quebec, Beauharnois – 1 GBps).

Why have so many changes?

We started to measure some significant delay in the Git and review operations on the old infrastructure, mainly due to three factors:

More users, more repositories, more concurrency. Individuals, OpenSource projects and Businesses started using GerritHub.io for their mission-critical repositories, considering Gerrit the “source of truth” of their review workflow. We needed more horsepower, memory, storage and ability to scale even further.
Bandwidth from USA and Far-east. The majority of people using GerritHub.io are from the other side of the Atlantic Ocean: this is typically not a problem from 7 AM to 3 PM … but after 4 PM the connectivity between Europe and the Americas becomes slow. Additionally, people using GerritHub.io from India, Japan, Australia and New Zealand experienced terrible slowdowns because of the excessive number of hops to reach Germany.
Gerrit master is much faster. Based on the current data and metrics measured on GerritHub.io, we have contributed a lot of patches to reduce the overhead caused by Gerrit DB and lessen the number of SQL queries per minute. All those new improvements are on Gerrit master, and we need to catch-up with the “latest and greatest” version.

Will I experience any GerritHub.io outage?

Last time that GitHub needed to make a major upgrade, asked his 5M users to stop working for 23 minutes,. This translates to a loss of two millions of hours of continuous delivery lifecycle, equivalent to over 130 man/years, worth no less than eight millions dollars.
We are going to adopt a new Zero-Downtime Gerrit roll-out strategy to make sure that all those changes are not going to impact your day-by-day activity. If you were not reading this post you would possibly even not notice the “switch” from the old to the new infrastructure, apart from the increase in speed and bandwidth.

Zero-downtime GerritHub.io migration, step by step with the associated expected timings.

Phase 0 – Replication to the new Gerrit infrastructure. (- 1 month ago)
We started migrating everything one month ago, and the old and new infrastructure are working side-by-side, thanks to Gerrit master-slave replication. The new Gerrit servers are active as slaves and are read-only.

Phase 1 – Migration kick-off. (08:00 GMT)
We install a Gerrit plugin that rejects all the push to GerritHub.io repositories providing a courtesy message: “Gerrit is under maintenance, all projects are READ ONLY”. All the HTTP POST, PUT and DELETE are disable on the Gerrit REST-API.

Phase 2 – Wait for replication events to complete and migrate DB. (08:02 GMT)
Git repositories are continuously replicated, but we do need to make sure that the event queue is empty. Once that happens we schedule the last final DB migration to the new infrastructure.

Phase 3 – Gerrit DB upgrade and reindex (08:04 GMT)
New Gerrit server executes the final upgrade and off-line reindex of the latest received changes.

Phase 4 – Gerrit start-up and cache warm-up (08:05 GMT)
New Gerrit is restarted and the most critical Gerrit caches (projects, accounts and groups) are pre-loaded in memory. This allows the incoming traffic spike to avoid the collapse of used threads and makes the transition as smooth as possible without slowdowns.

Phase 5 – Traffic switch and DNS updates (08:06 GMT)
GerritHub.io redirects all incoming HTTPS and SSH traffic to the new infrastructure. Git pushes and HTTP PUT, POST and DELETE operations of the REST API are operational again and served by the new Gerrit infrastructure. GerritHub.io DNS is updated to the new Canadian IPs.

Phase 6 – New IPs gets propagate to all worldwide DNSs (+ 1 day)
Once all the DNSs in the world would have been updated, everyone will start going directly to the new infrastructure without further hops or redirection from Germany. Customers from USA, Canada, South America, Asia, Japan, New Zealand and Australia should see a significant reduction of the network latency and increase of GerritHub.io responsiveness.

Firewall and SSH considerations

Even if Gerrit server’s SSH key is not changing, some of you may see a warning similar to this when they push or pull over SSH:

Warning: the RSA host key for ‘review.gerrithub.io’ differs from the key for the IP address ‘148.251.77.70’

The warning message will also tell you which lines in your ~/.ssh/known_hosts need to change. Open that file in your favorite editor, remove or comment out those lines, then retry your push or pull.

Should your network have some strict firewall rules to access external sites, you may want to whitelist the IP of the new infrastructure WLB to: 192.99.233.76.

Follow GerritHub.io migration progress.

We will advertise the migration progress on Twitter at @GitEnterprise. Should you have any issue you can tweet us or contact GerritForge Customer Support at support@gerritforge.com.

Gerrit User Summit & Hackathon 2015

Posted on November 16, 2015 by Git and Gerrit Code Review for the Enterprise

Gerrit User Summit at Google Mountain View – CA

Exciting times this year at the Gerrit User Summit and Hackathon 2015: the major contributors and players of the Gerrit community shared experiencing, opinions and news in an intense 7-days event in Google – Mountain View – CA.

The User Summit

The first two days have seen the User Summit at the centre of the stage: scalability, scalability and scalability have been the Leitmotif of the discussion.
GerritHub.io started the saga with the astonishing numbers of the two years growth:

6.5K users (+580%)
41K changes (+1700%)
16K projects (+530%)
500GBytes (+500%)

The two years of live production experience with users coming from GitHub have highlighted problems (replication, repositories sharding, multi-master) and possible solutions, ranging from Gerrit Virtual Private Hosting to the on-premises deployment integrated with BitBucket and GitHub:Enterprise.
Ericsson continued with a very detailed description of how Gerrit is used as “Enterprise Washing Machine” of all code that goes through the development pipeline: scalability and control are the fundamental keywords that were repeatedly mentioned and enforced.
The replication across sites have been massively improved regarding performance and stability since the introduction of Git/HTTPS as the main protocol for replication, as previously advised by GerritForge in 2014.
The first day ended with Google smashing out everyone his hypersonic numbers:

1.6M changes / 3.8M patch-sets
240 Gerrit virtual nodes
2.5M repositories

The second day was all projected on the future of Gerrit, with three very interesting features coming soon.

Gerrit 3.0 and NoteDB

Dave Borowitz presented the status of the replacement of Gerrit DBMS
with a 100% open standard solution based on commit notes, implemented by all OpenSource and Commercial “flavours” of Git. The solution will allow interoperability with other code-review implementations (e.g. Phabricator) and fully enable reviews replication and off-line operations.
The new DBMS-free version of Gerrit will be called the Ver. 3.0 and it will be the next version after current Gerrit master (Ver. 2.13) gets released. released.

Gerrit PluginManager: one plugin to rule them all

Luca Milanesio presented a new vision on how to discover and install plugins on the Gerrit platform: Gerrit PluginManager. We should not bundle more and more plugins with Gerrit, which eventually would lead to an explosion of the WAR file-size. You can now install only a “plugin-manager” which then guide you through the “one-click” set-up of all the others.
Differently from similar solutions such as Jenkins Update Centre, the Gerrit PluginManager is based on the live status of the plugins and their compatibility with Gerrit: as soon as one plugin gets patched and successfully compiled with a version of Gerrit, it will automatically be listed and made available by the PluginManager. Additionally GerritForge will provide a list of certified and guaranteed plugins that have been successfully tested with Gerrit.

PolyGerrit, the new web-component UX for Gerrit.

Andrew Bonventre is the Googler that is driving the transformation of Gerrit UX to the new Polymer platform, however for personal reasons he was not able to attend the Summit. Dave Borowitz replaced him and unveiled that will be the (very) near future of Gerrit UX: no more GWT and complex and black magic transforming Java in obscure JavaScript, the future is coming through web-components, a new emerging and promising standard for the HTML and modern web-browsers.
Despite the UX being at very early stages, Dave was able to showcase a fully functional list of changes and search box powered by Polymer and calling Gerrit REST-API, really fast and promising!

Gertty, the text-only Gerrit Code Review.

On the complete opposite side, why not using Gerrit from an 80×25 text-only console? Gertty is an astonishing 100% char-based console, all based on Gerrit REST-Api and a local SqLite local DB for caching changes. Allows complete off-line operations and synchronisation with Gerrit changes. Productive and effective while you are on the go.

CollabNet and Gerrit tuning cheat-sheet

CollabNet presented a useful four pages brochure to guide through the tuning of a Gerrit set-up for a small, medium or large installation. Based on their experience of running TeamForge SCM, the commercial fork of Gerrit Code Review based on their existing TeamForge ALM proprietary solution, they have been able to experiment the Gerrit default settings and learn how to adjust them to leverage the full power of your setup.
The audience appreciated the effort and encouraged CollabNet to post all the findings as Gerrit reviews to get the code-base better and improve the default settings of the set-up.

Gerrit and Jenkins Workflow dance together with Docker.

Cloudbees presented a very effective demo on how to use Gerrit and Jenkins Workflow plugin to implement a real-life Continuous Delivery Pipeline. The presentation leveraged the use of Docker images as previously presented by Stefano Galarraga in his “Gerrit and Jenkins Continuous Delivery Pipeline for BigData” talk. They both explained and showed how Code Review is key in implementing a successful and smooth code validation and roll-out. Stefano’s presentation made use of the new exciting feature of “topic submission” that enables the grouping and commit of multiple changes across repositories.

The Hackathon, topics and improvements.

A lot of code have been pushed during the hackathon:

400 changes merged
3 Gerrit version released

Gerrit metrics

Surely the hot-topic of the hackathon has been the introduction of a new pluggable metrics engine in Gerrit, currently half-merged in master branch. Gustaf demoed on how now is possible to use standard tools such as Graphite and JMX console to extract, display and graph the most relevant Gerrit metrics in real-time. This is similar to what the JavaMelody plugin was providing, with the added value of getting the data outside the Gerrit JVM and analyse with greater detail and a standard monitoring platform.

PolyGerrit

Gerrit master has been officially upgraded to allow the development of the new Polymer-based UX for Gerrit, code-named “PolyGerrit”. Gerrit master will need from now on the installation of NodeJS during development. This is needed for building and packaging the “vulcanised” version of Gerrit UX which contains the basic components of the user interface. At the moment the only thing that will be visible is the demo of list of changes presented by Dave Borowitz at the User Summit, however new changes are coming over and Google announced that they are targeting Q4-2015 for a first internal release of the new UX.

GerritForge CI Verifier

As Gerrit 3.0 will be completely revamped in terms of reviews persistence, the community pushed for having a stricter changes validation on both the old DBMS and new NoteDB based persistence. GerritForge extended the use of the CI system (https://gerrit-ci.gerritforge.com) to cover the validation of every change / patch-set that will be uploaded to Gerrit from now on. This is a substantial improvement on the Code Review workflow of Gerrit itself and will hopefully contribute to a stable and solid Ver. 3.0 release next year.

Externalisation of Gerrit Hooks and Events to plugins

Qualcomm worked at completing the externalisation of Gerrit hooks and stream events into plugins. This change will allow to plug different events providers depending on the type of Gerrit set-up, single node or multi-master. One more important step towards an OpenSource implementation of Gerrit Multi-Master.

Gerrit Code Review and Jenkins Continuous Delivery Pipeline on BigData

Posted on July 21, 2015 by Git and Gerrit Code Review for the Enterprise

Gerrit at the Jenkins User Conference 2015 – London

For the very first time, CloudBees organised a full User Conference in London and we have been very pleased to speak to present a real-life case-study of Continuous Integration and Continuous Delivery applied to a large-scale BigData Project.

See below a summary of the overall presentation published on the above YouTube video.

The trap of the BigData production phase

BigData has been historically used by data scientists in order to analyse data and extract features that are relevant for the business. This has typically been a very interactive process happing mostly on “notebook-style” environments where almost everything, from ad-hoc queries and graphs, could have been edited and executed interactively. This early stage of the process is typically known as “exploration” or “prototype analysis” phase. Sometimes last only a few days but often is used as day-by-day modus operandi.

However when the exploration phase is over, projects needed to be rewritten or adapted using a programming language (Scala, Python or Java) and transformations and aggregations expressed in jobs. During the “production-isation” phase code needs to be properly written and tested to be suitable for production.

Many projects fall into the trap of reducing the “production phase” to a mere translation of notebooks (or spreadsheets) into Scala, Java or Python code, relying only on the manual analysis of the resulting data as unique testing methodology. The lack of software engineering practices generates complex monolithic code, difficult to maintain, to understand and thus to validate: the agility of the initial “exploration” phase was then miserably lost in the translation into production code.

Why Continuous Delivery on BigData?

We have approached the development of BigData projects in a radically different way: instead of simply relying on existing tools, often not enough for setting up a proper Agile Delivery Pipeline, we introduced brand-new frameworks and applied them to the building blocks of a Continuous Delivery pipeline.

This is how Stefano Galarraga wrote started the ScaldingUnit project, aimed in de-composing the development of complex Scalding MapReduce jobs in simple and testable units.

We started then to benefit from the improved Agility and speed of delivery, giving constant feedback to data-scientists and delivering constant value to the Business stakeholders during the production phase. The talk presented at the Jenkins User Conference 2015 is smaller-scale show-case of the pipeline we created for our large clients.

Continuous Delivery Pipeline Building Blocks

In order to build a robust continuous delivery pipeline, we do need a robust code-base to start with: seems a bit obvious but is often forgotten. The only way to create a stable code-base, collectively developed and shared across different [distributed] Teams, is to adopt a robust code review lifecycle.

Gerrit Code Review is the most robust and scalable collaboration system that allows distributed teams to submit their changes and provide valuable feedback about the building blocks of the BigData solution. Data scientists can participate as well during the early stage of the production code development, giving suggestions and insight on the solution whilst is still in progress.

Docker provided the pipeline with the ability to define a set of “standard disposable systems” to host the real-life components of the target runtime, from Oracle to a BigData CDH Cluster.

Jenkins Continuous Integration is the glue that allowed coordinating all the different actors of the pipeline, activating the builds based on the stream events received from Gerrit Code Review and orchestrating the activation of the integration test environments on Docker.

Mesos and Marathon managed all the physical resources to allow a balanced allocation of all the Docker containers across the cluster. Everything has been managed through Mesos / Marathon, including the Gerrit and Jenkins services.

Pipeline flow – Pushing a new change to Gerrit Code Review

The BigData pipeline starts when a new piece of code is changed on the local development environment. Typically developers test local changes using the IDE and the Hadoop “local mode” which allows the local machine to “simulate” the behaviour of the runtime cluster.

The local mode testing is typically good enough for running unit-tests but often is unable to detect problems (e.g. non-serialisable objects, compression, performance) that are likely to appear in the target BigData cluster only. Allowing to push a code change to a target branch without having tested on a real cluster represent a potential risk of breaking the continuous delivery pipeline.

Gerrit Code Review allows the change to be committed and pushed to the Server repository and built on Jenkins Continuous Integration before the code is actually merged into the master branch (pre-commit validation).

Pipeline flow – Build and Unit-tests execution

Jenkins uses the Gerrit Trigger Plugin to fetch the code currently under review (which is not on master but on an open change) and triggers the standard Scala SBT build. This phase is typically very fast and takes only a few seconds to complete and provide the first validation feedback to Gerrit Code Review (Verified +1).

Until now we haven’t done anything special of different than a normal git-flow based continuous integration: we pushed our code and we got it validated in Jenkins before merging it to master. You could actually implement the pipeline until this point using GitHub Pull Requests or similar.

Pipeline flow – Integration test automation with a real BigData Cloudera CDH Cluster

Instead of considering the change “good enough” after a unit-test validation phase and then automatically merging it, we wanted to go through a further validation on a real cluster. We have completely automated the provisioning of a fully featured Cloudera CDH BigData cluster for running our change under review with the real Hadoop components.

In a typical pipeline, integration tests in a BigData Cluster are executed *after* the code is merged, mainly because of the intrinsic latencies associated to the provisioning of a proper reproducible integration environment. How then to speed-up the integration phase without necessarily blocking the development of new features?

We introduced Docker with Mesos / Marathon to have a much more flexible and intelligent management of the virtual resources: without having to virtualise the Hardware we were able to spawn new Docker instances in seconds instead of minutes ! Additionally the provisioning was coordinated by the Docker Build Step Jenkins plugin to allow the orchestration of the integration tests execution and the feedback on Gerrit Code Review.

Whenever an integration test phase succeeded or failed, Jenkins would have then submitted an “Integrated +1/-1” feedback to the original Gerrit Code Review change that triggered the test.

Pipeline flow – Change submission and release

When a change has received the Verified+1 (build + unit-tests successful) and Integrated+1 (integration-tests successful) is definitely ready to be reviewed and submitted to the master branch. The additional commit triggers the final release build that tags the code and uploads it to Nexus ready to be elected for production.

Pipeline flow – Rollout to production

The decision to rollout to production with a new change is typically enabled by a continuous delivery pipeline but manually operated by the Business stakeholders. Even though we could *potentially* rollout every change, we did not want *necessarily* do that because of the associated business implications.

Our approach was then to publish to Nexus all the potential *candidates* to production and roll-them-out to a pre-production environment, ready to be assessed by Data-Scientists and Business in real-time. The daily job scheduler had a configuration parameter that simply allowed to “pointing” to the version of the code to run every day. In this way whatever is deployed to Nexus is potentially fully working in production and rollout or rollback a release is just a matter of changing a label in the daily job scheduler.

Summary

Building a Continuous Delivery Pipeline for BigData has been a lot of fun and improved the agility of the Business in rolling out changes more quickly without having to compromise on features or stability.

When using a traditional Continuous Integration pipeline, the different stages (build + unit-test, integration-tests, system-tests, rollout) are all happening on the target branch causing it to be amber or red at times: whenever tests are failing the pipeline need to be restarted from start and people are blocked.

By adopting a Code Review-driven Continuous Integration Pipeline we managed to get the best of both worlds, avoiding feature branches but still keeping the ability to validate the code at each stage of the pipeline and reporting it back to the original change and the associated developer without to compromise the stability of the target branch or introducing artificial and distracting feature branches.

Resources

The slides of the talk are published on SlideShare.

All the docker images used during the presentation are available on GitHub:

GitHub outage, again :-( What is the real cost of FREE services?

Posted on May 6, 2015 by Git and Gerrit Code Review for the Enterprise

As a bitter surprise today, we are experiencing another GitHub outage. This time it seems a more serious problem than the average DDoS: GitHub’s Ops Team is perform an emergency maintenance on the whole site to recover the situation.

How much a FREE GitHub Service outage really costs me?

Everyone loves GitHub because it is nice, easy and most of all … it’s FREE ! Lots of projects started using it for much more than pure source code versioning:

People write books and documentation with it (see gitbook.com)
Teams started using it as free artifacts repository manager: projects wouldn’t build at all when GitHub is down
Companies started hosting web-pages on GitHub (see the nicely rendered microsoft.github.io)
GitHub Issue tracking and wikis are so simple that people are using for project collaboration

When everything works, it is amazing how your Team can be productive using GitHub on a daily basis. But when it fails, what can you do? And what if my Team cannot progress because they can’t see the tasks, wikis, requirement documents, web-pages … how much money am I really wasting when people is hanging around for hours?

Let’s consider a small Agile Team composed by a 1 x BA, 8 x Agile Devs, 1 x Scrum Master, 2 x DevOps and 2 x QA: a 30′ minutes outage like the one today would have an impact on 16 people of 1 man/day that means (for the US market) roughly $1,000 (as optimistic guess, it may cost even more). Even if GitHub goes down twice a year (gosh this happened more than twice I am afraid) your start-up will end up paying around $ 2,000 /year for GitHub. The overall amount doesn’t sound that expensive … but you wonder why GitHub “was supposed to be really FREE” if you end up spending money with it.

If we apply the same figures to a medium size company with at least 160 people working on development, your overall figure would jump to $20,000 /year. More importantly the time lost and delays caused on the project schedule may then have an avalanche effect on other teams and maybe causing additional pain and costs across your organisation and programme plan. Those extra costs can be sometimes difficult to quantify but for sure are much more relevant on your overall business.

Shall we give up using GitHub then? Or shall we move to GitHub:Enterprise instead?

The typical reaction to a GitHub outage is: “we cannot rely on the FREE version, we should buy GitHub:Enterprise which will run inside our company network” and use this argument with your manager to get a Purchase Order finalised NOW (I may be too malicious … but a outage may actually generate more money to GitHub than loss of reputation). When you look at the GitHub:Enterprise pricing it ends up that for your 160 people you would need to spend only $36,000 /year which seems on the same order of magnitude of your $20,000 wasted money without considering the extra hidden costs of project delays.

But are you really solving the problem? GitHub and GitHub:Enterprise are the same product, same code-base, just different pricing. What makes you wonder that your internal Ops Team can do a better job than GitHub? What makes you wonder that a GitHub bug would not appear on your GitHub:Enterprise set-up? Are you just an optimistic person?

Moving to GitHub:Enterprise is typicall needed when you have compliance / security requirements on data at-rest, but is not really addressing the problem of reliability and would potentially expose your Team to even further outages for software upgrade and management that typically you don’t have using GitHub alone. You are then spending $36,000 on top of your $20,000 (or even more) wasted previously without having real benefits.

Learning how to fly with GitHub

How to solve the problem then? Can we learn from somebody’s else experience?

Airplanes have exactly (if not even more demanding) requirements on their engines as we on a Version Control System. For an aircraft the cruising speed is everything, without that speed provided by its engines he cannot fly; we have similar requirements in our Development Team where GitHub is really what we need for progressing our development otherwise we are blocked.

The solution to the problem for an airplane to be reliable is not buying more expensive engines (which are not necessarily more reliable) but instead using two engines instead of one. Can we apply the same to GitHub? GitHub is in a nutshell a Git Server, why not relying on redundancy and replication? Can I set-up a replica of GitHub and use it for my reviews?

You can of course build your own replica using plain Git and GitHub WebHooks: it would require a bit of scripting but it can be done. During an outage you can use the replica and when GitHub is back all the pending changes can be pushed back to GitHub.

Can I have another FREE and automated replica of GitHub?

This is becoming challenging now: we want something that is completely FREE (no time spent in writing scripts, webhooks, no service provider to pay, no commercial product) but that allows us to use GitHub replicated, including Code Reviews.

It seems strange but what we are looking for actually exists and it is an OpenSource project called Gerrit Code Review. It is not only a Code Review and Git Server like GitHub but offers as well more advanced security and replication capabilities. It has been designed taking into account the needs of large distributed Teams and making their daily development lifecycle more reliable independently from local failures.

Cool, how can I get started with Gerrit and GitHub now with no hassles?

You read this quick introduction for getting started in setting up your private replica or, you are really in a hurry and you wanted a FREE hosted service, you can sign-up with 3 clicks to GerritHub.io.

I have only 5 mins of free time today: what can I read/watch to understand how it works?

Well, there are plenty of resources but if you are really in a hurry, you can watch the following YouTube Video:

If you have more time, you can read the Gerrit Code Review overview and tutorial at: https://review.gerrithub.io/Documentation/intro-quick.html

Get ready now to avoid wasting again money when the next GitHub outage … that nobody wishes … will (sadly) happen 😦

No more tears with Gerrit Code Review, thanks to Docker

Posted on April 30, 2015 by Git and Gerrit Code Review for the Enterprise

Gerrit has finally landed on Docker!
The first official set of Docker images have been uploaded and are available on DockerHub.

Gerrit with its default configuration is now available for CentOS 7 and Ubuntu 15.04

Thanks to the new Gerrit installer project a new set of native distribution means are becoming available: starting from the native packages for Linux to the Docker images published today.

Why Docker?

Well, why not? Docker is an amazing technology for packaging an application with its dependencies and activating it in a sandbox virtualised environment. It allows a more effective isolation of an application container and at the same time assures to have a clear separation between the application distribution and its data.

Additionally for those who want to run more than one Gerrit instance at a time, it allows to define specific QoS for the activated docker containers and assign to them an internal IP and ports to be routed in a multi-hosted environment.

How can I get Gerrit on Docker?

Well, it is simpler than you can imagine. Once you’ve installed Docker on your Linux box you just need to execute the following commands:

To download and run Gerrit on CentOS 7:

$ docker pull gerritforge/gerrit-centos7
$ docker run -d -p 8080:8080 -p 29418:29418 \
  gerritforge/gerrit-centos7

To download Gerrit and run Ubuntu 15.04:

$ docker pull gerritforge/gerrit-ubuntu15.04
$ docker run -d -p 8080:8080 -p 29418:29418 \
  gerritforge/gerrit-ubuntu15.04

Gerrit will be started inside a Docker container and will be exposed on ports 8080 and 29418 on the host machine IP.

How can I customise my Docker container?

Gerrit Dockerfiles are available in the gerrit-installer project and can be easily customised and tailored to your needs. A much better idea however would be to generate a new Dockerfile that starts from Dockerhub image and then change your Gerrit configuration files and steps to perform your desired set-up.

Dockerfile sample for Gerrit listening on HTTP port 8090:

FROM gerritforge/gerrit-centos7
MAINTAINER GerritForge

USER gerrit
RUN git config -f /var/gerrit/etc/gerrit.config httpd.listenurl http://*:8090/
EXPOSE 29418 8090

# Start Gerrit
CMD /var/gerrit/bin/gerrit.sh start && tail -f /var/gerrit/logs/error_log

How can I install a specific Gerrit Docker version?

All Docker images published are associated to a specific Gerrit tag representing the version installed on that image. The default is always the latest Gerrit version that in this case is 2.11.

To download and run a Gerrit 2.10.3.1 Docker image on CentOS 7:

$ docker pull gerritforge/gerrit-centos7:2.10.3.1
$ docker run -d -p 8080:8080 -p 29418:29418 \
  gerritforge/gerrit-centos7:2.10.3.1

What about having typical Gerrit configurations as Dockerfiles?

There are a lot of possible Gerrit configuration settings but the most typical ones are:

LDAP authentication
OAuth with Google / GitHub authentication
Master / Slave with Git over SSH
Master / Slave with Git over HTTPS
PostgreSQL DB
MySQL DB

All the above settings can be represented by a set of Dockerfiles similar to the one above mentioned: they will all start from a plain Gerrit Docker image (e.g. gerritforge/gerrit-centos7) and follow with the amended settings.

Where can I find the “pre-digested” Dockerfiles for the typical Gerrit configurations?

We are planning to enrich the Gerrit installer project with all the above typical scenarios and publish the associated Dockerfiles so that people can “pick&mix” the perfect recipe for a flawless installation.

Have fun with Gerrit and Docker, no more installation tears with Gerrit!

Gerrit Ver. 2.11 is now available for download

Posted on April 17, 2015 by Git and Gerrit Code Review for the Enterprise

Gerrit version 2.11 is now available.

Release highlights:

Inline editing in the browser
Removal of the old change screen
Many improvements in the change screen UI

Release notes:
https://gerrit-documentation.storage.googleapis.com/ReleaseNotes/ReleaseNotes-2.11.html

Download:
https://gerrit-releases.storage.googleapis.com/gerrit-2.11.war

Documentation:
https://gerrit-documentation.storage.googleapis.com/Documentation/2.11/index.html

Native packages:
If you have already installed Gerrit 2.10 using them, you need to execute:

(on Debian / Ubuntu)

apt-get update & apt-get install gerrit

(on CentOS / RedHat)

yum clean all && yum install gerrit

If you have not configured yet the GerritForge / BinTray repositories, please follow the instructions at:
https://gitenterprise.me/2015/02/27/gerrit-2-10-rpm-and-debian-packages-available/

Yet again, Gerrit delivers new features and cutting-edge technology improvements for Git and Code Review for the Enterprise!

Zero-downtime Git and Gerrit Code Review @GoogleSource.com

Posted on March 22, 2015 by Git and Gerrit Code Review for the Enterprise

Where is this coming from?

Zero downtime image from http://www.couchbase.com

Yesterday GitHub was down for a DB upgrade, an outage that overall lasted for 23 minutes. This may not sound a problematic downtime at all, but when you think that nowadays GitHub is used not only for Software development worldwide as a Git server but as also as a source and binary packaging repository and distribution centre, a Markdown pages server and possibly much more … and you multiply by the number of users / repos hosted, then 23 minutes may translate in a significant disruption and, for some mission-critical business use-cases, even financial loss.

We never needed planned outages for DB upgrades on Gerrit Code Review used for a lot of other OpenSource projects (ranging from Android to Chromium): how the Gerrit team is managing to outperform GitHub? I asked Shawn Pearce to spend some time to describe how his team at Google managed to implement its roll-out strategy in the delivery pipeline going through tons of major DB upgrades with zero downtime worldwide.

He kindly responded on the Gerrit Code Review mailing list with this post, and we are very thankful for having shared his experience with us, hoping that GitHub guys will read this post and may learn from it for future GitHub DB upgrades.

I am reporting here Shawn’s post AS-IS, in order to maximise the audience and enable more people to access its content.

How googlesource.com manages database upgrades with no downtime (by Shawn Pearce)

In light of the recent GitHub database outage, Luca Milanesio asked me to describe how googlesource.com has managed nearly 3 years of database upgrades with zero downtime. So… here is an attempt. 🙂

tl;dr: protobuf, Bigtable, and multi-master.

Long version…

Bigtable … not SQL

Years ago we settled on using Google Bigtable as the backing database for googlesource.com instead of MySQL or PostgreSQL. This decision actually came about because of virtual hosting (see below), not because Google is any better at running Bigtable than MySQL or PostgreSQL (we run those well too).

Briefly, Bigtable is a NoSQL database that organizes data into tables of column families; read the Bigtable paper for an overview. Rows can contain irregular shapes of columns, and two rows in the same table do not need to have the same layout (columns).

To support Gerrit Code Review I hand-wrote a complete implementation of the ReviewDB interface and all of its sub interfaces to transport data between the application and Bigtable.

Data is stored in ~3 Bigtables:

Accounts: Accounts, AccountDiffPreferences, AccountExternalIds, …
Changes: Changes, PatchSets, PatchLineComments, …
SiteData: AccountGroups, AccountGroupByIds, …

We mash data for multiple ReviewDb tables into the same Bigtable by assigning the tables to different column families. Data for an Accounts row goes into the “Accounts.data” column family, while data for an AccountDiffPreferences row goes into the “AccountDiffPreferences.data” column family. E.g.:

row: 100151 # account_id
Accounts.data:
... data for account object ...

AccountsDiffPreferences.data:
... data for diff pref object ...

Our guiding principal for what goes where is based (mostly) on the primary key declaration. If Account.Id was first in the primary key, the row(s) go into the Accounts Bigtable. If Change.Id was first in the primary key, the row(s) go into the Changes Bigtable. This means the StarredChanges data is stored in the Accounts Bigtable, and PatchLineComments is in the Changes Bigtable.

Everything else that didn’t quite fit the Accounts or Changes pattern went into SiteData. AccountGroups for example are in SiteData.

To be honest, this is all arbitrary. I could have randomly assigned ReviewDb tables to Bigtables. Or put them all in a single Bigtable.

Creating new tables

New table creation is handled by pushing a new column family to Bigtable. This is an online operation that does not require changing any existing data. Internally column families are just unique tags written before the stored data. Adding a column family just assigns a new tag that has not been used yet.

Protobuf

The really important part of our online schema upgrade process is actually Google protobuf.

Bigtable doesn’t store structured data. Bigtable stores sequences of bytes in column families. Googlers get structure by storing encoded protobuf messages in column families. Protobuf encodes messages by writing a unique integer tag before each field. The tag allows readers to match data back up to the runtime object during decoding.

Protobuf gives us very critical features:

– Unknown fields are skipped (and ignored). If a field has been deleted from the model, but still exists in data records, the application code can safely skip over that data by reading the tag, recognizing its an unknown field, skipping its encoded bytes, and continuing onto the next field.

– Unknown fields are preserved. If a field is not recognized its encoded bytes are kept in memory. When the application makes changes to a message and writes the message back to the database table, the unknown fields are preserved and written back as-is.

– Fields can be missing. If a field is not present in the data, it simply has no tag present in the encoded message. The field is assumed to be its default value by the application.

Each database table in ReviewDb is described by its own protobuf message. The @Column() annotations in ReviewDb include the unique field numbers used by protobuf to tag data in encoded messages. You can see this schema by printing the protobuf schema out:

java -jar gerrit.war ProtoGen -o reviewdb.proto ; cat reviewdb.proto

In our Bigtable mapping the Gerrit application server encodes an Account object into a protobuf message, then writes the encoded protobuf to the Accounts.data column family. Reading from the database is merely the reverse process.

Column deletion

Columns can be removed from a table by removing its @Column annotation from the Java object. The field definitions will be omitted from the protobuf description. New application code that reads from the database table will skip over the (now unknown) field. During updates of a row the deleted/unknown field will be preserved and written back to the database table.

It is very important that the field number is never reused.

Nothing prunes the old fields from the Bigtable. Disk storage is cheap, disk IOs are not. Leaving the deleted data on disk is cheaper than scanning through every row and clipping out the deleted fields.

This is why we leave deleted fields commented out in source code, so future developers know not to reuse a field number.

Column addition

Columns can be trivially added to an existing table by assigning a new field number. When newer application code reads an old record it won’t find the new tags and will simply assume the default that is supplied in the protobuf description.

Unfortunately the defaults used in @Column annotations don’t always match with the real intended defaults. We have had to hack this at Google by applying a patch to every version of Gerrit for 2 fields:

- optional bool size_bar_in_change_table = 16;
+ optional bool size_bar_in_change_table = 16 [default = true];
optional bool legacycid_in_change_table = 17;
optional string review_category_strategy = 18;
- optional bool mute_common_path_prefixes = 19;
+ optional bool mute_common_path_prefixes = 19 [default = true];

The open source project chose to apply these defaults using Schema_NNN upgrade files that rewrite all existing accounts to set the fields true during init. We do not have that luxury and instead patch every release we make to assume the “correct” default if the field is not present in the stored data. This is why I lobby so hard against boolean columns being true by default via Schema_NNN upgrades. 🙂

Because of the unknown field properties described earlier, it is (usually) safe to run newer binaries alongside old binaries against the same database. A newer binary may store new fields to a row. The older binary will ignore these, but preserves the unknown field data during updates.

Of course cross-field semantics could be confused if we attempted this. We limit our risk by staying close to HEAD and try really, really hard to avoid cross-field semantic issues (e.g. anything like status and open in changes).

Column rename

We really don’t care about column renames. The column names themselves are not stored in Bigtable or in the encoded protobuf messages. Column names are only in the application software. A column name change is just a recompile, similar to a method name change.

What we cannot do is change field IDs. Once used in an @Column annotation, we are stuck with that ID number forever. 🙂

Virtual Hosting

googlesource.com implements virtual hosting for hundreds of Gerrit sites. All sites are combined together into the same 3 Bigtables by prefixing each row with the site name, for example:

row: gerrit:100151 # $site:$account_id
Accounts.data:
... data for account object ...

AccountsDiffPreferences.data:
... data for diff pref object ...

The application server itself is virtual hosted by running a javax.servlet.Filter in front of Gerrit. The filter extracts the host name from the HTTP Host header and stores it somewhere accessible by the hand-coded ReviewDb implementation. All database operations include the host name as part of the row keys being accessed.

It is this virtual hosting strategy that forced our hand and required such smooth online schema migrations.

When we update the binary, we update the binary for hundreds of “servers” at once. We can’t shutdown everyone for 200 * 5 minutes to upgrade 200 sites at 5 minutes each while we run a Schema_NNN process serially. We also don’t want to use 200 CPUs to update 200 sites in parallel during a global 5 minute downtime window, too much can go wrong, and there will always be straggling sites. Neither option appealed to us.

So smooth online migrations it was. 🙂

Multi-master hosting

We don’t run one Gerrit server. We run many Gerrit servers against the same Bigtables. Requests load-balance across this pool of servers, based on a number of factors that are out of scope for this particular article.

We use this multi-master hosting to help do online binary upgrades of Gerrit.

Given N servers where N >= 3:

1) we take one out of the load balancing rotation
2) wait for in-flight requests to finish
3) stop the process
4) install the new version
5) restart it
6) add it back to the rotation
7) goto 1

We size N such that N is larger than the number we actually need to handle traffic; this allows us to lose a server without impact to traffic to do the upgrade dance.

Linux operating system upgrades can be coordinated the same way, as the servers are on different machines.

Multi-data center hosting

Given multi-master hosting, we don’t put all of our servers in the same data center. We run them in multiple data centers and allow the load balancers to route across all of them.

This strategy allows us to perform data center level maintenance without service interruption by taking some servers out of the load balancing rotation before maintenance starts.

Sometimes data center level maintenance is power related; e.g. servers may need to be shutdown to repair a failed UPS. Other times its database related. I recently corrupted a database replica in one data center by accident. I “shutdown” our servers in that data center while I manually restored a known good database. Nobody except my team at Google knew about my mistake, or the impact.

Once you are multi-data center, cross-site database consistency becomes an issue. Frankly we just reuse Google Megastore to get cross data center consistency based on a high quality Paxos implementation. Each of our data centers has a full copy of the database local to it and Paxos is used to ensure the application has a consistent view.

And by this point, you are probably wishing you had stopped at the tl;dr … 🙂

GitHub fully operational again

Posted on March 21, 2015 by Git and Gerrit Code Review for the Enterprise

GitHub outage latest for around 23 minutes and now the site has resumed normal stable operations.

GerritHub.io and his users have not been impacted by the GitHub outage, everything went smoothly and the cache TTL extension avoided any negative effects on our systems. Replication to GitHub resumed smoothly without any misalignment caused by the the outage.

Will this be the last GitHub outage? Have they learned how to implement effectively DB roll-outs with Continuous Delivery practices?

It would be very interesting if Shawn Pearce could put together a presentation on how Continuous Delivery is achieved for Gerrit Code Review at Google, avoiding downtime even during DB upgrades and roll-outs. Possibly GitHub could be inspired by us 🙂