Gerrit Upgraded with No Downtime

Screen Shot 2016-03-21 at 20.48.12

Zero DownTime success story.

From today at 08:06 GMT GerritHub users are served by our brand new infrastructure geo-located in Canada, Quebec, Beauharnois. It is the first time we applied a zero-downtime roll-out scheme, the PingDom uptime for the past 24h reported 100% uptime and 688 msec average response time for the page of the list of opened changes. The two response times spike on the above graphs are actually due to the old German infrastructure and happened before the start of the roll-out.

We can see the switch of the traffic to the new infrastructure from the increase of the overall response time (IP packets were routed from Germany to Canada causing extra hops); as the DNS propagation was spreading across the world, the overall number of hops gradually came back to normal.

Timeline of the events.

  • 08:00:00 GMT – Phase 1 – Set Gerrit READ-ONLY. All changes and Git repositories started to refuse push and updates.
  • 08:00:01 GMT – Phase 2 – Wait for pending replication to complete. Replication queue was empty; there was no need to wait.
  • 08:00:02 GMT – Phase 3 – Mirror DB and Git for the last time, delta-reindex, DB upgrade and Gerrit restart. It has been the longest part of the roll-out and lasted 5′ 32”, aligned with our estimates.
  • 08:05:34 GMT – Phase 4 – Cache warm-up. 20K projects, 8K accounts and 4.6K groups were pre-loaded in Gerrit. This step was optional but allowed us to redirect all the traffic without risks of causing thread spikes on the new infrastructure.
  • 08:06:23 GMT – Phase 5 – Redirect traffic to the new infrastructure.

Did anybody notice the rollout?

During the rollout the Git projects and Gerrit changes were read-only for 6′ and 23”. According to the logs, 493 Git/HTTP and 172 Git/SSH invocations were made and completed successfully: none of them failed.

What is the situation right now?

The new infrastructure public IP (192.99.233.76) has almost completed his DNS propagation around the world, the only countries not entirely covered are Australia and China. The rest of the world is coming directly to Canada avoid the German hops. Metrics are good, low CPU utilization and threads consumption compared to the old German infrastructure, symptom of the reduction of the execution and serving times and latency.

What’s next?

From now on we will continue to use this Blue/Green roll-out strategy, possibly improving in the ReadOnly window by introducing live distributed reindex and cache warm-up.

We fully commit to Zero-Downtime and Stability, the most valuable assets for our clients.

Gerrit Code Review and Jenkins Continuous Delivery Pipeline on BigData

Gerrit at the Jenkins User Conference 2015 – London

For the very first time, CloudBees organised a full User Conference in London and we have been very pleased to speak to present a real-life case-study of Continuous Integration and Continuous Delivery applied to a large-scale BigData Project.

See below a summary of the overall presentation published on the above YouTube video.

The trap of the BigData production phase

BigData has been historically used by data scientists in order to analyse data and extract  features that are relevant for the business. This has typically been a very interactive process happing mostly on “notebook-style” environments where almost everything, from ad-hoc queries and graphs, could have been edited and executed interactively. This early stage of the process is typically known as “exploration” or “prototype analysis” phase. Sometimes last only a few days but often is used as day-by-day modus operandi.

However when the exploration phase is over, projects needed to be rewritten or adapted using a programming language (Scala, Python or Java) and transformations and aggregations expressed in jobs. During the “production-isation” phase code needs to be properly written and tested to be suitable for production.

Many projects fall into the trap of reducing the “production phase” to a mere translation of notebooks (or spreadsheets) into Scala, Java or Python code, relying only on the manual analysis of the resulting data as unique testing methodology. The lack of software engineering practices generates complex monolithic code,  difficult to maintain, to understand and thus to validate: the agility of the initial “exploration” phase was then miserably lost in the translation into production code.

Why Continuous Delivery on BigData?

We have approached the development of BigData projects in a radically different way: instead of simply relying on existing tools, often not enough for setting up a proper Agile Delivery Pipeline, we introduced brand-new frameworks and applied them to the building blocks of a Continuous Delivery pipeline.

This is how Stefano Galarraga wrote started the ScaldingUnit project, aimed in de-composing the development of complex Scalding MapReduce jobs in simple and testable units.

We started then to benefit from the improved Agility and speed of delivery, giving constant feedback to data-scientists and delivering constant value to the Business stakeholders during the production phase. The talk presented at the Jenkins User Conference 2015 is smaller-scale show-case of the pipeline we created for our large clients.

Continuous Delivery Pipeline Building Blocks

In order to build a robust continuous delivery pipeline, we do need a robust code-base to start with: seems a bit obvious but is often forgotten. The only way to create a stable code-base,  collectively developed and shared across different [distributed] Teams, is to adopt a robust code review lifecycle.

Gerrit Code Review is the most robust and scalable collaboration system that allows distributed teams to submit their changes and provide valuable feedback about the building blocks of the BigData solution. Data scientists can participate as well during the early stage of the production code development, giving suggestions and insight on the solution whilst is still in progress.

Docker provided the pipeline with the ability to define a set of “standard disposable systems” to host the real-life components of the target runtime, from Oracle to a BigData CDH Cluster.

Jenkins Continuous Integration is the glue that allowed coordinating all the different actors of the pipeline, activating the builds based on the stream events received from Gerrit Code Review and orchestrating the activation of the integration test environments on Docker.

Mesos and Marathon managed all the physical resources to allow a balanced allocation of all the Docker containers across the cluster. Everything has been managed through Mesos / Marathon, including the Gerrit and Jenkins services.

Pipeline flow – Pushing a new change to Gerrit Code Review

The BigData pipeline starts when a new piece of code is changed on the local development environment. Typically developers test local changes using the IDE and the Hadoop “local mode” which allows the local machine to “simulate” the behaviour of the runtime cluster.

The local mode testing is typically good enough for running unit-tests but often is unable to detect problems (e.g. non-serialisable objects, compression, performance) that are likely to appear in the target BigData cluster only. Allowing to push a code change to a target branch without having tested on a real cluster represent a potential risk of breaking the continuous delivery pipeline.

Gerrit Code Review allows the change to be committed and pushed to the Server repository and built on Jenkins Continuous Integration before the code is actually merged into the master branch (pre-commit validation).

Pipeline flow – Build and Unit-tests execution

Jenkins uses the Gerrit Trigger Plugin to fetch the code currently under review (which is not on master but on an open change) and triggers the standard Scala SBT build. This phase is typically very fast and takes only a few seconds to complete and provide the first validation feedback to Gerrit Code Review (Verified +1).

Until now we haven’t done anything special of different than a normal git-flow based continuous integration: we pushed our code and we got it validated in Jenkins before merging it to master. You could actually implement the pipeline until this point using GitHub Pull Requests or similar.

Pipeline flow – Integration test automation with a real BigData Cloudera CDH Cluster

Instead of considering the change “good enough” after a unit-test validation phase and then automatically merging it, we wanted to go through a further validation on a real cluster. We have completely automated the provisioning of a fully featured Cloudera CDH BigData cluster for running our change under review with the real Hadoop components.

In a typical pipeline, integration tests in a BigData Cluster are executed *after* the code is merged, mainly because of the intrinsic latencies associated to the provisioning of a proper reproducible integration environment. How then to speed-up the integration phase without necessarily blocking the development of new features?

We introduced Docker with Mesos / Marathon to have a much more flexible and intelligent management of the virtual resources: without having to virtualise the Hardware we were able to spawn new Docker instances in seconds instead of minutes ! Additionally the provisioning was coordinated by the Docker Build Step Jenkins plugin to allow the orchestration of the integration tests execution and the feedback on Gerrit Code Review.

Whenever an integration test phase succeeded or failed, Jenkins would have then submitted an “Integrated +1/-1” feedback to the original Gerrit Code Review change that triggered the test.

Pipeline flow – Change submission and release

When a change has received the Verified+1 (build + unit-tests successful) and Integrated+1 (integration-tests successful) is definitely ready to be reviewed and submitted to the master branch. The additional commit triggers the final release build that tags the code and uploads it to Nexus ready to be elected for production.

Pipeline flow – Rollout to production

The decision to rollout to production with a new change is typically enabled by a continuous delivery pipeline but manually operated by the Business stakeholders. Even though we could *potentially* rollout every change, we did not want *necessarily* do that because of the associated business implications.

Our approach was then to publish to Nexus all the potential *candidates* to production and roll-them-out to a pre-production environment, ready to be assessed by Data-Scientists and Business in real-time. The daily job scheduler had a configuration parameter that simply allowed to “pointing” to the version of the code to run every day. In this way whatever is deployed to Nexus is potentially fully working in production and rollout or rollback a release is just a matter of changing a label in the daily job scheduler.

Summary

Building a Continuous Delivery Pipeline for BigData has been a lot of fun and improved the agility of the Business in rolling out changes more quickly without having to compromise on features or stability.

When using a traditional Continuous Integration pipeline, the different stages (build + unit-test, integration-tests, system-tests, rollout) are all happening on the target branch causing it to be amber or red at times: whenever tests are failing the pipeline need to be restarted from start and people are blocked.

By adopting a Code Review-driven Continuous Integration Pipeline we managed to get the best of both worlds, avoiding feature branches but still keeping the ability to validate the code at each stage of the pipeline and reporting it back to the original change and the associated developer without to compromise the stability of the target branch or introducing artificial and distracting feature branches.

Resources

The slides of the talk are published on SlideShare.

All the docker images used during the presentation are available on GitHub: