Stress your Gerrit with Gatling

As a Gerrit administrator, making sure there is no performance impact while upgrading from one version to another can be difficult.

It is essential to:

  • have a smooth and maintainable way to reproduce traffic profiles to stress your server
  • easily interpret the results of your tests

Tools like wrk and ab are simple and good to run simple benchmarking tests, but when it comes to more complex scenarios and collection of client-side metrics, they are not the best tools to use.

Furthermore, they only support Close Workload Models, which might not always fit the behaviour of your system.

For those reasons in GerritForge we started to look at more sophisticated tools, and we started adopting Gatling, an open-source load testing framework.

The tool

The Gatling homepage describes it this way:

“Gatling is a highly capable load testing tool. It is designed for ease of use, maintainability and high performance…

Out of the box, Gatling comes with excellent support of the HTTP protocol…..

As the core engine is actually protocol-agnostic, it is perfectly possible to implement support for other protocols…

Based on an expressive DSL, the scenarios are self-explanatory. They are easy to maintain and can be kept in a version control system…”
In this article, we focus on the maintainability and protocol agnosticism of the tool.

What about Git?

Gatling natively supports HTTP protocol, but since the core engine is protocol-agnostic, it was easy to write an extension to implement the Git protocol. I started working on the Gatling git extension in August during a Gerrit hackathon in Sweden, and I am happy to see that is starting to get traction in the community.

This way we ended up among the official Gatling extension on the official Gatling homepage:

Screenshot 2019-12-16 at 15.03.32.png

The code of the git extension if opensource and free to use. It can be found here, and the library can be downloaded from Maven central. In case you want to raise a bug, you can do it here.

Maintainability

Gatling is written in Scala, and it expects the load tests scenarios to be written in Scala. Don’t be scared; there is no need to learn crazy functional programming paradigms, the Gatling DSL does a good job in abstracting the underneath framework. To write a scenario you just have to learn the building blocks made available by the DSL.
Here a couple of snippets extracted from a scenario to understand the DSL is:

class ReplayRecordsFromFeederScenario extends Simulation {
  // Boireplate to select the protocol and import the configuration
  val gitProtocol = GitProtocol()
  implicit val conf = GatlingGitConfiguration()

  // Feeder definition: the data used for the scenario will be loaded from "data/requests.json"
  val feeder = jsonFile("data/requests.json").circular

  // Scenario definition: 
  val replayCallsScenario: ScenarioBuilder =
    scenario("Git commands") // What's the scenario's name?
      .forever {  // How many time do I need to run though the feed?
        feed(feeder) // Where shall I get my data?
          .exec(new GitRequestBuilder(GitRequestSession("${cmd}", "${url}"))) // Build a Git request
      }

  setUp(
    replayCallsScenario.inject(
      // Traffic shape definition....pretty self explanatory
      nothingFor(4 seconds),
      atOnceUsers(10),
      rampUsers(10) during (5 seconds),
      constantUsersPerSec(20) during (15 seconds),
      constantUsersPerSec(20) during (15 seconds) randomized
    ))
    .protocols(gitProtocol) // Which protocol should I use?
    .maxDuration(60 seconds) // How long should I run the scenario for?
}

That is how the feeder looks like:

[
  {
    "url": "ssh://admin@localhost:29418/loadtest-repo.git",
    "cmd": "clone"
  },
  {
    "url": "http://localhost:8080/loadtest-repo.git",
    "cmd": "fetch"
  },
  {
    "url": "http://localhost:8080/loadtest-repo.git",
    "cmd": "push",
    "ref-spec": "HEAD:refs/for/master"
  }
]

Here another example reproducing the creation of a WIP change using the REST API:

class WIPWorkflow extends Simulation {

  // Configuration bolierplate
  implicit val conf: GatlingGitConfiguration = GatlingGitConfiguration()
  val baseUrl = "https://review.gerrithub.io"
  val username: String = conf.httpConfiguration.userName
  val password: String = conf.httpConfiguration.password
  val httpProtocol: HttpProtocolBuilder = http
    .baseUrl(baseUrl)
    .userAgentHeader("Gatling test")
  val request_headers: Map[String, String] = Map(
    "Content-Type" -> "application/json"
  )
  
  val scn = scenario("WIP Workflow") // What's the name of my scenario?
    .exec(
      http("Create WIP change")
        .post("/a/changes/") // Which url and which HTTP verb should I use? 
        .headers(request_headers) 
        .basicAuth(username, password) // How do I authenticate?
        .body( // What's the body of my request?
          StringBody("""{
            "project" : "GerritForge/sandbox/e2e-tests",
            "subject" : "Let's test this Gerrit! Create WIP changes!",
            "branch" : "master",
            "work_in_progress": "true",
            "status" : "NEW"
          }""")
        )
        .check(status.is(201)) // What's the response code I expect?
    )

  setUp(scn.inject(
    atOnceUsers(1) // Traffic profile
  )
  ).protocols(httpProtocol)
}

Jenkins integration

Running load tests can be tedious and time-consuming, but yet essential to spot any possible performance regression in your application.
Providing the least possible friction is essential to incentivize people in running them. If you are already using Jenkins in your company, you can leverage the Gatling plugin to scale your load quickly and provide easy access to metrics.

A real use case: Gerrit v3.0 Vs Gerrit v3.1 load test…in production!

Let’s go through a real case scenario to show how useful and easy to read are the metrics provided by Gatling.

The closest environment to your production one is…production!

gerrithub.io runs Gerrit in a multi-site configuration, and this gives us the luxury of doing canary releases, only upgrading a subset of the master nodes running Gerrit. One of the significant advantages is that we can run A/B tests in production.

That allows us also to run meaningful load tests against the production environment. See below a simplified picture of our Gerrit setup where it is possible to see the canary server with a higher Gerrit version.

Screenshot 2019-12-14 at 16.26.52.png

We ran against 2 servers in Germany the same load tests which:

  • Create 100 chained Change Sets via REST API
  • Submit all the changes together via REST API

We then compared the server-side and client-side metrics to see if a good job has been done with the latest Gerrit version.

Server-side metrics

Server-side metrics come from the Prometheus exporter plugin. The image is showing the HTTP requests mean time:

Screenshot 2019-12-14 at 16.32.12.png

We can see the improvement in the latest Gerrit version. The mean requests time is almost halved and, of course, the overall duration is decreased.

Client-side metrics

Let’s see what is going on on the client-side using the metrics provided by Gatling.

Among all the metrics we are going to focus on one step of the test, since we have more data points about it, the creation of the change:

Screenshot 2019-12-14 at 16.36.45.png

We can already see from the overall report the reduction of the response time in the latest Gerrit version:

Screenshot 2019-12-14 at 16.38.05.png

If we look in-depth to all the response times, we can see that the distribution of the response times is pretty much the same, but the scale is different….again we confirmed the result we previously encountered.

Screenshot 2019-12-14 at 16.39.17.png

What can we say…Good job Gerrit community!

Wrapping up

(You can see my presentation about Gatling in the last Gerrit User summit in Sunnyvale here <- add this when the talk will be sharable)

I have touched superficially several topics in this blog posts:

  • simplicity and maintainability provided by the Gatling DSL
  • Integration with Jenkins
  • Gatling extensions and reuse of the statistic engine
  • Example scenarios

I would like to write more in-depth about all these topics in some follow-up blog posts. Feel free to vote for the topic you are more interested in or suggest new ones in the comments section of this post.
If you need any help in setting up your scenario or understand how to run load tests against your Gerrit installation effectively, GerritForge can help you.

Fabio Ponciroli (aka Ponch) – GerritForge
Gerrit Code Review and Gatling Contributor

GerritHub and Zero-Downtime Upgrade

GerritHub gets bigger on Mon, 21 March 08:00 GMT


GerritHub has experienced unprecedented growth over the past two years. The November 2015 numbers presented at the Google User Summit in Mountain View – CA have been surpassed again, and we do need to make sure that our infrastructure is still capable of dealing with current and future users’ needs.

What is changing in GerritHub.io?

We are changing everything, from the version of Gerrit to the hardware, network and storage infrastructure. Data, DBMS, Indexes and cache, need to be upgraded and refreshed to make sure that the new systems are reflecting exacting the current production data and sessions.
We are changing as well the geo-location of our servers, from the current server farm in Germany (Bayern, Nuremberg – 100 MBps) to a new server farm in Canada (Quebec, Beauharnois – 1 GBps).

Why have so many changes?

We started to measure some significant delay in the Git and review operations on the old infrastructure, mainly due to three factors:

  1. More users, more repositories, more concurrency. Individuals, OpenSource projects and Businesses started using GerritHub.io for their mission-critical repositories, considering Gerrit the “source of truth” of their review workflow. We needed more horsepower, memory, storage and ability to scale even further.
  2. Bandwidth from USA and Far-east. The majority of people using GerritHub.io are from the other side of the Atlantic Ocean: this is typically not a problem from 7 AM to 3 PM … but after 4 PM the connectivity between Europe and the Americas becomes slow. Additionally, people using GerritHub.io from India, Japan, Australia and New Zealand experienced terrible slowdowns because of the excessive number of hops to reach Germany.
  3. Gerrit master is much faster. Based on the current data and metrics measured on GerritHub.io, we have contributed a lot of patches to reduce the overhead caused by Gerrit DB and lessen the number of SQL queries per minute. All those new improvements are on Gerrit master, and we need to catch-up with the “latest and greatest” version.

Will I experience any GerritHub.io outage?

Last time that GitHub needed to make a major upgrade, asked his 5M users to stop working for 23 minutes,. This translates to a loss of two millions of hours of continuous delivery lifecycle, equivalent to over 130 man/years, worth no less than eight millions dollars.
We are going to adopt a new Zero-Downtime Gerrit roll-out strategy to make sure that all those changes are not going to impact your day-by-day activity. If you were not reading this post you would possibly even not notice the “switch” from the old to the new infrastructure, apart from the increase in speed and bandwidth.

Zero-downtime GerritHub.io migration, step by step with the associated expected timings.

Phase 0 – Replication to the new Gerrit infrastructure. (- 1 month ago)
We started migrating everything one month ago, and the old and new infrastructure are working side-by-side, thanks to Gerrit master-slave replication. The new Gerrit servers are active as slaves and are read-only.

Phase 1 – Migration kick-off. (08:00 GMT)
We install a Gerrit plugin that rejects all the push to GerritHub.io repositories providing a courtesy message: “Gerrit is under maintenance, all projects are READ ONLY”. All the HTTP POST, PUT and DELETE are disable on the Gerrit REST-API.

Phase 2 – Wait for replication events to complete and migrate DB. (08:02 GMT)
Git repositories are continuously replicated, but we do need to make sure that the event queue is empty. Once that happens we schedule the last final DB migration to the new infrastructure.

Phase 3 – Gerrit DB upgrade and reindex (08:04 GMT)
New Gerrit server executes the final upgrade and off-line reindex of the latest received changes.

Phase 4 – Gerrit start-up and cache warm-up (08:05 GMT)
New Gerrit is restarted and the most critical Gerrit caches (projects, accounts and groups) are pre-loaded in memory. This allows the incoming traffic spike to avoid the collapse of used threads and makes the transition as smooth as possible without slowdowns.

Phase 5 – Traffic switch and DNS updates (08:06 GMT)
GerritHub.io redirects all incoming HTTPS and SSH traffic to the new infrastructure. Git pushes and HTTP PUT, POST and DELETE operations of the REST API are operational again and served by the new Gerrit infrastructure. GerritHub.io DNS is updated to the new Canadian IPs.

Phase 6 – New IPs gets propagate to all worldwide DNSs (+ 1 day)
Once all the DNSs in the world would have been updated, everyone will start going directly to the new infrastructure without further hops or redirection from Germany. Customers from USA, Canada, South America, Asia, Japan, New Zealand and Australia should see a significant reduction of the network latency and increase of GerritHub.io responsiveness.

Firewall and SSH considerations

Even if Gerrit server’s SSH key is not changing, some of you may see a warning similar to this when they push or pull over SSH:

Warning: the RSA host key for ‘review.gerrithub.io’ differs from the key for the IP address ‘148.251.77.70’

The warning message will also tell you which lines in your ~/.ssh/known_hosts need to change. Open that file in your favorite editor, remove or comment out those lines, then retry your push or pull.

Should your network have some strict firewall rules to access external sites, you may want to whitelist the IP of the new infrastructure WLB to: 192.99.233.76.

Follow GerritHub.io migration progress.

We will advertise the migration progress on Twitter at @GitEnterprise. Should you have any issue you can tweet us or contact GerritForge Customer Support at support@gerritforge.com.

More capacity and performance for GerritHub.io

We are pleased to announce that we have successfully completed a new major hardware upgrade to the GerritHub.io platform.

What has been upgraded on GerritHub.io ?

There is a brand new production cluster which provides:

  • More memory (up to 32 GBytes per node)
  • More disk-space (up to 2TBytes per node)
  • More concurrency (up to 8 CPUs per node)

The new cluster is geolocated in Germany / Bayern and provides a much better and stable bandwith.

Do I need to change anything on my client ?

GerritHub.io is accessible from the old and the new IP addresses:

  • old IP address: 94.23.71.44
  • new IP address: 148.251.77.70

The GerritHub DNS is currently propagating the new IP to gerrithub.io and review.gerrithub.io hostnames: during the propagaton time (max 24h) both IPs will provide the same access to the Germany based production cluster.

What about my Git/SSH access ?

When using Git over SSH, the remote host SSH key is exchanged and associated to the resolved remote IP into ~/.ssh/known_hosts file in your local machine. This means that you have currently associated the GerritHub.io SSH public key to the 94.23.71.44 (old IP address).

When the DNS propagation will be completed, you will see a warning from your SSH client asking to verify that the new IP address is OK. In some cases you may be asked to verify and re-accept the SSH public key.

Example of the warning you would probably see on your Git client over SSH:

Warning: Permanently added the RSA host key for IP address '[148.251.77.70]:29418' to the list of known hosts.

Double-check that the IP address shown corresponds to the new GerritHub.io cluster (148.251.77.70). This warning will be displayed only once and then the new IP will be stored in your ~/.ssh/known_hosts file.