However, the world it was built for – local repositories and teams, along with simple, human-centric CI/CD pipelines – no longer exists.
Today, codebases are massive. Teams are global. And AI is generating code faster than ever before. Our infrastructure hasn’t kept pace, and traditional Git servers are reaching their limits.
Development Has Outgrown Traditional Git
Software development has exploded in both volume and velocity. According to Archive Market Research, the global software repository market will hit $15 billion in revenue in 2025, growing at 15% annually through 2033. GitHub now hosts over 420 million repositories, supporting more than 100 million developers – each pushing more commits, more frequently, and with more automation than ever.
At the same time, 97% of teams now use AI-assisted tools for vibe-coding, AI-agents for creating PRs and commits, automated AI-code-review, and AI-generated documentation. This results in a massive surge in server traffic, particularly read operations from increased test jobs and automated quality-checking processes, which are literally exploding.
Traditional scaling methods – adding more servers, bigger instances, or wider clusters – simply don’t address the core issue: the world has changed since Git was built 20 years ago.
The Breaking Point for Traditional Git Servers
The symptoms are now familiar to most engineering teams:
Slow performance. Git clones and fetches are taking several minutes or even timing out instead of completing swiftly in a few seconds.
Pipeline delays. CI/CD is waiting for repositories to be available for cloning or fetching, which delays the feedback cycle.
Storage capacity strain. Larger monorepos or multi-repo setups are hitting massive BLOB count and hitting ref limits. When storage is hosted on the Cloud or on shared disks, the increase in the directories and I/O transfer makes Git almost unusable.
Rising costs. Over-provisioned infrastructure is trying (and failing) to keep up. The provisional or elastic filesystem limits increase significantly, bringing operational costs to new heights never seen before, and putting more pressure on the budget.
The Archive Market Research report warns: “The rising complexity of software development necessitates efficient and reliable repositories for managing code, dependencies, and artifacts.”
In other words, the systems we depend on to deliver software are now slowing software down.
Scaling Smarter, Not Harder
At GerritForge, we believe that simply throwing more hardware at the problem is no longer enough: it is Git and its utilisation that need to be smarter. We have been researching on the use of AI for optimising the performance and scalability of repositories for over 3 years and shared our findings with the global scientific community.
That’s why we built GHS, an AI-powered accelerator for Git-based SCMs.
We have simulated and demonstrated how traditional tools would collapse and developed GHS, a brand-new model to overcome the traditional limitations of the Git repositories.
GHS learns how your repositories behave, your access patterns, your CI/CD workflows, and your traffic peaks – and, using reinforcement learning, automatically optimises performance in real-time.
With GHS, the whole development team will experience:
Up to 100× faster Git operations. Clones and fetches keep on completing at raw speed, without slowdowns, timeouts or slack.
Real-time auto-optimisation. The Git repositories are automatically monitored and optimised for speed, without manual operations.
Lower operational costs. The CPU and storage do not suffer from the overload of incoming traffic, delivering better performance on existing hardware. There is no need to scale up the infrastructure and costs massively.
Increased reliability. Git servers remain stable even under extreme load, eliminating the need for emergency restarts or maintenance.
Traditional scaling adds more servers. GHS scales using its sophisticated AI model to get the most out of what you already have.
The Gerrit User Summit 2025 made one thing clear: the era of traditional scaling is coming to an end.
As repositories grow and workflows evolve, the next generation of developer infrastructure must be intelligent, simplified & resilient. AI isn’t replacing developers – it’s augmenting them. But if your Git infrastructure can’t keep pace with this new level of activity, your entire SDLC pipeline will suffer.
2025 marks a turning point for software delivery. Traditional Git servers can’t keep up, but that doesn’t mean your team can’t. Because in modern DevOps, speed isn’t just efficiency, it’s a competitive edge.
Slow Git isn’t just annoying – it’s expensive for everyone: developers wait for a CI/CD validation of their changes, SCM admins are wasting their time in fire-fighting SCM slowdown and broken pipelines, IT managers are wasting millions of dollars in CPU and Disk hosting costs.What if your Git operations could be up to 100x faster, and keep up with the new AI-vibe coding landslide of PR and changes?
100x Faster Git, Powered by AI
GHS is an AI-based accelerator for Git SCM that redefines performance:
Up to 100x faster clones and fetches.
CI/CD pipelines that run without SCM barriers.
Adapt automatically to your repository shape and make it faster.
Scale without slowdown even under heavy loads
This isn’t traditional “tuning.” GHS learns your repos and access patterns, then continuously optimises them so your Git server is always running at maximum speed.
How Does GHS Deliver 100x Faster Git?
Measure Everything It collects detailed metrics in realtime on your repositories.
Spot Bottlenecks GHS AI model is trained on recognizing the bottlenecks and take immediate action, before they become a problem
Stay Fast Your Git stays consistently accelerated, not just temporarily boosted.
Why Speed Matters
Developers stop wasting hours on slow fetches and builds.
Release managers push features out faster.
IT leaders reduce infra costs by doing more with less.
Admins no longer fire-fight performance issues.
Every 1% of time saved in Git can add up to days of productivity across a large team. Imagine saving 100x that.
Who Needs 100x Faster Git?
Repositories of all sizes: AI-driven code generation and “vibe-coding” have dramatically accelerated the pace of software delivery.
Enterprises that have adopted the AI-pipeline want that value delivered faster to production.
Any team frustrated with slow CI/CD pipelines.
The GHS Advantage
Transformative speed: not just 2x or 5x faster, but up to 100x
SCM expertise: GerritForge’s decades of enterprise SCM know-how built in.
Proven reliability: Stability and uptime as performance scales.
Get Started Today
You can try GHS free for 30 days and experience the difference for yourself.
First, a huge thank you to the OpenInfra Foundation for hosting this event in Paris. Their invitation to have the Gerrit User Summit join the rest of the community set the stage for a truly collaborative and impactful gathering.
Paris last weekend wasn’t just a conference; it was a reunion. Fourteen years after the last GitTogether at Google’s Mountain View HQ, the “Git and Gerrit, together again” spirit was electric.
On October 18-19, luminaries from the early days (Scott, Martin, Luca, and many others) reconvened, sharing the floor with the new generation of innovators. The atmosphere was intense, filled with the same collaborative energy of 2011, but focused on a new set of challenges. The core question: how to evolve Git and Gerrit for the next decade of software development, a future dominated by AI, massive scale, and an urgent demand for smarter workflows.
Here are the key dispatches from the summit floor.
A Historic Reunion, A Shared Future
This event was a powerful reminder that the open-source spirit of cross-pollination is alive and well. The discussions were invigorated by the “fresh air” from new-school tools like GitButler and Jujutsu (JJ), which are fundamentally rethinking the developer experience.
In a significant show of industry-wide collaboration, we were delighted to have GitLab actively participating. Patrick’s technical presentation on the status of reftable was a highlight, but his engagement in discussions on collaborative solutions moving forward with the Gerrit community truly set the tone. It’s clear that the challenges ahead are shared by all platforms, and the solutions will be too.
Scaling Git in the Age of AI
The central theme was scale. In this rapidly accelerating AI era, software repositories are growing at an unprecedented rate across all platforms—Gerrit, GitHub, and GitLab alike. This isn’t a linear increase; it’s an explosion, and it’s pushing SCM systems to their breaking point.
The consensus was clear: traditional vertical and horizontal scaling is no longer enough. The community is now in a race to explore new techniques—from the metal up—to improve performance, slash memory usage, and make core Git operations efficient at a scale we’ve never seen before. This summit was a rare chance for maintainers from different ecosystems to align on these shared problems and forge collaborative paths to solutions.
Dispatches from the Front Lines: NVIDIA and Qualcomm
This challenge isn’t theoretical. We heard powerful testimonials from industry giants NVIDIA and Qualcomm, who are on the front lines of the AI revolution.
They shared fascinating and sobering insights into the repository explosion they are actively managing. Their AI workflows—encompassing massive datasets, huge model binaries, and unprecedented CI/CD activity—are generating data on a scale that is stressing even the most robust SCM systems. Their presentations detailed the unique challenges and innovative approaches they are pioneering to tackle this data gravity, providing invaluable real-world context that fueled the summit’s technical deep dives.
Beyond the Pull Request: The Quest for a ‘Commit-First’ World
One of the most passionate debates centered on the developer workflow itself. The wider Git community increasingly recognizes that the traditional, monolithic “pull request” model is ill-suited to the “change-focused” code review that platforms like Gerrit have championed for years.
The benefits of a change-based workflow, cleaner history, better hygiene, and higher-quality atomic changes—are driving a growing interest in standardizing a persistent Change-ID for each commit. This would make structured, atomic reviews a first-class citizen in Git itself. The collaboration at the summit between the Gerrit community, GitButler, JJ, and other Git contributors on defining this standard was a major breakthrough.
This shift is being powered by tools like GitButler and JJ, which are built on a core philosophy: Workflow Over Plumbing. Modifying commits, rebasing, and resolving conflicts remain intimidating hurdles for many developers. The Git command line can be complex and unintuitive. These new tools abstract that complexity away, guiding users through commit management in a way that feels natural. The result is faster iteration, higher confidence, and a far better developer experience.
AI and the Evolving Craft of Code Review
Finally, no technical summit in 2025 would be complete without a deep dive into AI. The arrival of AI-assisted coding is fundamentally shifting the dynamic between author and reviewer.
Engineers at the summit expressed a cautious optimism. On one hand, AI is a powerful tool to accelerate reviews, improve consistency, and bolster safety. On the other, everyone is aware of the trade-offs. Carelessly used, AI-generated code can weaken knowledge sharing, blur IP boundaries, and erode a team’s deep, institutional understanding of its own codebase.
The challenge going forward is not to replace the human in the loop, but to strengthen the craft of collaborative review by integrating AI as a true co-pilot.
A Path to 100x Scale: The GHS Initiative
The most forward-looking discussions at the summit centered on how to achieve the massive scale required. One of the most promising solutions presented was GHS (Git-at-High-Speed). This innovative approach is not just an incremental improvement; it’s a strategic initiative designed to increase SCM throughput by as much as 100x.
The project’s vision is to enable platforms like Gerrit, GitLab, and GitHub Enterprise to handle the explosive repository growth and build traffic generated by modern AI workflows. By re-architecting key components for hyper-scalability, GHS represents a concrete path forward, ensuring that the industry’s most critical SCMs can meet the unprecedented demands of the AI-driven future.
The Road from Paris
The Gerrit User Summit 2025 was more than a look back at the “glorious days.” It was a statement. The Git and Gerrit communities are unified, energized, and actively building the next generation of SCM. The spirit of GItTogether 2011 is back, but this time it’s armed with 14 years of experience and a clear-eyed view of the challenges and opportunities ahead.
So you think PRs are the only way you can do effective code review? Well, then you’ll be surprised to hear that ain’t the case!
If you’ve read my previous article, you’ll be aware that GitHub or GitLab are great for quick and easy integrations with CI/CD tooling and average for issue tracking/planning, but, in my opinion, miss the mark when it comes to facilitating effective code review.
Today, I want to show you a typical Gerrit workflow and explain why I believe it can greatly improve the quality of your reviews.
Let’s start by reiterating that PRs should be small so as to be easily reviewable, and it’s the creator’s job to make sure they are paired with a meaningful description. But don’t take it from me, you can read it directly on GitHub’s blog.
However, splitting out work into easily reviewable chunks is not easy or intuitive when following a PR-based approach as maintaining dependent PRs is cumbersome and error-prone, so let’s see how Gerrit facilitates this for developers.
Relation chains vs PRs
Let’s say that you’ve started working on a new feature. You realize early on that it’ll be a lot easier to implement if you do some refactoring beforehand. So you go ahead, do the refactoring, commit your work, then add the new feature and commit that too, as a separate commit. If you’re working with a PR based workflow, before doing the above you’ve probably created a branch and are now about to push it to the remote to then create a PR from the UI, am I right?
Well, if you’re using Gerrit, you don’t need to do that. If your work is supposed to be merged on main, then you can develop directly on that branch, create those 2 commits and push them with git push origin HEAD:refs/for/main, we’ll decompose this command later, but for now let’s look at the effects of it.
As you can see in the screenshot above, Gerrit detects that you’re pushing two separate commits, so it automatically creates 2 new changes, one for each commit, against the branch you’re working on. There was no need to interact with the UI either.
But how did this happen? HEAD:refs/for/<branch-name> tells the Gerrit backend that you’d like to push your current local changes (HEAD) up for review against branch-name Gerrit then does all the magic for you.
Let’s now take a look at the UI:
There’s quite a lot to unpack here, so let’s go in order. First, Gerrit created a Relation chain for us, that includes both commits. Then, in the commit message, we notice something maybe unexpected, a Change-Id. This is Gerrit’s way of tracking updates to a change. From now on, every time we push a new commit with the same Change-Id, Gerrit will understand it belongs to this change and update it accordingly. You could completely change the commit message, but, as long as you leave the Change-Id unchanged, Gerrit will understand where it belongs.
So let’s try doing that.
How to update changes
The main difference from what most people are used to here is that when you want to update a change, you don’t push a new commit; rather, you amend the existing one
First I need to checkout the change, I can do that by clicking on the DOWNLOAD button(highlighted in the image above), which will show this modal window
This gives me a few options for checking out said change without having to Google how to check out a specific ref or how to cherry-pick each time. So, I copy-paste the “Checkout” option and then run the following on my terminal.
You can see here that I’m not creating a new commit, but rather amending the same one.
Gerrit warns me now that I’ve not amended any files but only the commit message, no problems there, just a handy reminder.
As I’ve not changed the Change-Id, we can see that the change number is still the same. The UI now looks like this:
A few things to notice here. First, I’ve updated the commit message to add a description, great, but surely I wouldn’t want to lose track of the initial state of the change? Can I still see the previous version of this commit? YES!
When you amend your commit, you’re not losing the history of what was there before, you’re merely adding a new patchset to the change. Patchsets are what Gerrit calls each iteration of the commit.
You can also see how the next commit in the relation chain is now marked as an “Indirect relation.” This is because it is not based anymore on the latest version of its parent commit, which as just been updated, so let’s go ahead and fix it. First, we’ll need to select the commit we want to update and then click on the handy “Rebase” button, which will show the following modal
As this is a straightforward case, we’ll confirm the rebase on the parent change.
We can now see that the indirect relation message has disappeared. But how did Gerrit keep track of that? By creating a new patchset which tracks the updated version of the parent commit! Pretty neat, huh?
This leads us to our next topic:
Meaningful commit messages by design
Another problem I see with the PR workflow, or at least how it’s been implemented by the most popular tools today, is that it essentially leaves it up to each individual developer to commit meaningful messages. Whatever merge strategy you pick, neither of them allows for the commit message to be reviewed before it gets merged.
In Gerrit there is no PR description, the commit message IS the “PR description”. Further more, the commit message appears as file in its own right and users can comment on it. This allows reviewers to actually add comments on the message itself to ensure it’s up to scratch with whatever the company standards are.
Submitting the change
Finally, once the comment has been addressed and the change approved, we can click on Submit including parents, and both changes will be merged to the main.
As can be seen by running git log , which will look like this:
Conclusion
The one thing I’d like readers to take away from this article is that no, PRs are not the only way forward and there are solutions to the old problem of wanting to split up code reviews in more manageable chunks.
So let’s quickly review how Gerrit works: You create changes by pushing commits using the “magic” command refs/for/<bran-name> . When you want to update a change, you don’t push a new commit; rather, you amend the existing one. Gerrit keeps track of each update in the form of patchsets so you can see how each commit evolved over time.
You can then chain commits as you wish, allowing you to split your work in easily reviewable units.
Finally, commit messages are treated as first class citizens, they’re like any other file in the change, they can be reviewed and commented on, ensuring that no un-helpful commits are merged to the target branch.
(TL;DR) The Gerrit Code Review project has an ambitious roadmap for 2025 and beyond. Gerrit 3.12 (H1 2025) will focus on JGit performance improvements, X.509 signed commit support, and enhanced Owners and Analytics plugins. Gerrit 3.13 (H2 2025) aims to further optimize push performance, UI usability, and plugin updates, including Kafka and Zookeeper integrations.
The k8s-Gerrit initiative will improve Gerrit’s deployment on Kubernetes, ensuring better scalability and resilience. Looking ahead, Gerrit 4.0 (2026/2027) plans to decouple the review UI from the JGit server, enabling alternative review interfaces and improved flexibility.
The development team is prioritizing significant performance enhancements in JGit, the Java implementation of Git used by Gerrit. Key objectives include:
Speeding Up Conflicting Ref Names on Push: This aims to reduce delays during push operations when ref name conflicts occur.
Enhancing SearchForReuse Latency for Large Monorepos: The goal is to improve latency by at least an order of magnitude, facilitating more efficient operations in extensive monorepositories.
Improving Object Lookup Across Multiple Packfiles: Efforts are underway to accelerate object lookup times by at least tenfold, enhancing performance in repositories with numerous packfiles.
Parallelizing Bitmap Generation: By enabling bitmap generation across multiple cores, the team aims to expedite this process, contributing to overall performance gains.
Gerrit Core Experience Enhancements
A notable feature planned for this release is the support for X.509 signed commits, which will bolster security and authenticity in the code review process.
Owners Plugin Improvements
Enhancements to the Owners Plugin are set to provide clearer insights and streamlined interactions:
Action Requirements Display: Explicitly showing required actions by each owner at the file level.
Detailed Pending Reviews: Offering more comprehensive information on pending reviews by owners.
Easier Contact with File Owners: Facilitating more straightforward communication with file owners.
Analytics Plugin Optimization
The analytics plugin is slated for improvements to enhance speed and usability:
Repo Manifest Discovery: Native support for discovering repository manifests.
Faster Metrics Extraction: Accelerated extraction of metrics for branches, aiding in quicker data analysis.
Gerrit 3.13 (Target: H2 2025)
JGit Performance and Concurrency Enhancements
Building upon previous improvements, version 3.13 aims to further optimize performance:
Optional Connectivity and Collision Checks: Improving push performance by allowing the skipping of certain checks when appropriate.
Customizable Lock-Interval Retries: Providing flexibility in managing lock intervals to enhance concurrency handling.
Read-Only Multi-Pack Index Support: Introducing support for read-only multi-pack indexes to improve repository access efficiency.
Gerrit Core and UI Experience Enhancements
User experience remains a focal point with planned features such as:
File List Filtering: Allowing users to filter the file list in the change review screen for more efficient navigation.
Headless Gerrit Packaging: Offering a version of Gerrit that serves only read/write Git protocols, catering to users seeking a streamlined experience.
Plugin Updates
The roadmap includes updates to key plugins:
Kafka Events-Broker: Upgrading to support Kafka 3.9.0, enhancing event handling capabilities.
Zookeeper Global-Refdb: Updating to support Zookeeper 3.9.3, improving global reference database management.
Replication Plugin Enhancements
Efforts to simplify configuration and improve performance include:
Dynamic Endpoint Management: Introducing APIs for creating and updating replication endpoints dynamically.
UI Integration: Displaying replication status within the user interface for better visibility.
Reduced Latency on Force-Push: Improving replication latency during force-push operations by applying objects with prerequisites.
k8s-Gerrit
The k8s-Gerrit initiative focuses on deploying Gerrit within Kubernetes environments, aiming to enhance scalability, resilience, and ease of management. This approach leverages Kubernetes’ orchestration capabilities to provide automated deployment, scaling, and management of Gerrit instances, facilitating more efficient resource utilization and operational consistency.
Gerrit 4.0 (Target: 2026/2027)
Looking ahead, Gerrit 4.0 is set to introduce significant architectural changes:
Decoupling Gerrit Review UI and JGit Server: This separation will allow the Gerrit UI and JGit server to operate as independent services, providing greater flexibility in deployment and scaling.
Enabling Alternative Review UIs: By decoupling components, the platform will support the integration of other review interfaces, such as pull-request systems, offering users a broader range of tools tailored to their workflows.
The Gerrit community is encouraged to stay engaged with these developments, as the roadmap is subject to change. Contributors planning to work on features not listed are advised to inform the Engineering Steering Committee (ESC) to ensure alignment with the project’s goals.
As 2024 draws to a close, it’s a great time to reflect on the milestones and achievements of the GerritForge team in innovating Git performance with the release of GHS, supporting and advancing the Gerrit Code Review project with 20 releases and over 1.5k commits. This year has been one of exciting developments and initiatives that have made Gerrit stronger, more accessible, and better equipped to meet the needs of modern software teams. Here’s what GerritForge accomplished for the Git and Gerrit Code Review ecosystem in 2024.
2024 in numbers
1,500 changes merged
40 repositories involved in the Gerrit and JGit projects
20 releases
5 conferences, 6 meetups
11 contributors, 4 maintainers
20k man/hours spent in open-source projects
Managing the Releases of Gerrit 3.10 and 3.11
As usual, we continue to ensure the regular releases of new Gerrit versions. This year was no exception with the releases of Gerrit 3.10 and 3.11. These releases introduced new features, performance optimisations, and bug fixes to further solidify Gerrit as the leading code review tool for large-scale projects.
Our team cooperated with the Gerrit Code Review community to ensure these releases were stable, well-documented, and aligned with community needs. From coordinating release schedules to ensuring compatibility with plugins and integrations, GerritForge played a key role in making these versions a reality.
Key Highlights
Gerrit 3.10 brought major improvements in user experience, including refined UI capabilities and enhanced search functionality.
Gerrit 3.11 introduced critical updates to multi-site support, security enhancements, and significant performance boosts.
GerritMeets success in both the USA and Europe
The GerritMeets series has become a cornerstone of community engagement, and 2024 was no exception. GerritForge organised six GerritMeets across several locations, including California, Germany, and the UK, bringing together contributors, users, and maintainers to share knowledge, discuss trends, and explore use cases. Each session covered a diverse range of topics, including:
Best practices for multi-site setups.
Advanced plugin development.
Gerrit performance tuning for large repositories.
Innovations in CI/CD workflows with Gerrit.
These virtual meetups provided a platform for collaboration and learning, reinforcing Gerrit’s community spirit. All the recordings of the 2024 GerritMeets are available as a playlist on the GerritForge TV YouTube channel.
World-class Enterprise Gerrit-as-a-Service on Google Cloud Platform
In 2024, GerritForge expanded Gerrit’s capabilities by adding support for Gerrit-as-a-Service (GaaS) on the Google Cloud Platform (GCP). This initiative makes it easier than ever for organizations to adopt Gerrit without the operational overhead of managing infrastructure.
Benefits of Gerrit-as-a-Service on GCP
Scalability: Leveraging GCP’s powerful infrastructure to scale Gerrit deployments for enterprises of any size.
Simplicity: Reducing setup and maintenance complexity, allowing teams to focus on development and code reviews.
High Availability: Utilizing GCP’s advanced networking and storage capabilities for improved uptime and disaster recovery.
By enabling Gerrit on GCP, GerritForge is broadening the tool’s accessibility, particularly for teams looking for a cloud-native, fully managed solution.
Gerrit User Summit with Qualcomm
This year’s Gerrit User Summit, organized by Qualcomm in collaboration with GerritForge, was a highlight for the community. Held at the Qualcomm HQ in San Diego, CA (USA), the summit offered a chance for Gerrit enthusiasts worldwide to come together in person. The agenda featured:
Keynotes by Gerrit maintainers and industry leaders.
Hands-on hackathon for contributors and users.
Insightful panels on the future of Gerrit.
The collaboration with Qualcomm not only expanded the summit’s reach but also highlighted Gerrit’s growing importance in enterprise environments.
Showcasing GHS and Gerrit Code Review worldwide
This year, GerritForge actively participated in numerous conferences to showcase GHS and expand Gerrit’s visibility, and demonstrate its unique capabilities. These events were a fantastic opportunity to uncover GHS and showcase Gerrit to hundreds of developers who had never encountered it.
At these conferences, we showcased:
GHS: It demonstrates its seamless integration with Git and showcases a whopping 100x performance improvement over a vanilla setup thanks to the power of AI and reinforcement learning.
K8s-Gerrit: Highlighting how Gerrit deployed on Kubernetes provides unparalleled flexibility, performance, and multi-site support.
Through live demos, presentations, and Q&A sessions, we highlighted GHS and Gerrit’s ability to scale, its unique review model, and its role in making software delivery pipelines up to 100x times faster.
K8s-Gerrit Hackathon with SAP
Collaboration was at the forefront of the k8s-Gerrit Hackathon, co-hosted by GerritForge and SAP. The hackathon brought together developers from both SAP and GerritForge teams to tackle the challenges of Kubernetes-based Gerrit deployments and multi-site support.
Outcomes of the Hackathon
Enhanced scalability for k8s-Gerrit deployments.
Breakthroughs in multi-site replication and disaster recovery.
Valuable contributions to the Gerrit codebase and documentation.
The event exemplified the power of open collaboration, pushing Gerrit further into cloud-native development.
Looking Ahead to 2025
As we celebrate the progress made in 2024, we remain focused on the road ahead. GerritForge is committed to:
Public offering of GHS
Gerrit-as-a-Service in Google Cloud
Gerrit Code Review v3.12 and v3.13
More GerritMeets and Gerrit User Summit events
Our gratitude goes out to the Git and Gerrit Code Review communities, contributors, and partners who have made this year a success. Together, we’re building a tool that empowers teams to deliver high-quality code faster and more efficiently. Here’s to an even more impactful 2025!
We’re thrilled to announce that our team will be speaking about our advancements with GerritForge AI Health Service (GHS) at several prestigious conferences in the coming months. These events provide an incredible opportunity to share our innovative AI solutions with a broader audience, engage with industry experts, and showcase how GHS is revolutionizing the way organizations maintain the health and stability of their Gerrit and Git systems.
Our journey begins at the Linux Open Source Summit in Vienna, from the 16th to the 18th of September. This summit is a cornerstone event for the open-source community, and we couldn’t be more excited to discuss how GHS leverages AI to ensure the seamless performance of Git and Gerrit systems, even in the most demanding environments.
Next, we’ll be in Berlin for Git Merge on the 19th and 20th of September. Git Merge is the go-to event for Git enthusiasts and professionals alike, and we’re eager to dive deep into the technical aspects of GHS, sharing insights on how our AI solution optimizes system performance, reduces downtime, and empowers development teams to focus on what they do best—creating great software.
In October, we’re particularly excited about the Gerrit User Summit in San Diego, on the 10th and 11th. This event is especially important to us as it brings together the Gerrit community to discuss the latest developments and best practices. We’ll be showcasing how GHS is enhancing Gerrit environments by providing intelligent and automated health monitoring and ensuring peak performance.
Following that, we’ll speak at the OCX conference in Mainz, from the 22nd to the 24th of October. OCX is known for bringing together top minds in DevOps and open-source technology, making it the perfect venue to highlight how GHS is transforming the management of code review and source control systems with intelligent, automated health monitoring and remediation.
Finally, we’re thrilled to wrap up our conference tour at KubeCon in Salt Lake City, from the 12th to the 15th of November. As one of the most anticipated events in the cloud-native ecosystem, KubeCon offers an unparalleled platform to demonstrate how GHS integrates with Kubernetes environments, ensuring that your SCM systems are always running at peak performance.
These conferences represent more than just speaking engagements for us—they are an opportunity to engage with the community, learn from our peers, and continue pushing the boundaries of what’s possible with AI in software development. We can’t wait to connect with you at these events and share how GHS can make a tangible difference in your organization’s success.
Stay tuned for more updates as we approach these dates, and be sure to catch our sessions if you’re attending any of these events!
Daniele Sassoli GerritForge Engineering Manager Gerrit Code Review Community Manager and Contributor
GerritForge has confirmed over 2023 its commitment to the Gerrit Code Review platforming, helping deliver two major releases: Gerrit v3.8 and v3.9.
The major contributions combined are focused on the plugins for extending the reach of the Gerrit platform, first and foremost the pull-replication and multi-site, as shown by the split of the 853 contributions across the projects, weighted by the number of changes and average modifications per change.
Pull replication plugin This is where GerritForge excelled in providing an unprecedented level of performance over anything that has been built so far in terms of Git replication for Gerrit. Roughly one-third of the Team efforts have contributed to the pull replication plugin, which provided over 2022/23 a 1000x speedup factor compared to Gerrit tradition factor. GerritForge has further improved its stability, resilience and self-healing capabilities thanks to a fully distributed and pluggable message broker system.
Gerrit v3.8 and v3.9 GerritForge helped release two major versions of Gerrit Code Review, contributing noteworthy features like Java 17 support, cross-plugin communication, importing of projects across instances and the migration to Bazel 7.
Owners plugin Jacek has completely revamped the engine of the owners plugin, boosting it with an unprecedented level of performance, hundreds of times faster than in the previous release, and bringing it to the modernity of submit requirements without the need to write any Prolog rules.
Multi-site plugin The whole team helped provide more stability and bug fixes across multiple versions of Gerrit, from v3.4 up to the latest v3.9.
JGit GerritForge kept its promises in stepping up its efforts in getting important fixes merged, including the optimisation of the refs scanning in Git Protocol v2 and the fix for bitmap processing with incoming Git receive-pack concurrency that we promised to fix at the beginning of 2023.
Migration of Eclipse JGit/EGit to GerritHub.io
The 2023 has also seen a major improvement in GerritHub stability and availability, halving the total outage in a 12-month period from 19 to 10 minutes, with a total uptime of 99.998% (source: PIngdom.com)
The whole process was completed without any downtime and a reduced read-only window on the legacy Eclipse’s instance git.eclipse.org, which was needed because of the lack of multi-site support on the Eclipse side.
What we did achieve from our goals of 2023
JGit changes: we did merge 22 changes in 2023, most of them within the list of our targets for the year. One related to the packed-refs loading optimisation was abandoned (doesn’t get much traction from the rest of the community), and the last major one left is the priority queue refactoring still in progress on stable-6.6. Also, thanks to the migration of JGit/EGit to GerritHub.io, David Ostrovsky managed to get hold of its committer status and will now be able to provide more help in support in getting changes reviewed and merged.
JGit multi-pack index support: we did not have the bandwidth and focus to tackle this major improvement. The task is still open for anyone willing to help implement it.
Git repository optimiser: we kick-started the activity and researched the topic, with Ponch presenting the current status at the Gerrit User Summit 2023 in Sunnyvale CA.
Gerrit v3.8 and project-specific change numbers: the design document has been abandoned because of the need of rethinking its end-to-end user goals. However, we found and fixed many use cases where Gerrit wasn’t using the project/change-number pair for identifying changes, which is a pre-requisite for implementing any future project-specific change number use-case.
Gerrit Certified Binaries: the Platinum Enterprise Support for Gerrit has been enriched in 2023 with the certified binaries programme, with enhanced Gatling tests and E2E validation using AWS-Gerrit. Many bugs have been found and fixed in all the active versions of Gerrit; some of them were very critical and surprisingly undiscovered for months.
GerritForge Inc. revenue targets in the USA: the revenues increased by 50% in 2023, which was slightly below the initial expectations but still remarkable, despite the latest economic downturn of the past 12 months. 100% of the business has been transferred to the USA, including the GerritForge trademark and logo and we are now ready to start a new robust growth cycle in 2024 and beyond.
Looking at the future with AI in 2024
The recent economic news in the past 6 months has highlighted a difficult moment after the COVID-19 pandemic: the conjunction of the cost of living crisis, rising interest rates and two new major wars across the globe have pushed major tech companies to revise their small to medium-term growth figures, resulting in a series of waves of lay offs in the tech sector and beyond.
Whilst the layoffs are not immediately related to a lack of profitability of the companies involved, it highlights that in the medium term there will be a lot fewer engineers looking after the production systems across the company, including SCM.
SCM and Code Review are at the heart of the software lifecycle of tech companies and, therefore, represent the most critical part of the business that would need to be protected at all costs. GerritForge sees this change as a pivotal moment for stepping up its efforts in serving the community and helping companies to thrive with Gerrit and its Git SCM projects.
How do we maintain SCM stability with fewer people?
Gerrit Code Review has become more and more stable and reliable over the years, which should sound reassuring for all of those companies that are looking at a reduced staff and the challenge of keeping the lights on of the SCM. However, the major cause of disruption is represented by what is not linked to the SCM code but rather its data.
The Git repositories and their status are nowadays responsible for 80% of the stability issues with Gerrit and possibly with other Git servers as well. Imagine a system that is receiving a high rate of Git traffic (e.g. Git clone) of 100 operations per minute, and the system is able to cope thanks to a very optimised repository and bitmaps. However, things may change quickly and some of the user actions (e.g. a user performing a force-push on a feature branch) could invalidate the effectiveness of the Git bitmap and the server will start accumulating a backlog of traffic.
In a fully staffed team of SCM administrators and with all the necessary metrics and alerts in place, the above condition would trigger a specific alert that can be noticed, analysed, and actioned swiftly before anyone notices any service degradation.
However, when there is a shortage of Git SCM admins, the number of metrics and alerts to keep under control could be overwhelming, and the trade-offs could leave the system congestion classified as a lower-priority problem.
When a system congestion lasts too long, the incoming tasks queueing could reach its limits, and the users may start noticing issues. If the resource pools are too congested, the system could also start a catastrophic failure loop where the workload further reduces the fan out of the execution pool and causing soon a global outage.
The above condition is only one example of what could happen to a Git SCM system, but not the only one. There are many variables to take into account for preventing a system from failing; the knowledge and experience of managing them is embedded in the many of the engineers that are potentially laid off, with the potential of serious consequences for the tech companies.
GerritForge brings AI to the rescue of Git SCM stability
GerritForge has been active in the past 14 years in making the Git SCM system more suitable for enterprises from its very first inception: that’s the reason why this blog is named “GitEnterprise” after all.
We have been investing over 2022 and 2023 in analysing, gathering and exporting all the metrics of the Git repositories to the eyes and minds of the SCM administrators, thanks to open-source products like Git repo-metrics plugin. However, the recent economic downturn could leave all the knowledge and value of this data into a black hole if left in its current form.
When the work of analysing, monitoring and taking action on the data becomes too overwhelming for the size of the SCM Team left after the layoffs, there are other AI-based tools that can come to the rescue. However, none of them is available “out of the box” and their setup, maintenance and operation could also become an impediment.
GerritForge has a historic know-how on knowledge-based systems and has been lecturing the community about data collection and analysis for many years in the Gerrit Code Review community, for example the Gerrit DevOps Analytics initiative back in 2017. It is now the right time to push on these technologies and package them in a form that could be directly usable for all those companies who need it now.
Introducing GHS – GerritForge-AI Health Service
As part of our 2024 goals, GerritForge will release a brand-new service called GHS, directly addressing the needs of all companies that need to have a fully automated intelligent system for collecting, analysing and acting on the Git repository metrics.
The high-level description of the service has already been anticipated at the Gerrit User Summit 2023 in Sunnyvale by Ponch and the first release of the product is due in Q1 of 2024.
How does GHS work?
GHS is a multi-stage system composed of four basic processes:
Collect the metrics of your Gerrit or other Git repositories automatically and publish them on your registry of choice (e.g. Prometheus)
Combine the repository metrics with the other metrics of the system, including the CPU, memory and system load, automatically.
Detect dangerous situations where the repository or the system is starting to struggle and suggest a series of remediation policies, using the knowledge base and experience of GerritForge’s Team encoded as part of the AI engine.
Define a direct remediation plan with suggested priorities and, if requested, act on them automatically, assessing the results.
Stage 4, the automatic execution of the suggested remediation, can be also performed in cooperation with the SCM Administrators’ Team as it may need to go through the company procedures for its execution, such as change-management process or communication with the business.
However, if needed, point 4. can also be fully automated to allow GHS to act in case the SCM admins do not provide negative feedback on the proposed actions.
What the benefits of GHS for the SCM Team?
GHS is the natural evolution of GerritForge’s services, which have historically been proactive in the analysis of the Git SCM data and the proposal of an action plan. The GerritForge’s Health Check is a service that we have been successfully providing for years to our customers; the GerritForge Health Service is the completion of the End-to-End stability that the SCM Team needs now more than ever, to survive with a reduced workforce.
To the SCM Administrator, GHS provides the metrics, analysis and tailored recommendations in real-time.
To the Head of SCM and Release Management Team, GHS gives the peace of mind of keeping the system stable with a reduced workforce.
To the SCM users and developers, GHS provides a stable and responsive system throughout the day, without slowdowns or outages
To the Head of IT, GHS allows to satisfy the company’s needs of costs and head count reduction without sacrificing the overall productivity of the Teams involved.
GerritForge’s pledges to Gerrit Code Review quality and Open-Source
GerritForge has provided Enterprise Support and free contributions to Gerrit Code Review for 14 years, pretty much since the beginning of the project. We pledged in the past to be always 100% Open-Source and do commit to our promises.
For 2024, GerritForge will focus on delivering its promising Open-Source contributions to the stability and reliability of Gerrit Code Review, with:
Support for the Gerrit Code Review platform releases, Gerrit v3.10 and v3.11
Free support and development of the Gerrit CI validation process, in collaboration with all the other Gerrit Code Review contributors and maintainers
Free Open-Source fixes for all critical problems raised by any of its Enterprise Support Customers, available to everyone in the Gerrit Code Review community
Free Open-Source code base for the main four components of the new GHS product, following the Open-Core methodology for developing the service.
With regards to the initiatives that we started in the past few years, including pull-replication and multi-site, we believe they have reached a maturity level that would not need further major refactoring and extensions in 2024. We will continue to support and improve them over the years, based on the feedback and support requests coming from the Enterprise Support Customers and the wider Gerrit Open-Source community.
GHS AI engine and dogfooding on GerritHub.io.
GHS will have a rule-based AI system that will drive all the main decisions on the selection and prioritisation of the corrective actions on the system. As with all AI systems, the engine will need to start with a baseline knowledge and intelligence and evolve based on the experience made on real-life systems.
GerritForge’s commitment to quality is based on the base principle of dogfooding, where we use the system we develop every single day and learn from it. The paradigm is on the basis of our 14 years of success and we are committed to using it also for the development of GHS.
GerritForge has been hosting GerritHub.io since 2013, and tens of thousands of people and hundreds of companies are using it for their private and Open-Source projects every single day. The system is fully self-serviced; however, still relies on manual maintenance from our Gerrit and Git SCM admins.
We are committed to starting using GHS on GerritHub.io from day 1 and use the metrics and learning of the systems to improve its AI rule engine continuously. All customers of GerritForge’s GHS service will therefore benefit from historic knowledge and experience induced by the the learnings and optimisations made on GerritHub.io for the months and years to come.
GHS = Git SCM admins humans and AI-robots working together
GHS will revolutionise the way Git SCM admins are managing the system today: they will not be alone anymore, juggling a series of tools to understand what’s going on, but they will have an intelligent and expert robot at their service, driven by the wisdom and continuous learnings made by GerritForge, at their service every single day.
We expect a different future way of working in front of us: we are embracing this radical change in how people and companies work and making GHS serve them effectively and in line with our Open-Source pledges.
The future is bright with GerritForge-AI Health Service, Git and Gerrit Code Review at your service !
Luca Milanesio GerritForge CEO Gerrit Code Review Release Manager and member of the Engineering Steering Committee
Our mission is to improve the Gerrit Code Review platform, with the Open-Source community and in full transparency. Everything we do is published, is open, can be analysed and scrutinised by any member of the community at any time.
In this article, we are publishing what GerritForge is planning to do in the forthcoming 2021, a pivotal year that will see the entire world bouncing back and looking at the future of Technology with a different light.
Goal: cloud-native Gerrit Code Review
Since our initial post in 2019 about our plans for the future, we have been making steady progress toward a unified global Gerrit platform, highly available and distributed across the globe. Since then, a lot of steps have been made, many of them happened in 2020.
Thanks to SAP and GerritForge, we have now two Open-Source standard deployments on the Cloud:
K8s-Gerrit – Helm-charts for deploying Gerrit Code Review to a Kubernetes Cluster (beta)
AWS-Gerrit – CloudFormation templates for creating a production-like setup of Gerrit HA and Multi-Site to ECS on the AWS platform.
Even if Gerrit can deployed now into a Cloud environment, that does not mean that Gerrit is cloud-native. The main requirements for making an application cloud-native is captured by the 12-factor principles. We will be tackling some of them in our 2021 roadmap.
Cloud-native backing services
Thanks to the advances in the Gerrit Multi-Site plugin in 2020, it is today possible to have a geographically distributed cluster of Gerrit servers around the globe. The services that are needed for coordinating the events across Gerrit nodes and ensuring global consistency are:
Event-broker (Kafka)
Global ref-db (Zookeeper)
The price to pay for a global Gerrit Multi-Site is currently quite high, because one may argue that actually managing a cluster of Kafka brokers or Zookeeper nodes can be even more challenging that managing Gerrit itself.
A cloud-native application should be fully independent from its backing services and be opened to swapping one specific implementation (e.g. Kafka) with a serviced off-the-shelf solution (e.g. AWS Kinesis, GCP PubSub).
As GerritForge we are joining the forces with the Gerrit Code Review community in 2021 for refactoring the Gerrit Event system and making it more Cloud-native, using open-standards for representing and publishing events across the network.
The immediate actions are:
Build a new protobuf-based events representation in Gerrit v3.4
Abstract the serialisation of events from their transport
Implement different events brokers and listeners, other than the Kafka. (e.g. NATS)
Provide a GCP PubSub broker connector in cooperation with Google
Gerrit performance across sites
Making a stateful application like Gerrit Code Review a truly cloud-native and distributed platform brings some challenges in terms of latency and performance.
The current Gerrit Multi-Site architecture still relies heavily on the traditional (push) replication plugin, which has been working nicely for over a decade for a Gerrit primary/replica setup. When the repository size increases and the need for global consistency becomes more stringent, relying on a traditional async git push is not enough anymore: you need a lot more tools at your service.
GerritForge has introduced in 2020 the pull-replication plugin, which enables to invert the replication logic: instead of having one Gerrit server to push everything to everyone else, you just tell the other nodes of what’s new in the repository, and it will be their responsibility to fetch only the stuff that has changed.
In 2021 we are going to improve the plugin with native JGit client protocol v2 support and features that will go well beyond the limits of the git protocol:
BLOB and ref-update on remote git pulls. The remote Gerrit node can receive directly the content of what has changed, directly in its payload. That allows to avoid the expensive git pull operation especially for NoteDb related updates.
Event broker integration for git pulls. Instead of relying on REST-API, the pull replication plugin can produce pull replication events in the event broker.
The implementation of those improvements will allow to drastically reduce the incidence of the traditional (push) replication activity and make the Gerrit user experience more seamless across sites.
CI integration
Integrate Gerrit in more seamlessly with other cloud-native services, such as CI/CD. Gerrit v3.4 is planning to introduce a brand-new style of integration which would not require anymore any 3rd party adaptation or API support.
GerritForge is committed to endorse this new module and allow Gerrit v3.4 to be fully connected with Jenkins on day-1 and potentially other cloud-based CI/CD systems.
Git reftable: provide more performance for mono-repos
GerritForge will experiment and test at scale the introduction of Google’s Git reftable, already included from Gerrit v3.2 onwards, to assess its performance impact on large mono-repos, with hundreds of thousands of refs.
Have your say on Gerrit future
This is our GerritForge shopping list for 2021, wide open for discussion, contributions, amendments, and voting.
Why should you care about what we do? Simply because the union of intents and cross-pollination of ideas make the Open-Source products and community stronger: if we join forces in building a solution that is a common interest for all of us, we will all benefit from it.
Your feedback is welcome and precious, feel free to have your say and comment this post with your ideas, likes and suggestions on how to make them even better.
Starting from this week, we are going to share one video per week of the amazing talks that were presented at the Gerrit User Summit 2017 in London.
In addition to the YouTube recording, we are during the extraction of the text and publishing it together with the relevant pictures taken from the presenter’s slides, so that people can start digesting the content at small bites.
This week talk is Patrick Hiesel’s presentation on how Gerrit multi-tenant and multi-master setup has been implemented in Google.
Gerrit@Google – Patrick Hiesel, Google
My name is Patrick, and I am going to talk about the setup of Gerrit we are running at Google. I wanted to take you on a journey starting with Gerrit that you all know and making it the system we run at Google, step-by-step; and at the end will have a multi-master and multi-tenant system.
Multi-what?
Multi-tenant is the ability to serve multiple hosts from the same single Java process. Imagine like the same JVM task serving gerrit-review.googlesource.com and gerrit-chromium.googlesource.com.
Multi-master is the ability to have multiple Gerrit servers all over the world. You can contact any one of them for reads and writes.
Most systems have read replicas, which is straightforward, but write replicas is where the juicy meat is.
Multi-tenant
We have gerrit-review.googlesource.com, based on OpenSource Gerrit that you can download right now and have it running hopefully under ten minutes. That is core-Gerrit, and it depends on three things:
JGit: for all the Git stuff
Multiple indexes for the accounts, changes, and other stuff
Caches
All these three components are based on the filesystem in one way or the other.
Now you have a friend that is accessing go-review.googlesource.com, what are you going to do?
The most natural solution is to start another Gerrit instance for it. You can have all of them on one machine, you can give them different ports, very easy, and in the end, they’ll be all based on the filesystem.
All those Gerrit instances do not need to talk to each other; they can just be separate instances operating on separate ports. This is not a multi-tenant system, but only different Gerrit instances on the same host.
You can add another layer on top of it: a servlet engine which receives all the traffic, check which host the traffic is for, and just delegate to the individual host.
To take one step further, have that selection filter doing that for you. Gerrit has a daemon that runs all the functionalities. You can integrate that daemon into the incoming servlet filter. When you can get a request for go-review.googlesource.com and I do not have where to allocate it, you can just launch it, instantiate all the objects and then run the traffic from there. Also, unload instances would work in the same way.
The Gerrit server engine and the selection filter can run in a single JVM.
How Gerrit can conquer the world
So you have a master here in Europe, and you have got one friend on the west coast in the US. He says: “Oh your Gerrit is so slow I have no idea why and I wish I could move to GitHub.” You say: “Hold on, I can do better than that!” and so you put a new master for the person on West coast.
So the key to that is the replication and comes in two sets:
Objects that you have to replicate related to all the Git data. JGit is putting objects into the disk, and these are the data you need to replicate correctly and fast.
Other stuff that should replicate and fast to provide a pleasant user experience but it really can be best-effort. That is the indexes and the caches.
If it is okay for your master having a 200/600 msec additional latency, then do not replicate the caches. You can have a cold cache in Singapore or the US, and you can reread them without problems.
For the index, replication can be best-effort, but you should make an effort to replicate them. It is still nonetheless a mandatory requirement. One way to achieve that is to use ElasticSearch, but other index implementations that give indexes replication can be used as well.
Multi-tenant and Multi-master together
We talked about a multi-tenant system and then replicate them globally, so we have now a multi-tenant and multi-master system, actually pretty close to what we run at Google.
That is the stack that we run in total. We have a selection filter and two other filters to decide what the traffic is directed. We are also based on JGit, no magic there, we have index and caches we replicate, all our systems are based on filesystem and BigTable.
Some “magic” happens at the Git layer at Google because that is where all the majority consensus across all the cells. When you are pushing anything that is Git and, with NoteDb, anything that is a review is in the repository as well, the system tries to reach the majority of the cells and write the objects to them. When that is acknowledged, you get a green light on the push.
Majority consensus also means that you have it only in so many cells, but don’t have it on all the cells all the times. Some of the replication is happening in the long tail, by replication events eventually get acknowledged by the cells, and then they get written to all the masters.
Our indexes and caches are also replicated, but some of them are just in-memory and a component that gets replicated on top of BigTable.
Redundancy everywhere.
We run five data-centers across three continents (Americas, Europe, Asia-Pacific), with precisely the stack we just saw which gives us a good latency for most of our users worldwide.
Let’s talk about load balancing. We have a system that is multi-master and multi-tenant, and any of the tasks can serve any requests, but just because it can, doesn’t mean that it should.
Maybe it has in cold memory caches, or it is in Singapore, and you are in the US; so the question is what if the biggest machine is not big enough and we want to optimize it?
The idea is that we want to reduce latency that comes out of cold caches and minimize the time the site takes to load.
So you have a request for gerrit-review.googlesource.com, and your instance has a cold cache, and you need to read from disk to memory to serve the request.
You have a fleet of 300 tasks available but you want to serve gerrit-review.googlesource.com from only just five of them. If you serve 300 requests from 300 tasks in a round-robin manner, you pay the latency to load data from disk to memory for every single request. And the second motivation is that you want to distribute the load.
We want a system that can dynamically scale with changing load patterns. We want a system that can optimise the caches, to send a request for a site/repo to the few number of servers and tasks based on two conditions:
serve from one machine as long as it fits on it
server from more machines if you really have to
Level-1 load balancer
In the stack at Google, you saw two load balancing levels. What you see down in the picture is the Gerrit tasks, that contains all the software layers we talked about. We have a user that triggers JSON calls from the browser, with PolyGerrit. The first thing that the JSON request is hitting is the L1 load balancer. The primary routing of your request is by geographic proximity. We have five datacentres at Google; the L1 load balancer picks the one with the lowest latency. When the request goes into the data-center, we have another load balancer which is the one I’ll talk about more, because this is the one where the Gerrit specifics happen.
One thing that L1 is doing as well is managing the spillover of traffic. When a datacentre says “I can handle up to 100 QPS” the L1 load balancer starts redirecting traffic to other datacentres should that threshold be reached.
Level 2 load balancer
Let’s dive into L2 load balancer, we want to know how much traffic we are getting into each Gerrit task, and we want to know in the load balancer where the single request should go, and we want to know that fast!
We added three new components to the architecture:
An element to redirect tasks and provide functionality and can report the load we are handling right now. When I mean load it can be anything: QPS (Queries Per Second), metrics, we just want to know from the tasks: what is your current load? We have a system called slicer, which I am going to talk about in a second and it’s added there in the picture.
A second component we are adding to the load balancer, with a query interface that responds to the following question: “we have a request for gerrit-googlesource.com, where should it go?”. All of that should be done in memory and should be regularly updated with the new elements in the background so that we don’t add another component of latency by having another RPC.
A third part is coordinating everything and is called the assigner, and it takes all the load metrics that we reported generates new assignments and gives them to the query interface.
Introducing the slicer
We have a system that is called slicer. There is a very nice paper that I can recommend, published last fall, that talks about that. It is a load balancer that works on custom keys and can do automatic re-sharding based on new traffic patterns. When your nodes receive more traffic, the slicer will automatically distribute the load or re-shard the whole system. That is a suitable method for local sharding that happens within the data center; we do not use it for inter-datacentre because that is all done via geographic proximity.
The system works with 64-bit keys and gives you a lot of combinations. You can slice the keyspace, for instance, in 400 slices. That gives you 400 ranges, and you can take any of them and assign to one or more tasks. The hostname is my key for instance, and then you hash it, and you end up in the first slice that gets assigned to a single task with an index zero.
What can we do if the load changes? Let’s say that you have key zero that gets assigned to the first range and then the traffic changes. We have two options.
The first option is to assign more tasks, let’s say task 6, and then you round-robin between task 0 and task 6.
The second option is splitting into 600 or 800 slides to get a better grip on each of the keys.
We can also do that, and then we factor our the load for gerrit-review.googlesource.com and go-review.googlesource.com, and we put them into different hosts.
We do that for Gerrit, and one of the things we want for Gerrit is when we have to split per-host traffic with the affinity on the repository. Caches are based on the project, and because gerrit-android.googlesource.com is a massive host served from a lot of tasks, we don’t want all these tasks just to serve all general traffic for android. We want tasks serving android/project1 from here and android/project2 from there so that we optimise the second layer of caches.
What we do is to mangle these keys together based on both host and project. Before, all these chromium keyspace was served from a single host; when the load increases we just split the keys into Chromium source and the rest of metadata. This is the graph that we obtained after we implemented the load balancer. The load we have on each of the tasks in a single data center is represented by a line with a different color. What you can see is that are all nicely aligned, so that each task is serving precisely the same amount of traffic which is what you want.
What if one project is 100 times the size of the others and we are optimizing on queries per second? The system will just burn resources fast. We had that situation in May, we saw the graphs, and we said “all good, looks nice”; however, people were sending e-mails and raising bugs wondering if the system were serving any traffic at all or if it were down completely.
It turned out that Android had a lot of large repositories, regarding the number of references, and the objects. We were just optimizing the queries per second, but some of the tasks were doing just CPU intensive work, where others were happy with it. Some of them were burning CPU in flames, and others were fine.
So we moved out of the per-request affinity, and we modified the per-repository sharding to optimise all of this.
Warm vs. Cold Cache
There is an extra in the system that is pre-warming caches. What the load balancer can do for you is to tell you that traffic is changing and I need to reconsider how to split the load on the system. For each of the tasks is going to tell “I’m going to give you traffic for gerrit-review.googlesource.com” with a notice of 30 seconds. That time you can use for pre-warm caches.
That is especially nice if you restart your tasks because all the associated in-memory cache gets flushed. The load balancer tells you “oh, this is the list of the tasks I need” and then you can get them all and pre-warm their caches. This graph shows the impact of the cache warmer on our system, on the 99.9% requests latency, really on the long tail of requests latency. That looks nice because we brought the latency down by a third.
What is a task start dying during peak traffic? Imagine that the load balancer is saying “You’re going to handle this” and two seconds later says “I have to reconsider, you’re going to handle that instead”. Again you’re going to watch your system burning on fire, because you’re serving peak traffic and then you’re running close to 100% CPU. That situation causes the load balancer loading and unloading tasks all the time, which is inconvenient. The way we work around this is to make this cache warming a best-effort activity. You can do it if you’re below 50% CPU when you have time to do fancy things, but when you receive peak traffic, you just handle peak traffic without any optimisation made.
Multi-master and multi-tenant outside of Google
The question is: how do we do that in a non-Google setup?
There are plenty of options.
With the new Gerrit release in 2.15, we introduced to a new URL scheme, which includes the project name in the URL. Previously you had gerrit-review.googlesource.com/c/NNN and there was no way to directly know which project this is for and no way to do that load balancing that we just saw.
What we did in 2.15 is just add the project before that, so that extraction for both host and project can be made in a simpler way. You could do the same even before v2.15 but you needed a secondary index lookup, which most opensource load-balancers such as HAProxy or NGINX did not support. And of course, there are lots of products like Google Cloud load balancer, and others that you can use to achieve the same thing.
Wrap-up
We went through a journey where we took OpenSource Gerrit, we added sites selection and got a multi-tenant Gerrit.
Then we took this multi-tenant Gerrit, added replication and obtained multi-master Gerrit.
And then we took that with load balancing and lots of failures and lots of fixes, and we got pretty much the Gerrit that we run at Google, which brings me to the end of this talk 🙂
Q&A
Q: How strategic is Gerrit@Google? Do you have any other code-review systems? If yes, how is used Gerrit vs. the others?
We have another code-review system for internal use only, and Gerrit is used whenever we are doing OpenSource stuff, so for GoLang, Chromium, Android, Gerrit, and whenever the Google Team wants to collaborate with other OpenSource users, or in general with users that are not sitting at Google.
Historically the source at Google was developed in Perforce, and we ported from that to a home-based system called Piper. Around that, we have a tooling ecosystem which is internal. In parallel to that, Google started to do a lot of projects that have nothing to do with the internal search engine and available outside. What we see is that a lot of projects started at Google from scratch were thinking about “what system should we use?”. Many people said: “well, we’re going just to use Git because that’s what we know and we like, ” and when they needed code-review for Git they ended up with us. Gerrit and Git are very popular inside Google.
Q. You have two levels of load balancer. The first one is the location, and the second one is to decide what to do inside the data-center. What about if a location is off? Maybe is not fully off-line but has big problems, or has a very low-percentage of consensus, and some of the locations have not the “latest and greatest” of the repo. Possibly a location that should be “inconvenient for me” actually has the data I want.
You’re talking about replication layer where you have the objects in one location but not in the other. Our replication latency is in the order of seconds, but it may happen that one location is just really slow in getting the objects. That happens from time to time, and we have metrics that says what the replication lag is accounted for. When it exceeds a threshold we just shut the data-center off, which means cut-off the traffic, the data-center will not receive user-traffic anymore but it will still be able to get the replication done, and when the decrease the objects we need to replicate we can send the traffic again.
Cutting off the traffic is happing at the L1 load balancer where we said “don’t send anything there”.
Q. Do all the tasks have the same setup? Or do have a sort of micro-service architecture inside where some of the tasks are more dedicated to this type of operations and other for another type. Serving data from memory in one thing, but calculating diff change is a different type of task.
Not in general. All of our tasks are the same, except for checking access control permissions. We do not go through the whole Gerrit stack but we have only this little task that knows how the project config works and is going to tell us yes or no.