Inside the OpenGov Cloud: Evolution of Infrastructure and Operations—The How
This is Part Two in a three-part series of posts about the OpenGov Engineering team’s journey in 2018 and the evolution of its infrastructure and operations. Part One focused on the “why and what” (i.e., the problem space). This part focuses on the “how” (i.e., the solution space, and the wins that we secured).
Late 2017, early 2018 we began to earnestly look into getting ahead of the challenges and streamline our operations. Across the board unification and standardization would be the key to such an endeavor. Given that most of our component services were containerized already, we decided to use Kubernetes as our standard deployment platform. At that time AWS EKS (Elastic Container Service for Kubernetes) was not generally available. We started off by installing Kubernetes 1.9 on AWS using Kops. Our first real application on Kubernetes was a new, next-generation of Jenkins, that was over time meant to replace all other aforementioned CI systems. Our first Kubernetes cluster, the “EngOps cluster”, to run our CI/CD and other internal build systems was brought live in February 2018. Kubernetes was new to all of us and we had to establish certain key patterns by learning the hard way. I still recall solving key issues using Spotify’s docker-gc and by setting resource limits for our containers. We also ran into some interesting networking issues with our overlay service, Weave.
After getting our feet wet with an internal application we decided to bring it to one of our production applications. We deployed our first production Kubernetes cluster with one of our simpler applications in April 2018. Since it was too early for us to have a CD system, as a part of onboarding this application to Kubernetes, we built a general-purpose custom deployment pipeline in Jenkins to deploy our applications with a rolling update strategy using Helm (incidentally, that pipeline is still being used by us for all our deployments, as we are bringing up Spinnaker). Based on our experience at that time, Helm became the technology of choice for us to continue building on. Over time it became the “API boundary” between the EngOps team and our application developers, with the developers writing Helm charts for their services based on the blueprint created by the infrastructure team. Around this time (May/June 2018) we had a fairly good idea of navigating the Kubernetes landscape. But it always helps to validate the assumptions. Bhaskar Ghosh, one of the technical advisors at OpenGov, kindly connected us with folks in his network for that. I would like to publicly acknowledge the time and guidance that was provided to us by John Chadwick (back then) at Benzinga, Paul Martinez at Affinity, and Sophie Haskins at Github.
The next big leap in our Kubernetes advancement came about in July 2018 when we started deploying our legacy monolith et al. in production. That’s when the number of nodes in our cluster went up by more than 7x what we were running before. By that time EKS was also generally available but we did not want to disrupt our momentum on application migration front. As that momentum continued, around August 2018 we started working on a key infrastructure project—building the autoscaling framework for our applications and infrastructure. From the outset itself the key business value that I had expected to provide from our Kubernetes migration was to make our clusters more elastic, which would’ve been beneficial to us in two distinct ways—making them customer-friendly by dynamically adjusting to scale up based on seasonality in our production workloads and making them pocket-friendly by dynamically adjusting to scale down during nights, weekends, and other off-season times. We want our clusters to fully leverage what I call as tri-scale elasticity—node-level, which is provided by Cluster Autoscaler, horizontal pod-level, which is provided by Horizontal Pod Autoscaler (HPA), and vertical pod-level, which is to be provided by Vertical Pod Autoscaler (VPA). By November, we had a solution for the first two of those vectors implemented and available for our applications. Since some of our flagship applications were not necessarily CPU or RAM bound, one of the key requirements for HPA for us was to be able to scale up/down using custom metrics (e.g., job queue backlog) in addition to standard resource metrics (CPU and RAM).
This autoscaling project was an interesting journey—our initial design required introduction of a new active component that can implement Kubernetes metrics API (e.g., Prometheus) but we pivoted to a different approach during development—Datadog, which we were anyway using for metrics, introduced support for external metrics provider in Datadog cluster agent, which simplified our design tremendously. Although this complicated our life in a different way. It forced us to bite the bullet we knew we had to bite one day, but were trying to avoid—switching from Kops to EKS. Support for autoscaling based on custom metrics was introduced in Kubernetes 1.10, while we were on 1.9. We could have upgraded our Kops clusters to 1.10 or switched to EKS. After much deliberation and spiking the alternatives, we opted for the latter. Thus, in late October 2018, we began that massive effort. By mid‑December, after a really solid day and night push by a few folks in the team, we progressively had all our Kubernetes clusters on EKS!
Along the way, a couple other engineers introduced an interesting solution for auto-scaling of our single-tenant hosted application—we used kube downscaler to implement time-of-day and day-of-week automation for those single tenant instances that were not required to be always-on according to our SLA with the associated customers. Using that solution, we made sure that those instances were brought down automatically after end of and brought up before start of customary business hours.
Between July and December, it was just one application after another onboarding to Kubernetes – BIG THANKS to our partner engineering teams taking the lead on most of that work. Any applications that were not originally containerized now were. Architectural advancements were made to carve out multi-tenant portions out of our single-tenant managed service. Convox was eliminated. Then Chef. Along the way our new Kubernetes‑based Jenkins system was evolving slowly but surely, especially for monorepo tooling. We eliminated all of our legacy CI systems. All in all, we shaved off around $65000/year from our DevOps tool-chain cost.
We streamlined various other supporting aspects along the way. We switched from a team-based AWS account management strategy (there’s nothing wrong with that, but just not best for our fluid team landscape) to a more functional strategy—we now have AWS accounts by environments (e.g., EngOps, Production, Development, Testing, Staging). We now deploy to two regions only—our production region is us-east-1 and our pre-production region is us-west-2. A huge win along the way was in getting the required security and compliance practices in place. We adopted a new and improved cloud networking architecture that lends itself well for our expansion to international clusters. Predictability of our releases and communication with our field teams has improved tremendously—all of our applications are now deployed one way on one day; we have started using Atlassian Statuspage for incident and maintenance window notification. This might sound counter-intuitive, but despite releasing on one day, we no longer have massive releases. By optimizing our release strategy, failure in one application does not hold the release of another application hostage. This improved our handling of customer commitments. We moved away from a centralized release model to a federated release model, which is microservice-friendly and also sustainable from human perspective. Currently we have around 30 services that may need to be deployed individually across all our applications. Those services are deployed in a collaborative maintenance window by service owners. We learned from our mistakes and corrected them at the earliest opportunity—for example, we started off with host-based routing for our services, but because of challenges around wildcard certificate management, we later switched to path-based routing. While unification wasn’t the only factor, it played a big role in reducing our AWS spend—our December 2018 spend was less than half of what it was in January 2018.
Interested in contributing to OpenGov’s Engineering culture of innovation, leading-edge technology adoption, and quality? Check out our current job openings.