Inside the OpenGov Cloud: Evolution of Infrastructure and Operations—The Why and What

May 9, 2019 - Ashwani Wason

This is a series of blogs on The OpenGov Cloud™. The first blog post provides an overview of our application architecture and tech stack. The next series of three blogs talks about how our infrastructure and operations have evolved over the past year and where we are headed next.

OpenGov is a budgeting and performance company for the public sector. With its rapid organic growth and acquisition activity came significant planned and unplanned diversification and fragmentation in how its software was built and deployed and how its infrastructure was managed, scaled, and secured. A number of key business drivers required us to evolve our operations: more rapid delivery of features with an ever rising bar on quality; accommodating the needs of a fast-growing and distributed development organization on a greater scale; supporting our acquisition strategy; embracing our polyglot applications; and ensuring enterprise-grade security for our data and infrastructure. Over much of 2018, the engineering operations team at OpenGov focused heavily on streamlining our operations by unifying our infrastructure, standardization of our build, release and security/compliance processes, and optimization of our spend on production and development resources. It has been an amazing journey that has put us on the path to accelerating value delivery in 2019—improving predictability in software delivery, integrating acquisitions efficiently, operating an elastic infrastructure, and up-leveling developer productivity. I will go over the 2018 journey and where we are headed in 2019 in a three-part series of posts. This is Part One, which focuses on the “why and what” (i.e., the problem space).

Rewind to mid-to-late 2017 and picture the continuous lifecycle of bits—from development to build to deployment. OpenGov with its rich set of applications had more or less diversified in all those lifecycle stages. I’ll describe the challenges we had with that diversification.

Development: Much of our “legacy” code (monoliths, et al.) was organized in multiple Git repositories, with one repository per component service. Our newer microservices were organized in a singular massive Git monorepo. Active development was ongoing in both. Further the branching workflow that was used in monorepo was more CD-friendly Master‑centric workflow (inspired by GitHub Flow), while the legacy code used a custom workflow (almost but not quite GitFlow) with three long-lived branches—Master (of course), a feature release branch, and a hotfix branch. A lot of CI tooling had gone in to support the custom legacy workflow, but we were only getting started with the tooling for the monorepo. One common theme across all teams was that they worked in two-week Sprints. Another interesting aspect of diversification was for “local development” (pre-CI). Some teams used what were known as full-stacks provisioned using custom tooling on AWS EC2 instances. Other teams used to run services under development on their laptop (either using a VM or Minikube or Docker Compose) and point them to one of stable always-on pre-prod environments. It is worth noting that all applications that are deployed to our cloud today are built for Linux.

Build: We had four distinct CI systems and three distinct CI technologies: Jenkins and Travis CI were used for the legacy monolith component service builds. The monorepo used its own instance of Jenkins. Another team (from one of our acquisitions) used Semaphore CI.

Deployment: We were quite a bit ahead in our containerization journey: most of our component service artifacts were packaged in Docker Containers. However, we used four distinct approaches to deploy our application containers to production: our legacy applications used Chef. The containers ran directly on AWS EC2 instances, with a fairly static (and rather over-provisioned, I must add) footprint, without any overarching orchestration. Blue-Green deployment strategy was used at the entire cluster level (not at a service level). The microservices used the Convox platform to deploy containers to AWS ECS. Again, with a fairly static footprint. One of our applications (from an acquisition in 2017) deployed directly on AWS EC2 instances using mostly manual (snowflake) processes. Its components were not containerized. Another application (from an acquisition in 2016) also deployed directly on AWS EC2 instances. Again, not containerized. This latter one is an interesting application because it is deployed and operated as managed single-tenant instance of the application, one per customer. Each customer’s instance was deployed manually (for the most part) and thus a snowflake. If that wasn’t complicated enough, we deployed to multiple AWS regions, driven by localized decisions made by the individual application teams. We had three pre-prod regions: us-west-2, ca-central-1, and us-east-1, and we had four prod regions: us‑east-1, us‑east-2, us‑west-1, and ca-central-1. Multiple AWS accounts, one per team, were used for deployment, each obviously with its own VPC.

This diversification posed unique challenges for running effective engineering operations.

Around that time our applications were starting to become more and more integrated with each other. Given that our services were deployed across multiple accounts, our networking (VPC Peering, Load Balancing, request routing) progressively started to become increasingly complex with multiple points of failure. Movement of engineers between teams was not as efficient as there was a relatively steep learning curve for the different practices. System-level testing became more challenging. The infrastructure provisioning practices between different teams was also prone to errors (e.g., some AWS RDS instances were provisioned without encryption and backups) exposing security and compliance risks.

Another aspect of the diversification that was hurting us was from a lack of consistent release practices—the applications were deployed at their own independent schedules increasing the risk of impacting production. Cross-functional communication with our field teams was not ideal. Releases were done without empirical measures around deployment times, release variances, or failure/rollback rates. Chef-based cluster-level Blue-Green strategy for our legacy apps had a major side-effect—if smoke tests failed for one of the applications in the newly provisioned cluster, we abandoned the entire release, thereby affecting the commitments of other applications that tested out fine. (I, rather not-so-fondly, started referring to this as, “Oops from one team/application causing an Ouch to others.”) Teams using Convox were not too satisfied either with its developer and deployment workflows.

In early 2018 we were also getting serious about certifying our applications, infrastructure, and processes against compliance programs such as SOC 2, NIST, and FedRAMP. It very quickly became a non-starter to have to do that in such a diverse world. The level of effort would be many times over, the surface area to protect much larger, cost of independent assessment much higher, and having to sustain that year-over-year even harder to digest.

A big driver of running effective operations is the ability to extract metrics from our processes that can be used to surface valuable insights. It was untenable to do this across so many different systems in an organization with that rich set of applications that were delivered at that fast pace.

Our “DevOps tool-chain” spend was also quite high for our scale—at around $85000/year across Chef, Convox, Travis, Semaphore, including running multiple Jenkins instances on AWS. Our production infrastructure was fairly static in footprint, having been configured for the peak demand, but running as such all the time. One thing that cloud providers have done is to make it easy to throw hardware (read money) at the problem, move on, and forget about it. We fell into that trap all so often.

In the next article in this series I will go over how we reined in the diversification.

Interested in contributing to OpenGov’s Engineering culture of innovation, leading-edge technology adoption, and quality? Check out our current job openings.

Category: Technology

Inside the OpenGov Cloud: Evolution of Infrastructure and Operations—The Why and What

Related Posts