Architecting Ambassador for Availability- Operational Simplicity

Design decisions, testability, release, Kubernetes, and community

Published in

Ambassador Labs

5 min readJun 22, 2020

Any API Gateway is mission-critical infrastructure. If the API Gateway fails, so does your infrastructure. Since the initial release of the Ambassador API Gateway in 2017, we’ve made conscious architectural choices recognizing that the most important characteristic for Ambassador is availability. (For more about our approach to security, see this post.)

Today, Ambassador is deployed in thousands of mission-critical production environments all over the world. We have a thriving community of contributors including engineers from Active Campaign, AppDirect, HotStar/Disney, and Puppet Labs that continue to push the project forward. And, we’re excited to be donating Ambassador to the CNCF in the near future.

How have we architected for availability?

Operational Simplicity

Ambassador is deployed as a single, integrated container containing both the data and control planes for maximum operational simplicity. This approach makes it:

Simple to upgrade. There’s one piece of software, with no external dependencies to upgrade. Simple upgrades are an underrated aspect of reliable software.
Holistic health checking. An Ambassador container has a single set of integrated, end-to-end health checks that measure the full health of Ambassador. If the health check fails, the container is restarted by Kubernetes.
Simple(r) troubleshooting. Data center edge architecture is complex enough, with external load balancers, Ambassador, web application firewalls, and more. A single container for managing L7 traffic gives operators fewer places to search for issues and telemetry.
Simpler versioning. We use SemVer and have a single version that applies to the entire container.

The importance of operational simplicity cannot be overstated. The Uptime Institute reports that 70% of outages are caused by human error. Moreover, 60% of respondents believed that downtime can be prevented with better configuration or processes. Operational complexity is the enemy of uptime, so we continuously strive for simplicity.

Distributed monoliths: Control Plane and Data Plane

Some popular projects such as Istio separate the control plane from the data plane. This approach enables more fine-grained deployment control at the expense of operational complexity. Separating the control plane and data plane results in a distributed monolith.

Upgrading a distributed monolith is a complex operation, since care must be taken to ensure that the new version of the control plane works with the old version of the data plane (or vice versa). Rollbacks, in the event of an issue, are even more difficult. Istio’s upgrade documentation recommends a canary release of the control plane and incrementally migrating the data plane to the new control plane by manually restarting all deployments. The upgrade documentation does not cover a rollback process.

With Istio, it was necessary to separate the control plane and data plane into two independent components, since deploying the control plane on every pod in the mesh is impractical. As an API Gateway managing North/South traffic, Ambassador is not deployed on every pod in a cluster. In fact, a small number of Ambassador pods is usually sufficient for most workloads. In our real-world deployments (some of which exceed 500K RPS) and performance benchmarking, we have yet to encounter a situation where separating the control plane and data plane makes sense. Given our use case, we optimized for operational simplicity.

Testability

Infrastructure software is only as good as its test coverage. Over the years, testing has been a significant area of investment for the Ambassador team. We started with a series of simple tests that compared configuration generated by the Ambassador control plane to a static set of “known good” configuration files. Since then, we’ve extended our test framework in many different areas:

Parallelized end-to-end tests that exercise all of Ambassador in on-demand Kubernetes clusters
Automated performance tests around multiple different scenarios run in multiple cloud Kubernetes providers and different Kubernetes versions
Thousands of unit tests covering every component of the system
Documented release testing, review, and checklists

Release

No matter how much design and testing that may occur, bugs inevitably happen. We’ve invested in an extensive amount of engineering in our release processes, so that we’re able to rapidly release critical updates to customers. Some examples of this include:

Redundant container registries that are hosted in both Amazon and Google data centers so that your uptime is not affected by the downtime of a single registry.
Use of Semantic Versioning and detailed changelogs, so our users know exactly what has changed. In addition, the Ambassador Operator uses SemVer to reduce operational overhead and let you choose how frequently to upgrade from every incremental release or only major changes or anywhere in between.
In the classic CMM Level 5 approach, we continuously improve our engineering processes alongside continuously improving the Ambassador software. This includes activities such as blameless postmortems, metrics-driven refactoring, and hiring experienced enterprise-scale software engineers from companies such as Red Hat, Puppet, New Relic, Intel, and more.
Automated release processes, which run all of our tests, no matter how small the code change, because small changes could have large consequences.

Kubernetes First

Ambassador has only supported Kubernetes for production deployment. This enables the Ambassador architecture to delegate some of the challenges of engineering reliable software to Kubernetes. For example, Ambassador relies on Kubernetes as a single source of truth for configuration, leaving the hard parts of distributed system engineering to a proven data store. Kubernetes also has a rich set of high availability features. Ambassador is able to rely on these features for automated health checking, deployment scheduling, and more.

In addition to simplifying the overall architecture of Ambassador, Kubernetes also enables greater operational simplicity in Ambassador itself, since Ambassador is configured like any other Kubernetes deployment or DaemonSet.

Community

No amount of automated testing and architectural design can substitute for real-world testing. So we’re also incredibly grateful to our extensive community (currently, 3,400+ Slack members) who frequently test and give feedback on early builds of Ambassador. We’ve also had 144 people (and counting!) directly contribute code to Ambassador that cover all areas of functionality.

Summary

In designing for availability, we’ve consciously optimized for your operational simplicity. Backed by an ever-improving testing and ever-growing community, we’re continuously working to ensure that Ambassador is highly available and reliable.

Ambassador Labs

Architecting Ambassador for Availability- Operational Simplicity

Design decisions, testability, release, Kubernetes, and community

Operational Simplicity

Distributed monoliths: Control Plane and Data Plane

Testability

Release

Kubernetes First

Community

Summary

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Ambassador Labs

Written by Richard Li

No responses yet