Chasing a test failure in our gRPC-Web test
Debugging the Ambassador test suite
As we got ready to ship Ambassador 0.74.0, we ran into a bug where the date really was the thing! This is the story of how it happened, and how we figured out how to fix it.
It all started with Ambassador’s CI tests starting to fail.
Ambassador’s test suite leans very heavily on actual end-to-end functionality checks rather than other proxies: the test suite allocates a Kubernetes cluster, fires up a bunch of Ambassadors and upstream services, then sends thousands of requests through the system, making sure that the responses match what we expect. While this is a great way to be confident of what Ambassador is doing, it means we start dozens of Ambassadors nearly simultaneously, and that’s very hard on the cluster.
Speeding up CI
Since adding features means adding tests, and adding tests often means adding Ambassadors to the test cluster, periodically, our additions push the cluster over the edge of the cliff, and the tests start dying during startup. Every time this happens, we put effort into making the tests not so hard on the cluster, which usually means making Ambassador start up faster, which translates into benefits to customers running Ambassador in the wild.

This time, it wasn’t exactly clear where to put that effort. There didn’t seem to be any rhyme or reason as to which CI tests were failing, so we focused first on isolating the different Ambassadors from each other, to reduce reconfiguration churn during startup:
- We added label selectors to the test configurations so that each test Ambassador didn’t have to read and discard all the configurations for every other Ambassador. This works well enough to recommend it as a best practice going forward.
- We found and fixed some cases where tests were using resources created by other tests. This is a big no-no: tests can flake out depending on the exact sequence in which other tests start.
- Once we fixed the resource-creation thing, we split the tests into multiple groups that run sequentially. Paradoxically, this speeds up the overall run because each group can start much faster when we’re not stressing the cluster so much.
- Finally, we tuned the timing for the cluster’s liveness and readiness probes to avoid needless restarts during test startup.
These things helped quite a bit and walked us (hopefully) well back from the edge of the cliff. But after all of that, we found one test that was still failing.
The gRPC-Web Test
This test was the gRPC-Web
test. The worst part about this test was that it wasn’t failing consistently. It was failing 100% of the time in CI, but when my coworker based in Vancouver tried things locally, it passed 100% of the time. When I tried locally, it was somewhere in between. This is deeply offensive: there is nothing in the tests to look at who’s running them, so they should never change behavior in this way.
Underpinning everything we do in modern computing is the certainty that there is no magic in code. Things that seem to defy that idea are not acceptable.
About the time we realized how wildly inconsistent things were, I had a huge stroke of luck: the test started failing 100% of the time for me. On the face of it, this might seem like bad news, but no — being able to reproduce an issue goes a long, long way toward being able to understand it.
Debugging the error
First question: the error referred to an invalid tag. What was actually producing the error? The path here with all the tests is that our test client talks to Envoy, Envoy talks to the upstream service, the upstream answers back to Envoy, Envoy passes the answer back to the test client. But we knew that both gRPC and gRPC-Web
make for a more complex transaction:
- The test client constructs a gRPC request.
- The test client encodes the request as a
protobuf.
- The test client then encodes the
protobuf
as agRPC-Web
message. - The test client sends the encoded message to Envoy.
- Envoy decodes the
gRPC-Web
message, yielding theprotobuf.
- Envoy sends the
protobuf
to the upstream service. - The upstream service decodes the
protobuf
to process the request. - The upstream service generates a response.
- The upstream service encodes the response as a
protobuf.
- The upstream service sends the
protobuf
to Envoy. - Envoy encodes the
protobuf
as agRPC-Web
message. - Envoy sends the
gRPC-Web
message to the test client. - The test client decodes the
gRPC-Web
message to get aprotobuf.
- The test client decodes the
protobuf
to get the response. - The test client validates the response.

Searching the source code for the “invalid tag” message revealed that it was coming from the gRPC decoder generated in Go, which made us suspect steps 7 or 14. Checking the logs for the upstream service made it seem that the upstream wasn’t generating the error; checking the logs for Envoy as well (in trace mode) let us conclusively rule out step 7. The error was coming from step 14.
Knowing what is generating the error, of course, raises the question of why. To answer that question, we needed to see the data passing through the various steps of encoding and decoding — this is why it’s so critical to have a reproducer! We could have done this a lot of ways, but in many cases the quick and dirty way is the best: we modified the test client to simply use log.Printf
to dump the byte arrays it was working with at various points, then pasted the dumped array into a Python script:
dumped_str = ’10 27 97 99 99 101 112 116 97 …’
dumped = [ int(x) for x in dumped_str.split(‘ ‘) ]for b in dumped:
sys.stdout.write(chr(b))
Pipe that to hexdump
and it’s easier to see what’s going on. Redirect it to a file and we have the raw stream to play with using Emacs keyboard macros or whatnot. The Unix philosophy of having many easily-composable tools was a huge win here.
Eventually we had our culprit. The gRPC call we’re making includes a set of headers, one of which is a Date
header. This header worked:
Date: Wed, 77 Jul 23079 19:67:07 GMT
and this one failed:
Date: Wed, 77 Jul 23079 19:117:07 GMT
Ignore the horribly-broken values for the moment — of course it’s not the 77th of July in 23079, but that’s not what causes the problem here. What causes the problem is that the one that fails is one byte longer than the one that works.
The culprit: a buggy gRPC-Web decoder
This is because of the way gRPC-Web
is encoded for transport across HTTP/1.1 connections. Native gRPC uses HTTP/2 trailers to convey the status of the gRPC call, but of course that won’t work in HTTP/1.1. gRPC-Web
deals with this by defining a simple tag-length frame format: each frame has a one-byte tag
, a four-byte length
, and then the raw data for the frame. The whole thing is then base64
encoded, rather than trusting the connection to carry binary data correctly. The simplest case for a gRPC-Web
response has a frame containing the gRPC protobuf
followed by a frame containing the HTTP/2 trailers.
The gRPC protobuf
data for our success case ended up being one byte short of a multiple of three bytes, so the failure case was an exact multiple of three bytes long.
This is important because base64
works in groups of three bytes. If the message being encoded isn’t a multiple of three bytes, the encoded data will be padded with =
to make everything line up… and there was a bug in our gRPC-Web
decoder that assumed that it would see an =
to mark the end of every gRPC-Web
frame. When the protobuf
frame was an exact multiple of three bytes long, there was no =
at the end of the frame, the decoder lumped the trailer frame in with the protobuf
frame, and everything came crashing down.
Sometimes a whole day of debugging comes down to one character.
Intermittent failures and Heisenbugs

And after we fixed the decoder and saw the tests passing, what then? First we had to satisfy ourselves that what we had found really explained the bug. Digging into the source code and carefully studying the broken call to Time.Format
, we realized that the extra byte will be missing only for the first 10 minutes of every hour. The tests took between ten and fifteen minutes to run at that point, and the failing test ran about halfway through that time — so it was very, very easy to see how we could get a long run of failures indeed, and we felt pretty good about this bug explaining what we’d been seeing in the tests.
Next, we had to convince ourselves that our fix to correctly decode the gRPC-Web
message would really fix the bug. We did this by, well, repeatedly running the test suite which had been failing, and by bringing in other colleagues to review the fix. Things looked good on that front too.
Then, of course, we had to sit down and talk over how we might prevent future occurrences:
- How can we make sure the tests always run cleanly enough that we can easily see failures like this?
- How can we better track which tests are failing so we can more easily address them?
- How can we keep improving Ambassador’s startup so that the test cluster doesn’t fall over the edge of the cliff?
We’ll talk more about these things going forward — they’re very much ongoing.
Overall, the most important takeaway is simple: there is no magic in code. Any behavior you see, you can understand (though it might take awhile); any behavior you can understand, you can change. This is the core of what we do, even though it can be easy to lose sight of it.
And, of course, it’s good to remember that every so often, the bug really is because of the time of day or the phase of the moon. 😀
Join Ambassador Labs
Take a look at Ambassador Careers page and share your favorite debugging story with us on Twitter