Chasing a test failure in our gRPC-Web test

Debugging the Ambassador test suite

Published in

Ambassador Labs

8 min readAug 14, 2019

As we got ready to ship Ambassador 0.74.0, we ran into a bug where the date really was the thing! This is the story of how it happened, and how we figured out how to fix it.

It all started with Ambassador’s CI tests starting to fail.

Ambassador’s test suite leans very heavily on actual end-to-end functionality checks rather than other proxies: the test suite allocates a Kubernetes cluster, fires up a bunch of Ambassadors and upstream services, then sends thousands of requests through the system, making sure that the responses match what we expect. While this is a great way to be confident of what Ambassador is doing, it means we start dozens of Ambassadors nearly simultaneously, and that’s very hard on the cluster.

Speeding up CI

Since adding features means adding tests, and adding tests often means adding Ambassadors to the test cluster, periodically, our additions push the cluster over the edge of the cliff, and the tests start dying during startup. Every time this happens, we put effort into making the tests not so hard on the cluster, which usually means making Ambassador start up faster, which translates into benefits to customers running Ambassador in the wild.

This time, it wasn’t exactly clear where to put that effort. There didn’t seem to be any rhyme or reason as to which CI tests were failing, so we focused first on isolating the different Ambassadors from each other, to reduce reconfiguration churn during startup:

We added label selectors to the test configurations so that each test Ambassador didn’t have to read and discard all the configurations for every other Ambassador. This works well enough to recommend it as a best practice going forward.
We found and fixed some cases where tests were using resources created by other tests. This is a big no-no: tests can flake out depending on the exact sequence in which other tests start.
Once we fixed the resource-creation thing, we split the tests into multiple groups that run sequentially. Paradoxically, this speeds up the overall run because each group can start much faster when we’re not stressing the cluster so much.
Finally, we tuned the timing for the cluster’s liveness and readiness probes to avoid needless restarts during test startup.

These things helped quite a bit and walked us (hopefully) well back from the edge of the cliff. But after all of that, we found one test that was still failing.

The gRPC-Web Test

This test was the gRPC-Web test. The worst part about this test was that it wasn’t failing consistently. It was failing 100% of the time in CI, but when my coworker based in Vancouver tried things locally, it passed 100% of the time. When I tried locally, it was somewhere in between. This is deeply offensive: there is nothing in the tests to look at who’s running them, so they should never change behavior in this way.

Underpinning everything we do in modern computing is the certainty that there is no magic in code. Things that seem to defy that idea are not acceptable.

About the time we realized how wildly inconsistent things were, I had a huge stroke of luck: the test started failing 100% of the time for me. On the face of it, this might seem like bad news, but no — being able to reproduce an issue goes a long, long way toward being able to understand it.

Debugging the error

First question: the error referred to an invalid tag. What was actually producing the error? The path here with all the tests is that our test client talks to Envoy, Envoy talks to the upstream service, the upstream answers back to Envoy, Envoy passes the answer back to the test client. But we knew that both gRPC and gRPC-Web make for a more complex transaction:

The test client constructs a gRPC request.
The test client encodes the request as a protobuf.
The test client then encodes the protobuf as a gRPC-Web message.
The test client sends the encoded message to Envoy.
Envoy decodes the gRPC-Web message, yielding the protobuf.
Envoy sends the protobuf to the upstream service.
The upstream service decodes the protobuf to process the request.
The upstream service generates a response.
The upstream service encodes the response as a protobuf.
The upstream service sends the protobuf to Envoy.
Envoy encodes the protobuf as a gRPC-Web message.
Envoy sends the gRPC-Web message to the test client.
The test client decodes the gRPC-Web message to get a protobuf.
The test client decodes the protobuf to get the response.
The test client validates the response.

Searching the source code for the “invalid tag” message revealed that it was coming from the gRPC decoder generated in Go, which made us suspect steps 7 or 14. Checking the logs for the upstream service made it seem that the upstream wasn’t generating the error; checking the logs for Envoy as well (in trace mode) let us conclusively rule out step 7. The error was coming from step 14.

Knowing what is generating the error, of course, raises the question of why. To answer that question, we needed to see the data passing through the various steps of encoding and decoding — this is why it’s so critical to have a reproducer! We could have done this a lot of ways, but in many cases the quick and dirty way is the best: we modified the test client to simply use log.Printf to dump the byte arrays it was working with at various points, then pasted the dumped array into a Python script:

dumped_str = ’10 27 97 99 99 101 112 116 97 …’
dumped = [ int(x) for x in dumped_str.split(‘ ‘) ]for b in dumped:
    sys.stdout.write(chr(b))

Pipe that to hexdump and it’s easier to see what’s going on. Redirect it to a file and we have the raw stream to play with using Emacs keyboard macros or whatnot. The Unix philosophy of having many easily-composable tools was a huge win here.

Eventually we had our culprit. The gRPC call we’re making includes a set of headers, one of which is a Date header. This header worked:

Date: Wed, 77 Jul 23079 19:67:07 GMT

and this one failed:

Date: Wed, 77 Jul 23079 19:117:07 GMT

Ignore the horribly-broken values for the moment — of course it’s not the 77th of July in 23079, but that’s not what causes the problem here. What causes the problem is that the one that fails is one byte longer than the one that works.

The culprit: a buggy gRPC-Web decoder

This is because of the way gRPC-Web is encoded for transport across HTTP/1.1 connections. Native gRPC uses HTTP/2 trailers to convey the status of the gRPC call, but of course that won’t work in HTTP/1.1. gRPC-Web deals with this by defining a simple tag-length frame format: each frame has a one-byte tag, a four-byte length, and then the raw data for the frame. The whole thing is then base64 encoded, rather than trusting the connection to carry binary data correctly. The simplest case for a gRPC-Web response has a frame containing the gRPC protobuf followed by a frame containing the HTTP/2 trailers.

The gRPC protobuf data for our success case ended up being one byte short of a multiple of three bytes, so the failure case was an exact multiple of three bytes long.

This is important because base64 works in groups of three bytes. If the message being encoded isn’t a multiple of three bytes, the encoded data will be padded with = to make everything line up… and there was a bug in our gRPC-Web decoder that assumed that it would see an = to mark the end of every gRPC-Web frame. When the protobuf frame was an exact multiple of three bytes long, there was no = at the end of the frame, the decoder lumped the trailer frame in with the protobuf frame, and everything came crashing down.

Sometimes a whole day of debugging comes down to one character.

Intermittent failures and Heisenbugs

And after we fixed the decoder and saw the tests passing, what then? First we had to satisfy ourselves that what we had found really explained the bug. Digging into the source code and carefully studying the broken call to Time.Format, we realized that the extra byte will be missing only for the first 10 minutes of every hour. The tests took between ten and fifteen minutes to run at that point, and the failing test ran about halfway through that time — so it was very, very easy to see how we could get a long run of failures indeed, and we felt pretty good about this bug explaining what we’d been seeing in the tests.

Next, we had to convince ourselves that our fix to correctly decode the gRPC-Web message would really fix the bug. We did this by, well, repeatedly running the test suite which had been failing, and by bringing in other colleagues to review the fix. Things looked good on that front too.

Then, of course, we had to sit down and talk over how we might prevent future occurrences:

How can we make sure the tests always run cleanly enough that we can easily see failures like this?
How can we better track which tests are failing so we can more easily address them?
How can we keep improving Ambassador’s startup so that the test cluster doesn’t fall over the edge of the cliff?

We’ll talk more about these things going forward — they’re very much ongoing.

Overall, the most important takeaway is simple: there is no magic in code. Any behavior you see, you can understand (though it might take awhile); any behavior you can understand, you can change. This is the core of what we do, even though it can be easy to lose sight of it.

And, of course, it’s good to remember that every so often, the bug really is because of the time of day or the phase of the moon. 😀

Join Ambassador Labs

Take a look at Ambassador Careers page and share your favorite debugging story with us on Twitter

Ambassador Labs

Chasing a test failure in our gRPC-Web test

Debugging the Ambassador test suite

Speeding up CI

The gRPC-Web Test

Debugging the error

The culprit: a buggy gRPC-Web decoder

Intermittent failures and Heisenbugs

Join Ambassador Labs

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Ambassador Labs

Written by Flynn

No responses yet