Canaries in Practice

In the context of distributed systems, canaries are used to limit the blast radius of software releases. Per the Google SRE book:

To conduct a canary test, a subset of servers is upgraded to a new version or configuration and then left in an incubation period. Should no unexpected variances occur, the release continues and the rest of the servers are upgraded in a progressive fashion. Should anything go awry, the modified servers can be quickly reverted to a known good state.

For instance, let’s say you have a service with 100 servers all running the same version N deployed in the cloud behind a load balancer, all taking user traffic. The version N+1 of the service is ready, but just creating a new cluster with 100 servers at version N+1 and switching traffic to the new cluster may result in a service disruption. So to run a canary, you may do this (see martinfowler.com for diagrams):

  1. create a canary cluster with 1 server at version N+1, and put it behind the same load balancer as the prod cluster.
  2. wait for some time, let’s say 1h
  3. after the incubation period, somehow compare the metrics (more on that later) between the canary and the prod clusters.
  4. If the metrics look good, then proceed to the next step: roll version N+1 to X% of traffic (e.g. increase the size of the canary cluster to X servers and disable X servers in the old cluster). This can be repeated any number of times until X is 100% to make the rollout more or less gradual.

This process looks relatively simple, but there are a few important details to consider in practice:

  • testing in prod: it’s best to not think of canaries as testing, but you are still putting live customers in front of new software that you are not quite confident works 100%. Canaries will fail, and real customers will get errors. This sounds obvious, but it has to be accepted and understood. There are also potential ethical implications, such as with sticky canaries.

  • sticky canaries: it’s possible for some errors to need multiple round-trips with a new version to manifest themselves, particularly if the interaction with the canary changes some state such as cookies on the clients. If each request is independently assigned randomly to a prod or canary instance, it is unlikely that customers will hit the canary multiple times in a row (and hence expose the issue). To work around this, it is possible to implement sticky canaries, assigning not just a percentage of traffic but a percentage of users to a canary. The potential ethical conundrum here is that you may have just completely broken the user experience of your poor group of guinea pig users that have been assigned to a canary “test cell” for however long your incubation period is. So be mindful of this, and ideally rotate users in and out of the canary cell quickly enough so that they don’t experience obnoxious disruptions.

  • client errors are in the blind spot: the method as described above only looks at server metrics, not client metrics. It is possible that a canary will pass with flying colors, but when rolling it out fully client errors will make a rollback necessary. There is no trivial workaround for this, trying to correlate client errors with canaries is going to be tricky.

  • beware small services: if the main prod cluster had only 3 servers instead of 100 in the example above and the load balancer just cycles randomly through all available instances, the canary instance would take 25% of user traffic instead of 1%. Running canaries this way is now a major incident risk – ideally your load balancer/service discovery mechanism would let you control a percentage of traffic to send to the canary cluster, not just a portion of traffic based on the number of prod instances.

  • what metrics to look at: the example glosses over arguably the most important aspect, which is how to score the canaries. What metrics should be looked at and how should they be compared? On one end of the spectrum, you would only look at the HTTP error rate over the incubation period [1] (num_http_4XX + num_http_5XX)/num_requests (this would prevent catastrophic rollouts e.g. a service that returns 500 all the time but might miss subtler issues related to business logic). On the other hand of the spectrum, you would look at all the metrics exposed by the service (CPU, latency, memory usage, GC pauses, business logic errors, downstream dependencies, cache hits and misses, timeouts, connection errors…). The problem with this approach is that you may now be dealing with noisy signals.

  • dealing with noisy signals: performance-related metrics (CPU, memory, latency, GC) have a natural variance that can cause false positives such as a canary that fails because of a CPU or latency spike. The argument for including these is that we don’t want small performance regressions to pile up over time. While this can certainly happen and ongoing performance monitoring is necessary, I don’t believe a canary is the right place to check for performance regressions. Do you really want to delay a release because of a 2% CPU increase or memory usage? There are some other downsides too:
    • if a deviation in any metric can cause the canary to fail, then canaries are likely to fail left and right and devs will get desensitized to bad canary results (sometimes causing incidents because of the decision to push despite poor canary results)
    • it may be tempting to include more than 1 canary instance to smooth out spikes in noisy metrics. If you’re behind a simple load balancer, see beware small services again, this is going to make the problem worse.
    • unless there is a resource leak issue that causes performance to degrade over time, warm instances in the prod cluster are likely to have better performance metrics than fresh canary instances. A workaround for this is to introduce a dedicated baseline cluster at version N that is created at the same time as the canary cluster. The canary is then compared against that new baseline cluster, not against the prod cluster. But again, beware small services.
    • to test whether your metrics are too noisy, consider running an A/A canary where the canary cluster runs the same version as the prod cluster. If you get bad canary results, you know you have a problem.
  • weak signals: it is possible for a canary to look good but cause an incident after full rollout even if the symptomatic metric was analyzed in the canary. For instance, you may have an error that breaks 0.1% of your customers very badly: very hard to catch in a canary, but potentially thousands of angry phone calls as a result. There is another scenario where you may need to explicitly exclude weak signals from the analysis: for instance imagine the case where the baseline experienced 0 errors and the canary experienced N>0 errors: that’s an infinity percent deviation from the baseline, but also not a very strong signal.

  • inconsistent configurations:including too many metrics results in canary configurations that are inconsistent across teams and services, so you end up with the known-flaky service where the canary always fails but you can push anyway (mostly) and the seemingly-robust service where the canary always looks great but doesn’t actually catch issues. If possible, look for a minimal (hence consistent) canary configuration across services in order to reduce the cognitive load associated with them. As a corollary, you should consider fighting the urge to add more and more metrics after each incident (e.g. when someone asks “why wasn’t this caught by the canary?”).

  • push delays: even with great automation in place, if changes take 4h or more to roll out, then you most likely can only push once a day at most. There is a trade-off between development velocity and risk to be mindful of.

  • beware the forgotten feature flag: it is possible to sneak features past a canary with a disabled feature flag. Obviously not great when you turn it on and it blows up in prod (on the upside, only the feature needs to be reverted, not the whole rollout). As a workaround, consider having feature flags enabled by default and require an explicit configuration change to disable.

  • poison pills: even a single canary instance can cause significant damage if you leave it unattended for the full incubation period (200 RPS of errors for 1h is a lot of errors). Consider having automation in place to kill really bad canaries early, no need to throw errors for more than a couple minutes.

  • regions/DCs: if you deploy to multiple cloud regions (or data centers, cloud vendors…), you should probably run the canary in each because the metrics indicate the health of the code version N+1 in a particular environment and with a particular runtime configuration. A good canary result in AWS / us-east-1 does not tell you that version N+1 is good to go everywhere, it tells you that N+1 was fine in AWS / us-east-1 at that point in time, with the runtime configuration that was live at that time (including the specific set of enabled feature flags). This means that we’ll have to deal with inconsistent canary results that can be great in a region but disastrous in another region. [2]
    • also, you can’t exactly canary in a region that has been evacuated (see Chaos Kong), so your automation has to know not to roll out to an evacuated region, otherwise you may get a surprise when traffic comes back
  • relation with alerts: it feels wrong to have good canary results, proceed to a rollout, and then get paged because of the rollout. It feels like there should be a tighter integration between canaries and alerts, but I don’t know exactly how it would look like. If you do, please talk to me.

For a different flavor of this, see how Facebook pushes changes.

Discuss on HN

Discuss on Twitter

Footnotes

[1]: You would probably also need to check that the magnitude of the signal is what would expect, otherwise you may think a canary looks great (0% errors!) if it is not taking any customer traffic but just returns 200 for healthcheck requests. This one definitely happened to a friend.

[2]: For instance, you could have a misconfigured security group in eu-west-1 resulting in an unreachable critical dependency in prod for that region. Also happened to a friend.

Written on January 2, 2018