A canary release is a tactic for reducing deployment risk. The idea is to deploy a new release to one or two nodes in a service cluster, let them handle some portion of the work load, and see if any unexpected errors result. In effect, this is a kind of "testing" in production. The term comes from the practice of miners to bring canaries with them into the mines; if the canaries died, then the air wasn't safe to breathe and they should evacuate. Like all strategies, there are advantages and disadvantages to using this approach. This blog entry attempts to outline the more important considerations.
The reason this is attractive is that it mitigates risk without slowing down developer velocity as other forms of risk mitigation (e.g. additional testing before deployment) does. If you are adopting continuous delivery where there are far more deployments and changes are deployed at a greatly increased rate, risk mitigation techniques like this are attractive.
Canary deployments seek to mitigate risk as you'll have less impact should a defect accidentally be deployed. Fewer customers or end-users will be impacted by unintended defects making it to production as the portion of the load a canary deployment handles is a fraction of the total. For example, if the canary deployment only handles 2 percent of the load, that drastically reduces the impact of an unintended defect when compared to if the new release handled 100% of the load..
One problem is how to evaluate unintended defects. Logged exceptions, where the defect results in an exception, are easy to defect. Silent defects are defects that don't result in an exception (e.g. produce an incorrect answer or incorrect data). They are harder to catch and in fact might result in a derivative error in some other service that processes incorrect data. In fact, it might not be caught until an end user or customer complains.
Canary deployments do not effectively mitigate the risk of silent defects. The reason is that they are often not caught until far after the defect occurs. Furthermore, they often are detected manually, which introduces a time lag in and of itself.
Canary deployment capabilities increase complexity associated with your automated deployment mechanism. There are a few strategies that can be used to manage canary deployments. The available strategies are a comp[lex enough subject that they should be addressed in a separate blog post. All of these strategies introduce additional complexity to automated deployments and back-outs.
Canary deployments require the ability to measure and compare metrics of the canary to metrics of other nodes in the cluster. To those operating at a larger scale, this metric comparison between the canary and baseline is automated. Those metrics differing substantially compared to baseline are automatically rolled back.
Some use canary deployments can be utilized instead of capacity testing. The idea is that load is transferred gradually from one release to another enabling administrators to spot resource consumption issues before 100% of the load is transferred over. With this idea, all deployments are canary deployments; it's just that the deployment happens gradually over a longer period of time.
Database changes can present a problem for canary deployments. If the database changes in a way that's not backward compatible. As an example, if tables and or columns are removed. This can be a large topic and should really be addressed in a separate blog post.
Canary deployments are only possible in cases where there are no service contract changes. In an age where we're breaking up monolithic applications into larger number of smaller deployables, not every release will be a candidate for a canary deployment. Those deployments that introduce a contract breaking change will almost certainly require an all-or-nothing deployment of both the producing service and all consuming services.