Wednesday, January 27, 2016

Considerations for adopting Canary releases

A canary release is a tactic for reducing deployment risk. The idea is to deploy a new release to one or two nodes in a service cluster, let them handle some portion of the work load, and see if any unexpected errors result. In effect, this is a kind of "testing" in production. The term comes from the practice of miners to bring canaries with them into the mines; if the canaries died, then the air wasn't safe to breathe and they should evacuate. Like all strategies, there are advantages and disadvantages to using this approach. This blog entry attempts to outline the more important considerations.

The reason this is attractive is that it mitigates risk without slowing down developer velocity as other forms of risk mitigation (e.g. additional testing before deployment) does. If you are adopting continuous delivery where there are far more deployments and changes are deployed at a greatly increased rate, risk mitigation techniques like this are attractive.

Canary deployments seek to mitigate risk as you'll have less impact should a defect accidentally be deployed. Fewer customers or end-users will be impacted by unintended defects making it to production as the portion of the load a canary deployment handles is a fraction of the total. For example, if the canary deployment only handles 2 percent of the load, that drastically reduces the impact of an unintended defect when compared to if the new release handled 100% of the load..

One problem is how to evaluate unintended defects. Logged exceptions, where the defect results in an exception, are easy to defect. Silent defects are defects that don't result in an exception (e.g. produce an incorrect answer or incorrect data). They are harder to catch and in fact might result in a derivative error in some other service that processes incorrect data. In fact, it might not be caught until an end user or customer complains. 

Canary deployments do not effectively mitigate the risk of silent defects. The reason is that they are often not caught until far after the defect occurs. Furthermore, they often are detected manually, which introduces a time lag in and of itself.

Canary deployment capabilities increase complexity associated with your automated deployment mechanism. There are a few strategies that can be used to manage canary deployments. The available strategies are a comp[lex enough subject that they should be addressed in a separate blog post. All of these strategies introduce additional complexity to automated deployments and back-outs.

Canary deployments require the ability to measure and compare metrics of the canary to metrics of other nodes in the cluster. To those operating at a larger scale, this metric comparison between the canary and baseline is automated. Those metrics differing substantially compared to baseline are automatically rolled back.

Some use canary deployments can be utilized instead of capacity testing. The idea is that load is transferred gradually from one release to another enabling administrators to spot resource consumption issues before 100% of the load is transferred over. With this idea, all deployments are canary deployments; it's just that the deployment happens gradually over a longer period of time.

Database changes can present a problem for canary deployments. If the database changes in a way that's not backward compatible. As an example, if tables and or columns are removed. This can be a large topic and should really be addressed in a separate blog post. 

Canary deployments are only possible in cases where there are no service contract changes. In an age where we're breaking up monolithic applications into larger number of smaller deployables, not every release will be a candidate for a canary deployment. Those deployments that introduce a contract breaking change will almost certainly require an all-or-nothing deployment of both the producing service and all consuming services.

Saturday, January 16, 2016

Exception Throwing Etiquette

Nothing irritates me more than investigating reported exceptions where the developer didn't take the time to record the context with the exception.  That is, you get an error message like "Field XXX value not valid" or something like that.  What value was provided that was invalid?  What are the valid values for that field?  What were the other arguments to the method being thrown so that it's quick and easy to construct a test case for fixing the bug?  

These are all items of information that should have been reported with the exception.  Had they been reported with the exception, the developer assigned would have likely spent much less time diagnosing the issue and providing a fix.

You might ask why application architects should be concerned with code-level items such as these.  The reason is that architects are partly responsible for designing applications that are easy to support and help the organization keep support costs at bay.  Promoting and enforcing good exception etiquette is a cheap way to do that. 

Exceptions should contain context information to facilitate support.  This speeds support and problem resolution because it saves developers time.  That time savings allows developers to spend more time on feature development rather than support.  If your enterprise separates development and support, it allows the organization to spend less on support entirely.

Providing context with exceptions is easy for developers to do now.  Tooling exists now that makes this easy and safe.  I think the reason that developers don't take the time to include needed information with exceptions is laziness.  I'm going to present three options I see used most frequently.  That said, I'm sure that there are other tooling options.

Don't write custom formatting code for exceptions.  We've all see the issue of the exception reported was a null pointer from code that was in the process of formatting an exception message.  In other words, the exception that got reported was derivative; not the root exception that needs to be diagnosed.  The tooling options I'll tell you about reduce the risk of including context information with your exceptions as well as making it easier.

Apache Commons Lang Validation

A common reason for throwing exceptions is to validate arguments on public and protected methods.  I'll save a discussion on why all arguments for public and protected methods should be validated for another post. One of my favorite tools for validating arguments is the Validate utility from Apache Commons Lang.  Validate makes it very easy to both perform the validation and include context information along with the exception.  Examples follow which should make usage obvious.

Validate.notEmpty(firstName, "firstName can't be null or blank.  accountId=%s, 
    accountType=%s", accountId, accountType);

Validate.notNull(accountType, "accountType is required.  Allowed values are %s", 

Validate.isTrue(startingBalance >= minimumBalance, "Starting balance must be larger than %s.  balance=%s, accountId=%s, accountType=%s", 
    minimumBalance, startingBalance, accountId, accountType);

Exception messages produced by the above examples are:

java.lang.NullPointerException: firstName can't be null or blank.  accountId=12345, accountType=PERSONAL

java.lang.NullPointerException: accountType is required.  Allowed values are {PERSONAL,BUSINESS}

java.lang.IllegalArgumentException: Starting balance must be larger than 1000.0.  balance=100.0, accountId=12345, accountType=PERSONAL

Validate is null safe, so you don't need to worry about derivative null pointer exceptions.

Some developers prefer the Javax Validator framework as it's more annotation based.  I provide information on that later in the post.

Apache Commons Lang Contexted Exceptions

Exceptions that are not the result of a validation, but a part of something more complex, such as a remote call, I use the ContextedRuntimeException from Apache Commons Lang.  Note that there is a checked ContextedException that operates exactly the same way.

The idea is that when you create the exception you also add context values that will automatically be present in the log when the exception is reported.  the exception is null safe, so you don't need to worry about passing it null context values.  An example should make usage obvious:

throw new ContextedRuntimeException("Error adding customer", rEx)
    .addContextValue("accountId", accountId)
    .addContextValue("accountType", accountType)
    .addContextValue("customerName", customerName);

When the exception is logged, you'll see something like this:

org.apache.commons.lang3.exception.ContextedRuntimeException: Error adding customer
Exception Context:
 at examples.commons.lang.context.ContextedExceptionTest.test(

Javax Validator Framework

The Java Validator framework is popular in some organizations.  It's annotation based and is handy for validating inputs for published services automatically.  One of it's advantages is that since it's annotation based, it doesn't add code to the code base that needs to be tested.

One of the disadvantages to the Javax Validator framework is that it's much more difficult to include context information along with the exception thrown.  Yes, developers can provide explicit messages that will be used when the validation fails.  An example of including such a message follows:

public class CustomerAccount {
    @NotBlank(message="accountId can't be blank")
    private String accountId;
    @NotNull(message="accountType can't be blank")
    private AccountType accountType;
    @NotBlank(message="customerName can't be blank")
    private String customerName;
    @DecimalMin(value="1000.00", message="startingBalance must be at least $1000")
    private BigDecimal startingBalance;

Notice that in the examples above, all messages are explicit and hardcoded.  There's no easy ability to add in additional context information about other fields in the bean being validated so that a developer could get additional context to assist with diagnosing the issue.  Note that you also don't get the invalid value that violated the exception.

This presents an interesting question.  Would it be possible to use the Javax Validator framework and provide context information?  At first glance, you would have to write a custom class that implemented the MessageInterpolator interface (details here) and bootstrap it in (details here).  Maybe, I'll write one someday if I see enough of my clients utilizing this framework.  

Saturday, January 9, 2016

How Apache Mesos Adds Business Value

The Apache Mesos product adds business value.  An incredible amount of it.  But the source of that value isn't what most people think.  Most think that the business value is simplification of deployment for developers so that they spend more time developing features instead of worrying about the deployment infrastructure.  There is value for developers, but there's more.

Mesos eliminates cloud vendor lock in.  Mesos slaves run on virtual machines with prerequisite software on it.  Those slaves can exist just as easily on Amazon AWS or Google Cloud, Microsoft Azure, or any other cloud vendor.  All cloud vendors provide VPN capability so that hybrid cloud (a mixture of cloud and on premise) infrastructures are possible; everyone wants that.  You can't exist in today's marketplace without it.  There's nothing getting in the way of Mesos masters running slaves in on premise our cloud infrastructures in any combination.  Somebody makes a financial decision add to where slaves (or Mesos masters, for that matter) are hosted.  

Mesos reduces switching costs between cloud vendors.  One of the cloud "fears" I hear from clients is vendor lock-in.  That is, they start with a cloud vendor, that vendor changes its pricing structure or the quality isn't what they need, then they have a large project to switch cloud vendors.  Mesos makes switching cloud vendors or using multiple cloud vendors much cheaper and easier.  Adopting a new cloud vendor consists of establishing a VPN and creating Mesos slaves in your newly consumed cloud. Through the VPN, those newly created slaves can communicate with their Mesos masters wherever they are hosted. From a business perspective, this feature eliminates business risk. Your application runtime environment is effectively decoupled from your cloud vendor.  

Mesos vendor agnostic platform introduces a least common denominator problem.  One of the advantages of using the cloud is to capitalize on cloud vendor R&D, which is much larger and often much more focused than what enterprises can do individually.  Cloud R&D is realized by features such as dynamic scaling, which automatically scales your deployment to match the load it's currently servicing.  The web is repleat with postings about difficulties in making Mesos dynamically scalable and utilizing other cloud features.  Cloud vendor features will always be ahead of Mesos' ability to utilize them and will always have features that those adopting Mesos won't easily be able to use.  

It should be noted that Mesos is popular enough to get the attention of cloud vendors. For instance, AWS is devoting time and energy into providing a way to dynamically scale Mesos slaves through the ECS Scheduler Driver.  Azure and Microsoft have an effort going to bring Mesos (and Docker) to the windows world (here).  A quick google search reveals that there are many other Mesos integration efforts with cloud vendors and there's every reason to expect this to continue.

Saturday, January 2, 2016

Testing Private Methods in Unit Tests

Covering private code in unit tests can be problematic as they aren't directly executable from unit tests.  I've seen developers take one of two approaches.  One approach is to embed tests for private code in unit tests for protected or public methods.  Another is to escalate the declaration in the code being tested from private to protected, so that they can be more easily handled in unit tests.  Both approaches are problematic.

Testing private methods indirectly through protected or public methods is impractical.  If the private method has conditional logic or manipulates inaccessible fields, it can be hard to test each condition as it requires manipulating inputs to a protected or public method to do so.  In other words, testing of private methods becomes a "bank shot" more or less.  This takes additional time that developers don't always have.  It also makes test code more complex and more difficult to maintain.

Escalating methods and fields to protected status for testing creates a leakage of concerns issue.  Objects work like spies - on a "need to know basis".  If a method is going to be something other than private, then that escalated status should be needed in production code somehow.  If not, then the function of that private method wasn't really intended to be exposed.  The class might not be designed for that method to be overridden in an extension or called from outside the class in some other context.  Escalating that private method to protected in order to test it looses the documentation for other developers that it's dedicated to a specific purpose and really shouldn't be called from outside the class that defines it.

Apache Commons Lang and Reflection to the rescue.  Through reflection, access to public and private methods is more than possible.  Commons Lang makes that job easier.  In other words, you can have unit tests for private methods and check the value of private fields without escalating those items to protected status or attempting to test them indirectly.

Reading and Writing Private Fields

Interrogating and manipulating the values of private fields (without get and set methods) is fairly easy through the Commons Lang FieldUtils class.  Examples of reading and writing the values of private methods are provided in examples 1 and 2 respectively.  As you can see, FieldUtils gets this down to a one-liner.

Example 1:  Reading the value of a private field

Integer privateInt = (Integer) FieldUtils.readField(myClassInstance, "privateInt", true);
assertEquals(TEST_VALUE, privateInt);

Example 2:  Changing the value of a private field

FieldUtils.writeField(myClassInstance, "privateInt", Integer.valueOf(5), true);
// Run your tests and check for results

Executing Private Methods

Executing private methods *should* be just as easy from Commons Lang from the MethodUtils class.  In fact, I've proposed a minor enhancement to the Commons team to do just that. Until such a time, private methods can still be executed/tested directly. However, it's three lines of code and not just one.  I've got two illustrations of invoking private methods.  The first, example 3a, show executing a private method with no arguments.  The second, example 3b, is slightly more complex and assumes two arguments: a primitive int and a string.  Note that you do need to change the accessibility of the method so that it can be executed before you invoke it.

As I said, I hope that Commons Lang provides a one-line alternative for this at some point in the future.

Example 3a: Executing a private method (no arguments).

Method myPrivateMethod = MyClass.class.getDeclaredMethod("myPrivateMethod"); 
Object myResult = myPrivateMethod.invoke(myClassInstance);

Example 3b: Executing a private method (two arguments).

Method myPrivateMethod = MyClass.class.getDeclaredMethod("myPrivateMethod", Integer.TYPE, String.class); 
Object myResult = myPrivateMethod.invoke(myClassInstance, 5, "testValue");