Tuesday, July 17, 2018

Cloud Governance: Making DevOps Automation Effective

I see cloud automation of all types implemented to control and/or secure cloud assets. Examples of this type of automation using Amazon Web Services (AWS) include the following:
  • Preventing unauthorized entries in security groups allowing ingress from (security)
  • Alerts for Creating IAM users (possible security risk)
  • Forwarding application logs to Splunk (operational effectiveness)
  • Scheduling up-time for non-production assets (cost savings)
While these examples are AWS-specific, the principles discussed in this article are equally applicable if you're using Azure, GCP, or another cloud. 

Some of these examples (which are real-life examples, by the way) were very effective. Some were only marginally effective. For instance, the first two examples were extremely effective. If I defined a security group allowing unauthorized access publicly, that entry in the security group was automatically removed and an alert issued to the security team. There was a process to add an "authorized" security group allowing public access, but I had to justify the need. As for the second example regarding IAM user creation, if I created an IAM user, an alert was generated and sent to the security team. The new IAM user wasn't automatically removed, but a similar alert was generated and I was asked to justify the action.

The example regarding forwarding application logs to Splunk was absolutely horrible. It took many hours to implement. It was very fragile; new version deployments to applications broke the automation. The coding needed to automatically fix forwarding after an application deployment was cost prohibitive. Consequently, logs were not reliably forwarded.

Scheduling up-time for non-production assets using the AWS Instance Scheduler is marginally effective. The solution works. However, it depends on instance tags for functionality. Instances scheduled must have required tags if up-time for them is to be scheduled. Often, developers are inconsistent in the implementation of these tags. This dependency makes the solution marginally effective for the customer as the tags are often not consistently applied.  This entire discussion leads to an obvious question.

What characteristics make DevOps Automation effective?

The characteristics that make DevOps automation effective can be attributed to the following characteristics:

  • DevOps automation must be resilient
  • DevOps automation must be easy to install
  • DevOps automation must minimize ongoing manual labor.
DevOps automation must be resilient to be effective. In other words, automation isn't effective if it breaks or doesn't work as intended when seemingly unrelated changes are made. Essentially, the benefit to the automation isn't realized. In the case of the fragile Splunk forwarding, broken forwarding meant that support could not trust the results of Splunk searches. Essentially, the benefit of having all application logs in one place was not realized. Developers had to go to the Cloudwatch source.

DevOps automation must be easy to install to be effective. Automation that's difficult to install is a classic "barrier to entry".  In other words, installation difficulty increases the effective price of the automation. Automation that's not implemented doesn't provide benefit.  The Splunk forwarding example was extremely difficult to install due to complexity as well as poor and inaccurate documentation. As a consequence, many teams didn't even attempt to implement the solution.

DevOps automation must minimize or eliminate ongoing manual labor to be effective. The purpose of automation is to eliminate manual labor. If the automation itself has manual labor requirements, that just negates some portion of the benefit. The AWS Instance Scheduler example requires manual labor in terms of tag maintenance. Consequently, the complete benefit of minimizing instance runtime costs is never realized. There's labor to pay for.

How do I use these principles to improve the automation I create?

Test your install documentation. Write your documentation and let somebody, not familiar with your automation, install it. If they have questions, it means that your documentation has a defect. It might be that your documentation wasn't clear. It might be that you missed documenting something. Whatever the reason, take those questions as defect reports.

Support your automation when it breaks or doesn't work as intended. If a consumer has trouble with your automation, help them fix. After you fix the issue, do a root cause analysis. Figure out why the automation broke and improve it to prevent that problem from repeating. It could be that your automation makes invalid assumptions about the environment. In which case, there is a code change to make. It could be that the user didn't understand how to properly use your automation. This can be fixed by improving your documentation. Whatever caused the issue, fix it.

Solicit feedback for ongoing care and feeding your automation requires. Ongoing care and feeding required by your automation detract from the benefits your automation provides. If there are changes you can make that further minimize or eliminate that ongoing work, you should consider it. 

Let's consider the AWS Instance Scheduler example that requires ongoing work with regard to documenting instance schedules as tags on the instances themselves. If we wrote that automation, how could we further minimize that ongoing work? One way I can think of is to provide a way to specify a default schedule for an entire account. Tags on the individual instances would be optional.

Many companies use different AWS accounts for non-production and production resources. If I could provide a default schedule for all EC2 and RDS instances in the entire non-production account, individual tags on individual instances would be optional. There would still be a need for tagging options on the instance themselves for resources that need a custom schedule. But the default schedule would apply to all instances in an account without developers remembering to place scheduling tags. A large percentage of the ongoing work required by the AWS Instance Scheduler would go away. With that, the amount of money the organization saved using the automation would increase.

Consider leveraging the single installation model. Most custom automation I've seen are written to work in their default region in the account in which they are installed. For example, if I install the automation in us-east-1 for account 1111111, that's the only place the automation is available. If I want to use that automation in other regions or other accounts belonging to the same organization, I need to install separate copies of that automation. 

It is possible to code automation so that it operates in multiple regions and multiple accounts. That said, the code within the automation would get more complicated and likely require cross-account roles. That automation would also require a centralized configuration where it runs. For example, the AWS Instance Scheduler supports the single installation model. If you use that feature, there's additional configuration to specify the accounts and regions the scheduler will operate in. Furthermore, cross-account roles are required to provide it access. Providing guidance on implementing the single installation model effectively is a separate topic and may be the subject of a future article.

I hope this helps you with the automation you create. Comments and questions are welcome.

Sunday, May 27, 2018

Tips and Tactics for Passing the AWS Solution Architect Certification Exams

I've been using the AWS cloud platform since about 2010. When I embarked on the AWS certification path a couple of years ago, I knew it would be a challenge even with my experience. I knew professional level certs are some of the most challenging exams in IT. Having passed AWS Solution Architect Professional and Associate certification exams, I've been asked by several for tips on how to prepare and pass. The process is daunting given that the body of knowledge covered by the tests are incredibly broad. Many have blogged on this topic. This post describes tactics that worked for me; I hope they are of value to you.

Preparation and Study

Memory only gets you so far. The associate exam does have some questions on obscure service limits where memory skills help (e.g. How many VPCs can a user create by default?, What is the limit on the number of instances in a placement group?). The professional exam is purely story problems where each question is a scenario and you're expected to choose from different design options they present. You do need to learn how to apply the underlying principles AWS uses. You just can't memorize your way to success.

Networking skills should be considered a prerequisite. Most of my experience is in the application development realm. Fortunately, I had experience as a system administrator at different points and knew networking concepts. I've friends who went down the same path and had a very difficult time as they didn't have networking experience. You should be able to setup a VPC, public and private subnets in different availability zones including an internet gateway and NAT from memory. Don't forget to install a couple of instances to make sure the public/private access works. 

Purchase one of the online courses. These courses whittle down the size of the haystack and in some cases include labs where you learn the different services by actually doing, not just reading. I've seen some bloggers recommend purchasing multiple courses, but I don't think this is wise. Not only does it cost more money, but takes more time.

I used the ACloud Guru courses for both the associate and professional exams. Honestly, the associate course better prepared me for the exam than the professional course did. I still consider the professional course worth the money. I did use some lectures in the associate course as prep for the professional as it contained material on some services (like Redshift) that the professional course did not. The ACloud Guru courses include VPC labs that give you a chance to learn networking if that part of your experience is light. 
Note: I'm not affiliated with ACloud Guru other than being a customer. 

Purchase at least one set of practice exams. I used the tests from Whizlabs for the professional exam. For the associate cert, I used the quizzes supplied with the ACloud Guru associate course and got to the point where I made 100% on those tests most of the time. Some of those questions literally appear on the exam word-for-word. I also purchased a set of practice tests for the associate exam from a vendor I lost track of. 

Practice exams not only give you confidence when you start doing well, but it optimize your time. It allows you to concentrate study on the portions you miss rather than the things you know well.

Take the AWS Practice Exam. While the Whizlab tests are good, they aren't exactly like the AWS exams. The online experience with AWS is slightly different than Whizlabs. Also, AWS questions are more wordy, particularly in the professional exam. 

Research and verify your AWS Practice Exam answers. Capture screen shots of all questions and your answers. Rather than futz with screen shots during the timed test, I used Camtasia to make a movie of me taking the test and clipped the screen shots after. This exercise forces you to check your assumptions and better acquaints you with their exam tricks. 

For those taking the professional SA exam, take the AWS sample questions (free) test (english version). There are six questions. There's a trick here: the exact same test is published in Japanese - with answers on the lower right for each question. You're welcome.

Test Taking Strategies
This section details the strategies I used when taking the tests.  

Eliminate wrong answers first. Even if the question relies on facts you don't know, often you can eliminate at least one or two possible answers on the list. Even if you end up guessing, that can bump your chance of a correct answer to 50%. Sometimes, you even get to apply the Sherlock Holmes logic that if there's only one answer left after you eliminate the wrong answers, that answer must be correct even if you don't see why.

Often on the professional exam, an incorrect fact or tactic is often used in multiple answers.  If you spot the incorrect fact, makes it easy to eliminate several answers in one shot. Sometimes, there's more than one incorrect fact buried in multiple answers...

ReRead the one sentence that specifies how to choose among the various answers. For example, some questions have you choose which strategy "minimizes costs" without any mention of other requirements you expect like minimizing latency or achieving high availability. They will try to trick you by putting in a strategy that does minimize cost but also increases the chance of data loss or latency time. Don't read into the question: select the strategy that minimizes costs even though other common objectives suffer.

Another tactic is to put the word "not" in the sentence. For example, which of these strategies does not improve latency. That three-letter word can send you down the wrong path if you miss it.

There will be questions that rely on facts you don't know. Obsessing over these questions when you're obviously reduced to a guess just wastes time you don't have. Eliminate what look like obvious wrong answers, take your best guess, and learn to live with the fact you might have missed that question.

Don't over use the Flag feature. AWS exams have a 'flag' feature where you can mark specific questions and revisit them later. In the professional exam the other day, I flagged three of the 77 questions. If you flag most questions, it's the same as not having the feature at all as you won't have time to revisit all of them.

For me to flag a question, it needed to pass the following criteria:
  1. This question can't center on facts or a service I simply don't know -- I have to guess anyway.
  2. Time will help improve my chances of getting the answer correct.  (Be honest with yourself)
If it doesn't pass both criteria, I don't flag it. Devoting more time to the question would be a waste. I take my best guess and move on.

Caffeine lovers -- Consider the expresso trick. If you're a highly caffeinated individual like I am, you want caffeine for the test. However, you do not want to spend time in the rest room as it's a timed test. Expresso coffee, with a little cream to dilute the bitter taste, functions quite well for this without giving you unwanted liquid. I consumed four shots right before both tests. In all honesty, if I had it to do over again, I would have ordered six for the professional exam as it's so much longer.

I hope these thoughts help -- best of luck on the exams.

Sunday, August 6, 2017

Making Cloud Code Testable

Developers seem to assume that just because their application code may now interact with the cloud (e.g. read/write data from AWS S3 buckets, invoke AWS lambda functions, email via AWSD SES, send or receive AWS SMS messages, etc.) that it's no longer reasonable or easy to structure their code in a testable way without access to the cloud. Yes, my examples are AWS centric. The principles in this blog post apply to Azure or Google Cloud code as well.

By "testable" I mean unit tests that don't require access to the cloud to run. The purpose of such tests is to test logic in your code; not to test that code has the ability to connect to outside resources. Testing the ability of your code to connect to outside resources is the purpose of integration tests

Inject resources into application code that can't exist without the cloud, network, or external resources of any type. When your application code creates/instantiates these resources, that code can no longer be run without those resources. A previous version of a class in listing 1 violates this advice and isn't testable.

Listing 1: AntiPattern Example -- Don't do this at home!
Listing 1 isn't unit testable as it requires access to AWS to exist. By the way listing 1 is coded, any test on it's logic really is an integration test. A much better way to write this class can be found in listing 2.

Listing 2: A more testable way to write listing 1

Listing 2 is much more testable that listing 1 as it no longer depends on the environment in which the AmazonS3 instance was created and can easily be mocked. That instance can easily be mocked as I did in the unit test for this example class. I've a snippet of the Mockito code used to test this class in listing 3.  In fact, test coverage is 100% for line, branch, and mutation coverage for this class.

Listing 3: Mocking an AWS resource for unit tests.

Incidentally, there are numerous ways to inject cloud-dependent resources. You don't need to inject it on construction as I did in my example. For instance, the Spring Framework or Guice can also do that injection for you.

Another way to look at listing 2 is that it uses the architectural principle of "separation of concerns". Listing 2 separates the environmental concern of creating cloud dependent resources from the logic of listing the content of an AWS bucket. This makes the resulting classes much more focused and less complex. This concept begs a question: Where do you create the AmazonS3 client and can you do that in a unit testable way? The short answer is that you really can't. However, you can localize and minimize the amount of code that isn't unit testable.

Localize the creation of resources that are dependent on the cloud or any other external resource.  Generally, my projects end up with a "factory" class that handles instantiations that aren't unit testable.  To continue the S3 bucket list example, I'd envision a class like the factory presented in Listing 4.

Listing 4: AWS S3 Client Instantiation Example
This tactic minimizes the amount of code that isn't unit testable. Furthermore, this code doesn't generally contain complex logic leaving the more complex logic to classes that can be unit tested. 

Incidentally, I did also construct a unit test written using JMockIt as a development team I'm working with had adopted it. As it happens, once JMockit decorates a class with stubbing code, it's not possible to use it directly anymore (bug documented here).  I commonly want to stub an InputStream or OutputStream only for the purpose of testing my code handling exceptions. It seems that with this bug, this isn't currently possible using JMockIt. the bug is very irritating and applies to a significant percentage of the unit tests I write. I'm personally sticking to Mockito for now.

Friday, December 30, 2016

Making Technology Choices for Personal Growth.

When I started doing IT as a profession, the number of available, commercially used technologies was considerably less. Back then, most of IT ran on IBM mainframes. The number of tools you needed to learn to be marketable could be counted on one or two hands.  Today, there are far too many technologies/products/frameworks and not nearly enough time to learn them all.  Should I invest in AngularJS or React? Java8 or Scala? Amazon or Azure? Python or Node.js?  The list goes on and on.

If you're like me, your time to learn new products or frameworks is spare time. It's limited. There's just not enough spare time to learn everything, or even close. I often get asked whether they should learn this product or that framework. Very few ask about how they should be making their time investment choices. Let's face it, with so much educational material available on the internet for free or very low cash outlay, it's really a time investment we're talking about.

I look at learning technologies the same way I look at investments. Most software vendors, be they open source or commercial, publish documentation our even the ability to download and install most products you might consider for free. Thanks to Udemy and Amazon, who have pushed prices for online courses and eBooks to unbelievably low levels, cheap resources are available for the more popular choices. There's very little cash outlay for materials. However, it's an investment all the same. The investment is more time than cash.

I believe time is money. Time invested is time I can't spend consulting, writing, or anything else I do to make money. Consequently, I try to make good decisions as to which technologies I invest time in.  That begs the important question: how do you choose?  Furthermore, there's another dimension to this choice: how much time do you invest in one of those choices?

The amount of time you spend learning a new product, technology, or framework has diminishing returns. That is, the more time you invest, the less incremental payback you'll get for the investment. Like stocks, you don't necessarily buy a large portion to start out. Often, you invest a little and as the market develops, sometimes you buy more. Time investments in new technologies are the same way. 

Like financial investments, investing your time has risk. Any time spent learning a new technology, product, or framework that never takes off is time you'll never get back. Your objective is to consciously manage that risk. 

Research Tactics

There are several tactics I use to help me decide which technologies to invest in. Here are some of mine.

Ask a mentor or person whose opinion you value about technology choices you're considering. It's quick and easy. That person may have thought of issues or concerns that you haven't thought of so far. Often verbalizing your thoughts to another person helps you crystallize and more thoroughly think through your line of thought. 

Look at what the market values. 
The best way to do this is to use the Job Trends site from indeed.com. Indeed is a job posting site where hunters post resumes and firms or recruiters post jobs. Indeed keeps history and lets you graph postings and hunter skills listed on their resume over time. As an example, I'm using this comparison that compares top cloud vendors, perhaps for people looking at learning cloud technologies.

You want technologies with an upward job posting trend. You want technologies that firms value and recruit for. Given the time you'll need to invest, you want a realistic chance of a payback for your investment. The market can change at any time and the landscape might look different in six months, but hard numbers are more attractive than gut feelings. I've posted the job posting comparison of cloud vendors (taken on Dec. 30, 2016). I'm using Amazon Web Services, Azure, and Google Cloud in my comparison.

You can see that Amazon AWS job postings are on a sharp uptrend followed by a slightly less uptrend for Azure postings. It would be safe to infer that AWS skills have more of a market right now than either Azure or Google Cloud. If you believe I've left off vendors that you care about, you should surf to the site and change the vendors to your liking.

You want technologies with a postings vs.  seeker interest ratio of at least 1.00. Anything less means that there might be oversupply in the market and that can drive salaries/consulting rates down. I say "might" as not all job seekers are honest. Some look for jobs with skills they don't have. With the demand I hear about for AWS skills from recruiters, I believe the 1.02 seekers per posting ratio to be overstated and there are more postings per seeker. Note: Pay attention to the description: "seekers per posting" vs. "Posting per seeker".

Test your search criteria. Toward the bottom of the comparison page, you're offered links that will list job postings for that particular search criteria. You should look at a sample to make sure your search term isn't picking something you're not interested in. For example, if one were to enter a term too general, you could be including irrelevant postings and seekers in your comparison.

A few words of warning: Indeed data has limits. Here are a few:
  • This data doesn't account for current labor salaries/rates. For example, "HTML" looks hot until you figure out that market rates for that skill are really low.
  • The data presented is three months old and the current graph might be slightly different.
  • This data works best when you enter competing products or frameworks. Comparing AWS to Java isn't useful; market drivers for those two are completely different and you'll get more double counting of postings (posts might have both terms).

Look at what people are interested in

Before technologies are listed in job postings or on resumes, they are searched for on the web. Google maintains search term history that it graphs or allows you to download. Furthermore, as with the Indeed Job Trend site, you can compare search data for multiple searches. As an example, I've graphed the same search criteria we used above on the Indeed site.

Internet search results appear to be congruent with what we see on the Indeed Job Trends site. The difference between AWS and Azure does appear to be less stark than what we see on the Indeed site.  Here are some things to keep in mind:
  • It's important to test your search results just like with indeed.
  • Trend data for technologies with common names will be meaningless.
  • Poor paying technologies might have rising interest too just like on the Indeed site.
Deciding How Much Time to Invest

Deciding how much time to invest is more difficult. Everyone has different experience levels, talents, and abilities. Some technologies might take more investment than others. What amount I might need might be different than the amount you need. Here are some tactics I use.

You must invest in something all the time. Pick a commitment level; two hours a week, four hours a week, or even more. This should become a habit you don't even think about. If you don't you'll become stagnant. Your skills will get dated over time and your marketability will gradually decrease. Furthermore, you won't develop tactics for learning new things quickly. When you wake up and realize that you're very out of date, it'll take a long time to catch up. Unlike financial investments where you can adopt a cash position, it's too dangerous in a technology world to sit on the sidelines.

Time-box your investment. That is, set a rough amount of time you'll spend on a technology choice up front and then re-assess what you're willing to spend after that learning period. I typically use two hours or four hours as an initial time frame limit. That said, over the past couple of decades, I've honed tactics that allow me to spend less time than many others. Two hours might not be enough for you or it might be too much. You'll need to decide the amount. It's important that you roughly track the time you spend. Also, feel free to quit early if you learn enough to make the decision that you're not going to invest additional time.

Time-boxing mitigates your investment risk. What you want to avoid is going down a rabbit hole and spending boat loads of time unproductively. It also keeps the time you're spending at the forefront of your mind. You'll never get this time back.

Distinguish between "exploratory" learning and "objective" learning. Objective learning has a defined purpose such as doing upcoming work at your current employer or gearing up for a new job search or interview. Exploratory learning is for general knowledge you can use with colleagues. Exploratory learning doesn't require as much depth. 

There's value in learning the basics and general advantages and disadvantages of a product. With general knowledge, you know whether or not you want to pursue work using this technology. You will know enough about a technology to participate intelligently in conversations with other developers or recruiters. You know enough to come back later and dig more deeply if the need arises (you decide to submit for a job that needs it). Some developers will resist this idea, but you don't need an advanced, in-depth knowledge of every product you become acquainted with.

Don't learn technologies in depth until there's a high probability you'll will use them at work or need them to support a job search. Technology moves too quickly for that. Your newly learned in-depth knowledge becomes dated quickly. Looking at this in investment terms, don't invest until there's a reasonable probability of a payback. Another way to think about this is that classic YAGNI applies. 

Thanks for taking time to read this post. I would like to hear your thoughts on this topic.

Sunday, December 11, 2016

Automated Integration Testing in a Microservices World.

Everyone's dabbling with microservices these days. It turns out that writing distributed applications are difficult. They offer advantages to be sure. However, there is no free lunch. One of the difficulties is automated integration testing. That is, testing a microservice with all the external resources it needs including any other services it calls, the databases it uses, any queues it uses. Just setting up and maintaining these resources can be a daunting task that often takes specialized labor.  All too often, integration testing is difficult enough that the task is sloughed off. Fortunately, Docker and it's companion product Docker Compose can make integration testing much easier.

Standardizing on Docker images for deployment artifacts makes integration testing easier. Most organizations writing microservices seem to adopt Docker as a deployment artifact anyway as it greatly speeds up interaction between application developers and operations. It facilitates integration testing as well as you can deploy services you consume (or mocks for the services you consume) temporarily and run integration tests against them. Additionally, consumers for your service can temporarily deploy your docker image to perform their own integration tests. However, as most services also need databases, message queues, and possibly other resources to function properly, that isn't the end of the story. I've previously written about how to dockerize (is that a word?) your own services and applications here.

Docker images already exist for most database software and message software. It's possible to leverage these docker deployments for your own integration testing. In other words, the community has done part of your setup work for you. For example, if my service needs a PostgreSQL database to function, I leverage the official Docker deployment for my integration tests. As it turns out, the Postgres docker deployment makes their image very easy to consumer for integration testing. All I need to do is mount the directory '/docker-entrypoint-initdb.d' and make sure that directory has any SQL files and/or shell scripts I need run to set the database up for use by my application.  The MySQL docker deployment does something similar. For messaging, similar docker distributions exist for RabbitMQ, Active MQ, and Kafka. Note that ActiveMQ and Kafka aren't yet "official" docker deployments.

Docker Compose makes it very easy to assemble multiple images into a consistent and easily deployable environment. Docker-compose configurations are YAML files. Detailed documentation can be found here. It is out of scope for this blog entry to do a complete overview of Docker Compose, but I'll point you to an open source example and discuss a couple of snippets from the example as an illustration.

The screen shot on the left contains a snippet of a docker-compose configuration. The full source is here. Note that each section under services describes a docker image that's to be deployed and possibly built. In this snippet, images vote, redis, worker, and db are to be deployed. Note that vote and worker will be built (e.g. turned into a Docker image) before they are deployed. For images already built, it's only necessary to list the image name.

Other common compose directives are as follows:
  • volumes-- links a directory in the real world to a directory inside the container
  • ports-- links a port in the real world to a port on the inside of the container. For example, vote links port 5000 on the outside to port 80 on the inside.
  • command-- specifies the command within the Docker container that will be run at startup.
  • environment-- (not illustrated here) allows you to set environment variables within the Docker container

Assemble and maintain a Docker compose configuration for your services. This is for your own use in integration tests and so that your consumers can easily know what resources you require in case they want to run integration tests of their own. It's also possible for them to use that compose configuration directly and include it when they set up for their own integration tests.

The Docker environment for your integration tests should be started and shut down as part of the execution of the test. This has many advantages over maintaining the environment separately in an "always on" state. When integration tests aren't needed, they son't consume resources regardless. Those integration tests, along with their environment, can be easily run by developers locally if they need to debug issues; debugging separate environments is always more problematic. Furthermore, integration tests can be easily and painlessly be hosted anywhere (e.g. on-premise, in the cloud) and are host agnostic.

An Integration Test Example

I would be remiss if I didn't pull these concepts together for an integration test example for you.  For my example, I'm leveraging an integration test generic health check written to make sure that a RabbitMQ environment is up and functioning. The source for the check is here, but we're more interested in its integration test today. 

This test utilizes the DockerProcessAPI toolset as I don't currently work in environments that require a docker-machine and the Docker Remote API (Linux or Windows 10 Pro/Enterprise). If your environment requires a docker-machine (e.g. it is a Mac or an earlier version of Windows), then I recommend the Spotify docker-client instead.

The integration test for the health check uses Docker to establish a RabbitMQ environment before the test and shut it down after the test. This part is written as a JUnit test using the @BeforeClass and @AfterClass annotations to bring the environment up once for the entire test and not for each test individually.
In this example, I first pull the latest RabbitMQ image (official distribution). I then map a port for RabbitMQ to use and start the container. I wait five seconds for the environment to initialize, then cause a logging for the current docker environment running.

My log of what Docker containers are running isn't technically required. It does help sometimes if there are port conflicts where the test is running or other problems with a failed test that need to be investigated. As this test runs in a scheduled manner, I don't always know execution context.

After the test completes, the @AfterClass method will shut down the RabbitMQ instance I started and once again cause a container listing just in case something needs to be investigated.

That's a very short example. Had the integration test environment been more complicated and I needed Docker Compose, that would have been relatively simple with the DockerProcessAPI as well. Here's an example of bringing up a Docker Compose environment given a compose configuration YAML:

Here's an example after the test of bringing that same environment back down:

In addition, there are additional convenience methods on the DockerProcessAPI that can log compose environments that are running for investigative purposes later.

Thanks for taking time to read this entry. Feel free to comment or contact me if you have questions.  


Monday, November 21, 2016

Book Review: Reactive Services Architecture - Design Principles for Distributed Applications

A friend of mine gave me a copy of the new book "Reactive Services Architecture - Design Principles for Distributed Applications" by Jonas BonĂ©r.  My friend was quite taken by the content and asked me what I thought of it. Having been working with microservices architecture and writing about them for a couple of years, I was intrigued and took up the gauntlet.

I was immediately struck by the word "Reactive" and wondered what was the difference between a "reactive" microservice and a non-reactive microservice. I started reading the book with this question nagging me along the way.

Book Summary

The book starts out by a traditional defining of the problem with monolithic applications; These are applications that have grown too large and complex to maintain and enhance with any reasonable ease and speed. It has a very well written introduction to the concept of microservices architecture and how it solves the problems presented by monolithic applications. The book describes basic principles of microservices architecture and also effectively compares microservices to traditional SOA in a manner that's very easy to read and understand. Principles described by this sections are:
  • Microservices do one thing and do it well; they have a single functional (e.g. business) purpose.
  • Microservices most often evolve from monoliths; not often used for completely new applications
The second chapter refines the single responsibility trait that microservices have and introduces us to several additional microservice principles. They are principles that are present in previous discussions of microservice architectures and aren't really new, but are explained very clearly and concisely.  These principles are:
  1. Microservices act autonomously; they are context independent.
  2. Microservices own their own state / data store.
  3. Microservices should have location transparency; they use a service discovery mechanism so they can scale effectively and have clustering / resilience.
  4. In a microservice world, what can be made asynchronous or non-blocking, should be made asynchronous (some term this "eventually consistent").
  5. In a microservice world, planning for failure is a necessity.
I did find the discussion on isolation at the beginning chapter two extremely interesting. Isolation is presented as a coupling of a service to time (when execution occurs) and space (where the service is hosted). Removing this coupling (i.e. isolating services) is a prerequisite to adhere to principles 1, 3, 4, and 5 above. I had never explicitly identified this type of coupling previously in my writing and consider it insightful. 

The author does discuss reliance on messaging technologies as a part of the definition of a "reactive" microservice. I look at messaging as one of many techniques for making service calls asynchronous. I see messaging as a technique for supporting principle 4 and mitigating 5 and not really fundamental to the architecture. Usually, architecture definitions are principle based and not reliant on a specific implementation tactic. I agree with the author in that designers usually "assume" real-time needs instead of challenging them and making more portions of a system asynchronous. The author seems to assume the opposite. To be fair on pp 36-8 (chapter 3), the author does acknowledge that there are portions of a system that must be synchronous.

There are reasons for using ("persistent") messaging to make service calls asynchronous rather than spawning the work in a separate thread. Work in a spawned thread will be lost if a service instance dies where it won't be if persistent messaging is used. The author really doesn't discuss this point or address why "messaging" (persistent or not) is the only option he presents for making work asynchronous.

Chapter three discusses several cross-cutting concerns that frequently come up with people newly introduced to microservice architectures. They are:
  • Service discovery as a means to support location transparency.
  • API Gateways as a way to manage evolving contracts over time.
  • Messaging as a way to support several of the principles outlined above.
  • Security management 
  • Minimizing data coupling and coordination costs
The author does point to event driven architecture and conflict-free replicated data types (CRDTs) as being natural complements to microservice architectures. Essentially, the event-log becomes the source of truth for a microservice and the underlying database is simply a convenient "cache" of that truth. These concepts are touched on, but not really explored in depth. To be fair, these are weighty topics and likely deserve books of their own; declaring in-depth discussion of them "out of scope" for this book is reasonable.

Reviewer Summary -- Rates 4 Stars out of 5

This book is a great summary for those looking for an overview of microservices architecture. All concepts are explained concisely, clearly, and in an easy to understand writing style. As Lightbend is handing out free copies (at the time of this writing), it's certainly cost-effective.

This book respects the readers time. Time is a scarce resource for me as I'm sure it is for many. At 47 pages, it can be easily be read in one or two sittings.

This book doesn't try to sell products. When I saw the company name "Lightbend" on the cover, I was a little nervous that there would be plugs for Lightbend products. There isn't. Akka is mentioned as an implementation option as well as Apache Camel and other products. Nothing, however, that's overt marketing.

I'm not convinced that a "reactive" microservice is any different than a normal microservice. While I agree that all service calls that can be non-blocking/asynchronous should be, that's not really new or different. It is a microservice design best practice to be sure, but not really a different architecture style. The book does contain mentions of reactive programming and the Reactive Manifesto. But not really enough to link the books content specifically to those constructs for me. In the end, it's this point that kept my rating out of the five-star category.

You might wonder why I published this review on my blog instead of on Amazon. As a book author myself in this genre, Amazon will not publish my reviews on other computer-related books. Thanks for reading this article.

Friday, September 23, 2016

Using Java Thread Dumps to Diagnose Application Performance

On a holiday weekend last year, I got an emergency call from a client. One of their Java EE applications would freeze and stop servicing users within an hour after container start-up. I was called in to help investigate. I started off by requesting a thread dump and memory dump of the container once it had stopped accepting requests. It's the thread dump that I'm focusing on today.  That Java thread-dump is here (package scrubbed to protect the client).

During that exercise, I noticed that when I analyze thread dumps, I look for the same things. Whether it's a performance issue or some sort of freezing issue, I manually scanned the thread dump for the same types of conditions. This year, working for another client, I'm faced with a performance tuning exercise that will likely require analysis of numerous thread dumps and wasn't looking forward to the busy work. That prospect got me to do some introspection and figure out exactly what I look for and find or build a product that does this.

Threads that block other threads

Most developers know the syntax behind using Java's synchronized keyword and that it's used to ensure that one and only one thread executes a section of code or uses a given Java resource at one time. I'm not going to digress into a discussion of lock monitors and coding issues; if you need a refresher, please see this article. Most experienced developers use synchronization with extreme care as bugs with synchronization are intermittent and extremely hard to diagnose and fix.

Frequent symptoms of synchronization issues are performance problems or cases where applications freeze and no longer accept client requests. Essentially, synchronization causes other threads servicing other client requests to wait until the needed Java resource is available for use. Those waiting client threads are typically in a BLOCKED state. I immediately suspected this type of issue in diagnosing the issue I was investigating over the holiday last weekend.  Here's a sample of a BLOCKED thread entry in a thread dump:

"http-bio-" daemon prio=6 tid=0x000000001cf24000 nid=0x2054 waiting for monitor entry [0x0000000022f9c000]
   java.lang.Thread.State: BLOCKED (on object monitor)
 at java.beans.Introspector.getBeanInfo(Introspector.java:160)
 - waiting to lock <0x0000000680440048> (a com.sun.beans.WeakCache)
 at org.apache.axis.utils.BeanUtils$1.run(BeanUtils.java:92)
 at java.security.AccessController.doPrivileged(Native Method)

Note that the dump explicitly lists that it's waiting on resource 0x0000000680440048 which is owned by this thread:

"http-bio-" daemon prio=6 tid=0x000000001bf06000 nid=0x21b0 runnable [0x000000002e0dd000]
   java.lang.Thread.State: RUNNABLE
 at java.lang.Class.getDeclaredMethods0(Native Method)
 at java.lang.Class.privateGetDeclaredMethods(Class.java:2521)
 at java.lang.Class.privateGetPublicMethods(Class.java:2641)
 at java.lang.Class.getMethods(Class.java:1457)
 at java.beans.Introspector.getPublicDeclaredMethods(Introspector.java:1280)
 - locked <0x0000000680440048> (a com.sun.beans.WeakCache)
 at java.beans.Introspector.internalFindMethod(Introspector.java:1309)

It turns out that more than one thread was waiting on this resource.

IO bound threads

One frequent source of application performance issues are threads that are waiting on Input/Output to occur. This often takes the form of a database read or write or a service call of some type.  Many developers assume that most performance issues are caused by slow database access and start looking at SQL queries. I do not make this assumption. Furthermore, if it is a database tuning issue, you need to identify the specific SQL that needs to be tuned. At any rate, if the source of your performance issue is IO, thread dumps can help you identify where in your code the issue is taking place. 

Here is an example thread that's IO-bound:

"QuartzScheduler_Worker-2" prio=6 tid=0x000000001abc7000 nid=0x2208 runnable [0x000000001df3e000]
   java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:152)
 at java.net.SocketInputStream.read(SocketInputStream.java:122)
 at java.io.DataInputStream.readFully(DataInputStream.java:195)
 at java.io.DataInputStream.readFully(DataInputStream.java:169)
 at net.sourceforge.jtds.jdbc.SharedSocket.readPacket(SharedSocket.java:850)
 at net.sourceforge.jtds.jdbc.SharedSocket.getNetPacket(SharedSocket.java:731)
... (many thread stack entries omitted for brevity)
        at com.jmu.scholar.dao.FormRuleDAO.findByFormId(FormRuleDAO.java:50)

Note that within the thread stack, there's an explicit reference to application code that's initiating the IO. In this case, IO is being initiate by a database query in a specific application method. 

Note that just because the dump caught this one occurrence of this query, doesn't mean it's a performance issue. If, however, a large percentage of running threads are IO-bound in the same method, then this database access would become a tuning target. To tune this specific database access, a developer can focus on this one query instead of looking at all queries within the application. How to tun the database access is out of scope for this blog entry.

Performance Hot Spots

Most developers upon being asked how to tune an application in an interview will tell you to use a Java profiler. That answer misses the point. A profiler helps you tune a specific section of code after you've identified the section of code that needs to be tuned. Often performance issues show up in production and it's not possible to run a profiler on your container in production. 

A thread dump taken in production on an active application can help you identify which section of code needs to be tuned, perhaps with a profiler. Furthermore, thread dumps are unintrusive enough that you can take them in production without material impact to users.  

To see how dumps help, let's review how a profiler works. A profiler works by taking a thread dump periodically, perhaps every 5 milliseconds. That thread dump specifies where in your code you're spending time. For example, if your test causes the profiler to take 100 samples and method Work.do() appears in 33 of them, then your spending 33% of your time in that method. If that's the method with the highest percentage, that is where you'll often start tuning.

In fact, thread dump data is better than profiler data in several ways:
  • It measures what's actually happening in production vs. a profile of a specific business process or unit test case.
  • It includes any synchronization issues between threads that won't show up in a profile of one thread in a unit test case (there are no other threads to contend with).

The problem is collecting the data. Yes, thread dumps are easy to collect. Counting occurrences of method references in running threads is laborious, tedious, and annoyingly time consuming.

Introducing StackWise

The first thing I did was scan for products that analyze thread dumps for these types of issues. There are many products that analyze thread dumps, but they tend to be interactive tools that summarize threads and allow you to selectively expand and contract them. A couple categorized threads by state (e.g. RUNNING, BLOCKED, etc.). However, none of these really looked for the three items I look for. Hence, I created the product StackWise, which is open source and freely available for you to use.

StackWise will analyze a thread dump and report useful information on all three of these conditions.  A sample of StackWise output can be found here.  Note that you get the following items of information:
  • The percentage of threads that are IO bound and a summary of those IO-bound threads.
  • Threads that are locking resources needed by other threads.
  • A list of application method reference counts listed in descending order. 
In interpreting performance hot spots, StackWise will report application methods in which you're spending the most time. Methods belonging to ServletFilter classes can be ignored as they are often listed in all running threads.  Other method mentions, however, are possible tuning targets.

If you analyze thread dumps in ways other than what StackWise already covers, I'd like to hear your ideas.  Thanks for reading this entry.