Tuesday, July 17, 2018

Cloud Governance: Making DevOps Automation Effective

I see cloud automation of all types implemented to control and/or secure cloud assets. Examples of this type of automation using Amazon Web Services (AWS) include the following:
  • Preventing unauthorized entries in security groups allowing ingress from 0.0.0.0/0 (security)
  • Alerts for Creating IAM users (possible security risk)
  • Forwarding application logs to Splunk (operational effectiveness)
  • Scheduling up-time for non-production assets (cost savings)
While these examples are AWS-specific, the principles discussed in this article are equally applicable if you're using Azure, GCP, or another cloud. 

Some of these examples (which are real-life examples, by the way) were very effective. Some were only marginally effective. For instance, the first two examples were extremely effective. If I defined a security group allowing unauthorized access publicly, that entry in the security group was automatically removed and an alert issued to the security team. There was a process to add an "authorized" security group allowing public access, but I had to justify the need. As for the second example regarding IAM user creation, if I created an IAM user, an alert was generated and sent to the security team. The new IAM user wasn't automatically removed, but a similar alert was generated and I was asked to justify the action.

The example regarding forwarding application logs to Splunk was absolutely horrible. It took many hours to implement. It was very fragile; new version deployments to applications broke the automation. The coding needed to automatically fix forwarding after an application deployment was cost prohibitive. Consequently, logs were not reliably forwarded.

Scheduling up-time for non-production assets using the AWS Instance Scheduler is marginally effective. The solution works. However, it depends on instance tags for functionality. Instances scheduled must have required tags if up-time for them is to be scheduled. Often, developers are inconsistent in the implementation of these tags. This dependency makes the solution marginally effective for the customer as the tags are often not consistently applied.  This entire discussion leads to an obvious question.

What characteristics make DevOps Automation effective?

The characteristics that make DevOps automation effective can be attributed to the following characteristics:

  • DevOps automation must be resilient
  • DevOps automation must be easy to install
  • DevOps automation must minimize ongoing manual labor.
DevOps automation must be resilient to be effective. In other words, automation isn't effective if it breaks or doesn't work as intended when seemingly unrelated changes are made. Essentially, the benefit to the automation isn't realized. In the case of the fragile Splunk forwarding, broken forwarding meant that support could not trust the results of Splunk searches. Essentially, the benefit of having all application logs in one place was not realized. Developers had to go to the Cloudwatch source.

DevOps automation must be easy to install to be effective. Automation that's difficult to install is a classic "barrier to entry".  In other words, installation difficulty increases the effective price of the automation. Automation that's not implemented doesn't provide benefit.  The Splunk forwarding example was extremely difficult to install due to complexity as well as poor and inaccurate documentation. As a consequence, many teams didn't even attempt to implement the solution.

DevOps automation must minimize or eliminate ongoing manual labor to be effective. The purpose of automation is to eliminate manual labor. If the automation itself has manual labor requirements, that just negates some portion of the benefit. The AWS Instance Scheduler example requires manual labor in terms of tag maintenance. Consequently, the complete benefit of minimizing instance runtime costs is never realized. There's labor to pay for.

How do I use these principles to improve the automation I create?

Test your install documentation. Write your documentation and let somebody, not familiar with your automation, install it. If they have questions, it means that your documentation has a defect. It might be that your documentation wasn't clear. It might be that you missed documenting something. Whatever the reason, take those questions as defect reports.

Support your automation when it breaks or doesn't work as intended. If a consumer has trouble with your automation, help them fix. After you fix the issue, do a root cause analysis. Figure out why the automation broke and improve it to prevent that problem from repeating. It could be that your automation makes invalid assumptions about the environment. In which case, there is a code change to make. It could be that the user didn't understand how to properly use your automation. This can be fixed by improving your documentation. Whatever caused the issue, fix it.

Solicit feedback for ongoing care and feeding your automation requires. Ongoing care and feeding required by your automation detract from the benefits your automation provides. If there are changes you can make that further minimize or eliminate that ongoing work, you should consider it. 

Let's consider the AWS Instance Scheduler example that requires ongoing work with regard to documenting instance schedules as tags on the instances themselves. If we wrote that automation, how could we further minimize that ongoing work? One way I can think of is to provide a way to specify a default schedule for an entire account. Tags on the individual instances would be optional.

Many companies use different AWS accounts for non-production and production resources. If I could provide a default schedule for all EC2 and RDS instances in the entire non-production account, individual tags on individual instances would be optional. There would still be a need for tagging options on the instance themselves for resources that need a custom schedule. But the default schedule would apply to all instances in an account without developers remembering to place scheduling tags. A large percentage of the ongoing work required by the AWS Instance Scheduler would go away. With that, the amount of money the organization saved using the automation would increase.

Consider leveraging the single installation model. Most custom automation I've seen are written to work in their default region in the account in which they are installed. For example, if I install the automation in us-east-1 for account 1111111, that's the only place the automation is available. If I want to use that automation in other regions or other accounts belonging to the same organization, I need to install separate copies of that automation. 

It is possible to code automation so that it operates in multiple regions and multiple accounts. That said, the code within the automation would get more complicated and likely require cross-account roles. That automation would also require a centralized configuration where it runs. For example, the AWS Instance Scheduler supports the single installation model. If you use that feature, there's additional configuration to specify the accounts and regions the scheduler will operate in. Furthermore, cross-account roles are required to provide it access. Providing guidance on implementing the single installation model effectively is a separate topic and may be the subject of a future article.

I hope this helps you with the automation you create. Comments and questions are welcome.




Sunday, May 27, 2018

Tips and Tactics for Passing the AWS Solution Architect Certification Exams

I've been using the AWS cloud platform since about 2010. When I embarked on the AWS certification path a couple of years ago, I knew it would be a challenge even with my experience. I knew professional level certs are some of the most challenging exams in IT. Having passed AWS Solution Architect Professional and Associate certification exams, I've been asked by several for tips on how to prepare and pass. The process is daunting given that the body of knowledge covered by the tests are incredibly broad. Many have blogged on this topic. This post describes tactics that worked for me; I hope they are of value to you.

Preparation and Study

Memory only gets you so far. The associate exam does have some questions on obscure service limits where memory skills help (e.g. How many VPCs can a user create by default?, What is the limit on the number of instances in a placement group?). The professional exam is purely story problems where each question is a scenario and you're expected to choose from different design options they present. You do need to learn how to apply the underlying principles AWS uses. You just can't memorize your way to success.

Networking skills should be considered a prerequisite. Most of my experience is in the application development realm. Fortunately, I had experience as a system administrator at different points and knew networking concepts. I've friends who went down the same path and had a very difficult time as they didn't have networking experience. You should be able to setup a VPC, public and private subnets in different availability zones including an internet gateway and NAT from memory. Don't forget to install a couple of instances to make sure the public/private access works. 

Purchase one of the online courses. These courses whittle down the size of the haystack and in some cases include labs where you learn the different services by actually doing, not just reading. I've seen some bloggers recommend purchasing multiple courses, but I don't think this is wise. Not only does it cost more money, but takes more time.

I used the ACloud Guru courses for both the associate and professional exams. Honestly, the associate course better prepared me for the exam than the professional course did. I still consider the professional course worth the money. I did use some lectures in the associate course as prep for the professional as it contained material on some services (like Redshift) that the professional course did not. The ACloud Guru courses include VPC labs that give you a chance to learn networking if that part of your experience is light. 
Note: I'm not affiliated with ACloud Guru other than being a customer. 

Purchase at least one set of practice exams. I used the tests from Whizlabs for the professional exam. For the associate cert, I used the quizzes supplied with the ACloud Guru associate course and got to the point where I made 100% on those tests most of the time. Some of those questions literally appear on the exam word-for-word. I also purchased a set of practice tests for the associate exam from a vendor I lost track of. 

Practice exams not only give you confidence when you start doing well, but it optimize your time. It allows you to concentrate study on the portions you miss rather than the things you know well.

Take the AWS Practice Exam. While the Whizlab tests are good, they aren't exactly like the AWS exams. The online experience with AWS is slightly different than Whizlabs. Also, AWS questions are more wordy, particularly in the professional exam. 

Research and verify your AWS Practice Exam answers. Capture screen shots of all questions and your answers. Rather than futz with screen shots during the timed test, I used Camtasia to make a movie of me taking the test and clipped the screen shots after. This exercise forces you to check your assumptions and better acquaints you with their exam tricks. 

For those taking the professional SA exam, take the AWS sample questions (free) test (english version). There are six questions. There's a trick here: the exact same test is published in Japanese - with answers on the lower right for each question. You're welcome.

Test Taking Strategies
This section details the strategies I used when taking the tests.  

Eliminate wrong answers first. Even if the question relies on facts you don't know, often you can eliminate at least one or two possible answers on the list. Even if you end up guessing, that can bump your chance of a correct answer to 50%. Sometimes, you even get to apply the Sherlock Holmes logic that if there's only one answer left after you eliminate the wrong answers, that answer must be correct even if you don't see why.

Often on the professional exam, an incorrect fact or tactic is often used in multiple answers.  If you spot the incorrect fact, makes it easy to eliminate several answers in one shot. Sometimes, there's more than one incorrect fact buried in multiple answers...

ReRead the one sentence that specifies how to choose among the various answers. For example, some questions have you choose which strategy "minimizes costs" without any mention of other requirements you expect like minimizing latency or achieving high availability. They will try to trick you by putting in a strategy that does minimize cost but also increases the chance of data loss or latency time. Don't read into the question: select the strategy that minimizes costs even though other common objectives suffer.

Another tactic is to put the word "not" in the sentence. For example, which of these strategies does not improve latency. That three-letter word can send you down the wrong path if you miss it.

There will be questions that rely on facts you don't know. Obsessing over these questions when you're obviously reduced to a guess just wastes time you don't have. Eliminate what look like obvious wrong answers, take your best guess, and learn to live with the fact you might have missed that question.

Don't over use the Flag feature. AWS exams have a 'flag' feature where you can mark specific questions and revisit them later. In the professional exam the other day, I flagged three of the 77 questions. If you flag most questions, it's the same as not having the feature at all as you won't have time to revisit all of them.

For me to flag a question, it needed to pass the following criteria:
  1. This question can't center on facts or a service I simply don't know -- I have to guess anyway.
  2. Time will help improve my chances of getting the answer correct.  (Be honest with yourself)
If it doesn't pass both criteria, I don't flag it. Devoting more time to the question would be a waste. I take my best guess and move on.

Caffeine lovers -- Consider the expresso trick. If you're a highly caffeinated individual like I am, you want caffeine for the test. However, you do not want to spend time in the rest room as it's a timed test. Expresso coffee, with a little cream to dilute the bitter taste, functions quite well for this without giving you unwanted liquid. I consumed four shots right before both tests. In all honesty, if I had it to do over again, I would have ordered six for the professional exam as it's so much longer.

I hope these thoughts help -- best of luck on the exams.