Saturday, November 14, 2020

When to execute ARM Templates with Terraform

 

ARM templates are the native automation mechanism for the Azure cloud platform. It is possible to execute ARM templates from Terraform using resource azurerm_resource_group_template_deployment. To Azure professionals with less Terraform experience, this is appealing.  It allows them to use their existing skills and provides some short-term productivity gains. While I see the benefit, the tactic eliminates some of the benefits of using Terraform. 

Don't use Terraform to run ARM templates unless you absolutely have to. The template deployment resource represents an Azure Deployment, not the resources that deployment creates.  For example, if you execute an ARM template that creates a VNet, Terraform will only understand changes made to the ARM template. Executing a Terraform plan will *not* report changes that will be affected to the underlying VNet. If somebody made a manual change to that VNet, Terraform will not sense the change and re-apply the ARM template. 

Only use Terraform for ARM templates for new features that aren't in Terraform yet. This is rare, but it does happen. Microsoft enhancements are reflected in the REST APIs, and thus the ARM template schema, before enhancements are incorporated in the SDK. Once new features are in the SDK, they commonly are reflected in Terraform very quickly. But there are enhancements (e. g. the VWAN additions) that take months to be completely incorporated in the SDKs.

For example, at the time of this writing, Terraform resources do not yet exist for Virtual WAN VPN sites and VWAN VPN site-to-site connections. I recently used the template deployment resource to manage that infrastructure because there was no other choice from Terraform perspective. 

Consider Terraform execution of ARM templates after Terraform resources exist for new features as technical debt. That is, once Terraform formally supports the resources you need, you should enhance your Terraform to remove the ARM templates.  This makes your Terraform more consistent and allows you to identify configuration drift. As with all technical debt, the work should be properly scheduled in light of the team's other priorities. to use my previous example, the ARM templates used to manage VWAN VPN sites and connections should be refactored once Terraform resources exist for those constructs.

When an ARM template execution fails in Terraform, Terraform doesn't record the fact that the deployment was physically created in the state file. Consequently, to rerun the terraform ARM template after corrections, you either need to manually delete the Azure deployment, to do a Terraform import for that deployment to re-execute the Terraform configuration. 

Some try to work around the deployment creation problem by generating unique deployment names: I consider this kludge paste. It creates a large number of deployments to sift through if you want to review details on an error. It also means that Terraform will re-run ARM templates unnecessarily when the configuration is executed.

Friday, October 23, 2020

Best Practices for Managing Feature Branches

Feature branches are a popular source code management tactic used to manage and coordinate changes made by development teams. Developers create a feature branch is created from the main branch (typically master) and then merge the changes made to that feature branch back to the main branch when they are complete. This isolates changes made for a specific feature and limits the effect of feature enhancements on other team members until the change is ready.

When using feature branches, it's rare to directly develop using the master branch. In the example below, one developer might be working on a change called "feature 1" while another developer works on a separate enhancement "feature 2".  Each developer writes/commits code in isolation in separate branches.  When the enhancement is ready, that developer creates a pull request and merges the change back into master. The diagram below illustrates this example. Each bubble is a commit. 



Observations

The longer a feature branch lives, the higher the probability of integration problems when the feature branch is merged into master. In the example above, the developer for feature 2 might make changes that conflict with the changes made for feature 1. The longer a feature branch lives, the higher the likelihood that change conflicts occur.

Feature branches work best if the branch contains one targeted enhancement. Including multiple changes in a feature branch often lengthens the time the feature branch lives. It also makes code reviews more difficult as the change is more complicated.

The more developers working on a codebase, the more discipline the team needs regarding source control and change management. Even with feature branches, the chance of code integration issues on merge increases with each developer added. That is because the more developers making changes, the higher the probability of integration issues and code conflicts. The higher the probability that multiple developers are working on the same section of code at the same time. Yes, each developer is working in a separate branch, but those changes will be merged to master at some point.

Recommended Tactics

Feature branches should have a short life. Most of my feature branches live for less than one business day. They are narrow and targeted. If I need to make a "large change", I break it up into separate smaller changes using multiple feature branches.If a feature branch must live longer due to forces beyond my control, I rebase or merge in changes from master and address any needed merge issues.

Feature branches should represent changes from one and only one developer. When multiple developers make changes to the same feature branch, the chance of one developer of that feature branch negatively impacting other developers on that same feature branch greatly increases.

Feature branches should be removed after they are merged into master. If you don't, the resulting branch pollution will become confusing. The list of existing branches will grow to a large number. It won't be obvious which branches are active and which are historical.

Frequently rebase the feature branch against the master branch. Definitely rebase before merging back into the master branch (or creating a pull request which will accomplish the merge when completed). More importantly, it will consolidate changes made for the feature branch in git history. It's common for developers to merge rather than rebase to incorporate new changes from master.  Using merge is more intuitive.  That said, rebase makes git commit history easier to interpret as feature branch commits will be consolidated.  Additional information on rebasing can be found here.

An example series of commands to accomplish this follows:

git checkout master
git pull
git checkout feature_branch
git rebase master



Some teams prefer to squash commits when merging the feature branch into the master branch. This consolidates log history on the master branch as feature branches typically have multiple commits. For example, feature 1 with three committed changes can optionally be merged into master as one change.  This makes git history more concise and easier to read. Squashing commits will lose commit history detail on the feature branch, however.

Promptly respond and participate in requested code reviews. Most teams will use pull requests with code reviews as part of the process for merging feature branches into the master branch. It is common for developers to be slow to perform code reviews as they don't want the distraction. The trouble is that as long as the pull request is open, the feature branch it's associated with lives with it. The longer the feature branch lives, the greater the chance of integration issues.

Conclusion

Thanks for taking the time to read this article. I'd love to hear your thoughts.


Sunday, August 30, 2020

For Managers: DevOps Automation and Unintended Consequences

Most organizations adopting the cloud have adopted DevOps automation to some degree or another.  The primary reason is that continued manual maintenance isn't possible with the same staffing level and increased demand for a faster change rate. Many aren't to the point of achieving 100% automation but are striving for it. By "automation", I refer to Infrastructure as Code (IaC), automated builds and deployments (CI / CD Pipelines), machine image creation, security enforcement functions, etc. Most organizations struggle with the unexpected and unintended effect on the technology silos most have. I've seen similar issues with most of my cloud adoption and DevOps/automation clients for the past few years.

The goals most organizations have for consuming the cloud and adopting DevOps automation practices are several:
  • Increased speed to market for application capabilities
  • Increased productivity for IT staff
  • Increased scalability and performance of applications
  • Cost-effectiveness as footprint can dynamically scale to load
Steel Copy of a Wooden Bridge
All organizations initially view cloud adoption and DevOps automation as just a technology change. Consequently, they adopt automation toolsets and keep all business management processes in place (e. g. request forms, manual approvals, the internal team structure that governs who does what, etc.). Unfortunately, the paradigm shift to cloud infrastructure and full automation doesn't really permit that with the same organization structure. The new world is just too different.

Using existing business processes without change will make it difficult to achieve increased speed to market and consistency between environments.

Pre-automation business processes don't fit the cloud or DevOps automation. DevOps automation is commonly introduced with cloud consumption. Typically, the business is looking for ways to provide additional business capabilities faster and more cost-effectively. Consequently, the number of applications and the number of supported infrastructures increases. For many organizations, the business processes in place either can't easily support a larger software footprint. They don't support the increased speed of change demanded by the business.

The structure of automation often doesn't match the existing organizational structure. For example, setting up a cloud landing pad usually involves not only defining cloud networks, but configuring on premises connectivity, defining and enforcing security policies, defining and enforcing cloud service usage, and much more. From a strictly technology/coding perspective, the automation for these items is tightly coupled and a large portion of it usually belongs in the same automated source code project. Most organizations will have broken responsibility for these items into several teams, usually in separate departments, with people who don't usually work closely together. 

As another example, it's typical for application developers to augment their responsibilities to include IaC automation to meet application needs. That is, the management of virtual machines, application subnets, allowed network ingress and egress to an application is managed by application development teams. Pre-automation, these items would have been managed by different application teams.

The implementation of infrastructure and application hosting drastically changes when consuming the cloud. New cloud consumers quickly find out that the new world is different and consequently, existing business processes for allocating infrastructure and hosting applications in the cloud no longer apply. For example, existing business processes don't accommodate cloud vendors 

Patching the Steel Copy
On realizing the problems and organizational friction created by automation described above, most organizations attempt to "patch" their existing organization and supporting business processes. That is, they adopt a series of minor changes to mitigate some of the problems described above. Examples I've seen are:
  • Establish a manual review for security changes by the security team
  • Assume the cloud is "untrusted" and establish cumbersome firewall rules to guard on premises networks
  • Establish silos for networking and security changes
  • Establish tight restrictions for use of cloud options and services
Any manual review will slow down velocity and productivity. The perception is that this increases safety. However, manual reviews also slow everything down. To this extent, manual reviews through the baby out with the bathwater. A major benefit of DevOps and cloud consumption is increased speed to market. That is, both DevOps and cloud consumption should allow companies to make business capabilities available to end-users faster and increase competitive advantage. Manual reviews decrease if not eliminate this business benefit.

Organizational silos and restrictions create process bottlenecks and discourage innovation.  The logic for silos is that it helps companies achieve economies of scale for specialized skillsets. The trouble is that these silos can't keep up with application team demand. Application teams recognize the bottlenecks and adjust their designs to accommodate and streamline silo navigation rather than use the design they would like. In other words, they are discouraged from using new techniques that don't fit how the silos operate. While most companies provide an "exception" process that allows for a review of new tools, techniques, or procedures; exception processes are often cumbersome and time-consuming. In the end, organizational silos and restrictions depress productivity and slow the release of new business capabilities to end-users.

DevOps and cloud capabilities of companies often lag behind their needs. It takes time to get up to speed on cloud capabilities and DevOps practices. Consequently, the following often happens:
  • Initial environment set-ups and application deployments are much slower than expected.
  • Security vulnerabilities discovered at an increasing rate due to staff inexperience
  • The frequency of change for both management and staff is larger and more difficult than expected.
All the difficulties above depress productivity and reduce if not eliminate the benefits of DevOps and cloud consumption.

By now the reader might be second-guessing their decisions to adopt DevOps practices and the cloud. That's not where I'm headed. They are definitely good decisions, but 

Re-Write Management Processes from the Ground Up
By now, it should be obvious that patching existing management oversight and procedures has limitations. In fact, it won't really work for anyone's satisfaction. DevOps and cloud consumption requires a management paradigm shift in many ways. Let's face it. Management oversight methods and procedures that worked for a smaller on premises footprint simply don't work well for DevOps and the cloud. This section will highlight many paradigm shifts managers face and highlight things that need to change.

Acknowledge that DevOps and cloud consumption require a change in the way you think about management and oversight. This is difficult for many to do and is resisted at first. Once the paradigm shift is recognized, it's much easier to objectively evaluate alternative means and methods. You won't achieve the benefits of consuming the cloud otherwise. It expands your footprint with existing management and oversight processes that don't easily scale. 

Automate management oversight for cloud assets. Since everything in the cloud is "software", management oversight policies can be automated so that they no longer require manual oversight. Automated enforcement, once established, is much more consistent and doesn't require labor in the same way. Yes, this automation will require enhancement and maintenance just like any other software, but it increases the productivity of your security and cloud specialists exponentially. This is a body of work that will take planning and implementation effort - this isn't a costless option.  That said, in the long run, this is the most cost-effective option available currently.

Management oversight automation will also allow the company to migrate to continuous deployment and continuous delivery someday. In fact, continuous delivery is not possible without automating approvals and eliminating manual steps.

Don't try to transition to DevOps and the cloud without help. Yes, you retain smart people and they will get make the transition eventually. That said, it will take them a lot longer and you will experience "rookie" mistakes and accrue technical debt along the way. Keep in mind that you need help from a strategy perspective at a management level in addition to ground-level skills. Companies that look at DevOps and cloud consumption as strictly a technology change have trouble from a management perspective that I've outlined above.  

In Conclusion
This article comes from my experiences in the field. I help companies consume cloud technology and adopt DevOps tactics on a daily basis. That said, I'm always interested in hearing about your experiences. I hope that you find this entry useful and hope for many insightful comments. Thanks for your time.






Friday, May 29, 2020

Design Patterns for Cloud Management and DevSecOps

With the cloud (it doesn't matter which cloud vendor), truly all infrastructure and application management is software-based now. Consequently, most organizations manage their cloud footprint through code. Some organizations are further along that path, but most strive to achieve 100% infrastructure as code. Additionally, application infrastructure and releases are also managed as code. 

Having written code to manage cloud infrastructure, application infrastructure, and application build and release pipelines for years now; I frequently experience deja-vu. That is, I feel that I'm solving the same problem over and over again. Sometimes with different technologies or cloud vendors, but really repeating the same patterns over and over again.

It's time we start thinking of infrastructure code and the various forms of CI/CD pipelines in terms of software design patterns. Patterns that are repeatable and don't need to be "re-invented" for every application, every cloud vendor, or every enterprise.

What is a Software Design Pattern?

This concept was invented and published in 1994 in a book entitled Design Patterns: Elements of Reusable Object-Oriented Software. The book was written by four authors usually referred to as the "Gang of Four" (GOF). While the book originally targeted object-oriented software languages, the "pattern" concept was incredibly successful and has gone on to be applied to many other types of technologies. 

Software design patterns usually have the following components:
  • Problem Statement -- a description of the problem being solved
  • An Example -- a real-world example to help explain the reason the pattern exists
  • Applicability Statement -- a description of when this pattern should be considered
  • Structure -- a description of the pattern in clear enough terms that somebody could implement it
  • Consequences -- Listing of the advantages and disadvantages of using the pattern. This section also includes any limitations
The GOF book and many academic papers include some more sections and a more precise and detailed explanation for each component. I prefer a more practical approach.

What are the Design Patterns for Cloud Management and DevSecOps?

I'm currently dividing patterns into these categories:
  • Build Patterns
  • Application Release Patterns
  • Infrastructure Patterns


Build Patterns describe how source code is compiled, packaged, and made available for release. Additionally, many organizations apply automated testing as well as gather quality metrics. Build patterns currently identified are:
  • Packaging --- Includes any needed compilation. The output is something that can be included in a software release.
  • Automated Testing -- Includes any unit and/or integration testing needed to validate packaged software.
  • Metric Analysis -- Includes and static code analysis that analyzes code quality and complexity. 


Application Release Patterns are patterns used to safely deploy packaged software produced by a build pattern. Application release patterns currently identified are: 
  • All at Once (Spray and Pray) -- Pattern to deploy software without concern for an outage
  • Rolling Deployment -- Pattern to deploy software incrementally to minimize user outage time.
  • Blue / Green -- Pattern to utilize cloud technologies to minimize user outage time and provide easy back-out.
  • Canary -- Variant of Blue/Green that incrementally directs users to a new version of software to minimize the impact of deployments with defects.
Infrastructure Patterns are patterns that create or update cloud infrastructure including networking, security policies, on premises connectivity, monitoring, logging, etc.  Infrastructure patterns currently identified are:
  • Infrastructure Maintenance -- Includes network, security, monitoring, logging, infrastructure and much more
  • Image Production -- Create hardened virtual machine images often used by multiple applications or business units.
  • Mutable Infrastructure Maintenance -- Managing configuration updates for virtual machines that can't easily be destroyed and re-created at will.

Next Steps


Over the coming weeks, I'll document the patterns identified in this post. I'm always interested in patterns I might have missed.  Please feel free to contact me with questions, comments, and suggestions. Thanks for reading.

  

Saturday, November 16, 2019

Streamlining Tagging in Terraform projects.

Tagging resources in Azure or AWS Terraform projects used to be such a mind-numbing pain before the release of Terraform 0.12. For each resource in a Terraform project, the tag section was very verbose and very repetitive.  Now, with new variable functionality that comes with Terraform 0.12, I've fallen into a much more streamlined way of maintaining tags.

Tagging before Terraform 0.12

Before new features for Terraform 0.12, often a variable file had a large section dedicated to tag values and resources assigned those tags had many lines dedicated to tagging as well. An example of what life used to be like is below.  This ritual was repeated for any other resoruces in the project that needed tags.
resource "aws_instance" "webServer" {
 ami           = "${data.aws_ami.linux_ami.id}"
 instance_type = "t2.micro"
 subnet_id  = "${var.subnet_id}"
 key_name = "${var.key_pair}"

 tags {
   "Name" = "${var.instance_name}"
   "Environment" = "${var.environment}"
   "ChargebackDept" = "${var.chargeback}"
   "Business Priority" = "${var.priority}"
Many more tags indeed ad nauseum!
 }

}

Tagging after Terraform 0.12

I've fallen into the habit of using Terraform's new variable typing construct along with the map variable type and the map and merge functions. Today, my tag references look like the following for all resources that need tags. The important part is yellow.

resource "aws_instance" "webServer" {
 ami           = "${data.aws_ami.linux_ami.id}"
 instance_type = "t2.micro"
 subnet_id  = "${var.subnet_id}"
 key_name = "${var.key_pair}"

 tags {
   tags = "${merge(var.project_tags, var.environment_tags, map("Name", var.instance_name))}"
 }
}

Notice I use the merge function to combine two maps together. The first is project-wide and the second contains environment-related tag names and values. I also use the map function to add an entry for this specific resource.

In a variable file I typically call vars.tf, there are two variables declared. One has project-wide tag entries and another I reserve for environment-related entries.  Here's an example from vars.tf:
variable "project_tags" {
  type          = map
  default = {
    ChargebackDept         = "dan_haberer@vfc.com"
    BusinessPriority          = "VF Services"
 Include any number of tags in a format easy to maintain
  }
}

variable "environment_tags" {
  type          = map

}

Additionally, I typically use a tfvars file for environment-related tags.  Here's an example:
environment_tags = {
    Environment = "Development"

}

Looking at the Pros and Cons

I like this approach over the previous approach for several reasons.

Tag names and values are easy to maintain. Most tag names and values are in one spot. 

There are not as many variables to declare. While the variables contain larger values in the form of maps, I don't need to handle nearly as many of them in my Terraform projects.

This solution is much less verbose. I've cut hundreds of lines from some of my Terraform projects with this trick alone.

I hope you find this trick useful. I'm always open to ways to streamline Terraform code even further.


Tuesday, July 17, 2018

Cloud Governance: Making DevOps Automation Effective

I see cloud automation of all types implemented to control and/or secure cloud assets. Examples of this type of automation using Amazon Web Services (AWS) include the following:
  • Preventing unauthorized entries in security groups allowing ingress from 0.0.0.0/0 (security)
  • Alerts for Creating IAM users (possible security risk)
  • Forwarding application logs to Splunk (operational effectiveness)
  • Scheduling up-time for non-production assets (cost savings)
While these examples are AWS-specific, the principles discussed in this article are equally applicable if you're using Azure, GCP, or another cloud. 

Some of these examples (which are real-life examples, by the way) were very effective. Some were only marginally effective. For instance, the first two examples were extremely effective. If I defined a security group allowing unauthorized access publicly, that entry in the security group was automatically removed and an alert issued to the security team. There was a process to add an "authorized" security group allowing public access, but I had to justify the need. As for the second example regarding IAM user creation, if I created an IAM user, an alert was generated and sent to the security team. The new IAM user wasn't automatically removed, but a similar alert was generated and I was asked to justify the action.

The example regarding forwarding application logs to Splunk was absolutely horrible. It took many hours to implement. It was very fragile; new version deployments to applications broke the automation. The coding needed to automatically fix forwarding after an application deployment was cost prohibitive. Consequently, logs were not reliably forwarded.

Scheduling up-time for non-production assets using the AWS Instance Scheduler is marginally effective. The solution works. However, it depends on instance tags for functionality. Instances scheduled must have required tags if up-time for them is to be scheduled. Often, developers are inconsistent in the implementation of these tags. This dependency makes the solution marginally effective for the customer as the tags are often not consistently applied.  This entire discussion leads to an obvious question.

What characteristics make DevOps Automation effective?

The characteristics that make DevOps automation effective can be attributed to the following characteristics:

  • DevOps automation must be resilient
  • DevOps automation must be easy to install
  • DevOps automation must minimize ongoing manual labor.
DevOps automation must be resilient to be effective. In other words, automation isn't effective if it breaks or doesn't work as intended when seemingly unrelated changes are made. Essentially, the benefit to the automation isn't realized. In the case of the fragile Splunk forwarding, broken forwarding meant that support could not trust the results of Splunk searches. Essentially, the benefit of having all application logs in one place was not realized. Developers had to go to the Cloudwatch source.

DevOps automation must be easy to install to be effective. Automation that's difficult to install is a classic "barrier to entry".  In other words, installation difficulty increases the effective price of the automation. Automation that's not implemented doesn't provide benefit.  The Splunk forwarding example was extremely difficult to install due to complexity as well as poor and inaccurate documentation. As a consequence, many teams didn't even attempt to implement the solution.

DevOps automation must minimize or eliminate ongoing manual labor to be effective. The purpose of automation is to eliminate manual labor. If the automation itself has manual labor requirements, that just negates some portion of the benefit. The AWS Instance Scheduler example requires manual labor in terms of tag maintenance. Consequently, the complete benefit of minimizing instance runtime costs is never realized. There's labor to pay for.

How do I use these principles to improve the automation I create?

Test your install documentation. Write your documentation and let somebody, not familiar with your automation, install it. If they have questions, it means that your documentation has a defect. It might be that your documentation wasn't clear. It might be that you missed documenting something. Whatever the reason, take those questions as defect reports.

Support your automation when it breaks or doesn't work as intended. If a consumer has trouble with your automation, help them fix. After you fix the issue, do a root cause analysis. Figure out why the automation broke and improve it to prevent that problem from repeating. It could be that your automation makes invalid assumptions about the environment. In which case, there is a code change to make. It could be that the user didn't understand how to properly use your automation. This can be fixed by improving your documentation. Whatever caused the issue, fix it.

Solicit feedback for ongoing care and feeding your automation requires. Ongoing care and feeding required by your automation detract from the benefits your automation provides. If there are changes you can make that further minimize or eliminate that ongoing work, you should consider it. 

Let's consider the AWS Instance Scheduler example that requires ongoing work with regard to documenting instance schedules as tags on the instances themselves. If we wrote that automation, how could we further minimize that ongoing work? One way I can think of is to provide a way to specify a default schedule for an entire account. Tags on the individual instances would be optional.

Many companies use different AWS accounts for non-production and production resources. If I could provide a default schedule for all EC2 and RDS instances in the entire non-production account, individual tags on individual instances would be optional. There would still be a need for tagging options on the instance themselves for resources that need a custom schedule. But the default schedule would apply to all instances in an account without developers remembering to place scheduling tags. A large percentage of the ongoing work required by the AWS Instance Scheduler would go away. With that, the amount of money the organization saved using the automation would increase.

Consider leveraging the single installation model. Most custom automation I've seen are written to work in their default region in the account in which they are installed. For example, if I install the automation in us-east-1 for account 1111111, that's the only place the automation is available. If I want to use that automation in other regions or other accounts belonging to the same organization, I need to install separate copies of that automation. 

It is possible to code automation so that it operates in multiple regions and multiple accounts. That said, the code within the automation would get more complicated and likely require cross-account roles. That automation would also require a centralized configuration where it runs. For example, the AWS Instance Scheduler supports the single installation model. If you use that feature, there's additional configuration to specify the accounts and regions the scheduler will operate in. Furthermore, cross-account roles are required to provide it access. Providing guidance on implementing the single installation model effectively is a separate topic and may be the subject of a future article.

I hope this helps you with the automation you create. Comments and questions are welcome.




Sunday, May 27, 2018

Tips and Tactics for Passing the AWS Solution Architect Certification Exams

I've been using the AWS cloud platform since about 2010. When I embarked on the AWS certification path a couple of years ago, I knew it would be a challenge even with my experience. I knew professional level certs are some of the most challenging exams in IT. Having passed AWS Solution Architect Professional and Associate certification exams, I've been asked by several for tips on how to prepare and pass. The process is daunting given that the body of knowledge covered by the tests are incredibly broad. Many have blogged on this topic. This post describes tactics that worked for me; I hope they are of value to you.

Preparation and Study

Memory only gets you so far. The associate exam does have some questions on obscure service limits where memory skills help (e.g. How many VPCs can a user create by default?, What is the limit on the number of instances in a placement group?). The professional exam is purely story problems where each question is a scenario and you're expected to choose from different design options they present. You do need to learn how to apply the underlying principles AWS uses. You just can't memorize your way to success.

Networking skills should be considered a prerequisite. Most of my experience is in the application development realm. Fortunately, I had experience as a system administrator at different points and knew networking concepts. I've friends who went down the same path and had a very difficult time as they didn't have networking experience. You should be able to setup a VPC, public and private subnets in different availability zones including an internet gateway and NAT from memory. Don't forget to install a couple of instances to make sure the public/private access works. 

Purchase one of the online courses. These courses whittle down the size of the haystack and in some cases include labs where you learn the different services by actually doing, not just reading. I've seen some bloggers recommend purchasing multiple courses, but I don't think this is wise. Not only does it cost more money, but takes more time.

I used the ACloud Guru courses for both the associate and professional exams. Honestly, the associate course better prepared me for the exam than the professional course did. I still consider the professional course worth the money. I did use some lectures in the associate course as prep for the professional as it contained material on some services (like Redshift) that the professional course did not. The ACloud Guru courses include VPC labs that give you a chance to learn networking if that part of your experience is light. 
Note: I'm not affiliated with ACloud Guru other than being a customer. 

Purchase at least one set of practice exams. I used the tests from Whizlabs for the professional exam. For the associate cert, I used the quizzes supplied with the ACloud Guru associate course and got to the point where I made 100% on those tests most of the time. Some of those questions literally appear on the exam word-for-word. I also purchased a set of practice tests for the associate exam from a vendor I lost track of. 

Practice exams not only give you confidence when you start doing well, but it optimize your time. It allows you to concentrate study on the portions you miss rather than the things you know well.

Take the AWS Practice Exam. While the Whizlab tests are good, they aren't exactly like the AWS exams. The online experience with AWS is slightly different than Whizlabs. Also, AWS questions are more wordy, particularly in the professional exam. 

Research and verify your AWS Practice Exam answers. Capture screen shots of all questions and your answers. Rather than futz with screen shots during the timed test, I used Camtasia to make a movie of me taking the test and clipped the screen shots after. This exercise forces you to check your assumptions and better acquaints you with their exam tricks. 

For those taking the professional SA exam, take the AWS sample questions (free) test (english version). There are six questions. There's a trick here: the exact same test is published in Japanese - with answers on the lower right for each question. You're welcome.

Test Taking Strategies
This section details the strategies I used when taking the tests.  

Eliminate wrong answers first. Even if the question relies on facts you don't know, often you can eliminate at least one or two possible answers on the list. Even if you end up guessing, that can bump your chance of a correct answer to 50%. Sometimes, you even get to apply the Sherlock Holmes logic that if there's only one answer left after you eliminate the wrong answers, that answer must be correct even if you don't see why.

Often on the professional exam, an incorrect fact or tactic is often used in multiple answers.  If you spot the incorrect fact, makes it easy to eliminate several answers in one shot. Sometimes, there's more than one incorrect fact buried in multiple answers...

ReRead the one sentence that specifies how to choose among the various answers. For example, some questions have you choose which strategy "minimizes costs" without any mention of other requirements you expect like minimizing latency or achieving high availability. They will try to trick you by putting in a strategy that does minimize cost but also increases the chance of data loss or latency time. Don't read into the question: select the strategy that minimizes costs even though other common objectives suffer.

Another tactic is to put the word "not" in the sentence. For example, which of these strategies does not improve latency. That three-letter word can send you down the wrong path if you miss it.

There will be questions that rely on facts you don't know. Obsessing over these questions when you're obviously reduced to a guess just wastes time you don't have. Eliminate what look like obvious wrong answers, take your best guess, and learn to live with the fact you might have missed that question.

Don't over use the Flag feature. AWS exams have a 'flag' feature where you can mark specific questions and revisit them later. In the professional exam the other day, I flagged three of the 77 questions. If you flag most questions, it's the same as not having the feature at all as you won't have time to revisit all of them.

For me to flag a question, it needed to pass the following criteria:
  1. This question can't center on facts or a service I simply don't know -- I have to guess anyway.
  2. Time will help improve my chances of getting the answer correct.  (Be honest with yourself)
If it doesn't pass both criteria, I don't flag it. Devoting more time to the question would be a waste. I take my best guess and move on.

Caffeine lovers -- Consider the expresso trick. If you're a highly caffeinated individual like I am, you want caffeine for the test. However, you do not want to spend time in the rest room as it's a timed test. Expresso coffee, with a little cream to dilute the bitter taste, functions quite well for this without giving you unwanted liquid. I consumed four shots right before both tests. In all honesty, if I had it to do over again, I would have ordered six for the professional exam as it's so much longer.

I hope these thoughts help -- best of luck on the exams.