Tuesday, March 8, 2022

Infrastructure Code and the Shifting Sand Problem

Infrastructure code (IaC) has a problem that application code does not: changes outside of infrastructure code impact its ability to function as intended. I call this the shifting sand problem. The goal is for infrastructure code executions (of the same version) to always produce the same result. In this sense, we want infrastructure code executions to be "repeatable" in the same way we strive to make application builds repeatable. When IaC executions aren't repeatable, unplanned work is the result. 

There are many sources of IaC shifting sand. I'll detail those sources and ways to mitigate the problem in this post. 

Common IaC shifting sand sources are below. They are caused by a mixture of technology change and organization management procedures. 
  • Automatic dependency updates
  • Cloud backend updates
  • Right-hand/Left-hand issues
  • Managing cloud assets from multiple sources
Configuring automatic dependency updates is the practice of using the latest version of IaC software or common IaC code revisions. Examples include Terraform and Cloud provider versions, common IaC code versions, virtual machine image versions, operating system updates,  and many more. Most automatic updates are put in place as a convenience: developers want to avoid the additional work of upgrading the versions used. The problem is that breaking changes from these dependencies will cause IaC code not to function. Then unplanned work results and somebody will need to fix the issue,  often at an inconvenient time with a looming deadline.

Some operating system and virtual machine image updates are security-related. Examples include anti-virus software updates,  etc. This type of update is often unavoidable. Rather, the potential cost of delaying these updates can be considerable. 

For avoidable automatic updates (not security-related), the best mitigation is to explicitly specify the versions of dependencies used.  Never use 'latest' or the newest available. Let upgrades of dependencies be driven by changes needed for planned enhancements for end-users.  Then the upgrade work is planned and scheduled as opposed to inconveniently timed.

For unavoidable automatic updates (e.g. security-related), early detection by automated testing is best. In addition, apply these updates in lower-level environments on a regular basis. Forcing such updates in lower-level environments first will increase the chance that issues not covered by automatic testing can be found before production. 

Cloud backend updates are breaking changes introduced by cloud vendor software. While Cloud vendors do attempt to make their changes backward compatible,  they don't always succeed. Without naming names, I've been in support calls with Cloud vendor technical support teams and seen them blackout product changes or frantically release product fixes. In addition, I've frequently had to frantically upgrade IaC code to accommodate Cloud vendor backend changes. As with the automatic update problem, unplanned (and inconveniently timed) work results.

Scheduled testing in a sandbox environment is the best mitigation strategy we've found. This strategy increases the chance of detection early before changes are actually needed in real environments. Depending on the components used, sandbox testing can be expensive to run. The more often sandbox testing is run, the earlier problems like this will be detected. Unfortunately, increased run frequency drives up costs. 

Right-hand/left-hand issues are created within the organization itself where one department makes changes that have ramifications they don't see in other departments. One example I've seen frequently is a group in charge of making security policy changes effectively "breaks" IaC code maintained by other departments. In essence,  tactics taken by that IaC code were no longer allowed. In this example, making the policy change is often necessary.  

Early detection through scheduled sandbox testing (as described above) is the best mitigation strategy for the team maintaining IaC code for specific applications. The same tradeoff between frequency and cost applies.

Managing cloud assets from multiple sources occurs when something besides one IaC pipeline manages a cloud asset. The most frequent example is manual change. When developers manually change assets that are managed by IaC, often drift results. That is, the cloud asset, in reality, differs from what is in the IaC pipeline. 

Another example is creating multiple IaC code bases to manage the same asset. I've seen this frequently in cloud asset tagging, which is frequently used for accounting charge-back purposes. Drift always results as multiple IaC code bases rarely come up with the same answer. As a result, all IaC code bases (except the one last executed) differ from reality. 

Bottom line: ensure that one and only one IaC code base manages each cloud asset. This prevents configuration drift. It saves aggregation and labor in staff who are often left with the mystery of figuring out how that drift happened. This is a topic I might explore more completely in another article.

I hope this helps.  I'm always interested in other types of problems you encounter maintaining IaC code.  Thanks for taking the time to read this article. 

Saturday, January 9, 2021

For DevOps Professionals: Evolutionary Terraform

Organizations that use Terraform to manage cloud infrastructure often create and maintain Terraform modules as the code base grows. Inevitably, complexity increases with the introduction of reusable code. DevOps teams, I've worked with struggle with the level of modularization they should use and how to more easily manage it. 

I think of the modularization of Terraform as an evolutionary process. The level of modularization needed when organizations first start out is different from what they need as they mature. This article will take you through a sensible evolutionary path that only increases code complexity when truly needed.

Just to clarify my terminology, a configuration is a Terraform project that is used to directly manage cloud infrastructure. That is, create a virtual network and subnets for a specific development environment. A module is a Terraform project that is designed for reuse and is used by configurations. For instance, I usually have a module that creates a configured virtual network, all component subnets. This functionality is used for multiple virtual networks in multiple environments.

In the Beginning

When new technologies are adopted, simplicity is and should always be the goal. Only accept complexity that is necessary and only when it becomes necessary. Terraform is no exception. Let's discuss some opening tactics.

Use source control for all Terraform code. This is easy and it should be used from the beginning. Repositories are easy and inexpensive these days.

Centrally manage Terraform state (e. g. back-end state). By default, Terraform will store the Terraform state on the device where the configuration is executed. All cloud platform Terraform providers provide a way to store state in the cloud instead of on the device doing the execution. This is generally easy to set up and reduces the risk of loss of the current Terraform state. Here are setup instructions for AWS, Azure, and GCP.

Adopt a standard Terraform project structure that incorporates configurations and modules. A typical directory structure for a Terraform repository looks like the following:

Note that a standard project structure separates configurations from modules.
This makes reusable code easy for developers to identify. How configurations should be structured would be a great topic for another article. Briefly, I generally separate network infrastructure, common services for all applications, and application infrastructure into separate configurations to make the blast radius more manageable.

Note that only configurations have environment tfvars files. As configurations are used to directly manage infrastructure, modules are more focused and do not need to be coupled with the concept of different environments. An illustration of where to place tfvars files follows:

When the Number of Coders Grows

When the number of DevOps professionals on the team grows, it's common for changes from one person to accidentally conflict with changes others are making. This lengthens the time associated with changes and slows the team's velocity. 

Use feature branches to organize changes. This allows developers to test their changes with less fear that another developer will accidentally interfere. I've addressed feature branch usage in detail in this post.

Test feature branch changes in a sandbox environment. A sandbox environment to me is an environment that can easily be destroyed and recreated if something goes awry. Do not run feature branches in any non-sandbox environment. This allows developers to test new code in isolation without fear of accidentally negatively impacting others.

Only apply changes from a CI/CD pipeline. This provides an execution history. If something unexpected happens, execution history can provide information as to what was run when. It also removes any differences between the environments and access executing from individual devices.

Schedule CI/CD pipeline plan or validate operations for each configuration. This will allow you to detect configuration drift. It also ensures that all configurations are at least correct as far as syntax and that a configuration hasn't been affected by a breaking change in one of the modules. 

Some organizations use TerraTest to automatically test Terraform configurations. While I support automated testing if you can do it, TerraTest requires GoLang knowledge that not all organizations have. Mandating TerraTest can be a big ask.

When the Number of Configurations Grows

As the number of Terraform configurations grows, typically the blast radius for changes to modules also grows. The reason is that module usage also grows. With a small number of configurations, it's easier to test each configuration that uses the module that is being changed. The test effort grows with a growing number of configurations. Either velocity slows to accommodate the larger blast radius, or testing isn't as thorough, and accidental defects are released.

Ensure that you adopt module coding best practices. This is a large topic and deserves its own article, but I summarize some key points in the Module Coding Best Practices section below. As the number of configurations grows, the opportunity for reuse increases, and the number of modules also grows.

Separate out all modules into a separate repository and formally release by version/tag. This allows consuming configurations to insulate themselves from module changes. If configurations consume the latest release, they run the risk of not working if a breaking change was made to the modules they consume. In essence, consuming specific versions/tags converts "unplanned" work to "planned" work. Module upgrades can be scheduled with time allowed for it if needed. 

Note that versioning modules reduce the risk of change for modules. I've seen slightly different versions of modules that do much the same thing occur because people fear accidentally breaking configurations they know nothing about. Versioning eliminates this risk as the modified code will be published with a new version.

Once a version/tag is released, never change its content. In this world, there should be no concept of forcing configurations to accept changes. Consuming configurations should always control and be able to plan for module upgrades. 

Only consume modules explicitly specifying a tag/version. Consuming the "latest" version increases the risk of unplanned work as discussed previously.

Module Coding Best Practices

These practices deserve their own article, but to summarize:

Only create a module that has at least two consuming configurations. Creating a module for use by only one configuration is classic YAGNI. It introduces complexity that isn't yet necessary.

Avoid data lookups in modules. Pass needed information as input variables. The reason is subtle. Data lookups will error if the target is not found. As we're talking about modules, they don't (and shouldn't) understand configuration context. If the target of the lookup doesn't exist, the first plan for a configuration using the module will error out. Using data lookups in configurations are perfectly fine as they understand execution context. This is subtle.

As an example, let's say the module virtual-machine, used by configuration app-fred, executes a data lookup for a specific subnet. Let's also say that configuration app-fred creates that subnet. Configuration app-fred will not successfully plan because the subnet module virtual-machine is looking for doesn't exist yet on the first run. Bottom line - modules should not do data lookups because they don't (and shouldn't) understand the execution context.

Ensure that all modules are documented in Markdown with a README. I usually include an example usage section with common input options. The objective is to make it quick and easy for developers to use the module. In my own README documentation for modules, I include the following sections:

  • A list of input variables and brief description if needed
  • A common usage example that consumers can copy/paste/change to their own configurations.

Parting Ideas

Only assume complexity needed. The later stages of evolution described here are not needed in the beginning. Avoid classic YAGNI (You Ain't Going to Need It).

You don't get away from change management. While the practices described here reduce friction as your Terraform usage grows, change management is still needed. Somebody or group still needs to organize changes in a way that recognizes and accommodates dependencies.

Thanks for reading this article. As always, please contact me or comment if you've alternative thoughts.


Saturday, December 26, 2020

For DevOps Professionals: Barriers to 100% Infrastructure as Code

I was asked the other day why a particular part of the cloud infrastructure was added manually and not automated. It was a very small manual part and a one-time setup, but none-the-less I experienced déjà vu. It occurred to me that I've been asked that question at every client I've had since I got heavily into infrastructure code. We use the phrase "100% infrastructure as code" often. In fact, the overwhelmingly vast majority of cloud infrastructure is implemented via code. However, there is always some very tiny portion of the infrastructure that seems to be provided manually. The percentage is probably closer to 99.x% in most organizations I've had the privilege to do work for. Why is that? Why does the percentage never seem to be 100%? Let's make this more concrete and list some examples of automation I've encountered that wasn't 100% automated and why.

Examples of Automation Barriers

High-bandwidth from the cloud to on premises data centers are rarely 100% automated. This is the case for both Azure Express Routes and AWS Direct Connect connections. The reason is that a 3rd party firm controls access to the on-ramp or colocation device (e.g. CoreSite, Equinix, etc.). If the organization has access, it's usually manually controlled by a separate network infrastructure team. In other words, these devices aren't completely available to automation engineers that are needed for the development of infrastructure code. In essence, the cloud connectivity to the express route circuit can be automated, but that circuit's connectivity to the on-ramp is usually not.

Automating DNS entries are problematic in many hybrid-cloud organizations. This is an organizational barrier and not a technical barrier. It is common for DNS entries to be controlled by a separate team in a manual fashion. DNS authority is tightly controlled as there is effectively one DNS environment for the entire organization most of the time. The fear is that automation defects could negatively affect non-cloud entries or resources. 

Automating security policies and the assignment of those policies is problematic in many organizations. Typically, security is handled by a separate team and usually a team without infrastructure automation skills. Consequently, I've seen automation engineers write code to establish security policies, but those policies are manually assigned by a separate team. In essence, the traditional test-edit cycle that automation engineers need for this type of development doesn't exist.

Frequently, the creation of AWS accounts or Azure subscriptions is not automated. The reason is that in most organizations, that creation and their placement in the organizational tree is controlled by a separate team without automation coding skills. Furthermore, a sandbox environment for this type of automation code development doesn't exist.

Organizations define some resources to be central to the entire enterprise and don't have environment segregation. Examples are Active Directory environments, DNS, and WANs. The problem with this is that changes to central resources such as this become "production" changes and are tightly controlled. When everything is a production change, the test-edit development cycle automation engineers need doesn't exist. 

After doing some introspection with these examples, I've identified several common barriers to implementing 100% infrastructure as code. It turns out that most of these limitations are not technology-based.

Environment limitations

Infrastructure code development requires support for test-edit loops. That is, automation engineers need to be able to run, edit, and re-run infrastructure code to correct bugs. Writing infrastructure as code is just like application development in many ways. Automation engineers need to be able to experience occasional failures without negatively impacting others. These requirements are usually accomplished by a "sandbox" environment that others are not using.

The app developer part of me wants those tests and the verification of the result automated just like other types of application code. That said, the tooling to support automated testing of infrastructure code is sketchy at best and definitely not comprehensive. Automated testing is worth doing, but it is definitely not comprehensive. There are definitely limits to automated test coverage for infrastructure code.

The sandbox environment used for infrastructure code development must support the add/change/delete of that environment without negative impact on others. Like other types of development, infrastructure code doesn't always work as intended the first time it runs. In fact, you should assume that infrastructure code development might actually damage the sandbox environment in some way.

Sandbox environments should be viewed as completely disposable. That is, they can be created or destroyed as needs require little effort. It needs to be easy for an automation engineer to create a new sandbox for infrastructure code development and destroy it afterward.

It's common for sandbox environments to have limitations. That is, it's difficult for sandbox environments to accommodate 100% of infrastructure code development. Dependency requirements (e.g. Active Directory, DNS, connectivity to SAS, or on premises environments) are primary examples. These limitations contribute to the small portion of the infrastructure that is at least partially maintained manually.

Organizational authority

Automation engineers and the service accounts used for automation must have 100% control of the infrastructure maintained by code. That is, one can't develop infrastructure code without the authority to do so. This code can't be completely developed and tested. Consequently, the portion of the infrastructure that directly interfaces such resources is often manual.

Earlier in the post, I provided several examples that fit this category. For example, organizations using proprietary DNS products (e.g. Infoblox) often don't want to pay for additional licenses to support infrastructure code testing. Additionally, as DNS is often implemented in a one-environment paradigm (only production without separate development environments), organizations are hesitant to allow automation engineers security credentials needed to support infrastructure code support.

Active Directory (A/D) environments also fit into this category in many organizations. As A/D is often used to grant security privileges, organizations are loath to grant automation engineers and automation service accounts needed privileges to create groups, edit group membership, and delete groups.

All too often, the solution to these types of issues is to do a portion of infrastructure manually.

Low benefit/cost ratio

For some types of infrastructure, organizations find the benefits obtained by a complete automated solution aren't worth the costs. In other words, third-party costs (e. g. software licensing) make the "juice isn't worth the squeeze". Some infrastructure dependencies cost too much in money, labor, or time to dependency set-up costs to make the automation practical. Sometimes the manual labor involved in maintaining some infrastructure items is very small, making the cost of infrastructure code for those items not worth the effort.

Resources that are rarely updated and take an extremely long time to create/destroy often aren't worth the cost of automation. As an example, the AWS Transit gateway is often an example.  

Lack of DevOps team discipline can increase the cost of infrastructure automation and lower the benefit/cost ratio. Without the good discipline to the development life-cycle for infrastructure code and good source control habits, it's common for development work by one automation engineer to negatively impact the work of others. This leads to an increase in manual work or a decrease in team velocity. 

The breadth of specialized skills needed for some types of infrastructure can lower the benefit/cost ratio. As an example, work with one client required specialized networking and A/D skills to set up a test RRAS VPN target. If I didn't have a team member with these skills, I could never have tested that the cloud-side VPN infrastructure code worked - it would have been untested until use in one of the non-sandbox environments. I've seen other examples with regard to relational database administration skills and other types of specialized labor. The breadth of knowledge often needed by automation engineers is daunting.

Concluding Message

My acknowledgment that there are barriers to implementing 100% infrastructure as code should not be used as an excuse not to automate. Infrastructure code has produced some of the best productivity gains since we embarked on adopting cloud technologies. I'll never give up pressing for higher levels of infrastructure code automation. That said, when I recognize some of these non-technology barriers to infrastructure code, I'll feel a little less guilty. Yes, I'll try to craft workarounds, but recognize that it isn't always possible in every organization.

Thanks for reading this post. I hope you find it useful.

Wednesday, December 16, 2020

For Managers: Cloud Governance through Automation

Cloud consumption and DevOps automation is not just a technology change. It is a paradigm shift that managers participate in as well, but don't always realize it. One of the paradigm shifts involves cloud governance. If managers apply governance tactics developed over the years, they risk many of the benefits obtained by cloud consumption including speed to market. Having seen this transformation at several organizations, I've some thoughts on the topic. Please take time to comment if you've thoughts that I haven't reflected here.

Place automated guardrails on cloud usage instead of manual review processes. In short, when new policies are needed or existing policies modified, work with a cloud engineering team instead of adding manual review points. The benefits are:

  • Fewer review meetings
  • Reduced manual labor with both management oversight and application team compliance
  • Added security as enforcement is more consistent and comprehensive
  • Evolves as your cloud usage grows and changes
  • Allows decentralized management of cloud resources which frees application teams to innovate more.

This is a paradigm shift over what was needed in data centers. Hardware infrastructure found on premises makes governance and its enforcement manual. This leads to long lead times to acquire and configure additional infrastructure and makes governance a constraint to bringing additional technical capabilities to application teams and users. Manual approvals and reviews are needed costing time and management labor.

In the cloud, infrastructure automation is possible because everything is now software. Networking, infrastructure build-outs, security privileges/policies, and much more are now completely software configuration and don't involve hardware. The software nature of the cloud makes the automation of governance in the cloud possible. Once automated, governance is no longer manual. Governance is enforced automatically that will provide enterprise safety. As a consequence, the need for manual approvals decreases if not entirely eliminated. This frees application development teams to innovate at a faster pace.

What types of automated guardrails are possible?

As the cloud is entirely software, the sky is the limit. That said, there are several guardrails that I see as implementation candidates.

Whitelist cloud services application teams can use. As an example, some organizations have legal requirements, such as HIPPA or FERPA, that need to be adhered to. These organizations usually have a need to whitelist services that are HIPPA or FERPA compliant. As another example, some organizations standardize on third-party CDN or security products. They commonly want to prohibit cloud-vendor based solutions that aren't a part of the standard solution.

Whitelist cloud geographic regions application teams can use. Some organizations don't operate world-wide and want cloud assets existing only in specific regions.

Automatically remediate or alert for security issues. Most organizations have specific plans for publishing cloud assets on the internet. As an example, one of my clients automatically removes non-authorized published ports to all internet addresses (CIDR within a few seconds after such a port is opened. Another example, a customer of mine provides alerts when people are provided security privileges in addition to non-security administration privileges.

Automatically report and alert on underutilized cloud resources. Underutilized resources often cost money to no benefit. These resources are generally computing resources such as virtual machines. Alerts like these provide ways to lower cloud spend as it's often possible to downsize the compute resources.

Automatically report and alert for unexpected cost increases. Alerts like these need sensible thresholds. This alert usually prompts a review and possible remediation of the application causing the cost increase. 

Schedule uptime for non-production resources to save money. Often, organizations don't schedule downtime for non-production environments off-hours. Enterprises operating worldwide might not have this option as effectively there aren't "off-hours".

How can automated guardrails avoid becoming a bottleneck?

Application teams do not like constraints of any type. Having been on application teams for many years, I understand their sentiment. There are ways to keep guardrail development from becoming a bottleneck.

Fund automated guardrail development and maintenance. Like any other software produced by the enterprise, automated guardrails need development and support resources. Without adequate funding, they won't react to changing needs on a timely basis. Additionally, recognize that inadequate funding for automated guardrails will result in productivity losses for individual application teams across the enterprise.

Work with application development teams to identify and prioritize needed enhancements. This provides visibility into the guardrail backlog. Additionally, application teams can participate in prioritizing enhancements. Make them part of the process.

As cloud platforms evolve and change, automated guardrail development and maintenance is an activity that never "ends". Cloud governance is a continually evolving feedback loop. There must be a reasonable process for application teams to propose modifications to existing guardrails. As cloud technology changes over time, advances are made in current cloud services and new services invented. as an example, one of my clients must restrict cloud services used to those that are HIPPA compliant. As advances are made, that list grows over time and needs to be revisited.

As a manager in charge of cloud governance, what does this change mean to me?

Declare "war" on manual approvals. Instead of adding manual review/approval processes to govern cloud usage, engage a DevOps or cloud engineering team to enforce your desired behavior. A colleague of mine calls these "meat-gates". They slow everything down, both for management and application teams. They hamper delivering new features to end-users by slowing down application teams.    

DevOps automation is your friend and ally. It allows you to set policy and not need to devote as much to enforcement. You specify "what" policies you want to be enforced. DevOps automation engineers construct and maintain the enforcement of the policies you choose.    


I hope you find these thoughts useful. I'm always open to additional thoughts. Thanks for reading this post and taking the time to comment.

Saturday, November 14, 2020

When to execute ARM Templates with Terraform


ARM templates are the native automation mechanism for the Azure cloud platform. It is possible to execute ARM templates from Terraform using resource azurerm_resource_group_template_deployment. To Azure professionals with less Terraform experience, this is appealing.  It allows them to use their existing skills and provides some short-term productivity gains. While I see the benefit, the tactic eliminates some of the benefits of using Terraform. 

Don't use Terraform to run ARM templates unless you absolutely have to. The template deployment resource represents an Azure Deployment, not the resources that deployment creates.  For example, if you execute an ARM template that creates a VNet, Terraform will only understand changes made to the ARM template. Executing a Terraform plan will *not* report changes that will be affected to the underlying VNet. If somebody made a manual change to that VNet, Terraform will not sense the change and re-apply the ARM template. 

Only use Terraform for ARM templates for new features that aren't in Terraform yet. This is rare, but it does happen. Microsoft enhancements are reflected in the REST APIs, and thus the ARM template schema, before enhancements are incorporated in the SDK. Once new features are in the SDK, they commonly are reflected in Terraform very quickly. But there are enhancements (e. g. the VWAN additions) that take months to be completely incorporated in the SDKs.

For example, at the time of this writing, Terraform resources do not yet exist for Virtual WAN VPN sites and VWAN VPN site-to-site connections. I recently used the template deployment resource to manage that infrastructure because there was no other choice from Terraform perspective. 

Consider Terraform execution of ARM templates after Terraform resources exist for new features as technical debt. That is, once Terraform formally supports the resources you need, you should enhance your Terraform to remove the ARM templates.  This makes your Terraform more consistent and allows you to identify configuration drift. As with all technical debt, the work should be properly scheduled in light of the team's other priorities. to use my previous example, the ARM templates used to manage VWAN VPN sites and connections should be refactored once Terraform resources exist for those constructs.

When an ARM template execution fails in Terraform, Terraform doesn't record the fact that the deployment was physically created in the state file. Consequently, to rerun the terraform ARM template after corrections, you either need to manually delete the Azure deployment, to do a Terraform import for that deployment to re-execute the Terraform configuration. 

Some try to work around the deployment creation problem by generating unique deployment names: I consider this kludge paste. It creates a large number of deployments to sift through if you want to review details on an error. It also means that Terraform will re-run ARM templates unnecessarily when the configuration is executed.

Friday, October 23, 2020

Best Practices for Managing Feature Branches

Feature branches are a popular source code management tactic used to manage and coordinate changes made by development teams. Developers create a feature branch is created from the main branch (typically master) and then merge the changes made to that feature branch back to the main branch when they are complete. This isolates changes made for a specific feature and limits the effect of feature enhancements on other team members until the change is ready.

When using feature branches, it's rare to directly develop using the master branch. In the example below, one developer might be working on a change called "feature 1" while another developer works on a separate enhancement "feature 2".  Each developer writes/commits code in isolation in separate branches.  When the enhancement is ready, that developer creates a pull request and merges the change back into master. The diagram below illustrates this example. Each bubble is a commit. 


The longer a feature branch lives, the higher the probability of integration problems when the feature branch is merged into master. In the example above, the developer for feature 2 might make changes that conflict with the changes made for feature 1. The longer a feature branch lives, the higher the likelihood that change conflicts occur.

Feature branches work best if the branch contains one targeted enhancement. Including multiple changes in a feature branch often lengthens the time the feature branch lives. It also makes code reviews more difficult as the change is more complicated.

The more developers working on a codebase, the more discipline the team needs regarding source control and change management. Even with feature branches, the chance of code integration issues on merge increases with each developer added. That is because the more developers making changes, the higher the probability of integration issues and code conflicts. The higher the probability that multiple developers are working on the same section of code at the same time. Yes, each developer is working in a separate branch, but those changes will be merged to master at some point.

Recommended Tactics

Feature branches should have a short life. Most of my feature branches live for less than one business day. They are narrow and targeted. If I need to make a "large change", I break it up into separate smaller changes using multiple feature branches.If a feature branch must live longer due to forces beyond my control, I rebase or merge in changes from master and address any needed merge issues.

Feature branches should represent changes from one and only one developer. When multiple developers make changes to the same feature branch, the chance of one developer of that feature branch negatively impacting other developers on that same feature branch greatly increases.

Feature branches should be removed after they are merged into master. If you don't, the resulting branch pollution will become confusing. The list of existing branches will grow to a large number. It won't be obvious which branches are active and which are historical.

Frequently rebase the feature branch against the master branch. Definitely rebase before merging back into the master branch (or creating a pull request which will accomplish the merge when completed). More importantly, it will consolidate changes made for the feature branch in git history. It's common for developers to merge rather than rebase to incorporate new changes from master.  Using merge is more intuitive.  That said, rebase makes git commit history easier to interpret as feature branch commits will be consolidated.  Additional information on rebasing can be found here.

An example series of commands to accomplish this follows:

git checkout master
git pull
git checkout feature_branch
git rebase master

Some teams prefer to squash commits when merging the feature branch into the master branch. This consolidates log history on the master branch as feature branches typically have multiple commits. For example, feature 1 with three committed changes can optionally be merged into master as one change.  This makes git history more concise and easier to read. Squashing commits will lose commit history detail on the feature branch, however.

Promptly respond and participate in requested code reviews. Most teams will use pull requests with code reviews as part of the process for merging feature branches into the master branch. It is common for developers to be slow to perform code reviews as they don't want the distraction. The trouble is that as long as the pull request is open, the feature branch it's associated with lives with it. The longer the feature branch lives, the greater the chance of integration issues.


Thanks for taking the time to read this article. I'd love to hear your thoughts.

Sunday, August 30, 2020

For Managers: DevOps Automation and Unintended Consequences

Most organizations adopting the cloud have adopted DevOps automation to some degree or another.  The primary reason is that continued manual maintenance isn't possible with the same staffing level and increased demand for a faster change rate. Many aren't to the point of achieving 100% automation but are striving for it. By "automation", I refer to Infrastructure as Code (IaC), automated builds and deployments (CI / CD Pipelines), machine image creation, security enforcement functions, etc. Most organizations struggle with the unexpected and unintended effect on the technology silos most have. I've seen similar issues with most of my cloud adoption and DevOps/automation clients for the past few years.

The goals most organizations have for consuming the cloud and adopting DevOps automation practices are several:
  • Increased speed to market for application capabilities
  • Increased productivity for IT staff
  • Increased scalability and performance of applications
  • Cost-effectiveness as footprint can dynamically scale to load
Steel Copy of a Wooden Bridge
All organizations initially view cloud adoption and DevOps automation as just a technology change. Consequently, they adopt automation toolsets and keep all business management processes in place (e. g. request forms, manual approvals, the internal team structure that governs who does what, etc.). Unfortunately, the paradigm shift to cloud infrastructure and full automation doesn't really permit that with the same organization structure. The new world is just too different.

Using existing business processes without change will make it difficult to achieve increased speed to market and consistency between environments.

Pre-automation business processes don't fit the cloud or DevOps automation. DevOps automation is commonly introduced with cloud consumption. Typically, the business is looking for ways to provide additional business capabilities faster and more cost-effectively. Consequently, the number of applications and the number of supported infrastructures increases. For many organizations, the business processes in place either can't easily support a larger software footprint. They don't support the increased speed of change demanded by the business.

The structure of automation often doesn't match the existing organizational structure. For example, setting up a cloud landing pad usually involves not only defining cloud networks, but configuring on premises connectivity, defining and enforcing security policies, defining and enforcing cloud service usage, and much more. From a strictly technology/coding perspective, the automation for these items is tightly coupled and a large portion of it usually belongs in the same automated source code project. Most organizations will have broken responsibility for these items into several teams, usually in separate departments, with people who don't usually work closely together. 

As another example, it's typical for application developers to augment their responsibilities to include IaC automation to meet application needs. That is, the management of virtual machines, application subnets, allowed network ingress and egress to an application is managed by application development teams. Pre-automation, these items would have been managed by different application teams.

The implementation of infrastructure and application hosting drastically changes when consuming the cloud. New cloud consumers quickly find out that the new world is different and consequently, existing business processes for allocating infrastructure and hosting applications in the cloud no longer apply. For example, existing business processes don't accommodate cloud vendors 

Patching the Steel Copy
On realizing the problems and organizational friction created by automation described above, most organizations attempt to "patch" their existing organization and supporting business processes. That is, they adopt a series of minor changes to mitigate some of the problems described above. Examples I've seen are:
  • Establish a manual review for security changes by the security team
  • Assume the cloud is "untrusted" and establish cumbersome firewall rules to guard on premises networks
  • Establish silos for networking and security changes
  • Establish tight restrictions for use of cloud options and services
Any manual review will slow down velocity and productivity. The perception is that this increases safety. However, manual reviews also slow everything down. To this extent, manual reviews through the baby out with the bathwater. A major benefit of DevOps and cloud consumption is increased speed to market. That is, both DevOps and cloud consumption should allow companies to make business capabilities available to end-users faster and increase competitive advantage. Manual reviews decrease if not eliminate this business benefit.

Organizational silos and restrictions create process bottlenecks and discourage innovation.  The logic for silos is that it helps companies achieve economies of scale for specialized skillsets. The trouble is that these silos can't keep up with application team demand. Application teams recognize the bottlenecks and adjust their designs to accommodate and streamline silo navigation rather than use the design they would like. In other words, they are discouraged from using new techniques that don't fit how the silos operate. While most companies provide an "exception" process that allows for a review of new tools, techniques, or procedures; exception processes are often cumbersome and time-consuming. In the end, organizational silos and restrictions depress productivity and slow the release of new business capabilities to end-users.

DevOps and cloud capabilities of companies often lag behind their needs. It takes time to get up to speed on cloud capabilities and DevOps practices. Consequently, the following often happens:
  • Initial environment set-ups and application deployments are much slower than expected.
  • Security vulnerabilities discovered at an increasing rate due to staff inexperience
  • The frequency of change for both management and staff is larger and more difficult than expected.
All the difficulties above depress productivity and reduce if not eliminate the benefits of DevOps and cloud consumption.

By now the reader might be second-guessing their decisions to adopt DevOps practices and the cloud. That's not where I'm headed. They are definitely good decisions, but 

Re-Write Management Processes from the Ground Up
By now, it should be obvious that patching existing management oversight and procedures has limitations. In fact, it won't really work for anyone's satisfaction. DevOps and cloud consumption requires a management paradigm shift in many ways. Let's face it. Management oversight methods and procedures that worked for a smaller on premises footprint simply don't work well for DevOps and the cloud. This section will highlight many paradigm shifts managers face and highlight things that need to change.

Acknowledge that DevOps and cloud consumption require a change in the way you think about management and oversight. This is difficult for many to do and is resisted at first. Once the paradigm shift is recognized, it's much easier to objectively evaluate alternative means and methods. You won't achieve the benefits of consuming the cloud otherwise. It expands your footprint with existing management and oversight processes that don't easily scale. 

Automate management oversight for cloud assets. Since everything in the cloud is "software", management oversight policies can be automated so that they no longer require manual oversight. Automated enforcement, once established, is much more consistent and doesn't require labor in the same way. Yes, this automation will require enhancement and maintenance just like any other software, but it increases the productivity of your security and cloud specialists exponentially. This is a body of work that will take planning and implementation effort - this isn't a costless option.  That said, in the long run, this is the most cost-effective option available currently.

Management oversight automation will also allow the company to migrate to continuous deployment and continuous delivery someday. In fact, continuous delivery is not possible without automating approvals and eliminating manual steps.

Don't try to transition to DevOps and the cloud without help. Yes, you retain smart people and they will get make the transition eventually. That said, it will take them a lot longer and you will experience "rookie" mistakes and accrue technical debt along the way. Keep in mind that you need help from a strategy perspective at a management level in addition to ground-level skills. Companies that look at DevOps and cloud consumption as strictly a technology change have trouble from a management perspective that I've outlined above.  

In Conclusion
This article comes from my experiences in the field. I help companies consume cloud technology and adopt DevOps tactics on a daily basis. That said, I'm always interested in hearing about your experiences. I hope that you find this entry useful and hope for many insightful comments. Thanks for your time.