Wednesday, September 14, 2022

Appropriate workloads for Kubernetes

I've been asked about hosting cloud applications in Kubernetes. People seem to assume that Kubernetes is the best practice for hosting containerized workloads in all cases. While I love Kubernetes and have used it successfully in numerous applications, It shouldn't be the default, out-of-hand, hosting solution for containerized workloads. 

Kubernetes is far from the simplest solution for hosting containerized workloads. Even with cloud vendors making Kubernetes clusters more integrated and easier to manage, they are still very complex and take highly specialized administration skillsets. If you're using Azure, Containerized Instances is a comparatively more straightforward method than AKS for deploying containerized workloads. In fact, Azure has several different ways to deploy containerized workloads. Most are easier than AKS/Kubernetes.

If you're using AWS, ECS or Lambda is comparatively easier than Kubernetes. In fact, AWS has at least 17 ways to deploy containerized workloads. Incidentally, with any cloud, it's possible to create a virtual machine and run a containerized workload on it: I don't recommend this. Bottom line: Kubernetes AWS ECS, and Azure Containerized Instances are application runners.

If you adopt Kubernetes as a hosting mechanism, the benefits should pay for the additional complexity. Otherwise, the additional maintenance headaches and costs of Kubernetes are not a wise investment. That begs the question, what types of applications are most appropriate for Kubernetes? Which application types should adopt a more straightforward hosting mechanism?

As an aside, containerizing your applications is the mechanism that separates the concern of hosting from the functionality of your application. Cloud vendors have numerous ways to deploy and run containerized applications. Vendor lock-in usually isn't an issue with containerization.

Workloads Well-Suited for Kubernetes

This section details common types of applications that may benefit from Kubernetes hosting.

Applications with dozens or hundreds of cloud-native services often benefit from Kubernetes. Kubernetes can handle autoscaling and availability concerns among the different services with less set-up per service. Additionally, cloud vendors have integrated their security and monitoring frameworks in a way that generally makes management of Kubernetes-hosted services homogenous with the rest of your cloud footprint.

Applications servicing multiple customers that require single-tenant deployments often benefit from Kubernetes. For these applications, there is one deployment of a set of services per "customer". For example, if there are 1000 customers, there will be 1000 deployments of the same set of services for each one. Given that the number of deployments grows and shrinks as customers come and go, Kubernetes streamlines the setup for each as well as provides constructs to keep each of the customer deployments separate.

Applications requiring custom Domain Name Services (DNS) resolution that can't be delegated to the enterprise custom DNS often benefit from Kubernetes. This is because Kubernetes has configurable internal networking capabilities. Often this happens more as a result of organizational structure than technical reasons. In many enterprises, DNS is managed by infrastructure teams and not application teams.

Applications in enterprises where IP address conservation is necessary can benefit from Kubernetes. For applications with internal services that only Kubernetes-hosted services needs access to, Kubernetes internal networking model called kubenet provides an internal network not visible outside the cluster. For example, an application with hundreds of microservices may only need to expose a small fraction of those services outside the cluster. The kubenet networking model conserves IP addresses as internal services don't need IP addressability outside the cluster.

In the cloud, we think of IP addresses as "free" and without charge. For firms with a large existing on premises network, IP address space is often not free. In most firms I've seen, on premises networks are tarballs without sensible IP address schemes. Often, nobody understands the entirety of what network CIDRs are in use and for what. Additionally, routing is often manually configured, making CIDR block additions labor-intensive.

Summary

Not every application should be hosted in Kubernetes. Simple applications with small transaction volume often don't require the additional complexity Kubernetes brings. 

Thanks for taking the time to read this post. I hope this helps.

Sunday, September 4, 2022

Policy-based Management Challenges and Solutions

One of the most common best practices for managing security in the cloud is policy-based management.  Policy-based management optimally prevents security breaches or at least alerts you to their presence. Additionally, it alieviates the need for as many manual reviews and approvals, which slow down development of new business capabilities. That said, policy-based management presents many challenges. This post details common challenges and tactics to overcome them. 

Challenge #1: Introducing New or Changed Policies

New or changes to existing policies often break existing infrastructure code (IaC) supporting existing applications. This occurs because at the time the IaC was constructed, the policy wasn't in place and the actions were allowed. This results in unplanned work for application teams and schedule disruptions. As policy makers are usually separate teams, they often don't pay the cost associated with the associated unplanned work.

Policy change announcements are often ineffective. Partially, this is due to volume of announcements in most organizations. The announcement of an individual policy gets lost in a sea of other announcements. Additionally, sometimes IaC developers do not completely understand or see the ramifications of the policy change. 

Challenge #2: Policies with Automatic Remediation

Installing policies that have automatic remediations in them can actually break existing infrastructure and the applications that rely on it. While automatic remediation for policies is appealing from a security perspective as it fixes an issue in a short time after a security hole is created, it really just kicks the can. Any resulting breakage will need to be repaired sometimes causing an outage for end users.

The IaC that produced the invalid infrastructure will no longer match the infrastructure that physically exists and needs to be changed. In other words, the automatic remediation causes unplanned work for other teams. Sometimes, new policies cause common IaC modules used by multiple teams to no longer work and not individual application infrastructure code.

Challenge #3: Adapting Policies to Advances in Technology 

Many policy makers only consider legacy mutable infrastructure. Mutable infrastructure is common on premises and consists of static virtual machines/servers that are created once and updated with new application releases when needed. Immutable infrastructure VMs are completely disposable. The VMs are still updated, but by updating the images they are created from and replacing the VMs in their entirety.

For example, it is common to place a policy that requires that automatic security updates be applied to virtual machines on a regular basis. The issue is that such policies assume that the VM has a long life as it would under a mutable infrastructure. Such a policy doesn't apply to immutable infrastructures. For immutable infrastructure, the base image needs security updates applied and any VMs built using it should be rebuilt and redeployed.

Cloud vendor technology changes at a rapid pace. Keeping cloud policies up to date with current advances is a challenge. In practice, policy makers are often out of date and make invalid assumptions. Effects of this I commonly see are:
  • Assuming that cloud vendor capabilities for securing network access remains the same.  Often,  these capabilities advance.
  • Assuming VM IP addresses are static can safely be used in firewall rules. In the cloud,  IP addresses can change quite frequently.
  • Assuming that VM images are changeable (vended provided images might not be)
  • Assuming that there will be no needed exceptions to security policies

Tactics to Mitigate Challenges

Always audit compliance to policies first before installing automatic remediation. That is alert teams of new compliance issues before changing anything automatically. This allows teams to accommodate a security policy change proactively before change is forced. Additionally, a reasonable lead time needs to be provided so that teams have the opportunity to mitigate the additional work.

Test new or changed policies with any related enterprise-wide common IaC modules. It is common for organizations with mature DevOps capabilities to centralize common IaC modules and reuse them for multiple applications. This allows organizations to leverage existing work instead of having multiple teams reinvent the wheel. For example, if a policy regarding AWS S3 buckets or Azure storage accounts is being changed, test any common IaC modules that use those constructs. Make policy compliance part of the test. Note that these tests should be automated so they can easily be rerun.

Any policies with automatic remediation must provide an exception capability. For example, if some VM images are purchased and not changeable according to the license. It is common for such images to be granted exceptions from related security policies. Additionally, I've seen exceptions granted for cloud vendor-provided Kubernetes clusters where underlying VMs don't and can't meet policy requirements. 

New policies should be deployed in lower environments first. This increases the chance that any errors or issues will be identified before the policy is applied to production. Be sure to allow a reasonable period of time in lower environments to increase the likelyhood that issues will be identified and addressed.

Summary

Policy-based management has challenges, but should still be considered best practice. Thanks for taking time to read this article. Please contact me if you have questions, concerns, or are experiencing challenges I've not listed here.


Monday, August 8, 2022

Move your Network to the Cloud Too!

Over the past year, I'm seeing indications of what will be a big trend in cloud consumption: let's move our network to the cloud along with data centers. I'm talking about the WAN network primarily which many enterprises maintain worldwide.  Local offices will still need connectivity to the WAN; it's just that they will increasingly become on-ramps to the worldwide WAN hosted in the cloud. In other words, data centers will no longer be the "center" for all network access. 

Graphically, the concept of moving the WAN to the cloud would look like figure 1 below. Notice how all data centers and offices are connected to the WAN that handles traffic between them. While the image doesn't describe it, the Cloud-based WAN is worldwide and can serve offices and data centers across the globe.

Figure 1: Cloud-based WAN Network




Let's contrast this with figure 2 which depicts the WAN network topology common in enterprises today. Note that public cloud access typically routes via data centers making enterprise application access data center centric. Worldwide connectivity is managed by a custom MPLS network.

Figure 2: Traditional Worldwide MPLS Network



I'm seeing several motivations for the change in thinking about how worldwide networks should be organized. I'll separate the reasoning into the following categories:
  • Complexity
  • Performance
  • Financial
  • Speed to Market

Complexity

The complexity of non-Cloud MPLS networks, the base for most enterprise worldwide WANs, is tremendous. MPLS networks typically require large amounts of hardware that needs to be upgraded and replaced regularly. They take a large networking staff. While some outsource that to an MSP provider, they are still necessary. Outsourcing a large portion of the network to cloud vendors outsources this complexity and associated maintenance to a large degree. They also tend to be replete with numerous vendor contracts.

The complexity increases the business risk of change. MPLS networks are rarely supported by testing sandbox environments and automation. Many still make changes manually leading to inevitable human error and outages for users. Utilizing cloud vendors makes it much easier to automate the WAN infrastructure and provides a sandbox environment to test networking-related changes. This decreases the business risk of changes to networking infrastructure. This is huge. For most enterprises, the WAN that integrates all data centers and offices is essential.

Simpler capacity planning requirements. Hardware and vendor contacts needed for worldwide MPLS networks require sophisticated capacity planning due to long lead time requirements. This requirement is much simpler with cloud WAN implementations. Capacity planning still exists,  but it is far simpler and is easily changeable and adaptable on the fly. 

Performance

Network latency is generally significantly lower (faster) using cloud-provided WAN networking than worldwide MPLS networks. While your mileage will vary depending on your MPLS implementation, so much R&D goes into cloud-provided WANs that the likelihood that an enterprise will keep up any network performance advantages over time is low. Face it, most firms just can't compete.

Network latency is higher (slower) accessing resources that require networking between on premises and the cloud. As more IT workloads move from on premises to the cloud, closer proximity to the cloud will yield better performance. To this end, I see more enterprises leveraging cloud VPN services, which are closer to most application workloads, yielding better performance.

Financial

Converting networking hardware and infrastructure from capital expense (CapEx) to operational expense (OpEx) is appealing to many enterprises from an accounting perspective. As with computing resources, you pay for what you use for cloud-based WANs without hardware expenditures and management.

Networking labor is expensive specialized labor. Outsourcing that labor to cloud providers is definitively cheaper. Some enterprises mitigate this cost by enlisting a managed services provider (MSP), but outsourcing that labor to cloud vendors is cheaper as it capitalizes on the cloud's economy of scale advantages.

Speed to Market

No more long lead times for MPLS network upgrades and capacity increases. Increasing capacity in a cloud-provided WAN is typically measured in hours, not months. Furthermore, cloud-provided WAN products benefit from the cloud's dynamic scaling capabilities. Increasing MPLS network capacity takes sophisticated capacity planning and typically long lead times due to additional hardware expenditures.

Additional Benefits

The firm gets access to research and development advances made by cloud providers. The R&D resources that cloud providers are investing in WAN technologies surpass what most enterprises are able or willing to invest in. This means that over time, any differences in functionality and performance are likely to appear in cloud vendors first.

A cloud-based WAN is a natural partner when combined with a cloud-based VPN capability. This makes sense especially if the cloud hosts a larger percentage of application compute resources. Consuming the cloud-providers VPN solution moves those compute resources closer to what users access. With that closer proximity, typically comes better performance.

A cloud-based WAN is a natural partner for integrating multiple cloud providers. That is, Your AWS footprint can be securely connected to your Azure or GCP footprint directly. This avoids the slower connection between the cloud providers through an on premises data center.

Concluding Remarks

I'm reporting what I'm seeing at clients. This idea made no sense when many had a small fraction of their IT footprint in the cloud. Now that most firms now have most of their footprint in the cloud,  thinking on how to provide worldwide access to internal users needs to evolve. And the time for that evolution has come.

If you have thoughts or feedback, please contact me directly via LinkedIn or Email. thanks for taking the time to read this article. 


Wednesday, August 3, 2022

Radical Idea: Let's do more Testing in Production

There are many different types of application testing. This article is entirely about system-level testing, the most outer-level user experience testing. System-level testing is also the most difficult to automate.

I'm talking about integration or testing of the application from an end-user perspective only. Many use the term system-level testing for this activity.   Other types of testing such as unit, performance, exploratory, and usability, are not a part of this article. Other forms of testing are essential, but not the focus here. 

System-level automated testing has too much friction.  It can't keep up.  It's the most challenging type of testing to automate. Because of that, many still perform system-level testing manually. The cost-benefit of automating these types of tests is elusive. This test automation certainly can't support high-performing DevOps teams' high-frequency change rates. 

System-level automated tests are fragile.  The slightest change or refactoring at the outer web layer breaks a large percentage of system-level tests. Automated testing at this level usually relies on labels programmers use for parameters and control identifiers. Programmers usually consider them free to refactor these labels for clarity without notice.

The lack of automated system-level testing impedes the firm's ability to implement continuous delivery. In turn, manual system-level testing lowers the lead-time, which is one of the DORA metrics, we seem to be using these days.

What's the Alternative?

Let's outsource system-level testing to end users. Rather, let's enlist a small percentage of end users to use a release candidate in production and measure their error rate. Additionally, those errors can be provided to the development team for remediation.

Instead of writing system-level tests, implement canary deployments and provide a release candidate version that is considered production and uses production databases and resources. The release candidate is production in every way, except that it's used by a small percentage of users. If the application is hosted in the cloud, it's possible to create a "sister" installation of an application in production that uses production resources in the same way the active version of the application does.

Remediate the release candidate until the error profile is acceptable for mainstream release. In other words, fail forward, don't roll back when errors are discovered. At some point, the release candidate will be considered stable and is made active for 100% of users. At this point, a new release candidate is created for new features and changes to be tested in the same way. 

This solution avoids the problem of automating system-level tests and all its problems in terms of friction and fragility. Sometimes, the winning move is not to play! What I propose doesn't skip testing. It just changes the paradigm under which that testing is conducted.

The testing that end users do is going to be more comprehensive than any test plan can provide. Moreover, testing by end users will concentrate on the most frequently performed tasks.  

There are diminishing returns to increasing the number of users directed to the release candidate. That is, you will discover more defects increasing the number of users on the release candidate from 0% to 2% than you will from 25% to 50%. 

If you monitor error rates on the application, automation can be built to support continuous delivery. In other words, if the release candidate reveals no increase in error rates over the current live version, automation can make the switch based on thresholds you configure.

This concept sounds scary,  but is it functionally different from what we experience today? We all see defects deployed to production despite our best testing efforts.  I'm just suggesting we use what we experience rather than pretend we can avoid it. 

Thanks for taking the time to read this article. I'm always eager for feedback.



Tuesday, March 8, 2022

Infrastructure Code and the Shifting Sand Problem

Infrastructure code (IaC) has a problem that application code does not: changes outside of infrastructure code impact its ability to function as intended. I call this the shifting sand problem. The goal is for infrastructure code executions (of the same version) to always produce the same result. In this sense, we want infrastructure code executions to be "repeatable" in the same way we strive to make application builds repeatable. When IaC executions aren't repeatable, unplanned work is the result. 

There are many sources of IaC shifting sand. I'll detail those sources and ways to mitigate the problem in this post. 

Common IaC shifting sand sources are below. They are caused by a mixture of technology change and organization management procedures. 
  • Automatic dependency updates
  • Cloud backend updates
  • Right-hand/Left-hand issues
  • Managing cloud assets from multiple sources
Configuring automatic dependency updates is the practice of using the latest version of IaC software or common IaC code revisions. Examples include Terraform and Cloud provider versions, common IaC code versions, virtual machine image versions, operating system updates,  and many more. Most automatic updates are put in place as a convenience: developers want to avoid the additional work of upgrading the versions used. The problem is that breaking changes from these dependencies will cause IaC code not to function. Then unplanned work results and somebody will need to fix the issue,  often at an inconvenient time with a looming deadline.

Some operating system and virtual machine image updates are security-related. Examples include anti-virus software updates,  etc. This type of update is often unavoidable. Rather, the potential cost of delaying these updates can be considerable. 

For avoidable automatic updates (not security-related), the best mitigation is to explicitly specify the versions of dependencies used.  Never use 'latest' or the newest available. Let upgrades of dependencies be driven by changes needed for planned enhancements for end-users.  Then the upgrade work is planned and scheduled as opposed to inconveniently timed.

For unavoidable automatic updates (e.g. security-related), early detection by automated testing is best. In addition, apply these updates in lower-level environments on a regular basis. Forcing such updates in lower-level environments first will increase the chance that issues not covered by automatic testing can be found before production. 

Cloud backend updates are breaking changes introduced by cloud vendor software. While Cloud vendors do attempt to make their changes backward compatible,  they don't always succeed. Without naming names, I've been in support calls with Cloud vendor technical support teams and seen them blackout product changes or frantically release product fixes. In addition, I've frequently had to frantically upgrade IaC code to accommodate Cloud vendor backend changes. As with the automatic update problem, unplanned (and inconveniently timed) work results.

Scheduled testing in a sandbox environment is the best mitigation strategy we've found. This strategy increases the chance of detection early before changes are actually needed in real environments. Depending on the components used, sandbox testing can be expensive to run. The more often sandbox testing is run, the earlier problems like this will be detected. Unfortunately, increased run frequency drives up costs. 

Right-hand/left-hand issues are created within the organization itself where one department makes changes that have ramifications they don't see in other departments. One example I've seen frequently is a group in charge of making security policy changes effectively "breaks" IaC code maintained by other departments. In essence,  tactics taken by that IaC code were no longer allowed. In this example, making the policy change is often necessary.  

Early detection through scheduled sandbox testing (as described above) is the best mitigation strategy for the team maintaining IaC code for specific applications. The same tradeoff between frequency and cost applies.

Managing cloud assets from multiple sources occurs when something besides one IaC pipeline manages a cloud asset. The most frequent example is manual change. When developers manually change assets that are managed by IaC, often drift results. That is, the cloud asset, in reality, differs from what is in the IaC pipeline. 

Another example is creating multiple IaC code bases to manage the same asset. I've seen this frequently in cloud asset tagging, which is frequently used for accounting charge-back purposes. Drift always results as multiple IaC code bases rarely come up with the same answer. As a result, all IaC code bases (except the one last executed) differ from reality. 

Bottom line: ensure that one and only one IaC code base manages each cloud asset. This prevents configuration drift. It saves aggregation and labor in staff who are often left with the mystery of figuring out how that drift happened. This is a topic I might explore more completely in another article.

I hope this helps.  I'm always interested in other types of problems you encounter maintaining IaC code.  Thanks for taking the time to read this article. 


Saturday, January 9, 2021

For DevOps Professionals: Evolutionary Terraform

Organizations that use Terraform to manage cloud infrastructure often create and maintain Terraform modules as the code base grows. Inevitably, complexity increases with the introduction of reusable code. DevOps teams, I've worked with struggle with the level of modularization they should use and how to more easily manage it. 

I think of the modularization of Terraform as an evolutionary process. The level of modularization needed when organizations first start out is different from what they need as they mature. This article will take you through a sensible evolutionary path that only increases code complexity when truly needed.

Just to clarify my terminology, a configuration is a Terraform project that is used to directly manage cloud infrastructure. That is, create a virtual network and subnets for a specific development environment. A module is a Terraform project that is designed for reuse and is used by configurations. For instance, I usually have a module that creates a configured virtual network, all component subnets. This functionality is used for multiple virtual networks in multiple environments.

In the Beginning

When new technologies are adopted, simplicity is and should always be the goal. Only accept complexity that is necessary and only when it becomes necessary. Terraform is no exception. Let's discuss some opening tactics.

Use source control for all Terraform code. This is easy and it should be used from the beginning. Repositories are easy and inexpensive these days.

Centrally manage Terraform state (e. g. back-end state). By default, Terraform will store the Terraform state on the device where the configuration is executed. All cloud platform Terraform providers provide a way to store state in the cloud instead of on the device doing the execution. This is generally easy to set up and reduces the risk of loss of the current Terraform state. Here are setup instructions for AWS, Azure, and GCP.

Adopt a standard Terraform project structure that incorporates configurations and modules. A typical directory structure for a Terraform repository looks like the following:


Note that a standard project structure separates configurations from modules.
This makes reusable code easy for developers to identify. How configurations should be structured would be a great topic for another article. Briefly, I generally separate network infrastructure, common services for all applications, and application infrastructure into separate configurations to make the blast radius more manageable.

Note that only configurations have environment tfvars files. As configurations are used to directly manage infrastructure, modules are more focused and do not need to be coupled with the concept of different environments. An illustration of where to place tfvars files follows:


When the Number of Coders Grows

When the number of DevOps professionals on the team grows, it's common for changes from one person to accidentally conflict with changes others are making. This lengthens the time associated with changes and slows the team's velocity. 

Use feature branches to organize changes. This allows developers to test their changes with less fear that another developer will accidentally interfere. I've addressed feature branch usage in detail in this post.

Test feature branch changes in a sandbox environment. A sandbox environment to me is an environment that can easily be destroyed and recreated if something goes awry. Do not run feature branches in any non-sandbox environment. This allows developers to test new code in isolation without fear of accidentally negatively impacting others.

Only apply changes from a CI/CD pipeline. This provides an execution history. If something unexpected happens, execution history can provide information as to what was run when. It also removes any differences between the environments and access executing from individual devices.

Schedule CI/CD pipeline plan or validate operations for each configuration. This will allow you to detect configuration drift. It also ensures that all configurations are at least correct as far as syntax and that a configuration hasn't been affected by a breaking change in one of the modules. 

Some organizations use TerraTest to automatically test Terraform configurations. While I support automated testing if you can do it, TerraTest requires GoLang knowledge that not all organizations have. Mandating TerraTest can be a big ask.

When the Number of Configurations Grows

As the number of Terraform configurations grows, typically the blast radius for changes to modules also grows. The reason is that module usage also grows. With a small number of configurations, it's easier to test each configuration that uses the module that is being changed. The test effort grows with a growing number of configurations. Either velocity slows to accommodate the larger blast radius, or testing isn't as thorough, and accidental defects are released.

Ensure that you adopt module coding best practices. This is a large topic and deserves its own article, but I summarize some key points in the Module Coding Best Practices section below. As the number of configurations grows, the opportunity for reuse increases, and the number of modules also grows.

Separate out all modules into a separate repository and formally release by version/tag. This allows consuming configurations to insulate themselves from module changes. If configurations consume the latest release, they run the risk of not working if a breaking change was made to the modules they consume. In essence, consuming specific versions/tags converts "unplanned" work to "planned" work. Module upgrades can be scheduled with time allowed for it if needed. 

Note that versioning modules reduce the risk of change for modules. I've seen slightly different versions of modules that do much the same thing occur because people fear accidentally breaking configurations they know nothing about. Versioning eliminates this risk as the modified code will be published with a new version.

Once a version/tag is released, never change its content. In this world, there should be no concept of forcing configurations to accept changes. Consuming configurations should always control and be able to plan for module upgrades. 

Only consume modules explicitly specifying a tag/version. Consuming the "latest" version increases the risk of unplanned work as discussed previously.

Module Coding Best Practices

These practices deserve their own article, but to summarize:

Only create a module that has at least two consuming configurations. Creating a module for use by only one configuration is classic YAGNI. It introduces complexity that isn't yet necessary.

Avoid data lookups in modules. Pass needed information as input variables. The reason is subtle. Data lookups will error if the target is not found. As we're talking about modules, they don't (and shouldn't) understand configuration context. If the target of the lookup doesn't exist, the first plan for a configuration using the module will error out. Using data lookups in configurations are perfectly fine as they understand execution context. This is subtle.

As an example, let's say the module virtual-machine, used by configuration app-fred, executes a data lookup for a specific subnet. Let's also say that configuration app-fred creates that subnet. Configuration app-fred will not successfully plan because the subnet module virtual-machine is looking for doesn't exist yet on the first run. Bottom line - modules should not do data lookups because they don't (and shouldn't) understand the execution context.

Ensure that all modules are documented in Markdown with a README. I usually include an example usage section with common input options. The objective is to make it quick and easy for developers to use the module. In my own README documentation for modules, I include the following sections:

  • A list of input variables and brief description if needed
  • A common usage example that consumers can copy/paste/change to their own configurations.

Parting Ideas

Only assume complexity needed. The later stages of evolution described here are not needed in the beginning. Avoid classic YAGNI (You Ain't Going to Need It).

You don't get away from change management. While the practices described here reduce friction as your Terraform usage grows, change management is still needed. Somebody or group still needs to organize changes in a way that recognizes and accommodates dependencies.

Thanks for reading this article. As always, please contact me or comment if you've alternative thoughts.


   

Saturday, December 26, 2020

For DevOps Professionals: Barriers to 100% Infrastructure as Code

I was asked the other day why a particular part of the cloud infrastructure was added manually and not automated. It was a very small manual part and a one-time setup, but none-the-less I experienced déjà vu. It occurred to me that I've been asked that question at every client I've had since I got heavily into infrastructure code. We use the phrase "100% infrastructure as code" often. In fact, the overwhelmingly vast majority of cloud infrastructure is implemented via code. However, there is always some very tiny portion of the infrastructure that seems to be provided manually. The percentage is probably closer to 99.x% in most organizations I've had the privilege to do work for. Why is that? Why does the percentage never seem to be 100%? Let's make this more concrete and list some examples of automation I've encountered that wasn't 100% automated and why.

Examples of Automation Barriers

High-bandwidth from the cloud to on premises data centers are rarely 100% automated. This is the case for both Azure Express Routes and AWS Direct Connect connections. The reason is that a 3rd party firm controls access to the on-ramp or colocation device (e.g. CoreSite, Equinix, etc.). If the organization has access, it's usually manually controlled by a separate network infrastructure team. In other words, these devices aren't completely available to automation engineers that are needed for the development of infrastructure code. In essence, the cloud connectivity to the express route circuit can be automated, but that circuit's connectivity to the on-ramp is usually not.

Automating DNS entries are problematic in many hybrid-cloud organizations. This is an organizational barrier and not a technical barrier. It is common for DNS entries to be controlled by a separate team in a manual fashion. DNS authority is tightly controlled as there is effectively one DNS environment for the entire organization most of the time. The fear is that automation defects could negatively affect non-cloud entries or resources. 

Automating security policies and the assignment of those policies is problematic in many organizations. Typically, security is handled by a separate team and usually a team without infrastructure automation skills. Consequently, I've seen automation engineers write code to establish security policies, but those policies are manually assigned by a separate team. In essence, the traditional test-edit cycle that automation engineers need for this type of development doesn't exist.

Frequently, the creation of AWS accounts or Azure subscriptions is not automated. The reason is that in most organizations, that creation and their placement in the organizational tree is controlled by a separate team without automation coding skills. Furthermore, a sandbox environment for this type of automation code development doesn't exist.

Organizations define some resources to be central to the entire enterprise and don't have environment segregation. Examples are Active Directory environments, DNS, and WANs. The problem with this is that changes to central resources such as this become "production" changes and are tightly controlled. When everything is a production change, the test-edit development cycle automation engineers need doesn't exist. 

After doing some introspection with these examples, I've identified several common barriers to implementing 100% infrastructure as code. It turns out that most of these limitations are not technology-based.

Environment limitations

Infrastructure code development requires support for test-edit loops. That is, automation engineers need to be able to run, edit, and re-run infrastructure code to correct bugs. Writing infrastructure as code is just like application development in many ways. Automation engineers need to be able to experience occasional failures without negatively impacting others. These requirements are usually accomplished by a "sandbox" environment that others are not using.

The app developer part of me wants those tests and the verification of the result automated just like other types of application code. That said, the tooling to support automated testing of infrastructure code is sketchy at best and definitely not comprehensive. Automated testing is worth doing, but it is definitely not comprehensive. There are definitely limits to automated test coverage for infrastructure code.

The sandbox environment used for infrastructure code development must support the add/change/delete of that environment without negative impact on others. Like other types of development, infrastructure code doesn't always work as intended the first time it runs. In fact, you should assume that infrastructure code development might actually damage the sandbox environment in some way.

Sandbox environments should be viewed as completely disposable. That is, they can be created or destroyed as needs require little effort. It needs to be easy for an automation engineer to create a new sandbox for infrastructure code development and destroy it afterward.

It's common for sandbox environments to have limitations. That is, it's difficult for sandbox environments to accommodate 100% of infrastructure code development. Dependency requirements (e.g. Active Directory, DNS, connectivity to SAS, or on premises environments) are primary examples. These limitations contribute to the small portion of the infrastructure that is at least partially maintained manually.

Organizational authority

Automation engineers and the service accounts used for automation must have 100% control of the infrastructure maintained by code. That is, one can't develop infrastructure code without the authority to do so. This code can't be completely developed and tested. Consequently, the portion of the infrastructure that directly interfaces such resources is often manual.

Earlier in the post, I provided several examples that fit this category. For example, organizations using proprietary DNS products (e.g. Infoblox) often don't want to pay for additional licenses to support infrastructure code testing. Additionally, as DNS is often implemented in a one-environment paradigm (only production without separate development environments), organizations are hesitant to allow automation engineers security credentials needed to support infrastructure code support.

Active Directory (A/D) environments also fit into this category in many organizations. As A/D is often used to grant security privileges, organizations are loath to grant automation engineers and automation service accounts needed privileges to create groups, edit group membership, and delete groups.

All too often, the solution to these types of issues is to do a portion of infrastructure manually.

Low benefit/cost ratio

For some types of infrastructure, organizations find the benefits obtained by a complete automated solution aren't worth the costs. In other words, third-party costs (e. g. software licensing) make the "juice isn't worth the squeeze". Some infrastructure dependencies cost too much in money, labor, or time to dependency set-up costs to make the automation practical. Sometimes the manual labor involved in maintaining some infrastructure items is very small, making the cost of infrastructure code for those items not worth the effort.

Resources that are rarely updated and take an extremely long time to create/destroy often aren't worth the cost of automation. As an example, the AWS Transit gateway is often an example.  

Lack of DevOps team discipline can increase the cost of infrastructure automation and lower the benefit/cost ratio. Without the good discipline to the development life-cycle for infrastructure code and good source control habits, it's common for development work by one automation engineer to negatively impact the work of others. This leads to an increase in manual work or a decrease in team velocity. 

The breadth of specialized skills needed for some types of infrastructure can lower the benefit/cost ratio. As an example, work with one client required specialized networking and A/D skills to set up a test RRAS VPN target. If I didn't have a team member with these skills, I could never have tested that the cloud-side VPN infrastructure code worked - it would have been untested until use in one of the non-sandbox environments. I've seen other examples with regard to relational database administration skills and other types of specialized labor. The breadth of knowledge often needed by automation engineers is daunting.

Concluding Message

My acknowledgment that there are barriers to implementing 100% infrastructure as code should not be used as an excuse not to automate. Infrastructure code has produced some of the best productivity gains since we embarked on adopting cloud technologies. I'll never give up pressing for higher levels of infrastructure code automation. That said, when I recognize some of these non-technology barriers to infrastructure code, I'll feel a little less guilty. Yes, I'll try to craft workarounds, but recognize that it isn't always possible in every organization.

Thanks for reading this post. I hope you find it useful.