Saturday, January 9, 2021

For DevOps Professionals: Evolutionary Terraform

Organizations that use Terraform to manage cloud infrastructure often create and maintain Terraform modules as the code base grows. Inevitably, complexity increases with the introduction of reusable code. DevOps teams, I've worked with struggle with the level of modularization they should use and how to more easily manage it. 

I think of the modularization of Terraform as an evolutionary process. The level of modularization needed when organizations first start out is different from what they need as they mature. This article will take you through a sensible evolutionary path that only increases code complexity when truly needed.

Just to clarify my terminology, a configuration is a Terraform project that is used to directly manage cloud infrastructure. That is, create a virtual network and subnets for a specific development environment. A module is a Terraform project that is designed for reuse and is used by configurations. For instance, I usually have a module that creates a configured virtual network, all component subnets. This functionality is used for multiple virtual networks in multiple environments.

In the Beginning

When new technologies are adopted, simplicity is and should always be the goal. Only accept complexity that is necessary and only when it becomes necessary. Terraform is no exception. Let's discuss some opening tactics.

Use source control for all Terraform code. This is easy and it should be used from the beginning. Repositories are easy and inexpensive these days.

Centrally manage Terraform state (e. g. back-end state). By default, Terraform will store the Terraform state on the device where the configuration is executed. All cloud platform Terraform providers provide a way to store state in the cloud instead of on the device doing the execution. This is generally easy to set up and reduces the risk of loss of the current Terraform state. Here are setup instructions for AWS, Azure, and GCP.

Adopt a standard Terraform project structure that incorporates configurations and modules. A typical directory structure for a Terraform repository looks like the following:


Note that a standard project structure separates configurations from modules.
This makes reusable code easy for developers to identify. How configurations should be structured would be a great topic for another article. Briefly, I generally separate network infrastructure, common services for all applications, and application infrastructure into separate configurations to make the blast radius more manageable.

Note that only configurations have environment tfvars files. As configurations are used to directly manage infrastructure, modules are more focused and do not need to be coupled with the concept of different environments. An illustration of where to place tfvars files follows:


When the Number of Coders Grows

When the number of DevOps professionals on the team grows, it's common for changes from one person to accidentally conflict with changes others are making. This lengthens the time associated with changes and slows the team's velocity. 

Use feature branches to organize changes. This allows developers to test their changes with less fear that another developer will accidentally interfere. I've addressed feature branch usage in detail in this post.

Test feature branch changes in a sandbox environment. A sandbox environment to me is an environment that can easily be destroyed and recreated if something goes awry. Do not run feature branches in any non-sandbox environment. This allows developers to test new code in isolation without fear of accidentally negatively impacting others.

Only apply changes from a CI/CD pipeline. This provides an execution history. If something unexpected happens, execution history can provide information as to what was run when. It also removes any differences between the environments and access executing from individual devices.

Schedule CI/CD pipeline plan or validate operations for each configuration. This will allow you to detect configuration drift. It also ensures that all configurations are at least correct as far as syntax and that a configuration hasn't been affected by a breaking change in one of the modules. 

Some organizations use TerraTest to automatically test Terraform configurations. While I support automated testing if you can do it, TerraTest requires GoLang knowledge that not all organizations have. Mandating TerraTest can be a big ask.

When the Number of Configurations Grows

As the number of Terraform configurations grows, typically the blast radius for changes to modules also grows. The reason is that module usage also grows. With a small number of configurations, it's easier to test each configuration that uses the module that is being changed. The test effort grows with a growing number of configurations. Either velocity slows to accommodate the larger blast radius, or testing isn't as thorough, and accidental defects are released.

Ensure that you adopt module coding best practices. This is a large topic and deserves its own article, but I summarize some key points in the Module Coding Best Practices section below. As the number of configurations grows, the opportunity for reuse increases, and the number of modules also grows.

Separate out all modules into a separate repository and formally release by version/tag. This allows consuming configurations to insulate themselves from module changes. If configurations consume the latest release, they run the risk of not working if a breaking change was made to the modules they consume. In essence, consuming specific versions/tags converts "unplanned" work to "planned" work. Module upgrades can be scheduled with time allowed for it if needed. 

Note that versioning modules reduce the risk of change for modules. I've seen slightly different versions of modules that do much the same thing occur because people fear accidentally breaking configurations they know nothing about. Versioning eliminates this risk as the modified code will be published with a new version.

Once a version/tag is released, never change its content. In this world, there should be no concept of forcing configurations to accept changes. Consuming configurations should always control and be able to plan for module upgrades. 

Only consume modules explicitly specifying a tag/version. Consuming the "latest" version increases the risk of unplanned work as discussed previously.

Module Coding Best Practices

These practices deserve their own article, but to summarize:

Only create a module that has at least two consuming configurations. Creating a module for use by only one configuration is classic YAGNI. It introduces complexity that isn't yet necessary.

Avoid data lookups in modules. Pass needed information as input variables. The reason is subtle. Data lookups will error if the target is not found. As we're talking about modules, they don't (and shouldn't) understand configuration context. If the target of the lookup doesn't exist, the first plan for a configuration using the module will error out. Using data lookups in configurations are perfectly fine as they understand execution context. This is subtle.

As an example, let's say the module virtual-machine, used by configuration app-fred, executes a data lookup for a specific subnet. Let's also say that configuration app-fred creates that subnet. Configuration app-fred will not successfully plan because the subnet module virtual-machine is looking for doesn't exist yet on the first run. Bottom line - modules should not do data lookups because they don't (and shouldn't) understand the execution context.

Ensure that all modules are documented in Markdown with a README. I usually include an example usage section with common input options. The objective is to make it quick and easy for developers to use the module. In my own README documentation for modules, I include the following sections:

  • A list of input variables and brief description if needed
  • A common usage example that consumers can copy/paste/change to their own configurations.

Parting Ideas

Only assume complexity needed. The later stages of evolution described here are not needed in the beginning. Avoid classic YAGNI (You Ain't Going to Need It).

You don't get away from change management. While the practices described here reduce friction as your Terraform usage grows, change management is still needed. Somebody or group still needs to organize changes in a way that recognizes and accommodates dependencies.

Thanks for reading this article. As always, please contact me or comment if you've alternative thoughts.