Tuesday, March 8, 2022

Infrastructure Code and the Shifting Sand Problem

Infrastructure code (IaC) has a problem that application code does not: changes outside of infrastructure code impact its ability to function as intended. I call this the shifting sand problem. The goal is for infrastructure code executions (of the same version) to always produce the same result. In this sense, we want infrastructure code executions to be "repeatable" in the same way we strive to make application builds repeatable. When IaC executions aren't repeatable, unplanned work is the result. 

There are many sources of IaC shifting sand. I'll detail those sources and ways to mitigate the problem in this post. 

Common IaC shifting sand sources are below. They are caused by a mixture of technology change and organization management procedures. 
  • Automatic dependency updates
  • Cloud backend updates
  • Right-hand/Left-hand issues
  • Managing cloud assets from multiple sources
Configuring automatic dependency updates is the practice of using the latest version of IaC software or common IaC code revisions. Examples include Terraform and Cloud provider versions, common IaC code versions, virtual machine image versions, operating system updates,  and many more. Most automatic updates are put in place as a convenience: developers want to avoid the additional work of upgrading the versions used. The problem is that breaking changes from these dependencies will cause IaC code not to function. Then unplanned work results and somebody will need to fix the issue,  often at an inconvenient time with a looming deadline.

Some operating system and virtual machine image updates are security-related. Examples include anti-virus software updates,  etc. This type of update is often unavoidable. Rather, the potential cost of delaying these updates can be considerable. 

For avoidable automatic updates (not security-related), the best mitigation is to explicitly specify the versions of dependencies used.  Never use 'latest' or the newest available. Let upgrades of dependencies be driven by changes needed for planned enhancements for end-users.  Then the upgrade work is planned and scheduled as opposed to inconveniently timed.

For unavoidable automatic updates (e.g. security-related), early detection by automated testing is best. In addition, apply these updates in lower-level environments on a regular basis. Forcing such updates in lower-level environments first will increase the chance that issues not covered by automatic testing can be found before production. 

Cloud backend updates are breaking changes introduced by cloud vendor software. While Cloud vendors do attempt to make their changes backward compatible,  they don't always succeed. Without naming names, I've been in support calls with Cloud vendor technical support teams and seen them blackout product changes or frantically release product fixes. In addition, I've frequently had to frantically upgrade IaC code to accommodate Cloud vendor backend changes. As with the automatic update problem, unplanned (and inconveniently timed) work results.

Scheduled testing in a sandbox environment is the best mitigation strategy we've found. This strategy increases the chance of detection early before changes are actually needed in real environments. Depending on the components used, sandbox testing can be expensive to run. The more often sandbox testing is run, the earlier problems like this will be detected. Unfortunately, increased run frequency drives up costs. 

Right-hand/left-hand issues are created within the organization itself where one department makes changes that have ramifications they don't see in other departments. One example I've seen frequently is a group in charge of making security policy changes effectively "breaks" IaC code maintained by other departments. In essence,  tactics taken by that IaC code were no longer allowed. In this example, making the policy change is often necessary.  

Early detection through scheduled sandbox testing (as described above) is the best mitigation strategy for the team maintaining IaC code for specific applications. The same tradeoff between frequency and cost applies.

Managing cloud assets from multiple sources occurs when something besides one IaC pipeline manages a cloud asset. The most frequent example is manual change. When developers manually change assets that are managed by IaC, often drift results. That is, the cloud asset, in reality, differs from what is in the IaC pipeline. 

Another example is creating multiple IaC code bases to manage the same asset. I've seen this frequently in cloud asset tagging, which is frequently used for accounting charge-back purposes. Drift always results as multiple IaC code bases rarely come up with the same answer. As a result, all IaC code bases (except the one last executed) differ from reality. 

Bottom line: ensure that one and only one IaC code base manages each cloud asset. This prevents configuration drift. It saves aggregation and labor in staff who are often left with the mystery of figuring out how that drift happened. This is a topic I might explore more completely in another article.

I hope this helps.  I'm always interested in other types of problems you encounter maintaining IaC code.  Thanks for taking the time to read this article.