Saturday, April 13, 2024

For DevOps Professionals: Automatically Generating Terraform Documentation

Today, I'm furthering my war on DevOps busy work. This is a continuation of a previous post: For DevOps Professionals: Analyzing and Eliminating Busy Work. For this post, I want to remove the tediousness of properly documenting Terraform code.

Terraform code documentation should detail the following to make it easy for other developers to consume and maintain a Terraform configuration or project. The bottom line is high quality and accurate documentation for Terraform code makes change easier and saves people time. The relevant quote from my previous post is:

Automatically generate Terraform documentation. Like many,  we use the open source product Terraform-docs to generate Terraform documentation.  We seem to be executing that generation locally. When pull requests are created or updated,  a GitHub Workflow that automatically executes the documentation generation and pushes any changes would eliminate the manual step.

Terraform code documentation should have the following components. Without this information, anyone consuming or changing the project would be forced to audit the code line by line.  Here's an example from one of my Terraform modules.

  1. A concise description of what the Terraform project supports
  2. For reusable modules, a usage example
  3. Details about all inputs, including any defaults
  4. Details for all outputs
  5. Identification of all Terraform resources maintained
  6. A listing of any modules used by the Terraform project
  7. Provider requirements, including versions

Keeping these documentation items current is tedious and boring. Consequently, it usually becomes out of date quickly. Fortunately, the open-source product Terraform Docs can generate this information. My team uses this product religiously. 

We go further and have Terraform documentation updated automatically for every pull request update. As we're largely using GitHub, our Terraform projects update documentation automatically when our developers create or update a pull request. We've encapsulated the regeneration logic into GitHub action terraform-docs, making it easy to reuse for all Terraform projects in all our repositories. This concept is usable for most CI/CD toolsets.

Using the terraform-docs action is relatively straightforward. That said, I might take the step of creating a reusable GitHub workflow to further streamline usage. Here's an example of workflow usage.

The terraform-docs action is publicly available under the MIT open-source license. You're free to use it directly or fork it if you prefer.

If you use it directly, we recommend explicitly referencing a release tag. That way, you're not accidentally impacted by changes to the action going forward.

I'm always interested in your views on the topic. Especially if you've discovered additional ways to make documenting Terraform code higher quality and easier to maintain.

What I'm Currently Reading:

Accelerate DevOps with GitHub: Enhance software delivery performance with GitHub Issues, Projects, Actions, and Advanced Security. Kaufmann, Michael: Packt Publishing; 1st edition (September 9, 2022)

Sunday, April 7, 2024

For DevOps Professionals: Analyzing and Eliminating Busy Work

I periodically go through the exercise of identifying areas of busy work my team and I experience as DevOps professionals concentrating in infrastructure code development and maintenance.  While I do have a specific area of focus for my introspection today,  this is a good exercise for anyone in almost any vertical. 

I seem to be concentrating on Terraform IaC these days and migrating to GitHub workflows for CiCd capabilities.  We're entirely GitOps and 100% Infrastructure as Code. The opportunities I identify center on my current areas of focus.  Your areas of focus,  should you do a similar introspection,  might identify different opportunities. 

I've identified several opportunities for reducing and eliminating busy work for me and my team. I expect this list to evolve as some of these items are implemented.  This is an evolution for us. 
Automatically format Terraform code. There's no real need for people to manually format Terraform code.  Formatting aids and speeds your ability to read and understand code.  It's reading speed that provides the productivity benefit.  Terraform has a format utility that does a reasonable job (I realize that it's not perfect). When pull requests are created or updated,  a GitHub Workflow that automatically executes the formatting and pushes any changes would eliminate manual formatting. 

Automatically generate Terraform documentation. Like many,  we use the open source product Terraform-docs to generate Terraform documentation.  We seem to be executing that generation locally. When pull requests are created or updated,  a GitHub Workflow that automatically executes the documentation generation and pushes any changes would eliminate the manual step. Update: This topic is addressed in more depth in the post "For DevOps Professionals: Automatically Generating Terraform Documentation."

Automatically tag Terraform applies. Terraform's backend state paradigm pushes most into trunk-based development. Yes,  I realize that's where all people should be,  but that topic is a discussion for another day. Trunk-based development for Terraform introduces a hassle managing change as they are promoted through various environments.  We've settled on tagging to track which version is applied to which environment.  That tracking has been manual,  but there's really no need for that.  It's much better and more reliable to automatically tag the version of code that has been successfully applied. Not only would manual work be eliminated,  but the quality of the tagging would improve. 

Make Terraform state manipulation easy. We've been incorporating Terraform state manipulation operations into Terraform plan pipelines for years.  Now that we're transitioning to GitHub, those capabilities should transition too. The need for Terraform state operations is relatively rare,  but is critical when the need arises. 

Automate Terraform drift detection. Terraform introduced a capability of reporting "drift" in Terraform plan output. This capability has incrementally improved over time and become more useful.  I expect the quality of that drift output to continue to improve.  It's now possible to schedule workflows that perform Terraform plans and alert any drift detected. This saves unplanned work investigating drift encountered when making configuration changes. 

Automate Terraform module testing. We've encountered shifting sand with Terraform modules originating with cloud technology and security team policy changes.  While automated testing on a scheduled basis doesn't eliminate the problem,  it alerts platform engineering personnel to the issue.  That alert means that its possible to fix the issue before individual application teams are negatively impacted. 

I need an easily customizable templating assistant. The amount of time I copy/paste/change GitHub workflows and Terraform projects disturbs me and other members of my team.  I really need a personal implementation that is easy to implement and maintain.  Backstage implementations are complex and require more labor than most teams can afford.  Additionally, templating is a small percentage of Backstage capabilities. A Visual Studio Code extension called Components Boilerplate seems close to this idea,  but isn't quite there. 

This list will be my areas of focus over the coming weeks.  I'd like to know your thoughts on these items and other opportunities for productivity improvement you see in your teams.

What I'm Currently Reading:
Accelerate DevOps with GitHub: Enhance software delivery performance with GitHub Issues, Projects, Actions, and Advanced Security. Kaufmann, Michael: Packt Publishing; 1st edition (September 9, 2022)

Monday, December 26, 2022

Improve New Feature Delivery Rate with Value Stream Mapping

 I constantly see a desire to deliver enhancements and additional business capabilities to end users at a faster rate. What I don't see is a methodical and data-driven approach to achieving a faster delivery rate. I typically use a tactic called value stream mapping to improve clients' speed to market. That tactic seems obvious to me but isn't used as widely as I think it should be.

I'm going to define and illustrate value stream mapping for you in hopes that you see the value and understand the tactic well enough to apply it to your existing production processes and procedures. The concept applies to application features, infrastructure features, DevOps automation capabilities, and just about any type of information technology process I can think of. In fact, it applies to any business process I can think of, including those that aren't IT-related.

Example Application Delivery Value Stream

This is an example of the delivery process of a highly customized purchased application (COTS). As customizations were delivered by the vendor quite frequently, updates needed to be tested and deployed quite frequently. On average, the vendor supplied one to two updates per week. Each update required significant manual labor to test and deploy. The time and effort involved were costly and required tuning. We elected to do a value stream analysis. 

Below are components of the value stream, along with how long manual time was spent testing and deploying them. Note that given the length of outage required for deployments, significant coordination with the testing team and business users was necessary and often extended the lag between updates received from the vendor and getting those updates into the hands of end-users. 

We decided to automate the deployment process. The procedure given to us by the vendor was entirely manual. While the deployment process was only 32 clock-time hours of the total, decreasing that time to 4 hours allowed much greater flexibility in scheduling updates in the test environment as well as production. Now, updating installations for the test environment could be done off-hours without putting the testing team out of service. Additionally, production updates could be deployed off-hours and no longer require the weekend.

Lessons Learned

Automating testing would be the logical next tuning step. That said, automating tests for a COTS application is easier said than done. This particular application did not lend itself to easy UI testing in an automated fashion. As we didn't have access to product source code, testing service APIs wasn't an option either.

The value stream tuning effort illustrated here works for custom applications just as well as it did for this COTS example. The value stream tactic applies the tuning principle of optimizing the largest targets (those that take the most time) first. This principle is used when we tune CPU or memory consumption in applications. In fact, the principal can be used for non-IT processes as well, like budgets.

Value stream analysis should be an ongoing effort repeated periodically. Over time, your deployment process changes. When that happens, the value stream will also change.

As with other types of tuning efforts, it's important to identify a specific target. Like other tuning efforts, it's always possible to keep improving. Having a target allows you to know when the tuning effort is "done" and effort can be directed elsewhere.

Value stream analysis often reveals business procedures and processes that are not optimal. In another example from the field, I've seen DNS entries and firewall rule entries that are forced by organizational procedures to be manual and take significant amounts of lead time. It's important to track these activities in the value stream process as well. You need accurate information to make effective tuning decisions.

Thanks for taking the time to read my article. I'm always interested in comments and suggestions. 

Sunday, December 18, 2022

The Journey toward Continuous Delivery and Deployment: How to Start

I work with teams who are nowhere close to achieving continuous delivery, let alone continuous deployment.  Those teams are getting further and further behind where teams in other companies are on similar journeys. Often,  they don't have automated testing.  If they do,  often they are unit tests and not integration tests.  Often,  they don't have application infrastructure code and can't easily create additional environments. Often,  they work in long-lived feature branches and have teams working on disparate versions of the code. Consequently, the speed of new feature delivery to end users is abysmal. Hope is not lost.  The journey is difficult but possible.  Here are some initial steps to take. 

Establish a basic integration test suite if you don't already have one.  There is not going to be continuous anything without automated integration testing.  With automated integration testing,  you can have confidence that changes, whether they are big fixes or feature enhancements, don't accidentally introduce new defects. Often legacy code bases aren't written to be easily testable at a unit level. Concentrate on integration testing the application at its consumption points.  For web UIs, that means automated functional testing of the UI or at least the REST web service resources.  For APIs consumed by other applications, it means integration testing the service endpoints.  

Unit testing is always important, and I never discourage it.  Integration testing is more important as it tests end-user functionality,  not just small sections of code. If you have no automated tests,  start with integration testing first.  You can implement unit tests for new features along the way. 

Ideally,  the environment for integration testing should be established by infrastructure code before the tests and eliminated after.  That said,  I'm concentrating on the initial steps in this post and don't want to deviate.

Management support is needed for funding automated testing. That funding is both labor and tooling. Initial setup for automated testing has an upfront cost.  Remember that there is a broad range between 0% and 100% coverage.  The higher the percentage,  the better.  That said,  higher percentages have diminishing returns. Don't let perfection be the enemy of progress. 

Establish a continuous integration pipeline (CI) for the main/master source code branch of you don't already have one.  Continuous integration is a firm requirement to continuous delivery. The CI pipeline should run automatically on check-in of code changes.  Continuous integration identifies defects immediately after changes are checked in by executing all available unit and integration tests. The objective is to identify defects as early in the development process as possible. 

If continuous integration (CI) reveals an error,  fixing the error is the highest priority. Many pundits would say that this type of breakage should result in an "all hands on deck" type of emergency and should enlist all members of the team to fix it. As a practical matter,  the developer who checked in the change that broke CI is usually the best person to fix it.  They know what they did and are closer to the problem. The developer's tech lead can follow up with the developer at some point to identify any additional resources needed to fix the issue.  

Adopt trunk-based development and eliminate long-lived feature branches.  Trunk-based code management is a requirement for continuous anything. Trunk-based development ensures that developers are using the newest and most current code base.  This minimizes the chances of "merge hell" and keeps all team members up to date and working on the current code.

Depending on your CI/CD tooling,  it might be easier to use a short-term feature branch and initiate CI on merge.  These feature branches must be short-lived or you're not really adopting trunk-based development. As long as the short-term branch gets deleted after the merge,  this isn't a bad tactic. 

Only start changes that can be completed in four hours or less.  Two hours is better.  Break the change up into smaller pieces if it's longer than that. This reduces merge issues later.  This also reduces the chance that another team member introduces a conflicting change.  This practice provides an incentive to keep changes small and only make one change at a time.  This is in keeping with the objective of delivering new features to users faster. All good.

Eliminate the practice of reviewing pull requests.  Allow automated testing to catch defects.  If a junior developer checks in code that isn't optimal,  a more senior member of the team can refractor that change later and hopefully take the opportunity to educate the junior team member on the issue.  Either way,  if it didn't break CI, not much damage was done. 

Encouraging automated test coverage will change developer behavior. If developers "know" they will be expected to produce automated tests for changes they make, they will code to make testing easier. It's enlightened self-interest. That also makes the code base cleaner and increases its quality.

Continuous delivery is an ongoing,  neverending process.  The beginning tactics I list here are just a start. 

Wednesday, September 14, 2022

Appropriate workloads for Kubernetes

I've been asked about hosting cloud applications in Kubernetes. People seem to assume that Kubernetes is the best practice for hosting containerized workloads in all cases. While I love Kubernetes and have used it successfully in numerous applications, It shouldn't be the default, out-of-hand, hosting solution for containerized workloads. 

Kubernetes is far from the simplest solution for hosting containerized workloads. Even with cloud vendors making Kubernetes clusters more integrated and easier to manage, they are still very complex and take highly specialized administration skillsets. If you're using Azure, Containerized Instances is a comparatively more straightforward method than AKS for deploying containerized workloads. In fact, Azure has several different ways to deploy containerized workloads. Most are easier than AKS/Kubernetes.

If you're using AWS, ECS or Lambda is comparatively easier than Kubernetes. In fact, AWS has at least 17 ways to deploy containerized workloads. Incidentally, with any cloud, it's possible to create a virtual machine and run a containerized workload on it: I don't recommend this. Bottom line: Kubernetes AWS ECS, and Azure Containerized Instances are application runners.

If you adopt Kubernetes as a hosting mechanism, the benefits should pay for the additional complexity. Otherwise, the additional maintenance headaches and costs of Kubernetes are not a wise investment. That begs the question, what types of applications are most appropriate for Kubernetes? Which application types should adopt a more straightforward hosting mechanism?

As an aside, containerizing your applications is the mechanism that separates the concern of hosting from the functionality of your application. Cloud vendors have numerous ways to deploy and run containerized applications. Vendor lock-in usually isn't an issue with containerization.

Workloads Well-Suited for Kubernetes

This section details common types of applications that may benefit from Kubernetes hosting.

Applications with dozens or hundreds of cloud-native services often benefit from Kubernetes. Kubernetes can handle autoscaling and availability concerns among the different services with less set-up per service. Additionally, cloud vendors have integrated their security and monitoring frameworks in a way that generally makes management of Kubernetes-hosted services homogenous with the rest of your cloud footprint.

Applications servicing multiple customers that require single-tenant deployments often benefit from Kubernetes. For these applications, there is one deployment of a set of services per "customer". For example, if there are 1000 customers, there will be 1000 deployments of the same set of services for each one. Given that the number of deployments grows and shrinks as customers come and go, Kubernetes streamlines the setup for each as well as provides constructs to keep each of the customer deployments separate.

Applications requiring custom Domain Name Services (DNS) resolution that can't be delegated to the enterprise custom DNS often benefit from Kubernetes. This is because Kubernetes has configurable internal networking capabilities. Often this happens more as a result of organizational structure than technical reasons. In many enterprises, DNS is managed by infrastructure teams and not application teams.

Applications in enterprises where IP address conservation is necessary can benefit from Kubernetes. For applications with internal services that only Kubernetes-hosted services needs access to, Kubernetes internal networking model called kubenet provides an internal network not visible outside the cluster. For example, an application with hundreds of microservices may only need to expose a small fraction of those services outside the cluster. The kubenet networking model conserves IP addresses as internal services don't need IP addressability outside the cluster.

In the cloud, we think of IP addresses as "free" and without charge. For firms with a large existing on premises network, IP address space is often not free. In most firms I've seen, on premises networks are tarballs without sensible IP address schemes. Often, nobody understands the entirety of what network CIDRs are in use and for what. Additionally, routing is often manually configured, making CIDR block additions labor-intensive.


Not every application should be hosted in Kubernetes. Simple applications with small transaction volume often don't require the additional complexity Kubernetes brings. 

Thanks for taking the time to read this post. I hope this helps.

Sunday, September 4, 2022

Policy-based Management Challenges and Solutions

One of the most common best practices for managing security in the cloud is policy-based management.  Policy-based management optimally prevents security breaches or at least alerts you to their presence. Additionally, it alieviates the need for as many manual reviews and approvals, which slow down development of new business capabilities. That said, policy-based management presents many challenges. This post details common challenges and tactics to overcome them. 

Challenge #1: Introducing New or Changed Policies

New or changes to existing policies often break existing infrastructure code (IaC) supporting existing applications. This occurs because at the time the IaC was constructed, the policy wasn't in place and the actions were allowed. This results in unplanned work for application teams and schedule disruptions. As policy makers are usually separate teams, they often don't pay the cost associated with the associated unplanned work.

Policy change announcements are often ineffective. Partially, this is due to volume of announcements in most organizations. The announcement of an individual policy gets lost in a sea of other announcements. Additionally, sometimes IaC developers do not completely understand or see the ramifications of the policy change. 

Challenge #2: Policies with Automatic Remediation

Installing policies that have automatic remediations in them can actually break existing infrastructure and the applications that rely on it. While automatic remediation for policies is appealing from a security perspective as it fixes an issue in a short time after a security hole is created, it really just kicks the can. Any resulting breakage will need to be repaired sometimes causing an outage for end users.

The IaC that produced the invalid infrastructure will no longer match the infrastructure that physically exists and needs to be changed. In other words, the automatic remediation causes unplanned work for other teams. Sometimes, new policies cause common IaC modules used by multiple teams to no longer work and not individual application infrastructure code.

Challenge #3: Adapting Policies to Advances in Technology 

Many policy makers only consider legacy mutable infrastructure. Mutable infrastructure is common on premises and consists of static virtual machines/servers that are created once and updated with new application releases when needed. Immutable infrastructure VMs are completely disposable. The VMs are still updated, but by updating the images they are created from and replacing the VMs in their entirety.

For example, it is common to place a policy that requires that automatic security updates be applied to virtual machines on a regular basis. The issue is that such policies assume that the VM has a long life as it would under a mutable infrastructure. Such a policy doesn't apply to immutable infrastructures. For immutable infrastructure, the base image needs security updates applied and any VMs built using it should be rebuilt and redeployed.

Cloud vendor technology changes at a rapid pace. Keeping cloud policies up to date with current advances is a challenge. In practice, policy makers are often out of date and make invalid assumptions. Effects of this I commonly see are:
  • Assuming that cloud vendor capabilities for securing network access remains the same.  Often,  these capabilities advance.
  • Assuming VM IP addresses are static can safely be used in firewall rules. In the cloud,  IP addresses can change quite frequently.
  • Assuming that VM images are changeable (vended provided images might not be)
  • Assuming that there will be no needed exceptions to security policies

Tactics to Mitigate Challenges

Always audit compliance to policies first before installing automatic remediation. That is alert teams of new compliance issues before changing anything automatically. This allows teams to accommodate a security policy change proactively before change is forced. Additionally, a reasonable lead time needs to be provided so that teams have the opportunity to mitigate the additional work.

Test new or changed policies with any related enterprise-wide common IaC modules. It is common for organizations with mature DevOps capabilities to centralize common IaC modules and reuse them for multiple applications. This allows organizations to leverage existing work instead of having multiple teams reinvent the wheel. For example, if a policy regarding AWS S3 buckets or Azure storage accounts is being changed, test any common IaC modules that use those constructs. Make policy compliance part of the test. Note that these tests should be automated so they can easily be rerun.

Any policies with automatic remediation must provide an exception capability. For example, if some VM images are purchased and not changeable according to the license. It is common for such images to be granted exceptions from related security policies. Additionally, I've seen exceptions granted for cloud vendor-provided Kubernetes clusters where underlying VMs don't and can't meet policy requirements. 

New policies should be deployed in lower environments first. This increases the chance that any errors or issues will be identified before the policy is applied to production. Be sure to allow a reasonable period of time in lower environments to increase the likelyhood that issues will be identified and addressed.


Policy-based management has challenges, but should still be considered best practice. Thanks for taking time to read this article. Please contact me if you have questions, concerns, or are experiencing challenges I've not listed here.

Monday, August 8, 2022

Move your Network to the Cloud Too!

Over the past year, I'm seeing indications of what will be a big trend in cloud consumption: let's move our network to the cloud along with data centers. I'm talking about the WAN network primarily which many enterprises maintain worldwide.  Local offices will still need connectivity to the WAN; it's just that they will increasingly become on-ramps to the worldwide WAN hosted in the cloud. In other words, data centers will no longer be the "center" for all network access. 

Graphically, the concept of moving the WAN to the cloud would look like figure 1 below. Notice how all data centers and offices are connected to the WAN that handles traffic between them. While the image doesn't describe it, the Cloud-based WAN is worldwide and can serve offices and data centers across the globe.

Figure 1: Cloud-based WAN Network

Let's contrast this with figure 2 which depicts the WAN network topology common in enterprises today. Note that public cloud access typically routes via data centers making enterprise application access data center centric. Worldwide connectivity is managed by a custom MPLS network.

Figure 2: Traditional Worldwide MPLS Network

I'm seeing several motivations for the change in thinking about how worldwide networks should be organized. I'll separate the reasoning into the following categories:
  • Complexity
  • Performance
  • Financial
  • Speed to Market


The complexity of non-Cloud MPLS networks, the base for most enterprise worldwide WANs, is tremendous. MPLS networks typically require large amounts of hardware that needs to be upgraded and replaced regularly. They take a large networking staff. While some outsource that to an MSP provider, they are still necessary. Outsourcing a large portion of the network to cloud vendors outsources this complexity and associated maintenance to a large degree. They also tend to be replete with numerous vendor contracts.

The complexity increases the business risk of change. MPLS networks are rarely supported by testing sandbox environments and automation. Many still make changes manually leading to inevitable human error and outages for users. Utilizing cloud vendors makes it much easier to automate the WAN infrastructure and provides a sandbox environment to test networking-related changes. This decreases the business risk of changes to networking infrastructure. This is huge. For most enterprises, the WAN that integrates all data centers and offices is essential.

Simpler capacity planning requirements. Hardware and vendor contacts needed for worldwide MPLS networks require sophisticated capacity planning due to long lead time requirements. This requirement is much simpler with cloud WAN implementations. Capacity planning still exists,  but it is far simpler and is easily changeable and adaptable on the fly. 


Network latency is generally significantly lower (faster) using cloud-provided WAN networking than worldwide MPLS networks. While your mileage will vary depending on your MPLS implementation, so much R&D goes into cloud-provided WANs that the likelihood that an enterprise will keep up any network performance advantages over time is low. Face it, most firms just can't compete.

Network latency is higher (slower) accessing resources that require networking between on premises and the cloud. As more IT workloads move from on premises to the cloud, closer proximity to the cloud will yield better performance. To this end, I see more enterprises leveraging cloud VPN services, which are closer to most application workloads, yielding better performance.


Converting networking hardware and infrastructure from capital expense (CapEx) to operational expense (OpEx) is appealing to many enterprises from an accounting perspective. As with computing resources, you pay for what you use for cloud-based WANs without hardware expenditures and management.

Networking labor is expensive specialized labor. Outsourcing that labor to cloud providers is definitively cheaper. Some enterprises mitigate this cost by enlisting a managed services provider (MSP), but outsourcing that labor to cloud vendors is cheaper as it capitalizes on the cloud's economy of scale advantages.

Speed to Market

No more long lead times for MPLS network upgrades and capacity increases. Increasing capacity in a cloud-provided WAN is typically measured in hours, not months. Furthermore, cloud-provided WAN products benefit from the cloud's dynamic scaling capabilities. Increasing MPLS network capacity takes sophisticated capacity planning and typically long lead times due to additional hardware expenditures.

Additional Benefits

The firm gets access to research and development advances made by cloud providers. The R&D resources that cloud providers are investing in WAN technologies surpass what most enterprises are able or willing to invest in. This means that over time, any differences in functionality and performance are likely to appear in cloud vendors first.

A cloud-based WAN is a natural partner when combined with a cloud-based VPN capability. This makes sense especially if the cloud hosts a larger percentage of application compute resources. Consuming the cloud-providers VPN solution moves those compute resources closer to what users access. With that closer proximity, typically comes better performance.

A cloud-based WAN is a natural partner for integrating multiple cloud providers. That is, Your AWS footprint can be securely connected to your Azure or GCP footprint directly. This avoids the slower connection between the cloud providers through an on premises data center.

Concluding Remarks

I'm reporting what I'm seeing at clients. This idea made no sense when many had a small fraction of their IT footprint in the cloud. Now that most firms now have most of their footprint in the cloud,  thinking on how to provide worldwide access to internal users needs to evolve. And the time for that evolution has come.

If you have thoughts or feedback, please contact me directly via LinkedIn or Email. thanks for taking the time to read this article.