Saturday, December 26, 2020

For DevOps Professionals: Barriers to 100% Infrastructure as Code

I was asked the other day why a particular part of the cloud infrastructure was added manually and not automated. It was a very small manual part and a one-time setup, but none-the-less I experienced déjà vu. It occurred to me that I've been asked that question at every client I've had since I got heavily into infrastructure code. We use the phrase "100% infrastructure as code" often. In fact, the overwhelmingly vast majority of cloud infrastructure is implemented via code. However, there is always some very tiny portion of the infrastructure that seems to be provided manually. The percentage is probably closer to 99.x% in most organizations I've had the privilege to do work for. Why is that? Why does the percentage never seem to be 100%? Let's make this more concrete and list some examples of automation I've encountered that wasn't 100% automated and why.

Examples of Automation Barriers

High-bandwidth from the cloud to on premises data centers are rarely 100% automated. This is the case for both Azure Express Routes and AWS Direct Connect connections. The reason is that a 3rd party firm controls access to the on-ramp or colocation device (e.g. CoreSite, Equinix, etc.). If the organization has access, it's usually manually controlled by a separate network infrastructure team. In other words, these devices aren't completely available to automation engineers that are needed for the development of infrastructure code. In essence, the cloud connectivity to the express route circuit can be automated, but that circuit's connectivity to the on-ramp is usually not.

Automating DNS entries are problematic in many hybrid-cloud organizations. This is an organizational barrier and not a technical barrier. It is common for DNS entries to be controlled by a separate team in a manual fashion. DNS authority is tightly controlled as there is effectively one DNS environment for the entire organization most of the time. The fear is that automation defects could negatively affect non-cloud entries or resources. 

Automating security policies and the assignment of those policies is problematic in many organizations. Typically, security is handled by a separate team and usually a team without infrastructure automation skills. Consequently, I've seen automation engineers write code to establish security policies, but those policies are manually assigned by a separate team. In essence, the traditional test-edit cycle that automation engineers need for this type of development doesn't exist.

Frequently, the creation of AWS accounts or Azure subscriptions is not automated. The reason is that in most organizations, that creation and their placement in the organizational tree is controlled by a separate team without automation coding skills. Furthermore, a sandbox environment for this type of automation code development doesn't exist.

Organizations define some resources to be central to the entire enterprise and don't have environment segregation. Examples are Active Directory environments, DNS, and WANs. The problem with this is that changes to central resources such as this become "production" changes and are tightly controlled. When everything is a production change, the test-edit development cycle automation engineers need doesn't exist. 

After doing some introspection with these examples, I've identified several common barriers to implementing 100% infrastructure as code. It turns out that most of these limitations are not technology-based.

Environment limitations

Infrastructure code development requires support for test-edit loops. That is, automation engineers need to be able to run, edit, and re-run infrastructure code to correct bugs. Writing infrastructure as code is just like application development in many ways. Automation engineers need to be able to experience occasional failures without negatively impacting others. These requirements are usually accomplished by a "sandbox" environment that others are not using.

The app developer part of me wants those tests and the verification of the result automated just like other types of application code. That said, the tooling to support automated testing of infrastructure code is sketchy at best and definitely not comprehensive. Automated testing is worth doing, but it is definitely not comprehensive. There are definitely limits to automated test coverage for infrastructure code.

The sandbox environment used for infrastructure code development must support the add/change/delete of that environment without negative impact on others. Like other types of development, infrastructure code doesn't always work as intended the first time it runs. In fact, you should assume that infrastructure code development might actually damage the sandbox environment in some way.

Sandbox environments should be viewed as completely disposable. That is, they can be created or destroyed as needs require little effort. It needs to be easy for an automation engineer to create a new sandbox for infrastructure code development and destroy it afterward.

It's common for sandbox environments to have limitations. That is, it's difficult for sandbox environments to accommodate 100% of infrastructure code development. Dependency requirements (e.g. Active Directory, DNS, connectivity to SAS, or on premises environments) are primary examples. These limitations contribute to the small portion of the infrastructure that is at least partially maintained manually.

Organizational authority

Automation engineers and the service accounts used for automation must have 100% control of the infrastructure maintained by code. That is, one can't develop infrastructure code without the authority to do so. This code can't be completely developed and tested. Consequently, the portion of the infrastructure that directly interfaces such resources is often manual.

Earlier in the post, I provided several examples that fit this category. For example, organizations using proprietary DNS products (e.g. Infoblox) often don't want to pay for additional licenses to support infrastructure code testing. Additionally, as DNS is often implemented in a one-environment paradigm (only production without separate development environments), organizations are hesitant to allow automation engineers security credentials needed to support infrastructure code support.

Active Directory (A/D) environments also fit into this category in many organizations. As A/D is often used to grant security privileges, organizations are loath to grant automation engineers and automation service accounts needed privileges to create groups, edit group membership, and delete groups.

All too often, the solution to these types of issues is to do a portion of infrastructure manually.

Low benefit/cost ratio

For some types of infrastructure, organizations find the benefits obtained by a complete automated solution aren't worth the costs. In other words, third-party costs (e. g. software licensing) make the "juice isn't worth the squeeze". Some infrastructure dependencies cost too much in money, labor, or time to dependency set-up costs to make the automation practical. Sometimes the manual labor involved in maintaining some infrastructure items is very small, making the cost of infrastructure code for those items not worth the effort.

Resources that are rarely updated and take an extremely long time to create/destroy often aren't worth the cost of automation. As an example, the AWS Transit gateway is often an example.  

Lack of DevOps team discipline can increase the cost of infrastructure automation and lower the benefit/cost ratio. Without the good discipline to the development life-cycle for infrastructure code and good source control habits, it's common for development work by one automation engineer to negatively impact the work of others. This leads to an increase in manual work or a decrease in team velocity. 

The breadth of specialized skills needed for some types of infrastructure can lower the benefit/cost ratio. As an example, work with one client required specialized networking and A/D skills to set up a test RRAS VPN target. If I didn't have a team member with these skills, I could never have tested that the cloud-side VPN infrastructure code worked - it would have been untested until use in one of the non-sandbox environments. I've seen other examples with regard to relational database administration skills and other types of specialized labor. The breadth of knowledge often needed by automation engineers is daunting.

Concluding Message

My acknowledgment that there are barriers to implementing 100% infrastructure as code should not be used as an excuse not to automate. Infrastructure code has produced some of the best productivity gains since we embarked on adopting cloud technologies. I'll never give up pressing for higher levels of infrastructure code automation. That said, when I recognize some of these non-technology barriers to infrastructure code, I'll feel a little less guilty. Yes, I'll try to craft workarounds, but recognize that it isn't always possible in every organization.

Thanks for reading this post. I hope you find it useful.

Wednesday, December 16, 2020

For Managers: Cloud Governance through Automation

Cloud consumption and DevOps automation is not just a technology change. It is a paradigm shift that managers participate in as well, but don't always realize it. One of the paradigm shifts involves cloud governance. If managers apply governance tactics developed over the years, they risk many of the benefits obtained by cloud consumption including speed to market. Having seen this transformation at several organizations, I've some thoughts on the topic. Please take time to comment if you've thoughts that I haven't reflected here.

Place automated guardrails on cloud usage instead of manual review processes. In short, when new policies are needed or existing policies modified, work with a cloud engineering team instead of adding manual review points. The benefits are:

  • Fewer review meetings
  • Reduced manual labor with both management oversight and application team compliance
  • Added security as enforcement is more consistent and comprehensive
  • Evolves as your cloud usage grows and changes
  • Allows decentralized management of cloud resources which frees application teams to innovate more.

This is a paradigm shift over what was needed in data centers. Hardware infrastructure found on premises makes governance and its enforcement manual. This leads to long lead times to acquire and configure additional infrastructure and makes governance a constraint to bringing additional technical capabilities to application teams and users. Manual approvals and reviews are needed costing time and management labor.

In the cloud, infrastructure automation is possible because everything is now software. Networking, infrastructure build-outs, security privileges/policies, and much more are now completely software configuration and don't involve hardware. The software nature of the cloud makes the automation of governance in the cloud possible. Once automated, governance is no longer manual. Governance is enforced automatically that will provide enterprise safety. As a consequence, the need for manual approvals decreases if not entirely eliminated. This frees application development teams to innovate at a faster pace.

What types of automated guardrails are possible?

As the cloud is entirely software, the sky is the limit. That said, there are several guardrails that I see as implementation candidates.

Whitelist cloud services application teams can use. As an example, some organizations have legal requirements, such as HIPPA or FERPA, that need to be adhered to. These organizations usually have a need to whitelist services that are HIPPA or FERPA compliant. As another example, some organizations standardize on third-party CDN or security products. They commonly want to prohibit cloud-vendor based solutions that aren't a part of the standard solution.

Whitelist cloud geographic regions application teams can use. Some organizations don't operate world-wide and want cloud assets existing only in specific regions.

Automatically remediate or alert for security issues. Most organizations have specific plans for publishing cloud assets on the internet. As an example, one of my clients automatically removes non-authorized published ports to all internet addresses (CIDR 0.0.0.0/0) within a few seconds after such a port is opened. Another example, a customer of mine provides alerts when people are provided security privileges in addition to non-security administration privileges.

Automatically report and alert on underutilized cloud resources. Underutilized resources often cost money to no benefit. These resources are generally computing resources such as virtual machines. Alerts like these provide ways to lower cloud spend as it's often possible to downsize the compute resources.

Automatically report and alert for unexpected cost increases. Alerts like these need sensible thresholds. This alert usually prompts a review and possible remediation of the application causing the cost increase. 

Schedule uptime for non-production resources to save money. Often, organizations don't schedule downtime for non-production environments off-hours. Enterprises operating worldwide might not have this option as effectively there aren't "off-hours".

How can automated guardrails avoid becoming a bottleneck?

Application teams do not like constraints of any type. Having been on application teams for many years, I understand their sentiment. There are ways to keep guardrail development from becoming a bottleneck.

Fund automated guardrail development and maintenance. Like any other software produced by the enterprise, automated guardrails need development and support resources. Without adequate funding, they won't react to changing needs on a timely basis. Additionally, recognize that inadequate funding for automated guardrails will result in productivity losses for individual application teams across the enterprise.

Work with application development teams to identify and prioritize needed enhancements. This provides visibility into the guardrail backlog. Additionally, application teams can participate in prioritizing enhancements. Make them part of the process.

As cloud platforms evolve and change, automated guardrail development and maintenance is an activity that never "ends". Cloud governance is a continually evolving feedback loop. There must be a reasonable process for application teams to propose modifications to existing guardrails. As cloud technology changes over time, advances are made in current cloud services and new services invented. as an example, one of my clients must restrict cloud services used to those that are HIPPA compliant. As advances are made, that list grows over time and needs to be revisited.

As a manager in charge of cloud governance, what does this change mean to me?

Declare "war" on manual approvals. Instead of adding manual review/approval processes to govern cloud usage, engage a DevOps or cloud engineering team to enforce your desired behavior. A colleague of mine calls these "meat-gates". They slow everything down, both for management and application teams. They hamper delivering new features to end-users by slowing down application teams.    

DevOps automation is your friend and ally. It allows you to set policy and not need to devote as much to enforcement. You specify "what" policies you want to be enforced. DevOps automation engineers construct and maintain the enforcement of the policies you choose.    

Conclusion

I hope you find these thoughts useful. I'm always open to additional thoughts. Thanks for reading this post and taking the time to comment.