Wednesday, September 14, 2022

Appropriate workloads for Kubernetes

I've been asked about hosting cloud applications in Kubernetes. People seem to assume that Kubernetes is the best practice for hosting containerized workloads in all cases. While I love Kubernetes and have used it successfully in numerous applications, It shouldn't be the default, out-of-hand, hosting solution for containerized workloads. 

Kubernetes is far from the simplest solution for hosting containerized workloads. Even with cloud vendors making Kubernetes clusters more integrated and easier to manage, they are still very complex and take highly specialized administration skillsets. If you're using Azure, Containerized Instances is a comparatively more straightforward method than AKS for deploying containerized workloads. In fact, Azure has several different ways to deploy containerized workloads. Most are easier than AKS/Kubernetes.

If you're using AWS, ECS or Lambda is comparatively easier than Kubernetes. In fact, AWS has at least 17 ways to deploy containerized workloads. Incidentally, with any cloud, it's possible to create a virtual machine and run a containerized workload on it: I don't recommend this. Bottom line: Kubernetes AWS ECS, and Azure Containerized Instances are application runners.

If you adopt Kubernetes as a hosting mechanism, the benefits should pay for the additional complexity. Otherwise, the additional maintenance headaches and costs of Kubernetes are not a wise investment. That begs the question, what types of applications are most appropriate for Kubernetes? Which application types should adopt a more straightforward hosting mechanism?

As an aside, containerizing your applications is the mechanism that separates the concern of hosting from the functionality of your application. Cloud vendors have numerous ways to deploy and run containerized applications. Vendor lock-in usually isn't an issue with containerization.

Workloads Well-Suited for Kubernetes

This section details common types of applications that may benefit from Kubernetes hosting.

Applications with dozens or hundreds of cloud-native services often benefit from Kubernetes. Kubernetes can handle autoscaling and availability concerns among the different services with less set-up per service. Additionally, cloud vendors have integrated their security and monitoring frameworks in a way that generally makes management of Kubernetes-hosted services homogenous with the rest of your cloud footprint.

Applications servicing multiple customers that require single-tenant deployments often benefit from Kubernetes. For these applications, there is one deployment of a set of services per "customer". For example, if there are 1000 customers, there will be 1000 deployments of the same set of services for each one. Given that the number of deployments grows and shrinks as customers come and go, Kubernetes streamlines the setup for each as well as provides constructs to keep each of the customer deployments separate.

Applications requiring custom Domain Name Services (DNS) resolution that can't be delegated to the enterprise custom DNS often benefit from Kubernetes. This is because Kubernetes has configurable internal networking capabilities. Often this happens more as a result of organizational structure than technical reasons. In many enterprises, DNS is managed by infrastructure teams and not application teams.

Applications in enterprises where IP address conservation is necessary can benefit from Kubernetes. For applications with internal services that only Kubernetes-hosted services needs access to, Kubernetes internal networking model called kubenet provides an internal network not visible outside the cluster. For example, an application with hundreds of microservices may only need to expose a small fraction of those services outside the cluster. The kubenet networking model conserves IP addresses as internal services don't need IP addressability outside the cluster.

In the cloud, we think of IP addresses as "free" and without charge. For firms with a large existing on premises network, IP address space is often not free. In most firms I've seen, on premises networks are tarballs without sensible IP address schemes. Often, nobody understands the entirety of what network CIDRs are in use and for what. Additionally, routing is often manually configured, making CIDR block additions labor-intensive.

Summary

Not every application should be hosted in Kubernetes. Simple applications with small transaction volume often don't require the additional complexity Kubernetes brings. 

Thanks for taking the time to read this post. I hope this helps.

Sunday, September 4, 2022

Policy-based Management Challenges and Solutions

One of the most common best practices for managing security in the cloud is policy-based management.  Policy-based management optimally prevents security breaches or at least alerts you to their presence. Additionally, it alieviates the need for as many manual reviews and approvals, which slow down development of new business capabilities. That said, policy-based management presents many challenges. This post details common challenges and tactics to overcome them. 

Challenge #1: Introducing New or Changed Policies

New or changes to existing policies often break existing infrastructure code (IaC) supporting existing applications. This occurs because at the time the IaC was constructed, the policy wasn't in place and the actions were allowed. This results in unplanned work for application teams and schedule disruptions. As policy makers are usually separate teams, they often don't pay the cost associated with the associated unplanned work.

Policy change announcements are often ineffective. Partially, this is due to volume of announcements in most organizations. The announcement of an individual policy gets lost in a sea of other announcements. Additionally, sometimes IaC developers do not completely understand or see the ramifications of the policy change. 

Challenge #2: Policies with Automatic Remediation

Installing policies that have automatic remediations in them can actually break existing infrastructure and the applications that rely on it. While automatic remediation for policies is appealing from a security perspective as it fixes an issue in a short time after a security hole is created, it really just kicks the can. Any resulting breakage will need to be repaired sometimes causing an outage for end users.

The IaC that produced the invalid infrastructure will no longer match the infrastructure that physically exists and needs to be changed. In other words, the automatic remediation causes unplanned work for other teams. Sometimes, new policies cause common IaC modules used by multiple teams to no longer work and not individual application infrastructure code.

Challenge #3: Adapting Policies to Advances in Technology 

Many policy makers only consider legacy mutable infrastructure. Mutable infrastructure is common on premises and consists of static virtual machines/servers that are created once and updated with new application releases when needed. Immutable infrastructure VMs are completely disposable. The VMs are still updated, but by updating the images they are created from and replacing the VMs in their entirety.

For example, it is common to place a policy that requires that automatic security updates be applied to virtual machines on a regular basis. The issue is that such policies assume that the VM has a long life as it would under a mutable infrastructure. Such a policy doesn't apply to immutable infrastructures. For immutable infrastructure, the base image needs security updates applied and any VMs built using it should be rebuilt and redeployed.

Cloud vendor technology changes at a rapid pace. Keeping cloud policies up to date with current advances is a challenge. In practice, policy makers are often out of date and make invalid assumptions. Effects of this I commonly see are:
  • Assuming that cloud vendor capabilities for securing network access remains the same.  Often,  these capabilities advance.
  • Assuming VM IP addresses are static can safely be used in firewall rules. In the cloud,  IP addresses can change quite frequently.
  • Assuming that VM images are changeable (vended provided images might not be)
  • Assuming that there will be no needed exceptions to security policies

Tactics to Mitigate Challenges

Always audit compliance to policies first before installing automatic remediation. That is alert teams of new compliance issues before changing anything automatically. This allows teams to accommodate a security policy change proactively before change is forced. Additionally, a reasonable lead time needs to be provided so that teams have the opportunity to mitigate the additional work.

Test new or changed policies with any related enterprise-wide common IaC modules. It is common for organizations with mature DevOps capabilities to centralize common IaC modules and reuse them for multiple applications. This allows organizations to leverage existing work instead of having multiple teams reinvent the wheel. For example, if a policy regarding AWS S3 buckets or Azure storage accounts is being changed, test any common IaC modules that use those constructs. Make policy compliance part of the test. Note that these tests should be automated so they can easily be rerun.

Any policies with automatic remediation must provide an exception capability. For example, if some VM images are purchased and not changeable according to the license. It is common for such images to be granted exceptions from related security policies. Additionally, I've seen exceptions granted for cloud vendor-provided Kubernetes clusters where underlying VMs don't and can't meet policy requirements. 

New policies should be deployed in lower environments first. This increases the chance that any errors or issues will be identified before the policy is applied to production. Be sure to allow a reasonable period of time in lower environments to increase the likelyhood that issues will be identified and addressed.

Summary

Policy-based management has challenges, but should still be considered best practice. Thanks for taking time to read this article. Please contact me if you have questions, concerns, or are experiencing challenges I've not listed here.