Take advantage of Portworx PX-Enterprise to simplify management of data-rich workloads on Kubernetes Credit: Henrik5000 / Getty Images Kubernetes has many core abstractions, sometimes called primitives, that make the experience of deploying and managing applications so much better than what came before. Understanding these abstractions helps you take full advantage of Kubernetes and also avoid complexity—especially when running stateful applications like databases, data analytics, big data applications, streaming engines, machine learning, and AI apps. In this article, I’ll review some of the fundamental abstractions in Kubernetes storage, and walk through how Portworx PX-Enterprise helps solve important challenges that arise with the need for persistent storage in Kubernetes. Kubernetes abstractions and Kubernetes storage The Pod is a great example of a core Kubernetes abstraction. It’s actually the first example—the starting point. Back in 2015, other container orchestration systems started with a single container as the fundamental abstraction; Kubernetes started with Pods. A Pod is a group of one or more containers that need to run together to be useful. One simple analogy is that a Pod is like an outfit of clothing. It’s great to have a shirt and socks on, but let’s not walk out the door without pants! Pods are like that—they let us focus on what’s needed to be useful (a running outfit) and not overload us with bookkeeping minutiae (a shoelace, one sock). Don’t get me wrong, the minutiae is being tracked by the scheduler and Kubelet (the Kubernetes agent). But it’s this abstraction that allows for the ecosystem to build on Kubernetes and for administrators to automate their infrastructure. And today, we see that most other schedulers have adopted the Pods concept, a sure sign of its usefulness. The world of storage in Kubernetes has its primitives too, some of which may sound complex at first glance. These abstractions come together to take a complex problem—how to schedule efficiently when application demand is unpredictable—and provide a reliable solution. At the end of the day, you wouldn’t want to run in production without these abstractions. Here are the Kubernetes abstractions that describe and control storage: PersistentVolume (PV) – the representation for where data is held. Your infrastructure provider or storage vendor implements this. PVs are what you protect through standard means like backup, replication, and encryption. PersistentVolumeClaim (PVC) – how a Pod requests a PersistentVolume, including describing the size of PV needed. After the request, the PVC becomes the reference between a Pod and its PersistentVolume. Now, you might ask, why not skip PVCs and have Pods directly use PersistentVolume? Without a PVC concept, applications would be less portable, which we’ll explain later. StorageClass (SC) – describes the types of storage that your infrastructure offers. For example, your provider may offer two flavors: fast SSD with encryption and slow HDD without encryption. Just like with PVC, you might ask why is this needed? Again, these abstractions help with portability and let administrators prevent abuse from sloppy applications. We’ll also explain this point below. Here are the Kubernetes abstractions that describe and control applications: Pod – one or more containers that run on the same server, work together, and together form a basic unit of work. Deployment – a controller that ensures that the desired number of application Pods are running and that manages the Pod’s lifecycle. A lifecycle event might be adding more Pods or updating the version. A Pod definition is included within the written specification of a Deployment. A common question for customers is when to use a Deployment versus a StatefulSet. This is a good question that we’ll expand upon. StatefulSet – manages the entirety of the database, instead of individual Pods and their PVCs. It’s important to remember that a horizontally scaling database, like Cassandra, runs with multiple Pods that work together. With a StatefulSet, you don’t have to think about how each Cassandra node (instance) relates to another node. Kubernetes does that for you. These are the fundamental Kubernetes primitives that enable portability and scalability. There is a subtle and powerful beauty to how these abstractions work together. Since StatefulSets wrap a lot of the underlying primitives, let’s start with a more basic example using PostgreSQL. Then we will directly touch on some of the primitives, starting with a PVC, and build upwards. Deploying PostgreSQL on Kubernetes Customers love how Kubernetes manages applications on their infrastructure. As we walk through an example with PostgreSQL, we see that the Kubernetes primitives were designed to be as portable as possible. Even before we try to equate portability with multi-cloud, we see that portability means that the proper primitives enable apps to run, re-run, and re-re-run across servers. Portability and robustness are thus two sides of the same coin, which makes sense if we think about it. Apps have to be portable across servers if apps are to survive failures. Back to our PostgreSQL example. The Pod identifies the container image and a PVC. Here, we will run the PostgreSQL database, so the container image is for version 10.1 of PostgreSQL. The Pod is written as a section within a Deployment specification in this example. Had we chosen Cassandra, we would have written our Pod as part of a StatefulSet. The Deployment not only holds a Pod definition, but also allows us to make updates to that Pod as it runs. Let’s look at all of this within the Deployment specification. I’ve added comments for explanation. apiVersion: extensions/v1beta1 kind: Deployment metadata: name: postgres spec: template: # Pod definition portion of this deployment specification metadata: labels: app: postgres spec: # Container to use the application image PostgreSQL 10.1 containers: - image: "postgres:10.1" name: postgres envFrom: - configMapRef: name: example-config ports: - containerPort: 5432 name: postgres volumeMounts: # Container to use the PVC below called ‘postgres-data’ - name: postgres-data # Container sees itself as writing to the directory below mountPath: /var/lib/postgresql/data volumes: - name: postgres-data # PVC available to any containers in this Pod spec persistentVolumeClaim: claimName: postgres-data-claim In the above example, the Pod knows about its PVC but does not—and need not—know about the PersistentVolume. This part may feel a little roundabout, so bear with me. The PVC requests an amount of storage capacity and the type of storage to use. The PVC looks like this: apiVersion: v1 metadata: # Create a Persistent Volume using this Storage Class definition annotations: volume.beta.kubernetes.io/storage-class: px-postgres-sc spec: accessModes: - ReadWriteOnce resources: # Create a Persistent Volume with 5 GB of storage capacity requests: storage: 5Gi Up to this point, the application owner has been describing their app and requirements. Now, the infrastructure administrator gets involved by defining the types of storage available by publishing StorageClasses. It’s important to separate the concerns that the application is addressing from those of the infrastructure: the infrastructure admin needs to define what is sustainable in a shared cluster. Without such primitives, applications could trash each other—a problem that some Kubernetes alternatives are susceptible to. In the StorageClass specification below, all PVCs that specify this StorageClass will have replication and encryption and will be configured for database I/O workloads. StorageClasses are storage provider specific. Under the covers, the vendor who provides the PersistentVolume implements these features. apiVersion: storage.k8s.io/v1beta1 metadata: name: px-postgres-sc provisioner: kubernetes.io/portworx-volume parameters: # Replicate three copies using Portworx repl: "3" # Tune the I/O for the volume for databases io_profile: "db" # Encrypt the data using a key from Key Management System secure: "true” Running PostgreSQL on Kubernetes Now to install all of the above primitives, the administrator starts with the StorageClass. Typically, administrators will design several StorageClasses, allowing for tradeoffs between what different apps require and what the infrastructure can support. To publish the first StorageClass, the administrator runs the following command with the corresponding YAML file: $ kubectl create -f px-storage-class.yaml storageclass.storage.k8s.io "px-postgres-sc" created The application owner can now use storage as defined by the StorageClass. An application Pod will use a PVC to request storage. Since this is a new application, a new PersistentVolume will be created that satisfies the PVC. To create the PVC, we run the following command with our PVC file: $ kubectl create -f pvc.yaml persistentvolumeclaim "postgres-data-claim" created Now we are ready to deploy the PostgreSQL database. Our database will run as a Pod with a PostgreSQL container inside it. Since the Pod was defined within a Deployment specification, we simply create all of this by running the command on the Deployment YAML file: $ kubectl create -f postres-deployment.yaml deployment.extensions "postgres" created We can look backwards to see what was created. First, we ask for all PVCs by running the following command. Below, we see that the PVC we created is in a Bound status, meaning that it’s using the PersistentVolume. In other words, the storage primitives are ready for use. $ kubectl get pvc NAME STATUS VOLUME CAPACITY STORAGE CLASS AGE postgres... Bound pvc-3... 5Gi px-... 17s Next, we can look at the Pod that is running our PostgreSQL container. Below, we see that our PostgreSQL Pod is ready to serve requests. $ kubectl get pods NAME READY STATUS RESTARTS AGE postgres-dff54d66d... 1/1 Running 0 6s Taking a step back, we see that we created all of the primitives needed to run a database. More than that, we standardized on the storage our infrastructure offers to applications, which improves the experience for all subsequent Pods. And we can now control the database Pod using our Deployment object, automating parts of the upgrade process. The entire stack and set of primitives are shown in the figure below. Portworx The net result is that we have the language for expressing our desired state in production. We have Kubernetes that manages the applications to meet that intent. And we have a way to share our storage infrastructure with other applications. How Portworx addresses Kubernetes storage challenges We just walked through the Deployment of a stateful service using Kubernetes. It’s important to delve into the production requirements as we handle data-rich workloads. How do we resize a PersistentVolume? How do we encrypt microservices data while allowing for the portability benefits? Let’s delve into these topics that matter in production. The Kubernetes primitives are powerful because each application can now scale out (handle more requests) easily and independently of other applications. Moreover, you can update particular components with similar fine-grained control. But as with any application platform, you need an infrastructure that supports this flexibility. For stateful workloads that seek the benefits of Kubernetes, there are a number of common (and vendor neutral) gaps in the storage infrastructure that present challenges. Infrastructure limitations that impact stateful workloads on Kubernetes: Application discrimination — isolate and tune I/O behavior based on applications, especially as servers are now shared among apps. Example: control over when Elasticsearch deletes or Cassandra compacts. Clustered operations — allow for data access as Pods scale out across servers and across availability zones (when in public clouds). Oftentimes, the slowness of accessing storage becomes the concern. Monitoring and visibility — understanding performance as applications share infrastructure. Here, we can benefit from how labels can be used to tag across Pods and then down to disks. Protection — ensuring that backup, snapshots, and data protection mechanisms handle applications that are now dozens of Pods instead of a few large VMs. Portability — helping teams move their data as they move their compute either for development clusters getting promoted to test environments or for multi-cloud workloads. There are many ways storage and infrastructure solutions can make Kubernetes the best way to run stateful applications. At Portworx, we have been working on making stateful workloads as easy and resilient as stateless workloads with Kubernetes. Below are some of the ways we have been investing in that. Microservices first Unlike past enterprise storage systems, PX-Enterprise is designed from the ground up for microservice applications. Much of our work has gone into extending the experience for Kubernetes users, and Portworx itself can be installed, extended, and controlled using Kubernetes. As a result, Portworx benefits from the same advantages that are driving microservices elsewhere in the enterprise: fast to deploy, free of hardware/vendor lock-in, easy to update, and highly available with managed uptime. Kubernetes-driven Because PX-Enterprise runs as a Pod, it can be installed directly via a container scheduler, like Kubernetes, using the standard kubectl apply -f "[configuration spec]" command. Portworx has a spec generator that customers can use to automatically generate the configuration based on their own environment. Portworx Hardware-independent As a software-only storage and data management solution, it’s important that Portworx be able to run in any environment on any hardware. Our customers run in the public cloud, on premises, and in hybrid deployments. Portworx supports all of these configurations because enterprises require this flexibility. Application-aware Historically, storage products focused on providing storage capacity and performance (such as bandwidth and IOPS) from a centralized set of storage hardware, without getting involved in understanding the application. Container schedulers radically change the demands on storage. We now need to understand how data for dozens to thousands of application Pods need to be prioritized, managed, and snapshotted — all on a Kubernetes-based infrastructure. The solution also needs to provide automation and data protection to be useable in production. However, one simple challenge is that microservices architecture encourages applications to operate independently in some cases and as a group in other cases. As an example of operating independently, a relational database like PostgreSQL or MySQL will often be deployed as a single application Pod. Before upgrading the database version, teams need to take a snapshot and a backup so that a failsafe exists. One concern is how to make these operations fast, safe, and automatable. From a Portworx perspective, this is handled by making sure applications send (flush) their contents to the data volume before taking a snapshot. (See left panel below.) Portworx Without such application-level tooling, a data volume is only crash-consistent, not application-consistent. This means that an application (MySQL in this example) must run recovery steps. In addition, often admins must do manual verification before allowing the app to serve workloads again. This all takes more time and limits us from realizing the automation that we seek from schedulers. In other cases, scale-out applications like Cassandra run Pods across many servers. Together, the Pods form a single Cassandra ring and work together to provide higher throughput. It becomes important in these scale-out cases to be able to handle all of the Cassandra data volumes as a single group. Acting on each volume independently would otherwise introduce unwanted rebalancing that reduces predictability in production. In this scale-out case, the steps will start the same (by flushing the memory) but now end with a snapshot of all the data volumes as a group, as shown below. Portworx Unlike legacy storage approaches, this distributed set of operations represents new data management functionality that needs to be able to discriminate based on the use case (MySQL, Cassandra) while running on a shared infrastructure, just as Kubernetes does. At the same time, the experience needs to be integrated with Kubernetes in order to provide both automation and the intended data protection. For Portworx, we provide this functionality and a Kubernetes-native experience by integrating through Kubernetes scheduler extensions and a set of storage custom resources. Multi-cloud and hybrid-cloud ready Portworx installs itself as a Pod, can be managed by Kubernetes, deploys on almost any hardware, and is application aware. It is a natural fit to support multi-cloud and hybrid-cloud workloads. The key to multi-cloud operations for stateful services is overcoming data gravity, the idea that stateless components like load balancers and app containers are trivial to “move,” while stateful components like data volumes are difficult because data has mass (figuratively). Portworx overcomes data gravity, in part, by giving users the ability to snapshot application data, with full application consistency, even across multiple nodes, and to move that data to a secondary environment along with its configuration. With the ability to move data and configuration, Portworx supports multi-environment workloads such as burst-to-cloud, blue-green deployments of stateful applications, and copy-data management for the purposes of reproducibility and debugging, as well as more traditional backup and recovery. Data is as important as ever. If containers are to become as popular in the enterprise as VMs have been in the previous decade, then a solid storage and data management solution will be a requirement. Just as I couldn’t imagine a world in which VMware couldn’t run a database, I can’t imagine a world in which databases and other stateful services don’t run on Kubernetes. But containers, which are more dynamic and numerous than VMs by an order of magnitude, create problems for stateful services that traditional storage and data management solutions don’t solve. I’m excited to be working at Portworx to tackle these problems head-on. It’s an important mission. Eric Han is vice president of product management at Portworx, the cloud-native storage company. He previously worked at Google, where he was the first product manager for Kubernetes. — New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com. Related content feature 14 great preprocessors for developers who love to code Sometimes it seems like the rules of programming are designed to make coding a chore. Here are 14 ways preprocessors can help make software development fun again. By Peter Wayner Nov 18, 2024 10 mins Development Tools Software Development feature Designing the APIs that accidentally power businesses Well-designed APIs, even those often-neglected internal APIs, make developers more productive and businesses more agile. By Jean Yang Nov 18, 2024 6 mins APIs Software Development news Spin 3.0 supports polyglot development using Wasm components Fermyon’s open source framework for building server-side WebAssembly apps allows developers to compose apps from components created with different languages. By Paul Krill Nov 18, 2024 2 mins Microservices Serverless Computing Development Libraries and Frameworks news Go language evolving for future hardware, AI workloads The Go team is working to adapt Go to large multicore systems, the latest hardware instructions, and the needs of developers of large-scale AI systems. By Paul Krill Nov 15, 2024 3 mins Google Go Generative AI Programming Languages Resources Videos