Machine learning adoption has lagged in software development and operations (DevOps) due to the complexity of orchestration. Kubernetes helps simplify operations by offering a single platform for efficient ML workload management. This guide provides instructions for deploying ML models, auto-scaling configurations, resource management, and monitoring of ML workloads.
In this guide, the term server refers to a dedicated GPU server that is suitable for running ML workloads, including training and inference. GPU containers require physical access to the graphics accelerators, which is unavailable on a VPS, for example.
Kubernetes has already proven its effectiveness in solving key challenges like:
The immediate advantages of employing Kubernetes for ML tasks are restated here.
Kubernetes has become a powerful tool for managing machine learning workloads at scale. Before diving into deployment and scaling strategies, let’s start with the basics of setup.
Setting up a Kubernetes cluster for machine learning involves custom configuring worker nodes, networking, and resource management. This enables an independent, dynamic, and cost-effective setting that enables an ML engineer to dedicate maximum time to model building instead of technical upkeep.
Deploying Kubernetes on the server with GPU and cloud in platforms like AWS, GCP, and Azure further enhances the experience.
Below are some default configurations when setting up Kubernetes for ML workloads:
Kubernetes makes resource management easier by reducing manual infrastructure setup for ML model deployment.
Before proceeding, ensure you have the following:
To enable Kubernetes cluster on a server with Kubeadm, make sure to observe due process:
Prepare at least two servers (one master and one worker node) with the preferred settings below:
Here’s a comprehensive guide to configuring the dedicated servers and deploying a Kubernetes cluster using Kubeadm for ML workloads.
Select Ubuntu 24.04 LTS for the starting operating system.
Choose a server that has the following minimums:
Next, configure the firewall rules. Open the ports below:
For key pair assignment, either create a new SSH key pair or use an existing one, for instance, access.
Launching in Bulk: currently, a minimum of two servers must be launched (one master and one worker).
Considering cloud hosting? Check out our tutorial to choose the best fit for your ML workloads.
Set the correct permissions for your private SSH key to ensure secure access:
chmod 600 /path/to/private_key
Connect to your server via SSH:
ssh -i /path/to/private_key ubuntu@server-ip
Update the system and install dependencies:
sudo apt update && sudo apt upgrade -y
sudo apt install -y apt-transport-https ca-certificates curl
Installation of Docker:
sudo apt install -y docker.io
sudo systemctl enable docker
sudo systemctl start docker
sudo systemctl status docker
Kubernetes packages installation. Add the Kubernetes APT repository:
echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update
sudo apt-get install -y kubeadm=1.24.0-00 kubelet=1.24.0-00 kubectl=1.24.0-00
sudo apt-mark hold kubelet kubeadm kubectl
Firstly, set up the master node. On the master node, run:
sudo kubeadm init --pod-network-cidr 192.168.0.0/16 --kubernetes-version 1.24.0
Configure kubectl for master node:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Install Network Plugin (Calico):
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
Copy the join command and run on the master node:
kubeadm token create --print-join-command
Copy the output and run it on the worker node:
sudo kubeadm join <master-node-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>
Verify nodes in cluster. On the master node, check if the worker has joined:
kubectl get nodes
Here are the YAML files to deploy an ML workload (TensorFlow Serving) on your Kubernetes cluster.
1. Create a deployment (ml-model-deployment.yaml).
This deployment runs a TensorFlow Serving container with a pre-trained ML model.
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model
labels:
app: ml-model
spec:
replicas: 1
selector:
matchLabels:
app: ml-model
template:
metadata:
labels:
app: ml-model
spec:
containers:
- name: ml-model
image: tensorflow/serving:latest
ports:
- containerPort: 8501
volumeMounts:
- name: model-storage
mountPath: /models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: ml-model-pvc
2. Expose the deployment as a service (ml-model-service.yaml).
This exposes the ML model via a NodePort service.
apiVersion: v1
kind: Service
metadata:
name: ml-model-service
spec:
selector:
app: ml-model
ports:
- protocol: TCP
port: 8501
targetPort: 8501
nodePort: 30050
type: NodePort
3. Persistent volume for model storage (ml-model-pv.yaml).
Stores the ML model on a persistent volume.
apiVersion: v1
kind: PersistentVolume
metadata:
name: ml-model-pv
spec:
capacity:
storage: 5Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
hostPath:
path: "/mnt/ml-models"
4. Persistent volume claim (ml-model-pvc.yaml).
Define a persistent volume claim that connects your deployment to the volume:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ml-model-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
5. Apply the configurations.
Run the following commands to deploy the ML model:
kubectl apply -f ml-model-pv.yaml
kubectl apply -f ml-model-pvc.yaml
kubectl apply -f ml-model-deployment.yaml
kubectl apply -f ml-model-service.yaml
After deployment, check the running services:
kubectl get pods
kubectl get services
You can access the model at:
http://<EC2-PUBLIC-IP>:30050/v1/models/default
This setup ensures that your TensorFlow Serving model is running on Kubernetes with persistent storage. To ensure reliable deployments, Kubernetes supports a variety of deployment strategies, including blue-green deployments, A/B testing, and rolling upgrades. Learn about the various Kubernetes deployment options and their advantages.
The workload in ML has fluctuating resource requirements, so scaling is important. For this purpose, Kubernetes has Horizontal Pod Autoscaler (HPA) and also Vertical Pod Autoscaler (VPA) which automatically scale the number of running pods based on CPU and memory use. Along with these benefits, these practices ensure that resources are not overutilized while always being available.
Autoscaling relies on resource monitoring to scale in or out based on defined metrics. This Kubernetes for machine learning feature is very beneficial for ML operations that might spike request volume out of nowhere.
Automation control makes Kubernetes help in scaling down cost while still delivering on performance needs for ML under diverse load conditions.
ML workloads tend to require changing amounts of resources. Kubernetes’ Horizontal Pod Autoscaler (HPA) enables scaling the number of replicas based on CPU or other self-defined metrics.
Refer to this guide on horizontal vs. vertical scaling to learn about the main distinctions and advantages of these strategies.
Below is an example HPA configuration that scales the ml-model deployment when 70% of the average CPU utilization is exceeded.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-model-hpa
namespace: ml
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-model
minReplicas: 1
maxReplicas: 3
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
kubectl apply -f ml-model-hpa.yaml
Check the status:
kubectl get hpa
Resource usage in Machine Learning is heavy, often using a lot of CPU, memory, and even GPU. Resource allocation requests and limits in Kubernetes help to optimize ML execution. Maintaining resource management principles assures equal distribution across several ML processes and avoids resource starvation or overprovisioning.
Best practices for resource management:
All resources above ensure the optimal ML model runs with the least latency and highest performance. Allocating resources boosts the efficiency of ML workloads and prevents GPU's inefficient usage of resources with overloading nodes.
Example configuration with requests and resource limits:
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"
With automated scaling and resource management integration, Kubernetes stands out for its containerized deployment environment. It offers support for complex ML workflows, including model serving, distributed training, and data processing pipelines.
Monitoring ML workloads ensures models run efficiently and any failures are quickly detected. We can use Prometheus and Grafana for monitoring and Fluentd, Elasticsearch, and Kibana (EFK) for logging.
Performance, reliability, and issue resolution within Machine Learning workloads critically depend on monitoring and logging. Kubernetes has both internal functions as well as external services which help ensure an observability level across ML models, training jobs, and inference services.
Kubernetes ML observation techniques:
Kubernetes provides resource usage metrics such as CPU, memory, and GPU that are critical to the efficient monitoring of ML workloads within the system.
Real-time collection of data on pod and node performance is usually done using Prometheus.
Trends and performance issues over time can be visualized using interactive dashboards provided by Grafana.
Logs from multiple pods and nodes can be collected and analyzed with EFK (Elasticsearch, Fluentd, Kibana) stack.
Collective infrastructure, Cloud logging Services provided by AWS Cloudwatch, GCP Cloud Logging, and Azure Monitor serve the purpose of log centralization and analysis.
Prometheus integrated with Alertmanager sends alerts for spikes in CPU usage, memory leaks, or failing models when previously defined thresholds are crossed.
Problems can be escalated in real time through notifications sent via Slack, PagerDuty, or emails.
This ideal solution for large-scale projects offers unbeatable protection, high performance, and flexible settings.
Step 1: Deploy Prometheus and Grafana:
kubectl apply –f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/setup/
kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/
Step 2: Deploy EFK stack for logging:
kubectl apply –f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/fluentd-elasticsearch/es-statefulset.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/fluentd-elasticsearch/fluentd-es-ds.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/fluentd-elasticsearch/kibana-deployment.yaml
Step 3: After running these commands, verify that Prometheus components are running:
kubectl get pods -n monitoring
You will then be able to monitor the status of ML loads, identify bottlenecks, and respond to critical incidents in a timely manner.
Kubernetes offers a powerful and flexible way to manage machine learning workloads — from automation to scaling and resource optimization. ML teams can leverage Kubernetes for machine learning to efficiently deploy, scale, and monitor complex workflows with minimum downtime and optimal resource utilization.
With Prometheus, HPA, and GPU-aware scheduling for monitoring, dynamic scaling, and ML acceleration, respectively, Kubernetes provides a seamless infrastructure for ML workloads.
Organizations wishing to efficiently manage and optimize Kubernetes for fully automated and computationally efficient machine learning systems as workloads and business needs evolve will have no other choice but to adopt Kubernetes for machine learning as the complexity of ML applications rises. These recommendations will help teams develop efficient, scalable, and robust ML systems.