Kubernetes for ML: Model Deployment and Scaling Guide

Written by is*hosting team | Jun 19, 2025 10:00:00 AM

Machine learning adoption has lagged in software development and operations (DevOps) due to the complexity of orchestration. Kubernetes helps simplify operations by offering a single platform for efficient ML workload management. This guide provides instructions for deploying ML models, auto-scaling configurations, resource management, and monitoring of ML workloads.

In this guide, the term server refers to a dedicated GPU server that is suitable for running ML workloads, including training and inference. GPU containers require physical access to the graphics accelerators, which is unavailable on a VPS, for example.

Why Use Kubernetes for Machine Learning?

Kubernetes has already proven its effectiveness in solving key challenges like:

Dependency management for different environments.
Adjusting to dynamic scaling of ML models.
Efficient allocation of GPU resources.
Capturing logs and monitoring orchestration.

The immediate advantages of employing Kubernetes for ML tasks are restated here.

You get flexible resource scaling with Horizontal and Vertical Pod Autoscalers.
Your workloads stay consistent and reproducible in defined environments.
Efficiency is enhanced in scheduling as well as GPU resource utilization.
Automating Kubeflow and Tekton consolidates the orchestration of ML pipelines.

Guide on Kubernetes and ML

Kubernetes has become a powerful tool for managing machine learning workloads at scale. Before diving into deployment and scaling strategies, let’s start with the basics of setup.

Setting Up Kubernetes for ML Workloads

Setting up a Kubernetes cluster for machine learning involves custom configuring worker nodes, networking, and resource management. This enables an independent, dynamic, and cost-effective setting that enables an ML engineer to dedicate maximum time to model building instead of technical upkeep.

Deploying Kubernetes on the server with GPU and cloud in platforms like AWS, GCP, and Azure further enhances the experience.

Below are some default configurations when setting up Kubernetes for ML workloads:

Master Node: This accounts for the rest of the cluster management like scheduling and API handling.
Worker Nodes: Conducting ML tasks alongside container management and workload scaling.
Networking: Sets up the communication channels for pods, services, and storage.
Storage Integration: Provides persistent storage of large datasets through volumes or external storage integration solutions.

Kubernetes makes resource management easier by reducing manual infrastructure setup for ML model deployment.

Prerequisites

Before proceeding, ensure you have the following:

A hosting provider account with the necessary permissions
SSH key pair for server access
Basic understanding of Linux commands

Kubernetes Deployment with Kubeadm on Server in 4 Steps

To enable Kubernetes cluster on a server with Kubeadm, make sure to observe due process:

Prepare at least two servers (one master and one worker node) with the preferred settings below:

OS: Ubuntu 24.04 LTS
vCPU: 4
RAM: 16GB
Storage: 80GB SSD
Firewall rules: Ports 30000-32767, 2379-2380, 6443, 10250-10255, should be accessible

Here’s a comprehensive guide to configuring the dedicated servers and deploying a Kubernetes cluster using Kubeadm for ML workloads.

Step 1: Server Setup

Select Ubuntu 24.04 LTS for the starting operating system.

Choose a server that has the following minimums:

vCPU: 4
RAM: 16 GB
Storage: 80GB SSD

Next, configure the firewall rules. Open the ports below:

6443 - (Kubernetes API server)
2379-2380 - (etcd communication)
10250-10255 – (Kubelet communication)
30000-32767 – (NodePort services)

For key pair assignment, either create a new SSH key pair or use an existing one, for instance, access.

Launching in Bulk: currently, a minimum of two servers must be launched (one master and one worker).

Considering cloud hosting? Check out our tutorial to choose the best fit for your ML workloads.

Step 2: Kubernetes Setup Components

Set the correct permissions for your private SSH key to ensure secure access:

chmod 600 /path/to/private_key

Connect to your server via SSH:

ssh -i /path/to/private_key ubuntu@server-ip

Update the system and install dependencies:

sudo apt update && sudo apt upgrade -ysudo apt install -y apt-transport-https ca-certificates curl

Installation of Docker:

sudo apt install -y docker.iosudo systemctl enable dockersudo systemctl start dockersudo systemctl status docker

Kubernetes packages installation. Add the Kubernetes APT repository:

echo "deb http://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.listsudo apt updatesudo apt-get install -y kubeadm=1.24.0-00 kubelet=1.24.0-00 kubectl=1.24.0-00sudo apt-mark hold kubelet kubeadm kubectl

Step 3: Initialize Kubernetes Cluster

Firstly, set up the master node. On the master node, run:

sudo kubeadm init --pod-network-cidr 192.168.0.0/16 --kubernetes-version 1.24.0

Configure kubectl for master node:

mkdir -p $HOME/.kubesudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/configsudo chown $(id -u):$(id -g) $HOME/.kube/config

Install Network Plugin (Calico):

kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

Step 4: Join Worker Nodes

Copy the join command and run on the master node:

kubeadm token create --print-join-command

Copy the output and run it on the worker node:

sudo kubeadm join <master-node-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

Verify nodes in cluster. On the master node, check if the worker has joined:

kubectl get nodes

Deploying an ML Model on Kubernetes

Here are the YAML files to deploy an ML workload (TensorFlow Serving) on your Kubernetes cluster.

1. Create a deployment (ml-model-deployment.yaml).

This deployment runs a TensorFlow Serving container with a pre-trained ML model.

apiVersion: apps/v1kind: Deploymentmetadata: name: ml-model labels: app: ml-modelspec: replicas: 1 selector: matchLabels: app: ml-model template: metadata: labels: app: ml-model spec: containers: - name: ml-model image: tensorflow/serving:latest ports: - containerPort: 8501 volumeMounts: - name: model-storage mountPath: /models volumes: - name: model-storage persistentVolumeClaim: claimName: ml-model-pvc

2. Expose the deployment as a service (ml-model-service.yaml).

This exposes the ML model via a NodePort service.

apiVersion: v1kind: Servicemetadata: name: ml-model-servicespec: selector: app: ml-model ports: - protocol: TCP port: 8501 targetPort: 8501 nodePort: 30050 type: NodePort

3. Persistent volume for model storage (ml-model-pv.yaml).

Stores the ML model on a persistent volume.

apiVersion: v1kind: PersistentVolumemetadata: name: ml-model-pvspec: capacity: storage: 5Gi accessModes: - ReadWriteOnce persistentVolumeReclaimPolicy: Retain hostPath: path: "/mnt/ml-models"

4. Persistent volume claim (ml-model-pvc.yaml).

Define a persistent volume claim that connects your deployment to the volume:

apiVersion: v1kind: PersistentVolumeClaimmetadata: name: ml-model-pvcspec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi

5. Apply the configurations.

Run the following commands to deploy the ML model:

kubectl apply -f ml-model-pv.yamlkubectl apply -f ml-model-pvc.yamlkubectl apply -f ml-model-deployment.yamlkubectl apply -f ml-model-service.yaml

After deployment, check the running services:

kubectl get podskubectl get services

You can access the model at:

http://<EC2-PUBLIC-IP>:30050/v1/models/default

This setup ensures that your TensorFlow Serving model is running on Kubernetes with persistent storage. To ensure reliable deployments, Kubernetes supports a variety of deployment strategies, including blue-green deployments, A/B testing, and rolling upgrades. Learn about the various Kubernetes deployment options and their advantages.

Configuring Autoscaling for ML Workloads

The workload in ML has fluctuating resource requirements, so scaling is important. For this purpose, Kubernetes has Horizontal Pod Autoscaler (HPA) and also Vertical Pod Autoscaler (VPA) which automatically scale the number of running pods based on CPU and memory use. Along with these benefits, these practices ensure that resources are not overutilized while always being available.

Autoscaling relies on resource monitoring to scale in or out based on defined metrics. This Kubernetes for machine learning feature is very beneficial for ML operations that might spike request volume out of nowhere.

Horizontal Pod Autoscaler (HPA): Changes the number of pod replicas in response to CPU or memory usage and some other optional metrics.
Vertical Pod Autoscaler (VPA): This device alters resource assignment (CPU, memory) at the pod level depending on how much a particular pod is consuming.
Cluster Autoscaler: This can increase/decrease the level of the worker nodes depending on how resource-heavy the overall demand on the cluster is.

Automation control makes Kubernetes help in scaling down cost while still delivering on performance needs for ML under diverse load conditions.

ML workloads tend to require changing amounts of resources. Kubernetes’ Horizontal Pod Autoscaler (HPA) enables scaling the number of replicas based on CPU or other self-defined metrics.

Refer to this guide on horizontal vs. vertical scaling to learn about the main distinctions and advantages of these strategies.

Below is an example HPA configuration that scales the ml-model deployment when 70% of the average CPU utilization is exceeded.

apiVersion: autoscaling/v2kind: HorizontalPodAutoscalermetadata: name: ml-model-hpa namespace: mlspec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: ml-model minReplicas: 1 maxReplicas: 3 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70kubectl apply -f ml-model-hpa.yaml

Check the status:

kubectl get hpa

Managing Resources for Efficient ML Execution

Resource usage in Machine Learning is heavy, often using a lot of CPU, memory, and even GPU. Resource allocation requests and limits in Kubernetes help to optimize ML execution. Maintaining resource management principles assures equal distribution across several ML processes and avoids resource starvation or overprovisioning.

Best practices for resource management:

Memory Management and CPU Capping. Setting resource limits and ensuring employment of resources are balanced while avoiding node overload.
GPU utilization during training, making inferences, and other arithmetic operations, efficiently manages GPU resources and makes training and inference better.
Taints On Nodes Affinity. Fitting specific workloads on certain nodes to utilize to fit resource needs.
Resource Caps. Allocating cap limits to resource use within a particular namespace to control resource monopolization.
Task Completion Order. High-priority ML workloads should be given resources, lowering the importance of some tasks.

All resources above ensure the optimal ML model runs with the least latency and highest performance. Allocating resources boosts the efficiency of ML workloads and prevents GPU's inefficient usage of resources with overloading nodes.

Example configuration with requests and resource limits:

resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "2" memory: "2Gi"

With automated scaling and resource management integration, Kubernetes stands out for its containerized deployment environment. It offers support for complex ML workflows, including model serving, distributed training, and data processing pipelines.

Monitoring and Logging ML Workloads in Kubernetes

Monitoring ML workloads ensures models run efficiently and any failures are quickly detected. We can use Prometheus and Grafana for monitoring and Fluentd, Elasticsearch, and Kibana (EFK) for logging.

Performance, reliability, and issue resolution within Machine Learning workloads critically depend on monitoring and logging. Kubernetes has both internal functions as well as external services which help ensure an observability level across ML models, training jobs, and inference services.

Kubernetes ML observation techniques:

Set Up Resource Limits: Specify the requests and limits for CPU, memory, and GPU to ensure that ML jobs do not flood the cluster.
Log All Training and Inference Events: Save logs for training models and performing inference to enable debugging.
Use Custom Metrics for ML: Define application-specific metrics like model accuracy, inference latency, and request throughput as metrics.
Enable Persistent Logging: Store logs centrally to ensure persistence over time and to analyze the failures better.
Combining these logging and monitoring techniques helps organizations leverage Kubernetes' full potential regarding ML workload performance, reliability, and availability.

Important Components of ML Monitoring in Kubernetes

Gathering Performance Indicators

Kubernetes provides resource usage metrics such as CPU, memory, and GPU that are critical to the efficient monitoring of ML workloads within the system.

Real-time collection of data on pod and node performance is usually done using Prometheus.

Trends and performance issues over time can be visualized using interactive dashboards provided by Grafana.

Log Management and Error Tracking

Logs from multiple pods and nodes can be collected and analyzed with EFK (Elasticsearch, Fluentd, Kibana) stack.

Collective infrastructure, Cloud logging Services provided by AWS Cloudwatch, GCP Cloud Logging, and Azure Monitor serve the purpose of log centralization and analysis.

Failure Reporting Interfaces

Prometheus integrated with Alertmanager sends alerts for spikes in CPU usage, memory leaks, or failing models when previously defined thresholds are crossed.

Problems can be escalated in real time through notifications sent via Slack, PagerDuty, or emails.

Dedicated Server

This ideal solution for large-scale projects offers unbeatable protection, high performance, and flexible settings.

Plans

Step-by-step Setup of Monitoring and Logging Tools

Step 1: Deploy Prometheus and Grafana:

kubectl apply –f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/setup/

kubectl apply -f https://raw.githubusercontent.com/prometheus-operator/kube-prometheus/main/manifests/

Step 2: Deploy EFK stack for logging:

kubectl apply –f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/fluentd-elasticsearch/es-statefulset.yaml

kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/fluentd-elasticsearch/fluentd-es-ds.yaml

kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/fluentd-elasticsearch/kibana-deployment.yaml

Step 3: After running these commands, verify that Prometheus components are running:

kubectl get pods -n monitoring

You will then be able to monitor the status of ML loads, identify bottlenecks, and respond to critical incidents in a timely manner.

Conclusion

Kubernetes offers a powerful and flexible way to manage machine learning workloads — from automation to scaling and resource optimization. ML teams can leverage Kubernetes for machine learning to efficiently deploy, scale, and monitor complex workflows with minimum downtime and optimal resource utilization.

With Prometheus, HPA, and GPU-aware scheduling for monitoring, dynamic scaling, and ML acceleration, respectively, Kubernetes provides a seamless infrastructure for ML workloads.

Organizations wishing to efficiently manage and optimize Kubernetes for fully automated and computationally efficient machine learning systems as workloads and business needs evolve will have no other choice but to adopt Kubernetes for machine learning as the complexity of ML applications rises. These recommendations will help teams develop efficient, scalable, and robust ML systems.

View full post