Kubernetes HPA using GPU metrics

Ajay R
3 min readMar 15, 2021


Auto Scaling Pods based on GPU metrics

Horizontal Pod Autoscaler (HPA)

The Horizontal Pod Autoscaler(HPA) automatically scales the number of Pods in a replication controller, deployment, replica set or stateful set based on observed CPU utilization.


The Horizontal Pod Autoscaler is implemented as a Kubernetes API resource and a controller. The controller periodically adjusts the number of replicas in a replication controller or deployment to match the observed average CPU utilization to the target specified by user.

Deploy the Metrics Server

# Deploying the metrics server
$ kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/download/v0.4.1/components.yaml
# Verifying metrics-server status (it might take few minutes)
$ kubectl get apiservice v1beta1.metrics.k8s.io -o json | jq '.status'

Testing HPA

Deploying sample application

$ kubectl create deployment php-apache --image=us.gcr.io/k8s-artifacts-prod/hpa-example
$ kubectl set resources deploy php-apache --requests=cpu=200m
$ kubectl expose deploy php-apache --port 80
$ kubectl get pod -l app=php-apache

Creating HPA resource

$ kubectl autoscale deployment php-apache --cpu-percent=50 --min=1 --max=10

Creating load to test autoscaling

$ kubectl --generator=run-pod/v1 run -i --tty load-generator --image=busybox /bin/sh
# Execute this while loop from inside the busybox pod
$ while true; do wget -q -O - http://php-apache; done

You can notice that pod replicas will increase after reaching the target CPU utilization.

HPA based on GPU metrics

Kubernetes metrics server monitors CPU so to autoscale pods based on GPU requires fetching these GPU metrics from other exporter.

Setting up DCGM(Data Center GPU Manager)

To gather GPU metrics in Kubernetes, its recommended to use dcgm-exporter. dcgm-exporter, based on DCGM exposes GPU metrics for Prometheus and can be visualized using Grafana.

Deploying nvidia dcgm exporter

$ helm repo add gpu-helm-charts https://nvidia.github.io/gpu-monitoring-tools/helm-charts
$ helm repo update
$ helm install --generate-name gpu-helm-charts/dcgm-exporter

Setting up Prometheus

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
$ helm install prometheus-community/kube-prometheus-stack \
--create-namespace --namespace prometheus \

Install prometheus adapter to generate custom metrics

# get the svc name using the following command
$ kubectl get svc -n prometheus -lapp=kube-prometheus-stack-prometheus
$ helm install --name prometheus-adapter \
--set rbac.create=true,prometheus.url=http://<SVC_NAME>.prometheus.svc.cluster.local,prometheus.port=9090 stable/prometheus-adapter

Wait for few seconds and you should be able to get custom metrics from the API

$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_MEM_COPY_UTILNote: DCGM_FI_DEV_GPU_UTIL metric has been deprecated. To get this metric update the image tag of dcgm-exporter to '2.0.13-2.1.2-ubuntu18.04'

Creating HPA resource for GPU

# gpu-hpa.yaml
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
name: hpa-gpu
namespace: default
apiVersion: apps/v1beta1
kind: Deployment
name: <DEPLOYMENT_NAME> # on which autoscaling has to be applied
minReplicas: 1
maxReplicas: 3
- type: Object
kind: Service
name: <DCGM-EXPORTER SVC NAME> # kubectl get svc | grep dcgm
targetValue: 80

Anytime the GPU usage goes above 80, deployment will be scaled up.