如何正確在 Docker 與 K8s 使用 GPU
本文介紹在 Ubuntu Server 22.04 上安裝 Docker 與 Kubernetes 並在 Container 上使用 GPU 的紀錄
環境介紹
-
K8s-Controller(V100 GPU)
- OS:Ubuntu Server 22.04
- IP:192.168.137.154
- Hostname:k8s-controller.com
-
K8s-Node1(T4 GPU)
- OS:Ubuntu Server 22.04
- IP:192.168.137.168
- Hostname:k8s-node1.com
-
K8s-Node2(T4 GPU)
- OS:Ubuntu Server 22.04
- IP:192.168.137.249
- Hostname:k8s-node2.com
如果不太懂 K8s 上該使用 GPU 整體架構可以先看這篇文章,可以大致上先了解等等會用到的套件 K8s GPU 整體架構介紹:https://zhuanlan.zhihu.com/p/670798727
GPU 設定 及 安裝 (每台都需要設定)
環境設定
- Disable nouveau 開源版本的 GPU 驅動
當系統安裝完成之後,會安裝系統開源的 NVIDIA 驅動版本,名稱為 nouveau- 創建 /etc/modprobe.d/blacklist-nouveau.conf 文件
sudo vim /etc/modprobe.d/blacklist-nouveau.conf
將下面内容添加進去:blacklist nouveau blacklist lbm-nouveau options nouveau modeset=0 alias nouveau off alias lbm-nouveau off
- 創建 /etc/modprobe.d/nouveau-kms.conf 文件,並將 options nouveau mdeset=0 添加進去:
echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
- 更新一下 initramfs:
sudo update-initramfs -u
- 重啟伺服器:
sudo reboot
- 查看 nouveau 是否加載,沒顯示的話就表示已經禁用:
sudo lsmod | grep nouveau
套件安裝
有分"一次全部安裝好(CUDA Toolkit)"跟"各個小套件分別安裝"的方法,這邊推薦直接裝一次全部裝好的版本
安裝 Nvidia CUDA Toolkit 套件 (一次全裝版本)
這邊是使用 Local deb 安裝, 安裝 Ubuntu 22.04 local deb 版本
- 移除舊 NVIDIA 驅動
sudo apt-get --purge remove nvidia* sudo apt-get --purge remove libnvidia*
- 安装 CUDA Toolkit (裡面包含 Driver Cuda NVCC 等等…)
- Base Installer:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda-repo-ubuntu2204-12-3-local_12.3.2-545.23.08-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu2204-12-3-local_12.3.2-545.23.08-1_amd64.deb sudo cp /var/cuda-repo-ubuntu2204-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo apt-get update sudo apt-get -y install cuda-toolkit-12-3
- Driver Installer:二擇一即可我是裝上面就行了
- To install the legacy kernel module flavor:
sudo apt-get install -y cuda-drivers
- To install the open kernel module flavor:
sudo apt-get install -y nvidia-kernel-open-545 sudo apt-get install -y cuda-drivers-545
- To install the legacy kernel module flavor:
- Base Installer:
補充說明:
可以去官網選擇你的 OS 與版本去安裝合適套件與步驟:
Nvidia CUDA Toolkit:https://developer.nvidia.com/cuda-downloads
單純安裝 Nvidia GPU Drive (之後還要裝 Cuda):
實作中發現有 error 要控制 OS core 版本所以沒特別去解!!
所以要嘗試單個套件安裝的請自行 Debug,或哪天我找到正確方法我在補上!
-
wget NVIDIA GPU Driver
- 找自己型號的 Driver 並複製下載連結:
https://www.nvidia.com/Download/index.aspx?lang=en-us - 下載 Driver 到機台上
wget "Driver-Url"
- 找自己型號的 Driver 並複製下載連結:
-
install GPU Driver
chmod 777 "GPU-Driver" ./"GPU-Driver"
安裝後就顯示這個錯誤
ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details. ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See /var/log/nvidia-installer.log for details.
調整到跟 uname -r 一樣的版本或是調低 uname 到一樣的版本 install linux-kernel-headers kernel-package:
- sudo apt-get install linux-kernel-headers kernel-package
解決方法: ref:https://www.linuxprobe.com/ubuntu-nvidia-v100-gpu.html
補充:
裝完之後要使用 nvcc:
-
sudo nano ~/.bashrc
# 加在最下面 export PATH="/usr/local/<cuda-version-folder>/bin:$PATH" export LD_LIBRARY_PATH="/usr/local/<cuda-version-folder>/lib64:$LD_LIBRARY_PATH" # 或者用這個也可 export PATH="/usr/local/cuda/bin:$PATH" export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH" # 或者用這個也可 (我是用這個) # ref : https://blog.csdn.net/qq_41094058/article/details/116207333 if [ $LD_LIBRARY_PATH ]; then export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:'/usr/local/cuda/lib64' else export LD_LIBRARY_PATH='/usr/local/cuda/lib64' fi if [ $PATH ]; then export PATH=$PATH:'/usr/local/cuda/bin' else export PATH='/usr/local/cuda/bin' fi if [ $CUDA_HOME ]; then export CUDA_HOME=$CUDA_HOME:'/usr/local/cuda' else export CUDA_HOME='/usr/local/cuda' fi
-
source ~/.bashrc
-
nvcc --version
對於多版本 cuda 的切換,也可以通過建立連結的方法:sudo rm -rf cuda
sudo ln -s /usr/local/cuda-11.1/ /usr/local/cuda
-
Check Driver, CUDA, NVCC
nvidia-smi
nvcc -V
參考資料:
ref:https://zhuanlan.zhihu.com/p/338507526
GPU 相關資源:
- Nvidia Download Website: https://developer.nvidia.com/downloads
- Nvidia CUDA Toolkit : https://developer.nvidia.com/cuda-downloads
- Nvidia Driver : https://www.nvidia.com.tw/Download/index.aspx?lang=tw
GPU Check Command:
nvidia-smi
nvcc -V
GPU 補充:
- 搭配性 : GPU 驅動 去跟 CUDA 匹配, CUDA 可以裝低一點版本的只要 GPU 驅動有兼容到即可
- nvidia-smi 顯示的是當前驅動支持的最高版本的 cuda。 不是已經安裝的 cuda 版本。
- Cuda Toolkit 版本內裝的各套件版本 https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
- CUDA : 為 GPU 通用計算建構的運算平台
- cudnn : 為深度學習計算設計的套件
- CUDA Toolkit (nvidia) : CUDA 完整的工具安裝包,其中提供 Nvidia 驅動、開發 CUDA 程式相關開發工具等可供安裝的選項。包括 CUDA 編譯器、IDE、調適器等,CUDA 程式所對應的問件以及他們的頭文件
- CUDA Toolkit (Pytorch): CUDA 不完整的工具安裝包,其主要包含在使用 CUDA 相關功能時所依賴的動態連接庫。不會安裝驅動程式!!
- NVCC 是 CUDA 的編譯器,只是 CUDA Toolkit 的一部分
K8s安裝 (會寫說在哪個 Node 上安裝,沒特別寫就是每台都要裝)
-
set hostname (All nodes)
依照你要設定的 Hostname 去設定sudo hostnamectl set-hostname <k8smaster.example.net>
exec bash
-
set /etc/hosts (All nodes)
// IP Hostname ServerName 192.168.137.154 k8sc.net k8sc 192.168.137.168 k8sn1.net k8sn1 192.168.137.249 k8sn2.net k8sn2
-
disable swap , selinux, firewall
-
swap:
swapon --show sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab sudo swapoff -a
-
selinux:
sudo selinux-config-enforcing or sudo vim /etc/selinux/config Modify => SELINUX=disabled Reboot system
-
firewall:
sudo ufw status sudo ufw disabl
參考資料: ref:https://www.zhihu.com/question/374752553/answer/1052244227
-
-
modify core
sudo tee /etc/modules-load.d/containerd.conf <<EOF overlay br_netfilter EOF
-
reload mod
sudo modprobe overlay sudo modprobe br_netfilter
-
modify core
sudo tee /etc/sysctl.d/kubernetes.conf <<EOF net.bridge.bridge-nf-call-ip6tables = 1 net.bridge.bridge-nf-call-iptables = 1 net.ipv4.ip_forward = 1 EOF
-
reload system
sudo sysctl --system
-
install CRI suite
sudo apt install -y curl gnupg2 software-properties-common apt-transport-https ca-certificates
-
add docker apt repository
-
Add Docker’s official GPG key:
sudo apt-get update sudo apt-get install ca-certificates curl gnupg sudo install -m 0755 -d /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg sudo chmod a+r /etc/apt/keyrings/docker.gpg
-
Add the repository to Apt sources:
echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
-
-
install Docker
sudo apt-get update # 安裝 docker 相關套件 sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin # 新增 docker 群組 sudo groupadd docker # 把現在的 $USER 加到 docker 中 sudo usermod -aG docker $USER # 切換到 docker 群組 newgrp docker
-
設置 docker 使用 cgroupdriver=systemd
因為要配合 k8s 使用 systemd 所以要設定 docker daemonmkdir -p /etc/docker
vim /etc/docker/daemon.json
{ "exec-opts": ["native.cgroupdriver=systemd"] }
systemctl enable docker && systemctl
systemctl status docker
-
install cri-docker
- V0.3.9:
wget https://github.com/Mirantis/cri-dockerd/releases/download/v0.3.9/cri-dockerd_0.3.9.3-0.ubuntu-jammy_amd64.deb sudo dpkg -i cri-dockerd_0.3.9.3-0.ubuntu-jammy_amd64.deb systemctl daemon-reload systemctl enable cri-docker && systemctl start cri-docker && systemctl status cri-docker
- V0.3.10:
wget https://github.com/Mirantis/cri-dockerd/releases/download/v0.3.10/cri-dockerd_0.3.10.3-0.ubuntu-jammy_amd64.deb sudo dpkg -i cri-dockerd_0.3.10.3-0.ubuntu-jammy_amd64.deb systemctl daemon-reload systemctl enable cri-docker && systemctl start cri-docker && systemctl status cri-docker
- Docker 試跑:
Run container => docker run --name hello-world hello-world 看資源 => docker ps -a 刪除 container => docker rm hello-world
- V0.3.9:
-
Add Kubernetes apt repository:
(推薦)官方使用原生包管理工具安装 sudo apt-get update # apt-transport-https 可以是一个虚拟包;如果是这样,你可以跳过这个包 sudo apt-get install -y apt-transport-https ca-certificates curl # 如果 `/etc/apt/keyrings` 目录不存在,则应在 curl 命令之前创建它,请阅读下面的注释。 sudo mkdir -p -m 755 /etc/apt/keyrings curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
!!註解: 想要別的版本請改 V?.?? 要升级 kubectl 到别的次要版本,你需要先升级 /etc/apt/sources.list.d/kubernetes.list 中的版本, 再运行 apt-get update 和 apt-get upgrade。 ref: https://kubernetes.io/zh-cn/docs/tasks/tools/install-kubectl-linux/
-
install Kubernetes suite for Kubectl, kubeadm, kubelet
sudo apt update sudo apt install -y kubelet kubeadm kubectl (option) sudo apt-mark hold kubelet kubeadm kubectl
-
Kubeadm init (Controller Node)
特別注意!! 參數基本上都是依照你的環境狀態去新增或修改!! --pod-network-cidr <cidr_ip> 請依照你後來要使用的 cidr_ip 需求去修改
- Normal command:
sudo kubeadm init --control-plane-endpoint=k8sc.net --pod-network-cidr=172.168.0.0/16
- Docekr cri-dockerd command:
sudo kubeadm init --control-plane-endpoint=k8sc.net --pod-network-cidr=172.168.0.0/16 --cri-socket unix:///run/cri-dockerd.sock
- 初始化好會顯示
Your Kubernetes control-plane has initialized successfully! To start using your cluster, you need to run the following as a regular user: mkdir -p $HOME/.kube sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config sudo chown $(id -u):$(id -g) $HOME/.kube/config Alternatively, if you are the root user, you can run: export KUBECONFIG=/etc/kubernetes/admin.conf You should now deploy a pod network to the cluster. Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at: https://kubernetes.io/docs/concepts/cluster-administration/addons/ You can now join any number of control-plane nodes by copying certificate authorities and service account keys on each node and then running the following as root: kubeadm join k8sc.net:6443 --token m4mgvk.ds9gbxubeelkyg1d \ --discovery-token-ca-cert-hash sha256:967a99a31596e6d6ad9b40dabf69813b8c605f9fe1c8590ddbe68fa23d58e095 \ --control-plane Then you can join any number of worker nodes by running the following on each as root: kubeadm join k8sc.net:6443 --token m4mgvk.ds9gbxubeelkyg1d \ --discovery-token-ca-cert-hash sha256:967a99a31596e6d6ad9b40dabf69813b8c605f9fe1c8590ddbe68fa23d58e095
- Normal command:
-
Work Node Join cluster (WorkNode)
請依照你剛剛建立好的 k8s 去修改要輸入的 kubeadm join code
- Normal command: ``` kubeadm join k8sc.net:6443 --token m4mgvk.ds9gbxubeelkyg1d \ --discovery-token-ca-cert-hash sha256:967a99a31596e6d6ad9b40dabf69813b8c605f9fe1c8590ddbe68fa23d58e095 ``` - Docekr cri-dockerd command: ``` kubeadm join k8sc.net:6443 --token m4mgvk.ds9gbxubeelkyg1d \ --discovery-token-ca-cert-hash sha256:967a99a31596e6d6ad9b40dabf69813b8c605f9fe1c8590ddbe68fa23d58e095 --cri-socket unix:///run/cri-dockerd.sock ```
-
Check all node join (MasterNode) -
kubectl get nodes
-
install Calico Pod network suite (MasterNode)
此版本為 V3.27.0 因為這次使用的是 --pod-network-cidr=172.168.0.0/16 所以需要改文件
``` # 建立它所需要的套件 kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/tigera-operator.yaml # 下載設定檔 wget https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/custom-resources.yaml ``` 修改設定檔: ``` # 因為預設是 192.168.0.0 跟我們 IP 撞到,修改成 172.168.0.0 nano custom-resources.yaml # This section includes base Calico installation configuration. # For more information, see: https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.Installation apiVersion: operator.tigera.io/v1 kind: Installation metadata: name: default spec: # Configures Calico networking. calicoNetwork: # Note: The ipPools section cannot be modified post-install. ipPools: - blockSize: 26 cidr: 172.168.0.0/16 <= 修改這 encapsulation: VXLANCrossSubnet natOutgoing: Enabled nodeSelector: all() --- ``` ``` # 部屬 CNI kubectl create -f custom-resources.yaml ``` - waite all pods Running ``` watch kubectl get pods -n calico-system or watch kubectl get pods --all-namespaces # 移除在 control-plane 上的 汙點(taint) kubectl taint nodes --all node-role.kubernetes.io/control-plane- # 移除在 master 上的 汙點(taint) kubectl taint nodes --all node-role.kubernetes.io/master- ``` ``` 參考資料: ref: https://docs.tigera.io/calico/latest/getting-started/kubernetes/quickstart https://www.cnblogs.com/khtt/p/16563088.html ```
-
Check All Cluster Node STATUS
kubectl get nodes -o wide
會都相下面一樣顯示 Ready 代表 k8s 以可以正常使用!!
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k8smaster.net Ready control-plane 52m v1.29.1 192.8.1.66 <none> Ubuntu 22.04.3 LTS 5.15.0-78-generic containerd://1.6.27 k8snode1.net Ready <none> 23m v1.29.1 192.8.1.65 <none> Ubuntu 22.04.3 LTS 5.15.0-78-generic containerd://1.6.27 k8snode2.net Ready <none> 23m v1.29.1 192.8.1.69 <none> Ubuntu 22.04.3 LTS 5.15.0-78-generic containerd://1.6.27
Trouble Shooting:
- “初始錯誤"要重建 k8s:
# 如果 kubeadm reset (使用 containerd.io) kubeadm reset or (使用 cri-dockerd) kubeadm reset --cri-socket unix:///run/cri-dockerd.sock # 清除 k8s 文件 rm -rf $HOME/.kube # 重啟 k8s 服務 systemctl daemon-reload && systemctl restart kubelet 之後請回到 step.15
- “建立 CNI 之後"重建 k8s:
# 如果 kubeadm reset (使用 containerd.io) kubeadm reset or (使用 cri-dockerd) kubeadm reset --cri-socket unix:///run/cri-dockerd.sock # 刪除 CNI 文件 rm -rf /etc/cni/net.d # 刪除之前 CNI 所建立的 iptable sudo iptables -F && sudo iptables -t nat -F && sudo iptables -t mangle -F && sudo iptables -X # 刪除 k8s 文件 sudo rm -f $HOME/.kube/config
- kubectl get nodes 如果出現
root@k8sc:~/K8s# kubectl get node E0201 16:03:23.299872 18786 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0201 16:03:23.300534 18786 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0201 16:03:23.302286 18786 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0201 16:03:23.302901 18786 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused E0201 16:03:23.304646 18786 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused The connection to the server localhost:8080 was refused - did you specify the right host or port?
解決方法: => mkdir ~/.kube => cp /etc/kubernetes/admin.conf ~/.kube/config 參考資料: rfe: https://www.gbase8.cn/12320
Docker use GPU
有 GPU 的 host 都需要裝
-
install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
-
Setting /etc/docker/daemon.json
- nano /etc/docker/daemon.json
{ "exec-opts": ["native.cgroupdriver=systemd"], "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } } }
- sudo systemctl restart docker
- nano /etc/docker/daemon.json
-
check container can use GPU
docker run --rm -it nvcr.io/nvidia/cuda:10.2-base nvidia-smi
如果可以看到下面這樣 Nvidia-smi 畫面就代表你的 docker container 裡可以使用 GPU 了!!
Thu Feb 1 09:55:47 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 On | 00000000:18:00.0 Off | 0 | | N/A 28C P8 9W / 70W | 7MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
參考資料:
ref:https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#enabling-gpu-support-in-kubernetes
Install NGC
為了可以讓 k8s 可以找到 GPU 資源需要 Nvidia NGC 中的 images
-
Install NGC Command Line
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.37.0/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip find ngc-cli/ -type f -exec md5sum {} + | LC_ALL=C sort | md5sum -c ngc-cli.md5 chmod u+x ngc-cli/ngc echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile rfe :https://ngc.nvidia.com/setup/installers/cli
-
set NGC config set
- ngc.nvidia.com
- 辦一個會員
- 右上角頭像 SetUp -> API Key
- Generate API Key
- ngc config set
- 輸入 api key 跟一些資料
- docker login nvcr.io
完成就會像這樣Username: 輸入 => $oauthtoken Password: 輸入你的 token
確認是否可連上並下載回來 imagesroot@k8sc:~/NVD# docker login nvcr.io Username: $oauthtoken Password: WARNING! Your password will be stored unencrypted in /root/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store Login Succeeded
docker pull nvcr.io/nvidia/k8s-device-plugin:v0.14.4
-
Check Pods Ruing
root@k8sc:~/NVD/test# kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE ........ ........ ...... kube-system nvidia-device-plugin-daemonset-g7pn5 1/1 Running 0 11m kube-system nvidia-device-plugin-daemonset-j5t79 1/1 Running 0 11m kube-system nvidia-device-plugin-daemonset-pjjr9 1/1 Running 0 11m ........ ........ ......
-
Check kubectl describe nodes
確認有 nvidia.com/gpu 就代表成功了Addresses: InternalIP: 192.168.137.249 Hostname: k8sn2.net Capacity: cpu: 96 ephemeral-storage: 3843514416Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 395820888Ki nvidia.com/gpu: 5 pods: 110 Allocatable: cpu: 96 ephemeral-storage: 3542182879921 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 395718488Ki nvidia.com/gpu: 5 pods: 110 System Info: Machine ID: 99a1ca0433d9443aafba35201ede1a9b System UUID: d8c50c1b-e0ef-2445-bc45-140d4f639386 Boot ID: 5a8a1c1f-f155-44af-9e70-105e821bd24c Kernel Version: 6.5.0-15-generic OS Image: Ubuntu 22.04.3 LTS Operating System: linux Architecture: amd64 Container Runtime Version: docker://25.0.2 Kubelet Version: v1.29.1 Kube-Proxy Version: v1.29.1 PodCIDR: 172.168.2.0/24 PodCIDRs: 172.168.2.0/24 Non-terminated Pods: (5 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- calico-system calico-node-rvhj8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h calico-system calico-typha-5f87879b7d-tjwld 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h calico-system csi-node-driver-b58b7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h kube-system kube-proxy-xcf67 0 (0%) 0 (0%) 0 (0%) 0 (0%) 17h kube-system nvidia-device-plugin-daemonset-pjjr9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 14m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 0 (0%) 0 (0%) memory 0 (0%) 0 (0%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) nvidia.com/gpu 0 0 Events: <none>
參考資料:
ref:
https://github.com/NVIDIA/k8s-device-plugin
https://bluesmilery.github.io/blogs/afcb1072/
Trouble Shooting:
- reboot 之後 nvidia-smi 會報錯:
sudo ubuntu-drivers devices # 安裝帶有 recommended 版本的 Driver sudo apt-get install nvidia-driver-<*> ref: https://zhuanlan.zhihu.com/p/337013545 https://www.zhihu.com/question/474222642
- create -f k8s device plugin 之後發現 control-plane 沒有被派發任務,
step.19 沒做到此動作,輸入完即可
# 移除在 control-plane 上的 汙點(taint) kubectl taint nodes --all node-role.kubernetes.io/control-plane- # 移除在 master 上的 汙點(taint) kubectl taint nodes --all node-role.kubernetes.io/master-
GPU Burn
使用 wilicc/gpu-burn 作為 GPU Burn 程式並且打包成 images 使用
- wilicc/gpu-burn to image
git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
nano Dockerfile
# 把它修改版本及執行方式成我們要的樣子方便後面 k8s 使用
ARG CUDA_VERSION=12.3.1
ARG IMAGE_DISTRO=ubuntu22.04
FROM nvidia/cuda:${CUDA_VERSION}-devel-${IMAGE_DISTRO} AS builder
WORKDIR /build
COPY . /build/
RUN make
FROM nvidia/cuda:${CUDA_VERSION}-runtime-${IMAGE_DISTRO}
COPY --from=builder /build/gpu_burn /app/
COPY --from=builder /build/compare.ptx /app/
WORKDIR /app
# Create a /app/result directory and link it to the local ./result directory
RUN mkdir /app/result && ln -s /app/result /result
# 打包成 Images
docker build -t gpu_burn .
# 確認是否有打包好的 images
docker images
Private Docker Registry Server
因為需要下在自己打包的 image 給其他 node 使用又不想上傳到 dockerhub 所以自己建一個 Private Docker Registry
# 用 docker 架起 Registry
docker run -d --restart always -p 5000:5000 -v /root/K8s/registry:/var/lib/registry --name registry registry:2
# 把我們剛剛打包好的 image 打上 tag
docker tag <images_name> <registries_ip>:5000/<images_name>
# 在所有要下載的 server 修改 docker 設定檔新增以下 code
nano /etc/docker/daemon.json (all node)
"live-restore": true,
"group": "dockerroot",
"insecure-registries": ["<registries_ip>:5000"]
# 重啟 docker 讓他讀取新設定檔 (all node)
systemctl restart docker
# 上傳剛剛打上 tag 的 images
docker push <registries_IP>:5000/<images_name>
K8s exec GPU-Burn
gpu-burn-CN5c.yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-burn-job-controller
labels:
app: gpu-burn
spec:
ttlSecondsAfterFinished: 100
template:
metadata:
labels:
app: gpu-burn
spec:
restartPolicy: Never
nodeSelector:
kubernetes.io/hostname: k8sc.net
containers:
- name: gpu-burn
image: 192.168.137.154:5000/gpu_burn
imagePullPolicy: Never
command: [ "sh", "-c", "./gpu_burn 60 > /app/result/output.txt && exit" ]
resources:
limits:
nvidia.com/gpu: 5 # 設置每個 Pod 使用的 GPU 數量
volumeMounts:
- name: result-volume
mountPath: /app/result
volumes:
- name: result-volume
hostPath:
path: /root/gpu-result
type: Directory
---
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-burn-job-node1
labels:
app: gpu-burn
spec:
ttlSecondsAfterFinished: 100
template:
metadata:
labels:
app: gpu-burn
spec:
restartPolicy: Never
nodeSelector:
kubernetes.io/hostname: k8sn1.net
containers:
- name: gpu-burn
image: 192.168.137.154:5000/gpu_burn
imagePullPolicy: Never
command: [ "sh", "-c", "./gpu_burn 60 > /app/result/output.txt && exit" ]
resources:
limits:
nvidia.com/gpu: 5 # 設置每個 Pod 使用的 GPU 數量
volumeMounts:
- name: result-volume
mountPath: /app/result
volumes:
- name: result-volume
hostPath:
path: /root/gpu-result
type: Directory
---
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-burn-job-node2
labels:
app: gpu-burn
spec:
ttlSecondsAfterFinished: 100
template:
metadata:
labels:
app: gpu-burn
spec:
restartPolicy: Never
nodeSelector:
kubernetes.io/hostname: k8sn2.net
containers:
- name: gpu-burn
image: 192.168.137.154:5000/gpu_burn
imagePullPolicy: Never
command: [ "sh", "-c", "./gpu_burn 60 > /app/result/output.txt && exit" ]
resources:
limits:
nvidia.com/gpu: 5 # 設置每個 Pod 使用的 GPU 數量
volumeMounts:
- name: result-volume
mountPath: /app/result
volumes:
- name: result-volume
hostPath:
path: /root/gpu-result
type: Directory
gpu-burn-R4c-3pod.yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-burn-job-random
labels:
app: gpu-burn
spec:
ttlSecondsAfterFinished: 100
completions: 3 # 完成幾個
parallelism: 3 # 同時執行幾個
template:
metadata:
labels:
app: gpu-burn
spec:
restartPolicy: Never
containers:
- name: gpu-burn
image: 192.168.137.154:5000/gpu_burn
imagePullPolicy: Never
command: [ "sh", "-c", "./gpu_burn 60 > /app/result/output.txt && exit" ]
resources:
limits:
nvidia.com/gpu: 4 # 設置每個 Pod 使用的 GPU 數量
volumeMounts:
- name: result-volume
mountPath: /app/result
volumes:
- name: result-volume
hostPath:
path: /root/gpu-result
type: Directory
建立任務:
=> kubectl create -f <yaml>
刪除任務:
=> kubectl delete -f <yaml>
查看 job :
=> kubectl get pods
查看 job 詳細資訊 :
=> kubectl describe pods
or
=> kubectl describe pod <pod_name>
查看執行 Logs :
=> kubectl Logs <pod_name>