本文介紹在 Ubuntu Server 22.04 上安裝 Docker 與 Kubernetes 並在 Container 上使用 GPU 的紀錄

環境介紹

  • K8s-Controller(V100 GPU)

    • OS:Ubuntu Server 22.04
    • IP:192.168.137.154
    • Hostname:k8s-controller.com
  • K8s-Node1(T4 GPU)

    • OS:Ubuntu Server 22.04
    • IP:192.168.137.168
    • Hostname:k8s-node1.com
  • K8s-Node2(T4 GPU)

    • OS:Ubuntu Server 22.04
    • IP:192.168.137.249
    • Hostname:k8s-node2.com

如果不太懂 K8s 上該使用 GPU 整體架構可以先看這篇文章,可以大致上先了解等等會用到的套件 K8s GPU 整體架構介紹:https://zhuanlan.zhihu.com/p/670798727

GPU 設定 及 安裝 (每台都需要設定)

環境設定

  1. Disable nouveau 開源版本的 GPU 驅動
    當系統安裝完成之後,會安裝系統開源的 NVIDIA 驅動版本,名稱為 nouveau
    • 創建 /etc/modprobe.d/blacklist-nouveau.conf 文件
    • sudo vim /etc/modprobe.d/blacklist-nouveau.conf 將下面内容添加進去:
      blacklist nouveau
      blacklist lbm-nouveau
      options nouveau modeset=0
      alias nouveau off
      alias lbm-nouveau off
      
    • 創建 /etc/modprobe.d/nouveau-kms.conf 文件,並將 options nouveau mdeset=0 添加進去:
      echo options nouveau modeset=0 | sudo tee -a /etc/modprobe.d/nouveau-kms.conf
      
    • 更新一下 initramfs:
      sudo update-initramfs -u
      
    • 重啟伺服器:
      sudo reboot
      
    • 查看 nouveau 是否加載,沒顯示的話就表示已經禁用:
      sudo lsmod | grep nouveau
      

套件安裝

有分"一次全部安裝好(CUDA Toolkit)"跟"各個小套件分別安裝"的方法,這邊推薦直接裝一次全部裝好的版本

安裝 Nvidia CUDA Toolkit 套件 (一次全裝版本)

這邊是使用 Local deb 安裝, 安裝 Ubuntu 22.04 local deb 版本

  1. 移除舊 NVIDIA 驅動
    sudo apt-get --purge remove nvidia*
    sudo apt-get --purge remove libnvidia*
    
  2. 安装 CUDA Toolkit (裡面包含 Driver Cuda NVCC 等等…)
    • Base Installer:
      wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
      
      sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
      
      wget https://developer.download.nvidia.com/compute/cuda/12.3.2/local_installers/cuda-repo-ubuntu2204-12-3-local_12.3.2-545.23.08-1_amd64.deb
      
      sudo dpkg -i cuda-repo-ubuntu2204-12-3-local_12.3.2-545.23.08-1_amd64.deb
      
      sudo cp /var/cuda-repo-ubuntu2204-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
      
      sudo apt-get update
      
      sudo apt-get -y install cuda-toolkit-12-3
      
    • Driver Installer:二擇一即可我是裝上面就行了
      • To install the legacy kernel module flavor:
        sudo apt-get install -y cuda-drivers
        
      • To install the open kernel module flavor:
        sudo apt-get install -y nvidia-kernel-open-545
        
        sudo apt-get install -y cuda-drivers-545
        
補充說明:
可以去官網選擇你的 OS 與版本去安裝合適套件與步驟:
Nvidia CUDA Toolkit:https://developer.nvidia.com/cuda-downloads

單純安裝 Nvidia GPU Drive (之後還要裝 Cuda):

實作中發現有 error 要控制 OS core 版本所以沒特別去解!!
所以要嘗試單個套件安裝的請自行 Debug,或哪天我找到正確方法我在補上!
  1. wget NVIDIA GPU Driver

  2. install GPU Driver

    chmod 777 "GPU-Driver"
    ./"GPU-Driver"
    

    安裝後就顯示這個錯誤

    ERROR: An error occurred while performing the step: "Building kernel modules". See /var/log/nvidia-installer.log for details.
    ERROR: An error occurred while performing the step: "Checking to see whether the nvidia kernel module was successfully built". See
         /var/log/nvidia-installer.log for details.
    

    調整到跟 uname -r 一樣的版本或是調低 uname 到一樣的版本 install linux-kernel-headers kernel-package:

    • sudo apt-get install linux-kernel-headers kernel-package
    解決方法:
    ref:https://www.linuxprobe.com/ubuntu-nvidia-v100-gpu.html
    

補充:

裝完之後要使用 nvcc:

  1. sudo nano ~/.bashrc

    # 加在最下面
    export PATH="/usr/local/<cuda-version-folder>/bin:$PATH"
    export LD_LIBRARY_PATH="/usr/local/<cuda-version-folder>/lib64:$LD_LIBRARY_PATH"
    
    # 或者用這個也可
    export PATH="/usr/local/cuda/bin:$PATH"
    export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"
    
    # 或者用這個也可 (我是用這個)
    # ref : https://blog.csdn.net/qq_41094058/article/details/116207333
    if [ $LD_LIBRARY_PATH ]; then
        export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:'/usr/local/cuda/lib64'
    else
        export LD_LIBRARY_PATH='/usr/local/cuda/lib64'
    fi
    
    if [ $PATH ]; then
        export PATH=$PATH:'/usr/local/cuda/bin'
    else
        export PATH='/usr/local/cuda/bin'
    fi
    
    if [ $CUDA_HOME ]; then
        export CUDA_HOME=$CUDA_HOME:'/usr/local/cuda'
    else
        export CUDA_HOME='/usr/local/cuda'
    fi
    
  2. source ~/.bashrc

  3. nvcc --version 對於多版本 cuda 的切換,也可以通過建立連結的方法:

    • sudo rm -rf cuda
    • sudo ln -s /usr/local/cuda-11.1/ /usr/local/cuda
  4. Check Driver, CUDA, NVCC

    • nvidia-smi
    • nvcc -V
參考資料:
ref:https://zhuanlan.zhihu.com/p/338507526

GPU 相關資源:

GPU Check Command:

  • nvidia-smi
  • nvcc -V

GPU 補充:

  • 搭配性 : GPU 驅動 去跟 CUDA 匹配, CUDA 可以裝低一點版本的只要 GPU 驅動有兼容到即可
  • nvidia-smi 顯示的是當前驅動支持的最高版本的 cuda。 不是已經安裝的 cuda 版本。
  • Cuda Toolkit 版本內裝的各套件版本 https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
  • CUDA : 為 GPU 通用計算建構的運算平台
  • cudnn : 為深度學習計算設計的套件
  • CUDA Toolkit (nvidia) : CUDA 完整的工具安裝包,其中提供 Nvidia 驅動、開發 CUDA 程式相關開發工具等可供安裝的選項。包括 CUDA 編譯器、IDE、調適器等,CUDA 程式所對應的問件以及他們的頭文件
  • CUDA Toolkit (Pytorch): CUDA 不完整的工具安裝包,其主要包含在使用 CUDA 相關功能時所依賴的動態連接庫。不會安裝驅動程式!!
  • NVCC 是 CUDA 的編譯器,只是 CUDA Toolkit 的一部分

K8s安裝 (會寫說在哪個 Node 上安裝,沒特別寫就是每台都要裝)

  1. set hostname (All nodes)
    依照你要設定的 Hostname 去設定

    • sudo hostnamectl set-hostname <k8smaster.example.net>
    • exec bash
  2. set /etc/hosts (All nodes)

    //    IP          Hostname  ServerName
    192.168.137.154   k8sc.net  k8sc
    192.168.137.168   k8sn1.net k8sn1
    192.168.137.249   k8sn2.net k8sn2
    
  3. disable swap , selinux, firewall

    • swap:

      swapon --show
      sudo sed -i '/ swap / s/^\(.*\)$/#\1/g' /etc/fstab
      sudo swapoff -a
      
    • selinux:

      sudo selinux-config-enforcing
        or
      sudo vim /etc/selinux/config
      
      Modify => SELINUX=disabled
      
      Reboot system
      
    • firewall:

      sudo ufw status
      sudo ufw disabl
      
    參考資料:
    ref:https://www.zhihu.com/question/374752553/answer/1052244227
    
  4. modify core

    sudo tee /etc/modules-load.d/containerd.conf <<EOF
    overlay
    br_netfilter
    EOF
    
  5. reload mod

    sudo modprobe overlay
    sudo modprobe br_netfilter
    
  6. modify core

    sudo tee /etc/sysctl.d/kubernetes.conf <<EOF
    net.bridge.bridge-nf-call-ip6tables = 1
    net.bridge.bridge-nf-call-iptables = 1
    net.ipv4.ip_forward = 1
    EOF
    
  7. reload system

    sudo sysctl --system
    
  8. install CRI suite

    sudo apt install -y curl gnupg2 software-properties-common
    apt-transport-https ca-certificates
    
  9. add docker apt repository

    • Add Docker’s official GPG key:

      sudo apt-get update
      
      sudo apt-get install ca-certificates curl gnupg
      
      sudo install -m 0755 -d /etc/apt/keyrings
      
      curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
      
      sudo chmod a+r /etc/apt/keyrings/docker.gpg
      
    • Add the repository to Apt sources:

      echo \
        "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
        $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
        sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
      
  10. install Docker

    sudo apt-get update
    
    # 安裝 docker 相關套件
    sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
    
    # 新增 docker 群組
    sudo groupadd docker
    
    # 把現在的 $USER 加到 docker 中
    sudo usermod -aG docker $USER
    
    # 切換到 docker 群組
    newgrp docker
    
  11. 設置 docker 使用 cgroupdriver=systemd
    因為要配合 k8s 使用 systemd 所以要設定 docker daemon

    • mkdir -p /etc/docker
    • vim /etc/docker/daemon.json
      {
        "exec-opts": ["native.cgroupdriver=systemd"]
      }
      
    • systemctl enable docker && systemctl
    • systemctl status docker
  12. install cri-docker

    • V0.3.9:
      wget https://github.com/Mirantis/cri-dockerd/releases/download/v0.3.9/cri-dockerd_0.3.9.3-0.ubuntu-jammy_amd64.deb
      
      sudo dpkg -i cri-dockerd_0.3.9.3-0.ubuntu-jammy_amd64.deb
      
      systemctl daemon-reload
      
      systemctl enable cri-docker && systemctl start cri-docker && systemctl status cri-docker
      
    • V0.3.10:
      wget https://github.com/Mirantis/cri-dockerd/releases/download/v0.3.10/cri-dockerd_0.3.10.3-0.ubuntu-jammy_amd64.deb
      
      sudo dpkg -i cri-dockerd_0.3.10.3-0.ubuntu-jammy_amd64.deb
      
      systemctl daemon-reload
      
      systemctl enable cri-docker && systemctl start cri-docker && systemctl status cri-docker
      
    • Docker 試跑:
      Run container => docker run --name hello-world hello-world
      看資源 => docker ps -a
      刪除 container => docker rm hello-world
      
  13. Add Kubernetes apt repository:

    (推薦)官方使用原生包管理工具安装
    sudo apt-get update
    
    # apt-transport-https 可以是一个虚拟包;如果是这样,你可以跳过这个包
    sudo apt-get install -y apt-transport-https ca-certificates curl
    
    # 如果 `/etc/apt/keyrings` 目录不存在,则应在 curl 命令之前创建它,请阅读下面的注释。
    sudo mkdir -p -m 755 /etc/apt/keyrings
    
    curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.29/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
    
    echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.29/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list
    
    !!註解:
    想要別的版本請改 V?.??
    要升级 kubectl 到别的次要版本,你需要先升级 /etc/apt/sources.list.d/kubernetes.list 中的版本, 再运行 apt-get update 和 apt-get upgrade。
    ref: https://kubernetes.io/zh-cn/docs/tasks/tools/install-kubectl-linux/
    
  14. install Kubernetes suite for Kubectl, kubeadm, kubelet

    sudo apt update
    sudo apt install -y kubelet kubeadm kubectl
    (option) sudo apt-mark hold kubelet kubeadm kubectl
    
  15. Kubeadm init (Controller Node)

    特別注意!!
    參數基本上都是依照你的環境狀態去新增或修改!!
    --pod-network-cidr <cidr_ip> 請依照你後來要使用的 cidr_ip 需求去修改
    
    • Normal command:
      sudo kubeadm init --control-plane-endpoint=k8sc.net --pod-network-cidr=172.168.0.0/16
      
    • Docekr cri-dockerd command:
      sudo kubeadm init --control-plane-endpoint=k8sc.net --pod-network-cidr=172.168.0.0/16 --cri-socket unix:///run/cri-dockerd.sock
      
    • 初始化好會顯示
      Your Kubernetes control-plane has initialized successfully!
      
      To start using your cluster, you need to run the following as a regular user:
      
        mkdir -p $HOME/.kube
        sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
        sudo chown $(id -u):$(id -g) $HOME/.kube/config
      
      Alternatively, if you are the root user, you can run:
      
        export KUBECONFIG=/etc/kubernetes/admin.conf
      
      You should now deploy a pod network to the cluster.
      Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
        https://kubernetes.io/docs/concepts/cluster-administration/addons/
      
      You can now join any number of control-plane nodes by copying certificate authorities
      and service account keys on each node and then running the following as root:
      
        kubeadm join k8sc.net:6443 --token m4mgvk.ds9gbxubeelkyg1d \
        --discovery-token-ca-cert-hash sha256:967a99a31596e6d6ad9b40dabf69813b8c605f9fe1c8590ddbe68fa23d58e095 \
        --control-plane
      
      Then you can join any number of worker nodes by running the following on each as root:
      
      kubeadm join k8sc.net:6443 --token m4mgvk.ds9gbxubeelkyg1d \
        --discovery-token-ca-cert-hash sha256:967a99a31596e6d6ad9b40dabf69813b8c605f9fe1c8590ddbe68fa23d58e095
      
  16. Work Node Join cluster (WorkNode)

    請依照你剛剛建立好的 k8s 去修改要輸入的 kubeadm join code
    
    - Normal command:
      ```
      kubeadm join k8sc.net:6443 --token m4mgvk.ds9gbxubeelkyg1d \
      --discovery-token-ca-cert-hash sha256:967a99a31596e6d6ad9b40dabf69813b8c605f9fe1c8590ddbe68fa23d58e095
      ```
    - Docekr cri-dockerd command:
      ```
      kubeadm join k8sc.net:6443 --token m4mgvk.ds9gbxubeelkyg1d \
      --discovery-token-ca-cert-hash sha256:967a99a31596e6d6ad9b40dabf69813b8c605f9fe1c8590ddbe68fa23d58e095 --cri-socket unix:///run/cri-dockerd.sock
      ```
    
  17. Check all node join (MasterNode) - kubectl get nodes

  18. install Calico Pod network suite (MasterNode)

    此版本為 V3.27.0 因為這次使用的是 --pod-network-cidr=172.168.0.0/16 所以需要改文件
    
    ```
    # 建立它所需要的套件
    kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/tigera-operator.yaml
    
    # 下載設定檔
    wget https://raw.githubusercontent.com/projectcalico/calico/v3.27.0/manifests/custom-resources.yaml
    ```
    修改設定檔:
    ```
    # 因為預設是 192.168.0.0 跟我們 IP 撞到,修改成 172.168.0.0
    nano custom-resources.yaml
    
    # This section includes base Calico installation configuration.
    # For more information, see: https://docs.tigera.io/calico/latest/reference/installation/api#operator.tigera.io/v1.Installation
    apiVersion: operator.tigera.io/v1
    kind: Installation
    metadata:
    name: default
    spec:
    # Configures Calico networking.
    calicoNetwork:
        # Note: The ipPools section cannot be modified post-install.
        ipPools:
        - blockSize: 26
        cidr: 172.168.0.0/16 <= 修改這
        encapsulation: VXLANCrossSubnet
        natOutgoing: Enabled
        nodeSelector: all()
    ---
    ```
    
    ```
    # 部屬 CNI
    kubectl create -f custom-resources.yaml
    ```
    
    - waite all pods Running
        ```
        watch kubectl get pods -n calico-system
        or
        watch kubectl get pods --all-namespaces
    
        # 移除在 control-plane 上的 汙點(taint)
        kubectl taint nodes --all node-role.kubernetes.io/control-plane-
    
        # 移除在 master 上的 汙點(taint)
        kubectl taint nodes --all node-role.kubernetes.io/master-
        ```
    ```
    參考資料:
    ref:
    https://docs.tigera.io/calico/latest/getting-started/kubernetes/quickstart
    https://www.cnblogs.com/khtt/p/16563088.html
    ```
    
  19. Check All Cluster Node STATUS

    kubectl get nodes -o wide
    

    會都相下面一樣顯示 Ready 代表 k8s 以可以正常使用!!

    NAME            STATUS   ROLES           AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
    k8smaster.net   Ready    control-plane   52m   v1.29.1   192.8.1.66    <none>        Ubuntu 22.04.3 LTS   5.15.0-78-generic   containerd://1.6.27
    k8snode1.net    Ready    <none>          23m   v1.29.1   192.8.1.65    <none>        Ubuntu 22.04.3 LTS   5.15.0-78-generic   containerd://1.6.27
    k8snode2.net    Ready    <none>          23m   v1.29.1   192.8.1.69    <none>        Ubuntu 22.04.3 LTS   5.15.0-78-generic   containerd://1.6.27
    

Trouble Shooting:

  • “初始錯誤"要重建 k8s:
    # 如果 kubeadm reset
    (使用 containerd.io)
    kubeadm reset
      or
    (使用 cri-dockerd)
    kubeadm reset --cri-socket unix:///run/cri-dockerd.sock
    # 清除 k8s 文件
    rm -rf $HOME/.kube
    # 重啟 k8s 服務
    systemctl daemon-reload && systemctl restart kubelet
    之後請回到 step.15
    
  • “建立 CNI 之後"重建 k8s:
    # 如果 kubeadm reset
    (使用 containerd.io)
    kubeadm reset
      or
    (使用 cri-dockerd)
    kubeadm reset --cri-socket unix:///run/cri-dockerd.sock
    # 刪除 CNI 文件
    rm -rf /etc/cni/net.d
    # 刪除之前 CNI 所建立的 iptable
    sudo iptables -F && sudo iptables -t nat -F && sudo iptables -t mangle -F && sudo iptables -X
    # 刪除 k8s 文件
    sudo rm -f $HOME/.kube/config
    
  • kubectl get nodes 如果出現
    root@k8sc:~/K8s# kubectl get node
    E0201 16:03:23.299872   18786 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
    E0201 16:03:23.300534   18786 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
    E0201 16:03:23.302286   18786 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
    E0201 16:03:23.302901   18786 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
    E0201 16:03:23.304646   18786 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp 127.0.0.1:8080: connect: connection refused
    The connection to the server localhost:8080 was refused - did you specify the right host or port?
    
    解決方法:
    => mkdir ~/.kube
    => cp /etc/kubernetes/admin.conf ~/.kube/config
    參考資料:
    rfe: https://www.gbase8.cn/12320
    

Docker use GPU

有 GPU 的 host 都需要裝

  1. install nvidia-container-toolkit

    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    
    curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
    
    curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/libnvidia-container.list
    
    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
    
  2. Setting /etc/docker/daemon.json

    • nano /etc/docker/daemon.json
      {
      "exec-opts": ["native.cgroupdriver=systemd"],
      "default-runtime": "nvidia",
      "runtimes": {
          "nvidia": {
                  "path": "/usr/bin/nvidia-container-runtime",
                  "runtimeArgs": []
              }
          }
      }
      
    • sudo systemctl restart docker
  3. check container can use GPU

    docker run --rm -it nvcr.io/nvidia/cuda:10.2-base nvidia-smi
    

    如果可以看到下面這樣 Nvidia-smi 畫面就代表你的 docker container 裡可以使用 GPU 了!!

    Thu Feb  1 09:55:47 2024
    +---------------------------------------------------------------------------------------+
    | NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
    |-----------------------------------------+----------------------+----------------------+
    | GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
    |                                         |                      |               MIG M. |
    |=========================================+======================+======================|
    |   0  Tesla T4                       On  | 00000000:18:00.0 Off |                    0 |
    | N/A   28C    P8               9W /  70W |      7MiB / 15360MiB |      0%      Default |
    |                                         |                      |                  N/A |
    +-----------------------------------------+----------------------+----------------------+
    
參考資料:
ref:https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#enabling-gpu-support-in-kubernetes

Install NGC

為了可以讓 k8s 可以找到 GPU 資源需要 Nvidia NGC 中的 images

  1. Install NGC Command Line

    wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.37.0/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip
    
    find ngc-cli/ -type f -exec md5sum {} + | LC_ALL=C sort | md5sum -c ngc-cli.md5
    
    chmod u+x ngc-cli/ngc
    
    echo "export PATH=\"\$PATH:$(pwd)/ngc-cli\"" >> ~/.bash_profile && source ~/.bash_profile
    
    rfe :https://ngc.nvidia.com/setup/installers/cli
    
  2. set NGC config set

    1. ngc.nvidia.com
    2. 辦一個會員
    3. 右上角頭像 SetUp -> API Key
    4. Generate API Key
    5. ngc config set
    6. 輸入 api key 跟一些資料
    7. docker login nvcr.io
      Username:
      輸入 => $oauthtoken
      
      Password:
      輸入你的 token
      
      完成就會像這樣
      root@k8sc:~/NVD# docker login nvcr.io
      Username: $oauthtoken
      Password:
      WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
      Configure a credential helper to remove this warning. See
      https://docs.docker.com/engine/reference/commandline/login/#credentials-store
      
      Login Succeeded
      
      確認是否可連上並下載回來 images
      docker pull nvcr.io/nvidia/k8s-device-plugin:v0.14.4
      
  3. Check Pods Ruing

    root@k8sc:~/NVD/test# kubectl get pods -A
    NAMESPACE          NAME                                       READY   STATUS    RESTARTS       AGE
    ........           ........                                   ......
    kube-system        nvidia-device-plugin-daemonset-g7pn5       1/1     Running   0              11m
    kube-system        nvidia-device-plugin-daemonset-j5t79       1/1     Running   0              11m
    kube-system        nvidia-device-plugin-daemonset-pjjr9       1/1     Running   0              11m
    ........           ........                                   ......
    
  4. Check kubectl describe nodes
    確認有 nvidia.com/gpu 就代表成功了

    Addresses:
      InternalIP:  192.168.137.249
      Hostname:    k8sn2.net
    Capacity:
      cpu:                96
      ephemeral-storage:  3843514416Ki
      hugepages-1Gi:      0
      hugepages-2Mi:      0
      memory:             395820888Ki
      nvidia.com/gpu:     5
      pods:               110
    Allocatable:
      cpu:                96
      ephemeral-storage:  3542182879921
      hugepages-1Gi:      0
      hugepages-2Mi:      0
      memory:             395718488Ki
      nvidia.com/gpu:     5
      pods:               110
    System Info:
      Machine ID:                 99a1ca0433d9443aafba35201ede1a9b
      System UUID:                d8c50c1b-e0ef-2445-bc45-140d4f639386
      Boot ID:                    5a8a1c1f-f155-44af-9e70-105e821bd24c
      Kernel Version:             6.5.0-15-generic
      OS Image:                   Ubuntu 22.04.3 LTS
      Operating System:           linux
      Architecture:               amd64
      Container Runtime Version:  docker://25.0.2
      Kubelet Version:            v1.29.1
      Kube-Proxy Version:         v1.29.1
    PodCIDR:                      172.168.2.0/24
    PodCIDRs:                     172.168.2.0/24
    Non-terminated Pods:          (5 in total)
      Namespace                   Name                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
      ---------                   ----                                    ------------  ----------  ---------------  -------------  ---
      calico-system               calico-node-rvhj8                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
      calico-system               calico-typha-5f87879b7d-tjwld           0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
      calico-system               csi-node-driver-b58b7                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
      kube-system                 kube-proxy-xcf67                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         17h
      kube-system                 nvidia-device-plugin-daemonset-pjjr9    0 (0%)        0 (0%)      0 (0%)           0 (0%)         14m
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource           Requests  Limits
      --------           --------  ------
      cpu                0 (0%)    0 (0%)
      memory             0 (0%)    0 (0%)
      ephemeral-storage  0 (0%)    0 (0%)
      hugepages-1Gi      0 (0%)    0 (0%)
      hugepages-2Mi      0 (0%)    0 (0%)
      nvidia.com/gpu     0         0
    Events:              <none>
    
參考資料:
ref:
https://github.com/NVIDIA/k8s-device-plugin
https://bluesmilery.github.io/blogs/afcb1072/

Trouble Shooting:

  • reboot 之後 nvidia-smi 會報錯:
    sudo ubuntu-drivers devices
    # 安裝帶有 recommended 版本的 Driver
    sudo apt-get install nvidia-driver-<*>
    
    ref:
    https://zhuanlan.zhihu.com/p/337013545
    https://www.zhihu.com/question/474222642
    
  • create -f k8s device plugin 之後發現 control-plane 沒有被派發任務, step.19 沒做到此動作,輸入完即可
    # 移除在 control-plane 上的 汙點(taint)
    kubectl taint nodes --all node-role.kubernetes.io/control-plane-
    
    # 移除在 master 上的 汙點(taint)
    kubectl taint nodes --all node-role.kubernetes.io/master-
    

GPU Burn

使用 wilicc/gpu-burn 作為 GPU Burn 程式並且打包成 images 使用

  1. wilicc/gpu-burn to image
git clone https://github.com/wilicc/gpu-burn
cd gpu-burn
nano Dockerfile
# 把它修改版本及執行方式成我們要的樣子方便後面 k8s 使用
ARG CUDA_VERSION=12.3.1
ARG IMAGE_DISTRO=ubuntu22.04

FROM nvidia/cuda:${CUDA_VERSION}-devel-${IMAGE_DISTRO} AS builder

WORKDIR /build

COPY . /build/

RUN make

FROM nvidia/cuda:${CUDA_VERSION}-runtime-${IMAGE_DISTRO}

COPY --from=builder /build/gpu_burn /app/
COPY --from=builder /build/compare.ptx /app/

WORKDIR /app

# Create a /app/result directory and link it to the local ./result directory
RUN mkdir /app/result && ln -s /app/result /result
# 打包成 Images
docker build -t gpu_burn .

# 確認是否有打包好的 images
docker images

Private Docker Registry Server

因為需要下在自己打包的 image 給其他 node 使用又不想上傳到 dockerhub 所以自己建一個 Private Docker Registry

# 用 docker 架起 Registry
docker run -d --restart always -p 5000:5000 -v /root/K8s/registry:/var/lib/registry --name registry registry:2

# 把我們剛剛打包好的 image 打上 tag
docker tag <images_name> <registries_ip>:5000/<images_name>

# 在所有要下載的 server 修改 docker 設定檔新增以下 code
nano /etc/docker/daemon.json (all node)
"live-restore": true,
  "group": "dockerroot",
  "insecure-registries": ["<registries_ip>:5000"]
# 重啟 docker 讓他讀取新設定檔 (all node)
systemctl restart docker

# 上傳剛剛打上 tag 的 images
docker push <registries_IP>:5000/<images_name>

K8s exec GPU-Burn

gpu-burn-CN5c.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-burn-job-controller
  labels:
    app: gpu-burn
spec:
  ttlSecondsAfterFinished: 100
  template:
    metadata:
      labels:
        app: gpu-burn
    spec:
      restartPolicy: Never
      nodeSelector:
        kubernetes.io/hostname: k8sc.net
      containers:
      - name: gpu-burn
        image: 192.168.137.154:5000/gpu_burn
        imagePullPolicy: Never
        command: [ "sh", "-c", "./gpu_burn 60 > /app/result/output.txt && exit" ]
        resources:
          limits:
            nvidia.com/gpu: 5  # 設置每個 Pod 使用的 GPU 數量
        volumeMounts:
        - name: result-volume
          mountPath: /app/result
      volumes:
      - name: result-volume
        hostPath:
          path: /root/gpu-result
          type: Directory
---
apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-burn-job-node1
  labels:
    app: gpu-burn
spec:
  ttlSecondsAfterFinished: 100
  template:
    metadata:
      labels:
        app: gpu-burn
    spec:
      restartPolicy: Never
      nodeSelector:
        kubernetes.io/hostname: k8sn1.net
      containers:
      - name: gpu-burn
        image: 192.168.137.154:5000/gpu_burn
        imagePullPolicy: Never
        command: [ "sh", "-c", "./gpu_burn 60 > /app/result/output.txt && exit" ]
        resources:
          limits:
            nvidia.com/gpu: 5  # 設置每個 Pod 使用的 GPU 數量
        volumeMounts:
        - name: result-volume
          mountPath: /app/result
      volumes:
      - name: result-volume
        hostPath:
          path: /root/gpu-result
          type: Directory
---
apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-burn-job-node2
  labels:
    app: gpu-burn
spec:
  ttlSecondsAfterFinished: 100
  template:
    metadata:
      labels:
        app: gpu-burn
    spec:
      restartPolicy: Never
      nodeSelector:
        kubernetes.io/hostname: k8sn2.net
      containers:
      - name: gpu-burn
        image: 192.168.137.154:5000/gpu_burn
        imagePullPolicy: Never
        command: [ "sh", "-c", "./gpu_burn 60 > /app/result/output.txt && exit" ]
        resources:
          limits:
            nvidia.com/gpu: 5  # 設置每個 Pod 使用的 GPU 數量
        volumeMounts:
        - name: result-volume
          mountPath: /app/result
      volumes:
      - name: result-volume
        hostPath:
          path: /root/gpu-result
          type: Directory

gpu-burn-R4c-3pod.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-burn-job-random
  labels:
    app: gpu-burn
spec:
  ttlSecondsAfterFinished: 100
  completions: 3 # 完成幾個
  parallelism: 3 # 同時執行幾個
  template:
    metadata:
      labels:
        app: gpu-burn
    spec:
      restartPolicy: Never
      containers:
      - name: gpu-burn
        image: 192.168.137.154:5000/gpu_burn
        imagePullPolicy: Never
        command: [ "sh", "-c", "./gpu_burn 60 > /app/result/output.txt && exit" ]
        resources:
          limits:
            nvidia.com/gpu: 4  # 設置每個 Pod 使用的 GPU 數量
        volumeMounts:
        - name: result-volume
          mountPath: /app/result
      volumes:
      - name: result-volume
        hostPath:
          path: /root/gpu-result
          type: Directory
建立任務:
=> kubectl create -f <yaml>

刪除任務:
=> kubectl delete -f <yaml>

查看 job :
=> kubectl get pods

查看 job 詳細資訊 :
=> kubectl describe pods
or
=> kubectl describe pod <pod_name>

查看執行 Logs :
=> kubectl Logs <pod_name>