剛好有幸用到 8Core 80G GPU 的 A100 機台,但以為設定跟之前一樣卻要用 CUDA 跑 LLM 時遇到問題並紀錄解決方法。

當你今天是使用 ( V100 / A100 / A30 …等等 ) 時因為是使用 NVSwitch 連通所以需要安裝 3 以後的步驟才能正常使用 NVIDA GPU 的功能

  1. Install CUDA

    Follow: CUDA Toolkit 12.6 Downloads | NVIDIA Developer

    Base Installer:

    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
    sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
    wget https://developer.download.nvidia.com/compute/cuda/12.6.0/local_installers/cuda-repo-ubuntu2204-12-6-local_12.6.0-560.28.03-1_amd64.deb
    sudo dpkg -i cuda-repo-ubuntu2204-12-6-local_12.6.0-560.28.03-1_amd64.deb
    sudo cp /var/cuda-repo-ubuntu2204-12-6-local/cuda-*-keyring.gpg /usr/share/keyrings/
    sudo apt-get update
    sudo apt-get -y install cuda-toolkit-12-6
    

    Driver Installer:

    sudo apt-get install -y nvidia-open
    

    Setting NVCC:

    Command 中的 cuda-12 請依照你安裝的版本去替換

    sudo vim ~/.bashrc
    export PATH=/usr/local/cuda-12/bin${PATH:+:${PATH}}
    export LD_LIBRARY_PATH=/usr/local/cuda-12/lib64:${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
    
  2. Install cuDNN

    Follow: cuDNN 9.3.0 Downloads | NVIDIA Developer

    Base Installer:

    wget https://developer.download.nvidia.com/compute/cudnn/9.3.0/local_installers/cudnn-local-repo-ubuntu2204-9.3.0_1.0-1_amd64.deb
    sudo dpkg -i cudnn-local-repo-ubuntu2204-9.3.0_1.0-1_amd64.deb
    sudo cp /var/cudnn-local-repo-ubuntu2204-9.3.0/cudnn-*-keyring.gpg /usr/share/keyrings/
    sudo apt-get update
    sudo apt-get -y install cudnn
    

    If install specific CUDA version package:

    sudo apt-get -y install cudnn-cuda-<CUDA-Version>
    

    Install libfreeimage:

    sudo apt install libfreeimage-dev
    

    Test cuDNN:

    git clone https://github.com/NVIDIA/cuda-samples.git
    cd cuda-samples/Samples/bandwidthTest
    make
    ./bandwidthTest
    

    如果失敗請安裝接下來的步驟

  3. Install DCGM

    Follow: NVIDIA DCGM | NVIDIA Developer

    sudo apt-get update
    sudo apt-get install -y datacenter-gpu-manager
    
  4. Install nvidia-fabricmanager

    Follow: fabric-manager-user-guide.pdf (nvidia.com) - Chapter 2.6

    version=<your-gpu-Driver Version>
    main_version=$(echo $version | awk -F '.' '{print $1}')
    apt-get update
    apt-get -y install nvidia-fabricmanager-${main_version}=${version}-*
    
    Nvidia-smi:
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
    |-----------------------------------------+------------------------+----------------------+
    以此為例 version = 560.35.03
    
  5. Disabel nv-hostengine

    sudo nv-hostengine -t
    
  6. Start the fabricmanager

    sudo service nvidia-fabricmanager start
    
  7. Test cuDNN again

    git clone https://github.com/NVIDIA/cuda-samples.git
    cd cuda-samples/Samples/bandwidthTest
    make
    ./bandwidthT
    
    root@test-ORACLE-SERVER-E4-2c:~/cudnn_samples_v9/mnistCUDNN# ./mnistCUDNN
    Executing: mnistCUDNN
    cudnnGetVersion() : 90400 , CUDNN_VERSION from cudnn.h : 90400 (9.4.0)
    Host compiler version : GCC 11.4.0
    
    There are 8 CUDA capable devices on your machine :
    device 0 : sms 108  Capabilities 8.0, SmClock 1410.0 Mhz, MemSize (Mb) 81155, MemClock 1593.0 Mhz, Ecc=1, boardGroupID=0
    device 1 : sms 108  Capabilities 8.0, SmClock 1410.0 Mhz, MemSize (Mb) 81155, MemClock 1593.0 Mhz, Ecc=1, boardGroupID=1
    device 2 : sms 108  Capabilities 8.0, SmClock 1410.0 Mhz, MemSize (Mb) 81155, MemClock 1593.0 Mhz, Ecc=1, boardGroupID=2
    device 3 : sms 108  Capabilities 8.0, SmClock 1410.0 Mhz, MemSize (Mb) 81155, MemClock 1593.0 Mhz, Ecc=1, boardGroupID=3
    device 4 : sms 108  Capabilities 8.0, SmClock 1410.0 Mhz, MemSize (Mb) 81155, MemClock 1593.0 Mhz, Ecc=1, boardGroupID=4
    device 5 : sms 108  Capabilities 8.0, SmClok 1410.0 Mhz, MemSize (Mb) 81155, MemClock 1593.0 Mhz, Ecc=1, boardGroupID=5
    device 6 : sms 108  Capabilities 8.0, SmClock 1410.0 Mhz, MemSize (Mb) 81155, MemClock 1593.0 Mhz, Ecc=1, boardGroupID=6
    device 7 : sms 108  Capabilities 8.0, SmClock 1410.0 Mhz, MemSize (Mb) 81155, MemClock 1593.0 Mhz, Ecc=1, boardGroupID=7
    Using device 0
    
    Testing single precision
    Loading binary file data/conv1.bin
    Loading binary file data/conv1.bias.bin
    Loading binary file data/conv2.bin
    Loading binary file data/conv2.bias.bin
    Loading binary file data/ip1.bin
    Loading binary file data/ip1.bias.bin
    Loading binary file data/ip2.bin
    Loading binary file data/ip2.bias.bin
    Loading image data/one_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.027648 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.036864 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.062464 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.070656 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.091136 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.092160 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 129072 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.055296 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.064512 time requiring 1433120 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.090112 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.093184 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.098304 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.189440 time requiring 129072 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Resulting weights from Softmax:
    0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
    Loading image data/three_28x28.pgm
    Performing forward propagation ...
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.023552 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.026624 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.028672 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.057344 time requiring 184784 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.057344 time requiring 178432 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.064512 time requiring 2057744 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 129072 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Testing cudnnFindConvolutionForwardAlgorithm ...
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.053248 time requiring 2450080 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.055296 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.063488 time requiring 1433120 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.064512 time requiring 4656640 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.092160 time requiring 0 memory
    ^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.102400 time requiring 129072 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
    ^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
    Resulting weights from Softmax:
    0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
    Loading image data/five_28x28.pgm
    Performing forward propagation ...
    Resulting weights from Softmax:
    0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006
    
    Result of classification: 1 3 5
    
    Test passed!
    

    看到 Test passed 恭喜你成功可以正常使用 GPU 了!!

補充:

NVLink Topology Command:

nvidia-smi topo -m

NVLink Status Command:

nvidia-smi nvlink --status

更多相關 Nvidia-smi 查看 Nvlink Command:
nvidia-smi 工具检查NVIDIA NVLink - Docs

Reference

cuda runtime error (802) : system not yet initialized …/THCGeneral.cpp:50 · Issue #35710 · pytorch/pytorch · GitHub

How to Configure NVLink on Machines - digitalocean