Skip to content

Latest commit

 

History

History
182 lines (142 loc) · 5.26 KB

nvidia.md

File metadata and controls

182 lines (142 loc) · 5.26 KB

NVidia drivers

References:

Install CUDA

# install dependencies
sudo apt update
sudo apt install -y gcc linux-headers-$(uname -r) make dkms
# disable nouveau driver (runfile will fail if it detects that nouveau is loaded)
sudo tee /etc/modprobe.d/blacklist-nouveau.conf << EOF
blacklist nouveau
options nouveau modeset=0
EOF
sudo update-initramfs -u
# reboot after creating blacklist-nouveau.conf

Look for available CUDA downloads:

# run installer
wget https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run &&
chmod +x ./cuda_12.1.1_530.30.02_linux.run &&
sudo sh cuda_12.1.1_530.30.02_linux.run
# wait for it to load and type `accept`
# you can leave installation options on default

Upon successful installation you should see something like this:

===========
= Summary =
===========

Driver:   Installed
Toolkit:  Installed in /usr/local/cuda-12.1/

Please make sure that
 -   PATH includes /usr/local/cuda-12.1/bin
 -   LD_LIBRARY_PATH includes /usr/local/cuda-12.1/lib64, or, add /usr/local/cuda-12.1/lib64 to /etc/ld.so.conf and run ldconfig as root

To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-12.1/bin
To uninstall the NVIDIA Driver, run nvidia-uninstall
Logfile is /var/log/cuda-installer.log

Update environment:

# adjust CUDA version before executing
echo 'export PATH=$PATH:/usr/local/cuda-12.2/bin' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12.2/lib64' >> ~/.bashrc

Install Nvidia proprietary container toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey |
sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg &&
curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list |
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' |
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list &&
sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime=containerd && sudo systemctl restart containerd

# additionally set containerd default runtime to nvidia:
# [plugins]
#   [plugins."io.containerd.grpc.v1.cri"]
#     [plugins."io.containerd.grpc.v1.cri".containerd]
#       default_runtime_name = "nvidia"

References:

Uninstall

# replace X.Y with your values
sudo /usr/local/cuda-X.Y/bin/cuda-uninstaller
sudo /usr/bin/nvidia-uninstall

Control

# show all
nvidia-smi -q

# keeps nvidia driver loaded
# this provides better monitoring and faster startup for applications
# doesn't survive reboot
sudo nvidia-smi --id 0 --persistence-mode ENABLED

# tesla p40 doesn't support target temp
# but it seems to throttle around 90°C
# even though it says that slowdown temperature is 96°C
# sudo nvidia-smi --id 0 --gpu-target-temp 75

# My cooling solution is very limited
# Min and max power limit values can be found in nvidia-smi -q | grep 'Power Limit'
sudo nvidia-smi --id 0 --power-limit 125

fan control

sudo apt update
sudo apt-get install -y lm-sensors fancontrol
# https://www.binarytides.com/monitor-cpu-power-consumption-on-ubuntu/
sudo apt install -y linux-cpupower

sudo sensors-detect
sudo service kmod restart

sudo pwmconfig

sudo systemctl restart fancontrol.service

sensors

# turbostat doesn't seem to work properly with modern systems
sudo turbostat --Summary --quiet --show Busy%,Avg_MHz,PkgTmp,PkgWatt --interval 1
# s-tui supposedly works better, but I'm yet to check this
nano /etc/fancontrol
# Configuration file generated by pwmconfig, changes will be lost
INTERVAL=10
DEVPATH=hwmon1=devices/pci0000:00/0000:00:01.3/0000:02:00.2/0000:03:04.0/0000:0c:00.0/nvme/nvme2 hwmon3=devices/platform/nct6775.656
DEVNAME=hwmon1=nvme hwmon3=nct6779
FCTEMPS=hwmon3/pwm5=hwmon1/temp1_input
FCFANS= hwmon3/pwm5=hwmon3/fan5_input
MINTEMP= hwmon3/pwm5=20
MAXTEMP= hwmon3/pwm5=60
MINSTART= hwmon3/pwm5=199
MINSTOP= hwmon3/pwm5=198
MINPWM= hwmon3/pwm5=197
MAXPWM=hwmon3/pwm5=199

Monitor

watch -d -n 1 sh -c "nvidia-smi && echo && nvidia-smi --query-gpu=index,pstate,power.draw,clocks.sm,clocks.mem --format=csv"

vGPU drivers

References:

Tyan BIOS via internet

curl -k --request GET 'https://10.0.4.18/redfish/v1/'
curl -k -u 'Administrator:superuser' \
        --request GET 'https://10.0.4.18/redfish/v1/AccountService/Accounts/1' \
        --header 'If-Match: W/"1713558119"'
curl -k -u 'danil:Qqwe123!' \
        --request GET 'https://10.0.4.18/redfish/v1/AccountService/Accounts/1' \
        --header 'If-Match: W/"1713558119"'