Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/etc/docker/daemon.json changes automatically and minikube not working on GPU NVIDIA RTX 3090 with --driver=none #423

Closed
10 tasks
joaquinfdez opened this issue Jul 11, 2023 · 3 comments
Assignees

Comments

@joaquinfdez
Copy link

joaquinfdez commented Jul 11, 2023

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

I have successfully setup and launched minikube, however I cannot detect the GPU in my setup.

From the official Minikube tutorials page here, it's suggested to use either the KVM2 driver or the 'none' driver. However, after several trials, I've noticed that the KVM2 driver does not seem to support my NVIDIA GPU, specifically the RTX 3090.

To overcome this, I decided to use the 'none' driver for my Minikube setup. The 'none' driver seems to work well with the NVIDIA GPU RTX 3090. I did the process as follows but I couldn't run minikube on GPU.

2. Steps to reproduce the issue

  1. Disabled file system protections:
sudo sysctl fs.protected_regular=0
fs.protected_regular = 0
  1. Reloaded the systemd manager configuration:
sudo systemctl daemon-reload
  1. Enabled and started cri-docker.service and cri-docker.socket:
sudo systemctl enable cri-docker.service
sudo systemctl enable --now cri-docker.socket
  1. Started Minikube using the none driver:
minikube start --driver=none --apiserver-ips 127.0.0.1 --apiserver-name localhost
  1. Checked Minikube status, which was running correctly:
minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
  1. Created a daemonset using a NVIDIA k8s device plugin:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
daemonset.apps/nvidia-device-plugin-daemonset created
  1. Checked node status:
kubectl get nodes -ojson | jq .items[].status.capacity
{
  "cpu": "64",
  "ephemeral-storage": "1921221768Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "131704932Ki",
  "pods": "110"
}

Despite these steps, I am unable to detect the GPU RTX 3090 with minikube. Can anyone provide some guidance on what could be the issue?

Attach the log file

$ minikube start --driver=none --apiserver-ips 127.0.0.1 --apiserver-name localhost
😄  minikube v1.30.1 en Ubuntu 20.04
✨  Using the none driver based on user configuration
👍  Starting control plane node minikube in cluster minikube
🤹  Running on localhost (CPUs=64, Memory=128618MB, Disk=1876193MB) ...
ℹ️  OS release is Ubuntu 20.04.6 LTS
🐳  Preparando Kubernetes v1.26.3 en Docker 24.0.2...
    ▪ kubelet.resolv-conf=/run/systemd/resolve/resolv.conf
    ▪ Generando certificados y llaves
💢  initialization failed, will try again: wait: /bin/bash -c "sudo env PATH="/var/lib/minikube/binaries/v1.26.3:$PATH" kubeadm init --config /var/tmp/minikube/kubeadm.yaml  --ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests,DirAvailable--var-lib-minikube,DirAvailable--var-lib-minikube-etcd,FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml,FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml,FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml,FileAvailable--etc-kubernetes-manifests-etcd.yaml,Port-10250,Swap,NumCPU,Mem": exit status 1
stdout:
[init] Using Kubernetes version: v1.26.3
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/var/lib/minikube/certs"
[certs] Using existing ca certificate authority
[certs] Using existing apiserver certificate and key on disk
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] Generating "etcd/ca" certificate and key
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [localhost PC_RTX3090] and IPs [10.10.68.61 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [localhost PC_RTX3090] and IPs [10.10.68.61 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"

stderr:
W0706 10:23:45.430412   58489 initconfiguration.go:119] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/var/run/cri-dockerd.sock". Please update your configuration!
        [WARNING Swap]: swap is enabled; production deployments should disable swap unless testing the NodeSwap feature gate of the kubelet
error execution phase kubeconfig/admin: a kubeconfig file "/etc/kubernetes/admin.conf" exists already but has got the wrong CA cert
To see the stack trace of this error execute with --v=5 or higher

    ▪ Generando certificados y llaves
    ▪ Iniciando plano de control
    ▪ Configurando reglas RBAC...
🔗  Configurando CNI bridge CNI ...
🤹  Configuranto entorno del host local ...

❗  The 'none' driver is designed for experts who need to integrate with an existing VM
💡  Most users should use the newer 'docker' driver instead, which does not require root!
📘  For more information, see: https://minikube.sigs.k8s.io/docs/reference/drivers/none/

❗  La configuración de kubectl y de minikube se almacenará en /home/PC_RTX3090
❗  Para usar comandos de kubectl o minikube como tu propio usuario, puede que debas reubicarlos. Por ejemplo, para sobrescribir tu configuración, ejecuta:

    ▪ sudo mv /home/PC_RTX3090/.kube /home/PC_RTX3090/.minikube $HOME
    ▪ sudo chown -R $USER $HOME/.kube $HOME/.minikube

💡  El proceso se puede automatizar si se define la variable de entorno CHANGE_MINIKUBE_NONE_USER=true
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🔎  Verifying Kubernetes components...
🌟  Complementos habilitados: default-storageclass, storage-provisioner
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default
$ minikube status
minikube
type: Control Plane
host: Running
kubelet: Running
apiserver: Running
kubeconfig: Configured
$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
daemonset.apps/nvidia-device-plugin-daemonset created
$ kubectl get nodes -ojson | jq .items[].status.capacity
{
  "cpu": "64",
  "ephemeral-storage": "1921221768Ki",
  "hugepages-1Gi": "0",
  "hugepages-2Mi": "0",
  "memory": "131704932Ki",
  "pods": "110"
}

Operating System

Ubuntu

Driver

None

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
Tue Jul 11 12:07:16 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:41:00.0 Off |                  N/A |
| 67%   65C    P2   341W / 350W |  17214MiB / 24576MiB |     76%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:61:00.0 Off |                  N/A |
| 38%   41C    P8    14W / 350W |  15956MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   1569631      C   /usr/bin/python3                17212MiB |
|    1   N/A  N/A   1501058      C   /bin/python3                    15954MiB |
+-----------------------------------------------------------------------------+
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
    I don't know why, but everytime I rewrite this file, automatically it changes itself to the following default values.
{
   "exec-opts": ["native.cgroupdriver=cgroupfs"],
   "log-driver": "json-file",
   "log-opts": {
           "max-size": "100m"
   },
   "storage-driver": "overlay2"
}
  • The k8s-device-plugin container logs
    I cannot show it
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
    I cannot show it

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
docker version
Client: Docker Engine - Community
Version:           24.0.2
API version:       1.43
Go version:        go1.20.4
Git commit:        cb74dfc
Built:             Thu May 25 21:52:22 2023
OS/Arch:           linux/amd64
Context:           default

Server: Docker Engine - Community
Engine:
 Version:          24.0.2
 API version:      1.43 (minimum version 1.12)
 Go version:       go1.20.4
 Git commit:       659604f
 Built:            Thu May 25 21:52:22 2023
 OS/Arch:          linux/amd64
 Experimental:     false
containerd:
 Version:          1.6.21
 GitCommit:        3dce8eb055cbb6872793272b4f20ed16117344f8
runc:
 Version:          1.1.7
 GitCommit:        v1.1.7-0-g860f061
docker-init:
 Version:          0.19.0
 GitCommit:        de40ad0
  • Docker command, image and tag used
  • Kernel version from uname -a
uname -a
Linux WS5 5.15.0-76-generic #83~20.04.1-Ubuntu SMP Wed Jun 21 20:23:31 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
dpkg -l '*nvidia*'
Deseado=desconocido(U)/Instalar/eliminaR/Purgar/retener(H)
| Estado=No/Inst/ficheros-Conf/desempaqUetado/medio-conF/medio-inst(H)/espera-disparo(W)/pendienTe-disparo
|/ Err?=(ninguno)/requiere-Reinst (Estado,Err: mayúsc.=malo)
||/ Nombre                           Versión                     Arquitectura Descripción
+++-================================-===========================-============-=========================================================
un  libgldispatch0-nvidia            <ninguna>                   <ninguna>    (no hay ninguna descripción disponible)
ii  libnvidia-cfg1-525:amd64         525.125.06-0ubuntu0.20.04.3 amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any               <ninguna>                   <ninguna>    (no hay ninguna descripción disponible)
un  libnvidia-common                 <ninguna>                   <ninguna>    (no hay ninguna descripción disponible)
ii  libnvidia-common-525             525.125.06-0ubuntu0.20.04.3 all          Shared files used by the NVIDIA libraries
un  libnvidia-compute                <ninguna>                   <ninguna>    (no hay ninguna descripción disponible)
rc  libnvidia-compute-510:amd64      525.125.06-0ubuntu0.20.04.1 amd64        Transitional package for libnvidia-compute-525
ii  libnvidia-compute-525:amd64      525.125.06-0ubuntu0.20.04.3 amd64        NVIDIA libcompute package
ii  libnvidia-compute-525:i386       525.125.06-0ubuntu0.20.04.3 i386         NVIDIA libcompute package
ii  libnvidia-container-tools        1.13.2-1                    amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64       1.13.2-1                    amd64        NVIDIA container runtime library
un  libnvidia-decode                 <ninguna>                   <ninguna>    (no hay ninguna descripción disponible)
ii  libnvidia-decode-525:amd64       525.125.06-0ubuntu0.20.04.3 amd64        NVIDIA Video Decoding runtime libraries
ii  libnvidia-decode-525:i386        525.125.06-0ubuntu0.20.04.3 i386         NVIDIA Video Decoding runtime libraries
un  libnvidia-encode                 <ninguna>                   <ninguna>    (no hay ninguna descripción disponible)
ii  libnvidia-encode-525:amd64       525.125.06-0ubuntu0.20.04.3 amd64        NVENC Video Encoding runtime library
ii  libnvidia-encode-525:i386        525.125.06-0ubuntu0.20.04.3 i386         NVENC Video Encoding runtime library
- [ ] NVIDIA container library version from `nvidia-container-cli -V`
cli-version: 1.13.2
lib-version: 1.13.2
build date: 2023-06-06T20:27+00:00
build revision: f9624c1879fac71012a750b63a14a06dc7c8e345
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
@joysn71
Copy link

joysn71 commented Aug 26, 2023

I experience the exact same issue. Everytime when I add the nvidia-runtime to /etc/docker/daemon.json, it is changed back to its previous version after I started minikube.
Which is rather strange. Its even stranger, as yesterday I was able to provide GPU to minikube and verified it working in a JupyterLab notebook on Kubeflow.

I have no idea anymore whats going wrong and why it is so a pain to get the GPU running in Minikube.
Already tried driver kvm2 before and I was not able to get it working.

@L1ght94
Copy link

L1ght94 commented Aug 31, 2023

Hi! This issue is caused by minikube itself, as pointed out here. I faced the same problem and had it resolved by install minikube version 1.28.0.

@joysn71
Copy link

joysn71 commented Aug 31, 2023

Sorry for being late, but I managed to get the GPU working in Minikube.
Using Minikube with driver Docker now, Docker has nvidia-runtime and no overwrite of daemon.json anymore.
Minikube v1.31.2 with addon nvidia-gpu-device-plugin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants