Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker devicerequests/nvidia support #57

Open
robertgzr opened this issue Mar 31, 2020 · 39 comments
Open

docker devicerequests/nvidia support #57

robertgzr opened this issue Mar 31, 2020 · 39 comments
Assignees

Comments

@robertgzr
Copy link

robertgzr commented Mar 31, 2020

we want to support exposing gpu resources to user containers via the new DeviceRequests API introduced in docker 19.03.x

To enable this we need to have the nvidia driver, userland driver-support libraries and libnvidia-container in the host os

WIP branch: https://github.com/balena-os/balena-jetson/tree/rgz/cuda_libs_test
Depends-on: balena-os/meta-balena#1824 [merged 🎊]


arch call notes (internal): https://docs.google.com/document/d/1tFaDKyTsdi1TUfxfAjAAGJCfUVCwmPxIstdrYaOJ-I0
arch call item (internal): https://app.frontapp.com/open/cnv_5bqfytf

@robertgzr robertgzr self-assigned this Mar 31, 2020
@robertgzr
Copy link
Author

@acostach
Copy link
Contributor

acostach commented Apr 1, 2020

Looks like they might @robertgzr , I need to check with a yocto build with these 2 packages. It will take a bit cause the cuda packages first need to be downloaded locally with the nvidia sdk manager, these and their dependencies can't be pulled by yocto automatically. I'll get back to you..

@robertgzr
Copy link
Author

robertgzr commented Apr 1, 2020

I think we will still run out of space. I already have trouble getting the new balena-engine binary into some devices because of the size increase of the binary there. And the cuda stuff is going to come in with another at least 15mb or something like that

@acostach
Copy link
Contributor

acostach commented Apr 1, 2020

@robertgzr I built an image with those and it's available on dev 3d612ed56aaa2ba22cf73ba7a2021cb7 if you want to test with the patched engine.

A couple notes would be:

  • libcuda appears to come from tegra-libraries, which is a package with ~130mb worth of nvidia libraries (libnv*, libnvidia*). Not sure if only some of them or all are tied together, as for instance cuda-drivers adds a depends on tegra-libraries. But if you get it to work probably we can try remove them one by one see if anything breaks.
  • I increased the rootfs size to allow for lots of space, for testing with these packages and the new engine

@robertgzr
Copy link
Author

@acostach do you have a branch on here I can use, I would like to pull in the engine using balena-os/meta-balena#1824 rather than copying the binary around...

@robertgzr
Copy link
Author

@acostach I'm trying to figure out why it won't work out of the box...

root@3d612ed:~# nvidia-container-cli info
NVRM version:   (null)
CUDA version:   10.0

Device Index:   0
Device Minor:   0
Model:          NVIDIA Tegra X2
Brand:          (null)
GPU UUID:       (null)
Bus Location:   (null)
Architecture:   6.2

root@3d612ed:~# balena run --rm -it --gpus all nvcr.io/nvidia/l4t-base:r32.3.1 bash
balena: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

I feel like the info command should return a UUID (unfortunately haven't tested this on the playground box last week)


also the container-cli can be used to ask what components of the driver are required:

root@3d612ed:~# nvidia-container-cli list
/usr/lib/libcuda.so.1.1
/usr/lib/libnvidia-ptxjitcompiler.so.32.3.1
/usr/lib/libnvidia-fatbinaryloader.so.32.3.1
/usr/lib/libnvidia-eglcore.so.32.3.1
/usr/lib/libnvidia-glcore.so.32.3.1
/usr/lib/libnvidia-tls.so.32.3.1
/usr/lib/libnvidia-glsi.so.32.3.1
/usr/lib/libGLX_nvidia.so.0
/usr/lib/libEGL_nvidia.so.0
/usr/lib/libGLESv2_nvidia.so.2
/usr/lib/libGLESv1_CM_nvidia.so.1

@robertgzr
Copy link
Author

where can I see which version of the driver is installed on the tx2?

@acostach
Copy link
Contributor

acostach commented Apr 6, 2020

@robertgzr it's the 32.3.1 driver from l4t 32.3.1, if that's what you are referring to.

root@3d612ed:~# modinfo /lib/modules/4.9.140-l4t-r32.3.1/kernel/drivers/gpu/nvgpu/nvgpu.ko
filename: /lib/modules/4.9.140-l4t-r32.3.1/kernel/drivers/gpu/nvgpu/nvgpu.ko
alias: of:NTCnvidia,gv11bC*
alias: of:NTCnvidia,gv11b
alias: of:NTCnvidia,tegra186-gp10bC*
alias: of:NTCnvidia,tegra186-gp10b
alias: of:NTCnvidia,tegra210-gm20bC*
alias: of:NTCnvidia,tegra210-gm20b
depends:
intree: Y
vermagic: 4.9.140-l4t-r32.3.1 SMP preempt mod_unload modversions aarch64

Wondering if it does so because it's not initialized, due to the firmware blobs that are usually extracted to the container weren't loaded by the driver, as they aren't in the hostOS? I'm referring to (BSP Archive) Tegra186_Linux_R32.3.1.tbz2/Linux_for_tegra/nv_tegra/nvidia_drivers.tbz2/lib/firmware/tegra18x , gp10b. Not sure this is the issue but can you try to initialize it first, maybe from a container and then shut down the container but leave the driver loaded, or unpack nv_drivers directly in the hostOS.

https://github.com/balena-io-playground/tx2-container-contracts-sample/blob/16d3ad09f0615956389f04105e3b533be9620388/tx2_32_2/Dockerfile.template#L7 but use the 32.3.1 BSP archive for the TX2 from here: https://developer.nvidia.com/embedded/linux-tegra

I haven't got time just yet to look into or release a 32.3.1 based BalenaOS for the tx2, but if you are having issues with unpacking the BSP archive in the container, here's how it works for the Nano on 32.3.1: https://github.com/acostach/jetson-nano-container-contracts/blob/51e9bfa97a91692c6b806ed32c9e96e656f5b088/nano_32_3_1/Dockerfile.template#L7

@robertgzr
Copy link
Author

I think we're fine in the driver department. it looks like docker is only loading it's internal compat layer for the nvidia stuff if nvidia-container-runtime-hook is present on the hostOS

I'm going to see if I can find where this is supposed to come from but I think it's usually installed as part of the libnvidia-container package

@acostach
Copy link
Contributor

acostach commented Apr 7, 2020

@robertgzr thanks, I've updated https://github.com/balena-os/balena-jetson/commits/cuda_libs_test with this package, let me know if it works with it

@robertgzr
Copy link
Author

@acostach any idea why the runtime-hook is complaining about missing libraries:

root@3d612ed:~# nvidia-container-runtime-hook
nvidia-container-runtime-hook: error while loading shared libraries: libstd.so: cannot open shared object file: No such file or directory

is libstd not a rust thing?

@acostach
Copy link
Contributor

acostach commented Apr 9, 2020

not sure @robertgzr , I see this libstd is provided by rust in the rootfs, but probably the hook binary comes pre-compiled and was built against a different version of the library?

root@3d612ed:~# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/rust/
root@3d612ed:~# nvidia-container-runtime-hook
nvidia-container-runtime-hook: symbol lookup error: nvidia-container-runtime-hook: undefined symbol: main

@robertgzr
Copy link
Author

robertgzr commented Apr 9, 2020

the thing is there is not rust dependency? and the hook binary should be built from source by https://github.com/madisongh/meta-tegra/blob/master/recipes-containers/nvidia-container-toolkit/nvidia-container-toolkit_1.0.5.bb from https://github.com/NVIDIA/container-toolkit/tree/60f165ad6901f85b0c3acbf7ce2c66cd759c4fb8/nvidia-container-toolkit
no?

something is wrong here... but I don't understand

@acostach
Copy link
Contributor

acostach commented Apr 9, 2020

@robertgzr It doesn't look like a rust dependency, unless I'm mistaking somewhere. And that's right, the hook binary is built from sources, but they are go sources.

So it appears there are 2 libstds, one from rust as you said, which isn't good for us, and another one from go. The go version that we currently have in the image comes from meta-balena and is at version 1.10.

I think the hook was built against some newer go 'headers', although I'm not familiar with the go workflow or build process.

root@3d612ed:~# export LD_LIBRARY_PATH=/home/root/ # this is where I copied libstd.so provided by go on the shared board
root@3d612ed:~# nvidia-container-runtime-hook 
nvidia-container-runtime-hook: symbol lookup error: nvidia-container-runtime-hook: undefined symbol: runtime.arm64HasATOMICS

Looking at this: https://github.com/golang/go/blob/a1550d3ca3a6a90b8bbb610950d1b30649411243/src/cmd/internal/goobj2/builtinlist.go#L185
I see the symbol 'runtime.arm64HasATOMICS' is present starting from go version ~1.14
so I updated manually to go 1.14, updated the poky class to zeus, re-built go and nvidia-container-toolkit, then uploaded the runtime-container-toolkit and libstd.so binaries to the shared board and it appears to work:

root@3d612ed:~# export LD_LIBRARY_PATH=/home/root/
root@3d612ed:~# nvidia-container-runtime-hook 
Usage of nvidia-container-runtime-hook:
  -config string
    configuration file
  -debug
    enable debug output

Please try to run it again now on the shared device, check if this unblocks.

@robertgzr
Copy link
Author

robertgzr commented Apr 10, 2020

@acostach oh ok that makes more sense now... sounds to me like something is still up with our go integration in meta-balena. if you check my pr here: balena-os/meta-balena#1824

This provides the nvidia enabled balena-engine. part of it is a bump to go 1.12.12

sounds like the build of nvidida-container-toolkit uses a different go toolchain than that one? how is that possible? I thought we can enforce it via thr GOVERSION env from meta-balena


Looking at this: golang/go:src/cmd/internal/goobj2/builtinlist.go@a1550d3#L185
I see the symbol 'runtime.arm64HasATOMICS' is present starting from go version ~1.14

the toolkit recipe shouldn't have a dependency on any version of go btw. if it gets built by 1.12.12 it should just work.

I have actually never encountered something like that. I didn't even know the go stdlib could be loaded as shared

@robertgzr
Copy link
Author

I looked at the documentation a little bit:

yocto compiles the go runtime (which includes the stdlib) as a shared library:
http://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/recipes-devtools/go/go-runtime.inc?h=zeus#n40

the go.bbclass in poky has a switch to link the recipe to that shared lib, GO_DYNLINK:
http://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/classes/go.bbclass?h=zeus#n35

that is enabled for supported platforms by default I think:
http://git.yoctoproject.org/cgit/cgit.cgi/poky/tree/meta/classes/goarch.bbclass?h=zeus#n26

https://golang.org/cmd/go/#hdr-Compile_packages_and_dependencies (ctrl-f "linkshared")

@robertgzr
Copy link
Author

I manually compiled it without those flags and now it work:

root@3d612ed:~# balena run --gpus all -it balenalib/jetson-tx2-ubuntu:bionic-run bash
root@71f92d3822b7:/#

@dremsol
Copy link
Contributor

dremsol commented Apr 23, 2020

Hi @robertgzr, @acostach. I'm happy to see progress on the subject. We have been asking for this feature for a long time. Will this be supported in BalenaOS anytime soon?

@acostach
Copy link
Contributor

Hi @dremsol currently it's something we're considering and investigating, we don't have a timeline as the final conclusions were not reached yet.

@dremsol
Copy link
Contributor

dremsol commented May 8, 2020

Hi @robertgzr & @acostach,

I've taken a deeper look at this issue and I would like to share our experiences. Besides, I have a couple of questions which i hope you can answer. First of all, our custom OS indeed shows similar output

root@photon-nano:~# nvidia-container-cli info
NVRM version:   (null)
CUDA version:   10.0

Device Index:   0
Device Minor:   0
Model:          NVIDIA Tegra X1
Brand:          (null)
GPU UUID:       (null)
Bus Location:   (null)
Architecture:   5.3

root@photon-nano:~# nvidia-container-cli list
/usr/lib/libcuda.so.1.1
/usr/lib/libnvidia-ptxjitcompiler.so.32.3.1
/usr/lib/libnvidia-fatbinaryloader.so.32.3.1
/usr/lib/libnvidia-eglcore.so.32.3.1
/usr/lib/libnvidia-glcore.so.32.3.1
/usr/lib/libnvidia-tls.so.32.3.1
/usr/lib/libnvidia-glsi.so.32.3.1
/usr/lib/libGLX_nvidia.so.0
/usr/lib/libEGL_nvidia.so.0
/usr/lib/libGLESv2_nvidia.so.2
/usr/lib/libGLESv1_CM_nvidia.so.1

This allows to run the CUDA samples by pulling nvcr.io/nvidia/l4t-base:r32.4.2 just fine under the assumption the CUDA libs are installed in HostOS. So far so good and probably the goal you want to achieve in this issue.

First thing i would like to point out is that once you would like to mount a CSI camera into the docker container it requires a daemon to run in HostOS (tegra-argus-daemon). Subsequently, the additional argument to add to run command for accessing CSI Camera from within container is: -v /tmp/argus_socket:/tmp/argus_socket. Considering a USB Camera the additional argument is --device /dev/video0:/dev/video0.

Now we are building our application using the deepstream-l4t container and this is where it get's interesting as the required HostOS packages become application dependent. Besides CUDA, it requires cuDNN and TensorRT. While this is still feasible to include somehow (either static or configurable through BalenaCloud) it becomes a mess once you need to include the application specific gstreamer plugins in HostOS. To give small snippet (not optimized);

# NVIDIA
IMAGE_INSTALL_append = " cuda-driver cuda-toolkit nvidia-container-runtime cuda-samples nvidia-docker cudnn tensorrt libvisionworks libvisionworks-sfm libvisionworks-tracking tegra-tools tegra-argus-daemon"
 
# gstreamer and plugings
## nvidia specific packages
IMAGE_INSTALL_append = " gstreamer1.0-omx-tegra gstreamer1.0-plugins-nveglgles gstreamer1.0-plugins-nvvideo4linux2 gstreamer1.0-plugins-nvvideosinks"
## most of these are pulled in as dependencies of the nvidia specific packages
## specify them explicitly as dependencies here to ensure they are included
## TODO: check depends and cleanup
IMAGE_INSTALL_append = " gstreamer1.0 gstreamer1.0-meta-base gstreamer1.0-plugins-base gstreamer1.0-plugins-bad"
IMAGE_INSTALL_append = " gstreamer1.0-plugins-good gstreamer1.0-python gstreamer1.0-rtsp-server gstreamer1.0-vaapi"

As our goal is clear, how do you see this fit in the Balena ecosystem? A 'one image to rule them all' approach would not work for all applications i guess.

@jellyfish-bot
Copy link

[robertgzr] This issue has attached support thread https://jel.ly.fish/#/de9ddbf3-0b65-4cba-a2e2-38e43855f1bd

@robertgzr
Copy link
Author

robertgzr commented May 15, 2020

@dremsol how difficult do you think it would be to run the tegra-argus-daemon itself in a container as well?
then you can just share the socket with your app container and it gives you full control over the dependencies too

@dremsol
Copy link
Contributor

dremsol commented May 18, 2020

Hi @robertgzr,

Good suggestion, didn't test that so far. We forked balena-jetson and got nvidia-container-runtime working with balena-engine. Besides CUDA, we included cuDNN, TensorRT, and Visionworks (jetson-nano) as required by NGC l4t containers with some minor changes in nvidia-container-runtime (runc vs. balena-runc).

root@balena:~# balena run -it --rm --net=host --runtime nvidia nvcr.io/nvidia/deepstream-l4t:4.0.2-19.12-base
Unable to find image 'nvcr.io/nvidia/deepstream-l4t:4.0.2-19.12-base' locally
4.0.2-19.12-base: Pulling from nvidia/deepstream-l4t
8aaa03d29a6e: Pull complete
......
bcac47627c16: Pull complete
Total:  [==================================================>]  559.6MB/559.6MB
Digest: sha256:58c0e19332824da544b72c5eae063d1f1a0ea876af76a8e519dd71aeb023d1de
Status: Downloaded newer image for nvcr.io/nvidia/deepstream-l4t:4.0.2-19.12-base
root@balena:~# 

Depending on the application, the following packages may be installed on HostOS where the container-runtime-csv bbclass makes the appropriate nvidia runtime links.

./external/openembedded-layer/recipes-multimedia/v4l2apps/v4l-utils_%.bbappend:inherit container-runtime-csv
./recipes-devtools/visionworks/libvisionworks-sfm_0.90.4.bb:inherit nvidia_devnet_downloads container-runtime-csv
./recipes-devtools/visionworks/libvisionworks_1.6.0.500n.bb:inherit nvidia_devnet_downloads container-runtime-csv
./recipes-devtools/visionworks/libvisionworks-tracking_0.88.2.bb:inherit nvidia_devnet_downloads container-runtime-csv
./recipes-devtools/gie/tensorrt_6.0.1-1.bb:inherit nvidia_devnet_downloads container-runtime-csv
./recipes-devtools/cudnn/cudnn_7.6.3.28-1.bb:inherit nvidia_devnet_downloads container-runtime-csv
./recipes-devtools/cuda/cuda-shared-binaries-10.0.326-1.inc:inherit container-runtime-csv
./recipes-devtools/cuda/cuda-cudart_10.0.326-1.bb:inherit container-runtime-csv siteinfo
./recipes-bsp/tegra-binaries/gstreamer1.0-plugins-tegra_32.3.1.bb:inherit container-runtime-csv
./recipes-bsp/tegra-binaries/tegra-libraries_32.3.1.bb:inherit container-runtime-csv
./recipes-bsp/tegra-binaries/tegra-firmware_32.3.1.bb:inherit container-runtime-csv
./recipes-bsp/tegra-binaries/libdrm-nvdc_32.3.1.bb:inherit container-runtime-csv
./recipes-bsp/tegra-binaries/tegra-nvphs-base_32.3.1.bb:inherit container-runtime-csv
./recipes-multimedia/libv4l2/libv4l2-minimal_1.18.0.bb:inherit autotools gettext pkgconfig container-runtime-csv
./recipes-multimedia/gstreamer/gstreamer1.0-plugins-nvjpeg_1.14.0-r32.3.1.bb:inherit autotools gtk-doc gettext pkgconfig container-runtime-csv
./recipes-multimedia/gstreamer/gstreamer1.0-omx-tegra_1.0.0-r32.3.1.bb:inherit autotools pkgconfig gettext container-runtime-csv
./recipes-multimedia/gstreamer/gstreamer1.0-plugins-nveglgles_1.2.3-r32.3.1.bb:inherit autotools gettext gobject-introspection pkgconfig container-runtime-csv
./recipes-multimedia/gstreamer/gstreamer1.0-plugins-nvvideo4linux2_1.14.0-r32.3.1.bb:inherit gettext pkgconfig container-runtime-csv
./recipes-multimedia/gstreamer/gstreamer1.0-plugins-nvvideosinks_1.14.0-r32.3.1.bb:inherit gettext pkgconfig container-runtime-csv

As nvidia-container-runtime expects JetPack as HostOS it's not yet clear to me which packages are really necessary besides the ones allready included. Anyway, we had to include the nvidia specific gstreamer packages in our custom OS to get our application running within deepstream container. I've tried to include them with Balena but didn't succeed so far as balena-jetson depends on warrior (vs zeus in meta-tegra to support nvidia-container-runtime).

➜  resin-image git:(master) cat installed-package-sizes.txt | head -n 10
436914  KiB     libcudnn7
218999  KiB     tensorrt
106699  KiB     tegra-libraries
91930   KiB     cuda-cublas
52629   KiB     balena
39126   KiB     kernel-image-initramfs
35853   KiB     go-runtime
27149   KiB     libvisionworks

@dremsol
Copy link
Contributor

dremsol commented May 19, 2020

Hi @robertgzr & @acostach,

Had a good talk with Joe today and he asked me to keep you updated. It seems that the nvidia runtime is working nicely with balena-engine and the Host packages are being mapped accordingly by using the mount plugin.

Running the deviceQuery sample returns a PASS;

root@balena:/tmp/deviceQuery# balena run -it --runtime nvidia devicequery
./deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3962 MBytes (4154109952 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             1600 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

@robertgzr
Copy link
Author

@dremsol this sounds amazing. with balena-engine 19.03 finally merged in it's main repo we're one step closer to making all of this happen in vanilla balenaOS. I have lifted the meta-balena PR out of draft status here: balena-os/meta-balena#1824 and it's under review right now.
Once we merge the new engine there, work on this issue should pick up again.

You're using nvidi-container-runtime (the previous iteration of gpu support) while I mostly tried to make this work through https://github.com/NVIDIA/container-toolkit/ which is the approach that docker "blessed"
I don't see why the mount plugin work shouldn't be possible there, as long as libnvidia-container has the changes on it's jetson branch

@dremsol
Copy link
Contributor

dremsol commented May 22, 2020

Hi @robertgzr, we are very happy to hear that GPU support is moving to production. We will keep an eye on the PR.

Thanks for the suggestion and it seems you are right. It's a bit hard to follow the footsteps of NVIDIA sometimes but we managed to drop the dependency on runtime. However, this also drops the inclusion of l4t.csv but has been solved in nvidia-container-toolkit as libnvidia-container parses .csv files.

  • Why is NVIDIA referring to --runtime nvidia everywhere as this is obsolete?

Based on the work of @acostach in jetson-nano-sample-app we would like to run all cuda samples in a striped down version of the Dockerfile to test the --gpus all flag and the plugin mounts. We managed to get ./clock and ./deviceQuery working. However for the remaining samples involving OpenGL we stumble upon some errors related to OpenGL after building and firing the container as follows;

balena build -t cudasamples -f Dockerfile.cudesamples .
balena run -it --rm --privileged --gpus all cudasamples bash

And setting DISPLAY and running X

    $ export DISPLAY=:0
    $ X &
    $ ./clock               <PASS>
    $ ./deviceQuery         <PASS>
    $ ./postProcessGL       <FAIL>
    $ ./simpleGL            <FAIL>
    $ ./simpleTexture3D     <FAIL>
    $ ./smokeParticles      <FAIL>

Failed sample outputs look like;

simpleTexture3D Starting...

GPU Device 0: "NVIDIA Tegra X1" with compute capability 5.3

CUDA error at simpleTexture3D.cpp:247 code=30(cudaErrorUnknown) "cudaGraphicsGLRegisterBuffer(&cuda_pbo_resource, pbo, cudaGraphicsMapFlagsWriteDiscard)"

simpleGL (VBO) starting...

GPU Device 0: "NVIDIA Tegra X1" with compute capability 5.3

CUDA error at simpleGL.cu:422 code=30(cudaErrorUnknown) "cudaGraphicsGLRegisterBuffer(vbo_res, *vbo, vbo_res_flags)"
CUDA error at simpleGL.cu:434 code=33(cudaErrorInvalidResourceHandle) "cudaGraphicsUnregisterResource(vbo_res)"
./postProcessGL Starting...

(Interactive OpenGL Demo)
GPU Device 0: "NVIDIA Tegra X1" with compute capability 5.3

CUDA error at main.cpp:243 code=30(cudaErrorUnknown) "cudaGraphicsGLRegisterBuffer(pbo_resource, *pbo, cudaGraphicsMapFlagsNone)"
CUDA Smoke Particles Starting...

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

The following required OpenGL extensions missing:
	GL_ARB_multitexture
	GL_ARB_vertex_buffer_object
	GL_EXT_geometry_shader4.

@acostach, have you seen these errors before? It seems like it has something to do with X but I cant figure out the cause. dmesg is showing plugging and unplugging HDMI and when running the samples the display is blinking briefly before crash. Do you have a clue?

@dremsol
Copy link
Contributor

dremsol commented May 22, 2020

@robertgzr i think i answered my own question as it seems compose doesn't support the --gpus all flag yet as seen in following issue.

Anyway it shouldn't be a problem to install runtime (--runtime nvidia) alongside toolkit as both flags will probably work.

@robertgzr
Copy link
Author

@dremsol

Why is NVIDIA referring to --runtime nvidia everywhere as this is obsolete?

I know, that has been a major pain when researching this topic. I guess plenty of people out there are still using the old approaches... but there are just so many repos that claim to be the one and the container-toolkit for example doesn't even come with a README but is essential for the whole thing to work.

You are right upstream composefile support isn't progressing much: docker/compose#7124

but that will not really be an issue I hope because you can basically communicate the same using env vars, check out their base images here

@robertgzr
Copy link
Author

@acostach should we try to cut down the set of commits on the wip branch? We should only need to unmask the cuda recipe, include the container-toolkit and bump meta-balena no? I guess the rootfs size needs to be investigated but I would leave this until the very end...

@robertgzr robertgzr mentioned this issue Jun 3, 2020
3 tasks
@acostach
Copy link
Contributor

acostach commented Jun 8, 2020

@robertgzr I've been investigating what can be cut down and still be able to load a container with the 'gpus --all' functionality.

What I found is we need at least 69M worth of libraries. The container-hook needs and will fail if any of these is missing, and it also loads and mounts them by default without the need to have them in the csv.

These dependencies are:

/usr/lib/libcuda.so.1.1 16M
/usr/lib/libnvidia-ptxjitcompiler.so.32.4.2  8.0M
/usr/lib/libnvidia-fatbinaryloader.so.32.4.2 362K
/usr/lib/libnvidia-eglcore.so.32.4.2 21M
/usr/lib/libnvidia-glcore.so.32.4.2  22M
/usr/lib/libnvidia-tls.so.32.4.2 5.6K
/usr/lib/libnvidia-glsi.so.32.4.2 598K
/usr/lib/libGLX_nvidia.so.0 1.2M
/usr/lib/libEGL_nvidia.so.0 1.1M
/usr/lib/libGLESv2_nvidia.so.2 115K
/usr/lib/libGLESv1_CM_nvidia.so.1 67K

and come from tegra-libraries package from meta-tegra repository. The current available rootfs space is not sufficient for these. Also, tegra-libraries adds other libnv libraries apart from them, which total to around 162 MB.

It's not necessary to have the cuda deb, nor devnet mirror set to install these libs and start the container. However, this would be the minimum package list necessary in the hostOS image:

    libnvidia-container-tools \
    nvidia-container-toolkit \
    nvidia-container-runtime \
    go-runtime \
    tegra-libraries \ 

Adding these to #74 and increasing the rootfs size is one step. To successfully run ./deviceQuery from inside a l4t base image, the following entries also need to be appended to the csv.

lib, /usr/lib/libnvos.so
lib, /usr/lib/libnvrm_gpu.so
lib, /usr/lib/libnvrm.so
lib, /usr/lib/libnvrm_graphics.so

Depending on each particular use-case, other libs may need to be added too.

Now, since considerably more rootfs space is necessary for running the hook, maybe we can reopen your original arch item, which according to the docs assumed only libcuda was necessary, and discuss these findings and the way forward?

@robertgzr
Copy link
Author

@acostach yeah an arch call discussion for this sounds good.

what irritates me is that I would expect some subset of the libraries here https://github.com/NVIDIA/libnvidia-container/blob/jetson/src/nvc_info.c to be the ones that it requires regardless of mount plugin csv

@dremsol
Copy link
Contributor

dremsol commented Jun 26, 2020

Hi @acostach & @robertgzr, any progress with on the above related to PR #75? Anything we can do from our side?

@acostach
Copy link
Contributor

Hi @dremsol , we are currently investigating internally on how we could increase the current available space on existing devices to allow for including the libraries necessary for the nvidia hook to run.

Now, on what device are you interested in running this? If it's just the photon-xavier-nx, I think it could have a larger rootfs from the very beginning, before it gets deployed to the dashboard.

@dremsol
Copy link
Contributor

dremsol commented Jun 26, 2020

Hi @acostach, that's what we understood from the call we had last week with @robertgzr. As he pointed out, with a private device type this wouldn't be an issue. Anyway before moving to private device type, we will be running two devices in production;

  • photon-nano
  • photon-xavier-nx

For both devices this requires an increased rootfs. Currently @pgils is working on your suggestions in #76. After getting this PR merged he will update the photon-nano BSP's to l4t 32.4.2. What would be a reasonable rootfs size for both of these devices?

@acostach
Copy link
Contributor

acostach commented Jun 26, 2020

@dremsol It think it shouldn't be extremely large but at the same time allow for installing the libs needed for the hook to run. At a minimum I found around 70MB just for the hook to start. To be able to run deviceQuery in container, another few MB for around 4 libraries would be needed. However, to ensure further compatibility, probably the whole tegra-libraries package of around 170MB would be needed. These libraries are not part of the restricted downloads. Apart from them would come any other libraries or executables that you might need to specifically reside inside the root filesystem, which should be run or get mounted from there.

@openedhardware
Copy link

@acostach
Any update on this issue?

@acostach
Copy link
Contributor

Hi @openedhardware , this is being handled in balena-os/meta-balena#1987

@jellyfish-bot
Copy link

[robertgzr] This issue has attached support thread https://jel.ly.fish/db2f3928-f02c-4ec8-96e6-0747b83227dd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants