-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docker devicerequests/nvidia support #57
Comments
@acostach I think we will need: https://github.com/madisongh/meta-tegra/blob/master/recipes-devtools/cuda/cuda-driver_10.0.326-1.bb which would give us the driver libs, right? and we will also need https://github.com/madisongh/meta-tegra/tree/master/recipes-containers/libnvidia-container-tools |
Looks like they might @robertgzr , I need to check with a yocto build with these 2 packages. It will take a bit cause the cuda packages first need to be downloaded locally with the nvidia sdk manager, these and their dependencies can't be pulled by yocto automatically. I'll get back to you.. |
I think we will still run out of space. I already have trouble getting the new balena-engine binary into some devices because of the size increase of the binary there. And the cuda stuff is going to come in with another at least 15mb or something like that |
@robertgzr I built an image with those and it's available on dev 3d612ed56aaa2ba22cf73ba7a2021cb7 if you want to test with the patched engine. A couple notes would be:
|
@acostach do you have a branch on here I can use, I would like to pull in the engine using balena-os/meta-balena#1824 rather than copying the binary around... |
@acostach I'm trying to figure out why it won't work out of the box...
I feel like the also the container-cli can be used to ask what components of the driver are required:
|
where can I see which version of the driver is installed on the tx2? |
@robertgzr it's the 32.3.1 driver from l4t 32.3.1, if that's what you are referring to. root@3d612ed:~# modinfo /lib/modules/4.9.140-l4t-r32.3.1/kernel/drivers/gpu/nvgpu/nvgpu.ko Wondering if it does so because it's not initialized, due to the firmware blobs that are usually extracted to the container weren't loaded by the driver, as they aren't in the hostOS? I'm referring to (BSP Archive) Tegra186_Linux_R32.3.1.tbz2/Linux_for_tegra/nv_tegra/nvidia_drivers.tbz2/lib/firmware/tegra18x , gp10b. Not sure this is the issue but can you try to initialize it first, maybe from a container and then shut down the container but leave the driver loaded, or unpack nv_drivers directly in the hostOS. https://github.com/balena-io-playground/tx2-container-contracts-sample/blob/16d3ad09f0615956389f04105e3b533be9620388/tx2_32_2/Dockerfile.template#L7 but use the 32.3.1 BSP archive for the TX2 from here: https://developer.nvidia.com/embedded/linux-tegra I haven't got time just yet to look into or release a 32.3.1 based BalenaOS for the tx2, but if you are having issues with unpacking the BSP archive in the container, here's how it works for the Nano on 32.3.1: https://github.com/acostach/jetson-nano-container-contracts/blob/51e9bfa97a91692c6b806ed32c9e96e656f5b088/nano_32_3_1/Dockerfile.template#L7 |
I think we're fine in the driver department. it looks like docker is only loading it's internal compat layer for the nvidia stuff if I'm going to see if I can find where this is supposed to come from but I think it's usually installed as part of the libnvidia-container package |
@robertgzr thanks, I've updated https://github.com/balena-os/balena-jetson/commits/cuda_libs_test with this package, let me know if it works with it |
@acostach any idea why the runtime-hook is complaining about missing libraries:
is libstd not a rust thing? |
not sure @robertgzr , I see this libstd is provided by rust in the rootfs, but probably the hook binary comes pre-compiled and was built against a different version of the library?
|
the thing is there is not rust dependency? and the hook binary should be built from source by https://github.com/madisongh/meta-tegra/blob/master/recipes-containers/nvidia-container-toolkit/nvidia-container-toolkit_1.0.5.bb from https://github.com/NVIDIA/container-toolkit/tree/60f165ad6901f85b0c3acbf7ce2c66cd759c4fb8/nvidia-container-toolkit something is wrong here... but I don't understand |
@robertgzr It doesn't look like a rust dependency, unless I'm mistaking somewhere. And that's right, the hook binary is built from sources, but they are go sources. So it appears there are 2 libstds, one from rust as you said, which isn't good for us, and another one from go. The go version that we currently have in the image comes from meta-balena and is at version 1.10. I think the hook was built against some newer go 'headers', although I'm not familiar with the go workflow or build process.
Looking at this: https://github.com/golang/go/blob/a1550d3ca3a6a90b8bbb610950d1b30649411243/src/cmd/internal/goobj2/builtinlist.go#L185
Please try to run it again now on the shared device, check if this unblocks. |
@acostach oh ok that makes more sense now... sounds to me like something is still up with our go integration in meta-balena. if you check my pr here: balena-os/meta-balena#1824 This provides the nvidia enabled balena-engine. part of it is a bump to go 1.12.12 sounds like the build of nvidida-container-toolkit uses a different go toolchain than that one? how is that possible? I thought we can enforce it via thr GOVERSION env from meta-balena
the toolkit recipe shouldn't have a dependency on any version of go btw. if it gets built by 1.12.12 it should just work. I have actually never encountered something like that. I didn't even know the go stdlib could be loaded as shared |
I looked at the documentation a little bit: yocto compiles the go runtime (which includes the stdlib) as a shared library: the go.bbclass in poky has a switch to link the recipe to that shared lib, that is enabled for supported platforms by default I think: https://golang.org/cmd/go/#hdr-Compile_packages_and_dependencies (ctrl-f "linkshared") |
I manually compiled it without those flags and now it work:
|
Hi @robertgzr, @acostach. I'm happy to see progress on the subject. We have been asking for this feature for a long time. Will this be supported in BalenaOS anytime soon? |
Hi @dremsol currently it's something we're considering and investigating, we don't have a timeline as the final conclusions were not reached yet. |
Hi @robertgzr & @acostach, I've taken a deeper look at this issue and I would like to share our experiences. Besides, I have a couple of questions which i hope you can answer. First of all, our custom OS indeed shows similar output
This allows to run the CUDA samples by pulling First thing i would like to point out is that once you would like to mount a CSI camera into the docker container it requires a daemon to run in HostOS (tegra-argus-daemon). Subsequently, the additional argument to add to run command for accessing CSI Camera from within container is: Now we are building our application using the deepstream-l4t container and this is where it get's interesting as the required HostOS packages become application dependent. Besides CUDA, it requires cuDNN and TensorRT. While this is still feasible to include somehow (either static or configurable through BalenaCloud) it becomes a mess once you need to include the application specific gstreamer plugins in HostOS. To give small snippet (not optimized);
As our goal is clear, how do you see this fit in the Balena ecosystem? A 'one image to rule them all' approach would not work for all applications i guess. |
[robertgzr] This issue has attached support thread https://jel.ly.fish/#/de9ddbf3-0b65-4cba-a2e2-38e43855f1bd |
@dremsol how difficult do you think it would be to run the |
Hi @robertgzr, Good suggestion, didn't test that so far. We forked balena-jetson and got nvidia-container-runtime working with balena-engine. Besides CUDA, we included cuDNN, TensorRT, and Visionworks (jetson-nano) as required by NGC l4t containers with some minor changes in nvidia-container-runtime (runc vs. balena-runc).
Depending on the application, the following packages may be installed on HostOS where the
As nvidia-container-runtime expects JetPack as HostOS it's not yet clear to me which packages are really necessary besides the ones allready included. Anyway, we had to include the nvidia specific gstreamer packages in our custom OS to get our application running within deepstream container.
|
Hi @robertgzr & @acostach, Had a good talk with Joe today and he asked me to keep you updated. It seems that the nvidia runtime is working nicely with balena-engine and the Host packages are being mapped accordingly by using the mount plugin. Running the deviceQuery sample returns a PASS;
|
@dremsol this sounds amazing. with balena-engine 19.03 finally merged in it's main repo we're one step closer to making all of this happen in vanilla balenaOS. I have lifted the meta-balena PR out of draft status here: balena-os/meta-balena#1824 and it's under review right now. You're using nvidi-container-runtime (the previous iteration of gpu support) while I mostly tried to make this work through https://github.com/NVIDIA/container-toolkit/ which is the approach that docker "blessed" |
Hi @robertgzr, we are very happy to hear that GPU support is moving to production. We will keep an eye on the PR. Thanks for the suggestion and it seems you are right. It's a bit hard to follow the footsteps of NVIDIA sometimes but we managed to drop the dependency on runtime. However, this also drops the inclusion of
Based on the work of @acostach in jetson-nano-sample-app we would like to run all cuda samples in a striped down version of the Dockerfile to test the
And setting DISPLAY and running X
Failed sample outputs look like;
@acostach, have you seen these errors before? It seems like it has something to do with X but I cant figure out the cause. dmesg is showing plugging and unplugging HDMI and when running the samples the display is blinking briefly before crash. Do you have a clue? |
@robertgzr i think i answered my own question as it seems compose doesn't support the Anyway it shouldn't be a problem to install runtime ( |
I know, that has been a major pain when researching this topic. I guess plenty of people out there are still using the old approaches... but there are just so many repos that claim to be the one and the container-toolkit for example doesn't even come with a README but is essential for the whole thing to work. You are right upstream composefile support isn't progressing much: docker/compose#7124 but that will not really be an issue I hope because you can basically communicate the same using env vars, check out their base images here |
@acostach should we try to cut down the set of commits on the wip branch? We should only need to unmask the cuda recipe, include the container-toolkit and bump meta-balena no? I guess the rootfs size needs to be investigated but I would leave this until the very end... |
@robertgzr I've been investigating what can be cut down and still be able to load a container with the 'gpus --all' functionality. What I found is we need at least 69M worth of libraries. The container-hook needs and will fail if any of these is missing, and it also loads and mounts them by default without the need to have them in the csv. These dependencies are:
and come from tegra-libraries package from meta-tegra repository. The current available rootfs space is not sufficient for these. Also, tegra-libraries adds other libnv libraries apart from them, which total to around 162 MB. It's not necessary to have the cuda deb, nor devnet mirror set to install these libs and start the container. However, this would be the minimum package list necessary in the hostOS image:
Adding these to #74 and increasing the rootfs size is one step. To successfully run ./deviceQuery from inside a l4t base image, the following entries also need to be appended to the csv.
Depending on each particular use-case, other libs may need to be added too. Now, since considerably more rootfs space is necessary for running the hook, maybe we can reopen your original arch item, which according to the docs assumed only libcuda was necessary, and discuss these findings and the way forward? |
@acostach yeah an arch call discussion for this sounds good. what irritates me is that I would expect some subset of the libraries here https://github.com/NVIDIA/libnvidia-container/blob/jetson/src/nvc_info.c to be the ones that it requires regardless of mount plugin csv |
Hi @acostach & @robertgzr, any progress with on the above related to PR #75? Anything we can do from our side? |
Hi @dremsol , we are currently investigating internally on how we could increase the current available space on existing devices to allow for including the libraries necessary for the nvidia hook to run. Now, on what device are you interested in running this? If it's just the photon-xavier-nx, I think it could have a larger rootfs from the very beginning, before it gets deployed to the dashboard. |
Hi @acostach, that's what we understood from the call we had last week with @robertgzr. As he pointed out, with a private device type this wouldn't be an issue. Anyway before moving to private device type, we will be running two devices in production;
For both devices this requires an increased rootfs. Currently @pgils is working on your suggestions in #76. After getting this PR merged he will update the photon-nano BSP's to l4t 32.4.2. What would be a reasonable rootfs size for both of these devices? |
@dremsol It think it shouldn't be extremely large but at the same time allow for installing the libs needed for the hook to run. At a minimum I found around 70MB just for the hook to start. To be able to run deviceQuery in container, another few MB for around 4 libraries would be needed. However, to ensure further compatibility, probably the whole tegra-libraries package of around 170MB would be needed. These libraries are not part of the restricted downloads. Apart from them would come any other libraries or executables that you might need to specifically reside inside the root filesystem, which should be run or get mounted from there. |
@acostach |
Hi @openedhardware , this is being handled in balena-os/meta-balena#1987 |
[robertgzr] This issue has attached support thread https://jel.ly.fish/db2f3928-f02c-4ec8-96e6-0747b83227dd |
we want to support exposing gpu resources to user containers via the new DeviceRequests API introduced in docker 19.03.x
To enable this we need to have the nvidia driver, userland driver-support libraries and libnvidia-container in the host os
WIP branch: https://github.com/balena-os/balena-jetson/tree/rgz/cuda_libs_testDepends-on: balena-os/meta-balena#1824 [merged 🎊]
arch call notes (internal): https://docs.google.com/document/d/1tFaDKyTsdi1TUfxfAjAAGJCfUVCwmPxIstdrYaOJ-I0
arch call item (internal): https://app.frontapp.com/open/cnv_5bqfytf
The text was updated successfully, but these errors were encountered: