Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need help on gds_parnter build docker error #43

Open
gaowayne opened this issue Jun 19, 2024 · 2 comments
Open

need help on gds_parnter build docker error #43

gaowayne opened this issue Jun 19, 2024 · 2 comments

Comments

@gaowayne
Copy link

hello expert,
I am installing GDS on ubuntu22.04, now everything works fine. gdsio can shows direct write is 3x better than cpu copy gpu write.

but I am trying to build gds_parnters dockers to run test suite. but I am keeping blocked by below error, could you please help?

root@smcx12svr01:~/wayne/gds/usr/local/gds/docker# ./build_docker.sh -v 1.10.0 -c 12.5.0 -m 24.04-0.6.6.0
building for MOFED version 24.04-0.6.6.0
/usr/bin/7z
Saving: MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
--2024-06-19 08:36:55--  http://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
Resolving content.mellanox.com (content.mellanox.com)... 107.178.241.102
Connecting to content.mellanox.com (content.mellanox.com)|107.178.241.102|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 325040157 (310M) [application/x-tar]
Saving to: ‘MLNX_INSTALLER.tgz’

MLNX_INSTALLER.tgz              100%[====================================================>] 309.98M  14.9MB/s    in 23s     

2024-06-19 08:37:19 (13.2 MB/s) - ‘MLNX_INSTALLER.tgz’ saved [325040157/325040157]

ubuntu22.04
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  662.6MB
Step 1/36 : FROM ubuntu:22.04
 ---> 67c845845b7d
Step 2/36 : ARG CUDA_PATH
 ---> Using cache
 ---> 0a500ae6e7f8
Step 3/36 : ARG CUDA_REPO
 ---> Using cache
 ---> bd784abb4415
Step 4/36 : ARG USE_CUSTOM_CUFILE
 ---> Using cache
 ---> 90a7109edc8c
Step 5/36 : ARG USE_LOCAL_REPO
 ---> Using cache
 ---> 56d738920fc3
Step 6/36 : ARG CUDA_VERS_PART_ONE
 ---> Using cache
 ---> ca61d5480a5b
Step 7/36 : ARG CUDA_VERS_PART_TWO
 ---> Using cache
 ---> 318ee253b9ae
Step 8/36 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> 1dec7fc3a39d
Step 9/36 : ENV CUDA_PATH="/usr/local/cuda-${CUDA_PATH}"
 ---> Using cache
 ---> eb484076b884
Step 10/36 : RUN echo "cuda path: ${CUDA_PATH}"
 ---> Using cache
 ---> ee37c6f66202
Step 11/36 : RUN  apt-get update && apt-get install -y --no-install-recommends      gnupg2 curl ca-certificates software-properties-common      wget libpci3 libssl-dev
 ---> Running in 906a18424f84
Get:1 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy InRelease [270 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy/universe amd64 Packages [17.5 MB]
Get:6 http://archive.ubuntu.com/ubuntu jammy/restricted amd64 Packages [164 kB]
Get:7 http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages [1792 kB]
Get:8 http://archive.ubuntu.com/ubuntu jammy/multiverse amd64 Packages [266 kB]
Get:9 http://archive.ubuntu.com/ubuntu jammy-backports/universe amd64 Packages [31.8 kB]
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports/main amd64 Packages [81.0 kB]
Reading package lists...
E: Release file for http://security.ubuntu.com/ubuntu/dists/jammy-security/InRelease is not valid yet (invalid for another 6h 17min 6s). Updates for this repository will not be applied.
E: Release file for http://archive.ubuntu.com/ubuntu/dists/jammy-updates/InRelease is not valid yet (invalid for another 6h 18min 11s). Updates for this repository will not be applied.
The command '/bin/sh -c apt-get update && apt-get install -y --no-install-recommends      gnupg2 curl ca-certificates software-properties-common      wget libpci3 libssl-dev' returned a non-zero code: 100
failed to build docker for cuda ver 12.5.0 with MOFED: 24.04-0.6.6.0
@gaowayne
Copy link
Author

guys, I already fixed above issue. now I am suffering cannot find samples folder. I manually make one and copy the files into source, but still not working.

root@smcx12svr01:/usr/local/gds/docker# ./build_docker.sh -v 1.10.0 -c 12.5 -m 24.04-0.6.6.0
building for MOFED version 24.04-0.6.6.0
/usr/bin/7z
Saving: MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
--2024-06-20 01:50:20--  http://content.mellanox.com/ofed/MLNX_OFED-24.04-0.6.6.0/MLNX_OFED_LINUX-24.04-0.6.6.0-ubuntu22.04-x86_64.tgz
Resolving content.mellanox.com (content.mellanox.com)... 107.178.241.102
Connecting to content.mellanox.com (content.mellanox.com)|107.178.241.102|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 325040157 (310M) [application/x-tar]
Saving to: ‘MLNX_INSTALLER.tgz’

MLNX_INSTALLER.tgz              100%[====================================================>] 309.98M  65.4MB/s    in 4.6s    

2024-06-20 01:50:25 (66.8 MB/s) - ‘MLNX_INSTALLER.tgz’ saved [325040157/325040157]

ubuntu22.04
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            Install the buildx component to build images with BuildKit:
            https://docs.docker.com/go/buildx/

Sending build context to Docker daemon  662.6MB
Step 1/36 : FROM ubuntu:22.04
 ---> 67c845845b7d
Step 2/36 : ARG CUDA_PATH
 ---> Using cache
 ---> 0a500ae6e7f8
Step 3/36 : ARG CUDA_REPO
 ---> Using cache
 ---> bd784abb4415
Step 4/36 : ARG USE_CUSTOM_CUFILE
 ---> Using cache
 ---> 90a7109edc8c
Step 5/36 : ARG USE_LOCAL_REPO
 ---> Using cache
 ---> 56d738920fc3
Step 6/36 : ARG CUDA_VERS_PART_ONE
 ---> Using cache
 ---> ca61d5480a5b
Step 7/36 : ARG CUDA_VERS_PART_TWO
 ---> Using cache
 ---> 318ee253b9ae
Step 8/36 : ARG DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> 1dec7fc3a39d
Step 9/36 : ENV CUDA_PATH="/usr/local/cuda-${CUDA_PATH}"
 ---> Using cache
 ---> eb484076b884
Step 10/36 : RUN echo "cuda path: ${CUDA_PATH}"
 ---> Using cache
 ---> ee37c6f66202
Step 11/36 : RUN  apt-get -o Acquire::Check-Valid-Until=false -o Acquire::Check-Date=false update && apt-get install -y --no-install-recommends      gnupg2 curl ca-certificates software-properties-common      wget libpci3 libssl-dev
 ---> Using cache
 ---> 4f053d0b3398
Step 12/36 : ADD /cuda_repo /cuda_repo
 ---> Using cache
 ---> 69d4dcf5d0bf
Step 13/36 : ADD /custom_cufile /custom_cufile
 ---> Using cache
 ---> 75a1d99c5409
Step 14/36 : RUN if [ "$USE_LOCAL_REPO" = "1" ]; then      dpkg -i /cuda_repo/cuda_local.deb &&      cp /var/cuda-repo*/cuda-*-keyring.gpg /usr/share/keyrings;      else      curl -fsSL ${CUDA_REPO}/3bf863cc.pub | apt-key add - &&      add-apt-repository "deb ${CUDA_REPO} /"; fi
 ---> Using cache
 ---> 4c2bdedd3961
Step 15/36 : RUN  apt-get -o Acquire::Check-Valid-Until=false -o Acquire::Check-Date=false update && apt-get install -y --no-install-recommends      nvidia-fs      gds-tools-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-cudart-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-cudart-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-nvcc-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libcufile-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-nvrtc-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libcurand-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libnpp-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-nvtx-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      cuda-compat-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libtinfo5 libncursesw5      cuda-command-line-tools-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libcufile-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libcurand-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libnpp-dev-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}      libcusparse-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}-      libcublas-${CUDA_VERS_PART_ONE}-${CUDA_VERS_PART_TWO}-      && ln -s cuda-${CUDA_VERS_PART_ONE}.${CUDA_VERS_PART_TWO} /usr/local/cuda &&      rm -rf /var/lib/apt/lists/*      echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf      && echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf &&      apt-get -o Acquire::Check-Valid-Until=false -o Acquire::Check-Date=false update &&      apt-get upgrade -y &&      apt-get install -y --no-install-recommends      lsb-core      apt-utils      sysstat      nfs-common      iotop      sudo      kmod      binutils      gcc g++      numactl      netbase      net-tools      iproute2      iputils-ping      libnl-3-dev libnl-route-3-dev udev      p7zip-full p7zip-rar      dpkg-dev libudev-dev liburcu-dev libmount-dev libnuma-dev libjsoncpp-dev python3 libelf-dev
 ---> Using cache
 ---> f21ec859d274
Step 16/36 : ADD /mlnx_install /usr/local/mlnx_install
 ---> Using cache
 ---> 0433cbd300a9
Step 17/36 : RUN /usr/local/mlnx_install/mlnxofedinstall --user-space-only --without-fw-update --basic -q --force
 ---> Using cache
 ---> f2d8de5d3ea1
Step 18/36 : RUN apt-get install dkms -y
 ---> Using cache
 ---> 46bb23af469e
Step 19/36 : RUN sed -i 's/"allow_compat_mode": false,/"allow_compat_mode": true,/' /etc/cufile.json
 ---> Using cache
 ---> 587f4541450d
Step 20/36 : RUN  echo "${CUDA_PATH}/targets/x86_64-linux/lib/" > /etc/ld.so.conf.d/cufile.conf
 ---> Using cache
 ---> 617845530a74
Step 21/36 : RUN  ldconfig
 ---> Using cache
 ---> dfa9c7e2336f
Step 22/36 : RUN mkdir -p /usr/local/gds/tools && cp ${CUDA_PATH}/gds/tools/README /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gds_stats /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck.py /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscp /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio_verify /usr/local/gds/tools/ && cp -rf /usr/local/gds/samples/ /usr/local/gds/tools/samples/
 ---> Running in f4c360d04970
cp: cannot stat '/usr/local/gds/samples/': No such file or directory
The command '/bin/sh -c mkdir -p /usr/local/gds/tools && cp ${CUDA_PATH}/gds/tools/README /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gds_stats /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscheck.py /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdscp /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio /usr/local/gds/tools/ && cp ${CUDA_PATH}/gds/tools/gdsio_verify /usr/local/gds/tools/ && cp -rf /usr/local/gds/samples/ /usr/local/gds/tools/samples/' returned a non-zero code: 1
failed to build docker for cuda ver 12.5 with MOFED: 24.04-0.6.6.0

@gaowayne
Copy link
Author

guys, I fixed above issue. I can build container now, after I run gds contianer, it show me kernel build error inside container below. could you please help?

I run this: /usr/local/gds/docker# ./gds_docker.sh -p /mnt/nvme -v 1.10.0 -c 12.5.0 -m -t sanity

Here is result log have errors.

rm -rf *.o *.ko* *.mod.* .*.cmd nv.symvers Module.symvers modules.order .tmp_versions/ *~ core .depend TAGS .cache.mk *.o.ur-safe
rm -f config-host.h
rm -f nvidia-fs.mod
Getting symbol versions from /lib/modules/5.15.0-112-generic/updates/dkms/nvidia.ko ...
Created: /usr/src/nvidia-fs/nv.symvers
checking if uaccess.h access_ok has 3 parameters... no
checking if uaccess.h access_ok has 2 parameters... no
Checking if blkdev.h has blk_rq_payload_bytes... no
Checking if fs.h has call_read_iter and call_write_iter... no
Checking if fs.h has filemap_range_has_page... no
Checking if kiocb structue has ki_complete field... no
Checking if vm_fault_t exist in mm_types.h... no
Checking if enum PCIE_SPEED_32_0GT exists in pci.h... no
Checking if enum PCIE_SPEED_64_0GT exists in pci.h... no
Checking if atomic64_t counter is of type long... no
Checking if RQF_COPY_USER is present or not... no
Checking if dma_drain_size and dma_drain_needed are present in struct request_queue... no
Checking if struct proc_ops is present or not ... no
Checking if split is present in vm_operations_struct or not ... no
Checking if mremap in vm_operations_struct has one parameter... no
Checking if mremap in vm_operations_struct has two parameters... no
Checking if symbol module_mutex is present... no
Checking if blk-integrity.h is present... no
Checking if KI_COMPLETE has 3 parameters ... no
Checking if pin_user_pages_fast symbol is present in kernel or not ... no
Checking if prandom_u32 symbol is present in kernel or not ... no
Checking if devnode in class has doesn't have const device or not ... no
Checking if class_create has two parameters or not ... no
Checking if vma_flags are modifiable directly ... no
make[1]: Entering directory '/usr/src/linux-headers-5.15.0-112-generic'
make[1]: Makefile: No such file or directory
make[1]: *** No rule to make target 'Makefile'.  Stop.
make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-112-generic'
make: *** [Makefile:107: module] Error 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant