Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RHEL SBSA TF2 Backend Build #106

Merged
merged 2 commits into from
Sep 11, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,11 @@ configure_file(src/libtriton_tensorflow.ldscript libtriton_tensorflow.ldscript C

if (${TRITON_TENSORFLOW_DOCKER_BUILD})
if (CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "aarch64")
set(LIBS_ARCH "aarch64")
if(${RHEL_BUILD})
set(LIBS_ARCH "sbsa")
else()
set(LIBS_ARCH "aarch64")
endif()
else()
set(LIBS_ARCH "x86_64")
endif()
Expand All @@ -155,7 +159,7 @@ if (${TRITON_TENSORFLOW_DOCKER_BUILD})
COMMAND docker stop tensorflow_backend_deps || echo "error ignored..." || true
COMMAND docker rm tensorflow_backend_deps || echo "error ignored..." || true
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then mkdir tf_backend_deps && docker run -it -d --name tensorflow_backend_deps ${TRITON_TENSORFLOW_DOCKER_IMAGE} \; fi \;
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/x86_64-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/${LIBS_ARCH}-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't find utility make this more transparent?

Suggested change
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/${LIBS_ARCH}-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*" | tar --strip-components=3 -xf - -C ./tf_backend_deps \; fi
COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker exec tensorflow_backend_deps sh -c "find /usr/local/cuda/targets/*-linux/lib -name libnccl.so* -exec tar -cvf archive.tar {} \; fi "

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great suggestion! Let me examine the tar commands for equivalency first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can use find then please remove the

if(${RHEL_BUILD})
      set(LIBS_ARCH "sbsa")
    else()
      set(LIBS_ARCH "aarch64")
    endif()

Copy link
Contributor Author

@fpetrini15 fpetrini15 Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I spent yesterday afternoon trying to get the right incantation for this, but have so far been unsuccessful. However, I've learned a lot more about what's going on here. From my understanding we are tar-ing each of the libnccl.so matches within the external tf container and then writing it to standard output (note the - [https://www.gnu.org/software/tar/manual/html_node/file.html]:

"tar -cf - $<IF:$<BOOL:${RHEL_BUILD}>,/usr/local/cuda/targets/${LIBS_ARCH}-linux/lib/,/usr/lib/${LIBS_ARCH}-linux-gnu/>libnccl.so*"

Which is then picked up in the subsequent command to expand the tar file in the build container stripping out the leading path ( /path/to/lib/libnccl.so --> libnccl.so):

tar --strip-components=3 -xf - -C ./tf_backend_deps

When we attempt to use the find command, nothing I've tried seems to match this behavior, however, I also think this is a really odd way to go about copying these files. I've tried a few alternate strategies such as mounting the tf_backend_deps folder and copying the libs in, which I thought would be a more straightforward approach, but that hasn't worked either.

I will continue trying to find a way to resolve this, but since this is not tested by default in our CI, I will prioritize getting the EA3 image out for now.

COMMAND if [ "${TRITON_TENSORFLOW_INSTALL_EXTRA_DEPS}" = "ON" ] \; then docker stop tensorflow_backend_deps && docker rm tensorflow_backend_deps \; fi \;

COMMENT "Extracting ${TRITON_TENSORFLOW_CC_LIBNAME}.${TRITON_TENSORFLOW_VERSION} and ${TRITON_TENSORFLOW_FW_LIBNAME}.${TRITON_TENSORFLOW_VERSION} from ${TRITON_TENSORFLOW_DOCKER_IMAGE}"
Expand Down
Loading