Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues in building from source #20

Open
divyam3897 opened this issue Jan 17, 2019 · 10 comments
Open

Issues in building from source #20

divyam3897 opened this issue Jan 17, 2019 · 10 comments

Comments

@divyam3897
Copy link

Configuration:
Operating System: Linux Ubuntu 16.04
Python version: 3.5.2
Tensorflow version: 1.12.0
Cuda version: 9.0
GPU: TITAN X (Pascal)

Command to Reproduce
make compile

Problem:
The build command fails with the following errors:

ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 252; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 713; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 1186; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 1661; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas fatal   : Ptx assembly aborted due to errors
Makefile:106: recipe for target 'build/blocksparse_hgemm_cn_op_gpu.cu.o' failed
make: *** [build/blocksparse_hgemm_cn_op_gpu.cu.o] Error 255

pip install blocksparse fails too and results in #7

@ThomasHagebols
Copy link

Hi Divyam,

First let me ask some details. Are you working in a virtual env or a conda env? If so activate that environment before compiling. Perform a make clean and a make compile afterward.

I can confirm that in many cases the pip install doesn't work. It seems to be highly dependent on your specific setup. So compiling yourself seems to be necessary in most cases.

First let me give some general advice. Currently the readme is out dated. There are some extra requirements you need to install which are not mentioned there

Python requirements:

  • Networkx

System requirements:
mpich : apt-get install mpich
Nvidia cudnn
Nvidia nccl

Your issue seems to be related with the fact that you are using compute_70 in CUDA 9. For some reason this didn't work in my case either. If you comment out the following lines your issues might be solved

 	-gencode=arch=compute_70,code=sm_70 \
 	-gencode=arch=compute_70,code=compute_70

Good luck

@divyam3897
Copy link
Author

divyam3897 commented Jan 18, 2019

Hi @ThomasHagebols

Thank you for your valuable comment, I agree that the README is outdated. To answer your question, I am working in a virtualenv, I had the system requirements fulfilled before and commenting the lines did help in succeeding the build.

However, if I run test/blocksparse_matmul_test.py after the build, it is back to #7 which from the discussion in #7 seems was fixed in the source but looks like it still exists?

@ThomasHagebols
Copy link

I'll make a pull request for the Readme and update the requirements in the setup file.

I have the same issue with failing tests. Unfortunately I don't have the expertise to fix those issues.

@scott-gray
Copy link
Contributor

The m8n32k16 error is just a matter of not having cuda >= 9.2. It's kind of annoying all the ptx breaking changes nvidia has been making lately that could have been easily avoided with a small amount of foresight.

Anyway, we have a paper going out soon covering the blocksparse transformer ops. I plan to clean things up and fully document everything prior to that. I'll also have some new conv kernels as well. We're pushing hard now on learned sparsity in a variety of architectures so this code is changing quickly internally. Though I guess I should warn you that a lot of the new development is mostly targeting tensorcore capable hardware.

@divyam3897
Copy link
Author

Hi @scott-gray

Thank you for your comment, looking forward to the changes!
Though the build succeeds with the changes in Makefile, however even then the import fails due to tensorflow.python.framework.errors_impl.NotFoundError: ...../blocksparse_ops.so: undefined symbol: _ZN3MPI8Datatype4FreeEv as also raised in #7 . Seems like you did commit the changes for it before but it still persists as a result of which the pip install fails to make it work,

@scott-gray
Copy link
Contributor

No idea what that error could be. Something must be off with your build env. I put some comments in the bottom of the Makefile showing the env I use:

https://github.com/openai/blocksparse/blob/master/Makefile

Though you no longer need to patch tensorflow to support batched matmul in fp16. But you will still likely need to build from source to get cuda >= 9.2 support.

@ruiwang2uber
Copy link

Same configuration as above.

I have the exactly same problem.

when I comment out

 	-gencode=arch=compute_70,code=sm_70 \
 	-gencode=arch=compute_70,code=compute_70

I can finish make compile

but when i try test/blocksparse_matmul_test.py

it failed due to

(spinningup) ruiwang@ubuntu-ruiwang:~/blocksparse$ python test/blocksparse_matmul_test.py
Traceback (most recent call last):
  File "test/blocksparse_matmul_test.py", line 12, in <module>
    from blocksparse.matmul import BlocksparseMatMul, SparseProj, group_param_grads
  File "/home/ruiwang/anaconda3/envs/spinningup/lib/python3.6/site-packages/blocksparse/matmul.py", line 13, in <module>
    import blocksparse.ewops as ew
  File "/home/ruiwang/anaconda3/envs/spinningup/lib/python3.6/site-packages/blocksparse/ewops.py", line 17, in <module>
    _op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so'))
  File "/home/ruiwang/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/ruiwang/anaconda3/envs/spinningup/lib/python3.6/site-packages/blocksparse/blocksparse_ops.so: undefined symbol: _ZN10tensorflow15OpKernelContext10input_listENS_11StringPieceEPNS_11OpInputListE

@ruiwang2uber
Copy link

@scott-gray @divyam3897 have you ever got any chance to resolve this? Thanks!

@rohitg1594
Copy link

Same issue as @ruiwang2uber @divyam3897, any updates? Would appreciate it. Thx!

@jlozano
Copy link

jlozano commented Dec 30, 2020

In case this helps anyone, I created the following Dockerfile and instructions that worked for me:

Dockerfile (place this in root of the blocksparse repo):

FROM tensorflow/tensorflow:1.15.2-gpu-py3
RUN pip install --upgrade pip
RUN pip3 install tensorflow-gpu==1.13.1

# Need this to run the tests
RUN pip3 install networkx==2.5

ENV NCCL_VERSION=2.4.8-1+cuda10.0
RUN apt-get update && apt-get install -y --no-install-recommends \
  mpich \
  libmpich-dev \
  libnccl2=${NCCL_VERSION} \
  libnccl-dev=${NCCL_VERSION} \
  curl

# Make sure the linker knows where to look for things
ENV LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}"

Instructions (you might need to run these commands with sudo):
NOTE:

  • commands prefixed by $ should be run in a shell on the host machine
  • commands prefixed by # should be run in an interactive shell in the docker container
  1. Build image
$ docker image build -f Dockerfile --rm -t blocksparse:local .
  1. Start docker container with an interactive terminal, Choose the relevant CPU/GPU option below

CPU

  • the tests below will fail if you try to run them without GPU support
  • the ln command should be run inside the docker container
$ docker run -it --privileged -w /working_dir -v ${PWD}:/working_dir --rm blocksparse:local
# ln -s /usr/local/cuda/compat/libcuda.so /usr/lib/libcuda.so

GPU

$ docker run -it --gpus all --privileged -w /working_dir -v ${PWD}:/working_dir --rm blocksparse:local
  1. Compile (inside the docker container)
# make compile
  1. Install compiled version (inside the docker container)
# pip3 install dist/*.whl
  1. Test compiled version (inside the docker container)
# test/blocksparse_matmul_test.py
# test/blocksparse_conv_test.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants