Issues in building from source #20

divyam3897 · 2019-01-17T04:35:58Z

Configuration:
Operating System: Linux Ubuntu 16.04
Python version: 3.5.2
Tensorflow version: 1.12.0
Cuda version: 9.0
GPU: TITAN X (Pascal)

Command to Reproduce
make compile

Problem:
The build command fails with the following errors:

ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 252; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 713; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 1186; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 1661; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas fatal   : Ptx assembly aborted due to errors
Makefile:106: recipe for target 'build/blocksparse_hgemm_cn_op_gpu.cu.o' failed
make: *** [build/blocksparse_hgemm_cn_op_gpu.cu.o] Error 255

pip install blocksparse fails too and results in #7

The text was updated successfully, but these errors were encountered:

ThomasHagebols · 2019-01-18T15:43:25Z

Hi Divyam,

First let me ask some details. Are you working in a virtual env or a conda env? If so activate that environment before compiling. Perform a make clean and a make compile afterward.

I can confirm that in many cases the pip install doesn't work. It seems to be highly dependent on your specific setup. So compiling yourself seems to be necessary in most cases.

First let me give some general advice. Currently the readme is out dated. There are some extra requirements you need to install which are not mentioned there

Python requirements:

Networkx

System requirements:
mpich : apt-get install mpich
Nvidia cudnn
Nvidia nccl

Your issue seems to be related with the fact that you are using compute_70 in CUDA 9. For some reason this didn't work in my case either. If you comment out the following lines your issues might be solved

 	-gencode=arch=compute_70,code=sm_70 \
 	-gencode=arch=compute_70,code=compute_70

Good luck

divyam3897 · 2019-01-18T16:16:28Z

Hi @ThomasHagebols

Thank you for your valuable comment, I agree that the README is outdated. To answer your question, I am working in a virtualenv, I had the system requirements fulfilled before and commenting the lines did help in succeeding the build.

However, if I run test/blocksparse_matmul_test.py after the build, it is back to #7 which from the discussion in #7 seems was fixed in the source but looks like it still exists?

ThomasHagebols · 2019-01-18T16:29:49Z

I'll make a pull request for the Readme and update the requirements in the setup file.

I have the same issue with failing tests. Unfortunately I don't have the expertise to fix those issues.

scott-gray · 2019-01-18T18:01:04Z

The m8n32k16 error is just a matter of not having cuda >= 9.2. It's kind of annoying all the ptx breaking changes nvidia has been making lately that could have been easily avoided with a small amount of foresight.

Anyway, we have a paper going out soon covering the blocksparse transformer ops. I plan to clean things up and fully document everything prior to that. I'll also have some new conv kernels as well. We're pushing hard now on learned sparsity in a variety of architectures so this code is changing quickly internally. Though I guess I should warn you that a lot of the new development is mostly targeting tensorcore capable hardware.

divyam3897 · 2019-01-18T18:09:37Z

Hi @scott-gray

Thank you for your comment, looking forward to the changes!
Though the build succeeds with the changes in Makefile, however even then the import fails due to tensorflow.python.framework.errors_impl.NotFoundError: ...../blocksparse_ops.so: undefined symbol: _ZN3MPI8Datatype4FreeEv as also raised in #7 . Seems like you did commit the changes for it before but it still persists as a result of which the pip install fails to make it work,

scott-gray · 2019-01-18T18:18:55Z

No idea what that error could be. Something must be off with your build env. I put some comments in the bottom of the Makefile showing the env I use:

https://github.com/openai/blocksparse/blob/master/Makefile

Though you no longer need to patch tensorflow to support batched matmul in fp16. But you will still likely need to build from source to get cuda >= 9.2 support.

ruiwang2uber · 2019-03-13T22:13:34Z

Same configuration as above.

I have the exactly same problem.

when I comment out

 	-gencode=arch=compute_70,code=sm_70 \
 	-gencode=arch=compute_70,code=compute_70

I can finish make compile

but when i try test/blocksparse_matmul_test.py

it failed due to

(spinningup) ruiwang@ubuntu-ruiwang:~/blocksparse$ python test/blocksparse_matmul_test.py
Traceback (most recent call last):
  File "test/blocksparse_matmul_test.py", line 12, in <module>
    from blocksparse.matmul import BlocksparseMatMul, SparseProj, group_param_grads
  File "/home/ruiwang/anaconda3/envs/spinningup/lib/python3.6/site-packages/blocksparse/matmul.py", line 13, in <module>
    import blocksparse.ewops as ew
  File "/home/ruiwang/anaconda3/envs/spinningup/lib/python3.6/site-packages/blocksparse/ewops.py", line 17, in <module>
    _op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so'))
  File "/home/ruiwang/anaconda3/envs/spinningup/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/ruiwang/anaconda3/envs/spinningup/lib/python3.6/site-packages/blocksparse/blocksparse_ops.so: undefined symbol: _ZN10tensorflow15OpKernelContext10input_listENS_11StringPieceEPNS_11OpInputListE

ruiwang2uber · 2019-03-13T22:23:05Z

@scott-gray @divyam3897 have you ever got any chance to resolve this? Thanks!

rohitg1594 · 2019-04-26T13:20:45Z

Same issue as @ruiwang2uber @divyam3897, any updates? Would appreciate it. Thx!

jlozano · 2020-12-30T06:49:48Z

In case this helps anyone, I created the following Dockerfile and instructions that worked for me:

Dockerfile (place this in root of the blocksparse repo):

FROM tensorflow/tensorflow:1.15.2-gpu-py3
RUN pip install --upgrade pip
RUN pip3 install tensorflow-gpu==1.13.1

# Need this to run the tests
RUN pip3 install networkx==2.5

ENV NCCL_VERSION=2.4.8-1+cuda10.0
RUN apt-get update && apt-get install -y --no-install-recommends \
  mpich \
  libmpich-dev \
  libnccl2=${NCCL_VERSION} \
  libnccl-dev=${NCCL_VERSION} \
  curl

# Make sure the linker knows where to look for things
ENV LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}"

Instructions (you might need to run these commands with sudo):
NOTE:

commands prefixed by $ should be run in a shell on the host machine
commands prefixed by # should be run in an interactive shell in the docker container

Build image

$ docker image build -f Dockerfile --rm -t blocksparse:local .

Start docker container with an interactive terminal, Choose the relevant CPU/GPU option below

CPU

the tests below will fail if you try to run them without GPU support
the ln command should be run inside the docker container

$ docker run -it --privileged -w /working_dir -v ${PWD}:/working_dir --rm blocksparse:local
# ln -s /usr/local/cuda/compat/libcuda.so /usr/lib/libcuda.so

GPU

$ docker run -it --gpus all --privileged -w /working_dir -v ${PWD}:/working_dir --rm blocksparse:local

Compile (inside the docker container)

# make compile

Install compiled version (inside the docker container)

# pip3 install dist/*.whl

Test compiled version (inside the docker container)

# test/blocksparse_matmul_test.py
# test/blocksparse_conv_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues in building from source #20

Issues in building from source #20

divyam3897 commented Jan 17, 2019

ThomasHagebols commented Jan 18, 2019

divyam3897 commented Jan 18, 2019 •

edited

Loading

ThomasHagebols commented Jan 18, 2019

scott-gray commented Jan 18, 2019

divyam3897 commented Jan 18, 2019

scott-gray commented Jan 18, 2019

ruiwang2uber commented Mar 13, 2019

ruiwang2uber commented Mar 13, 2019

rohitg1594 commented Apr 26, 2019

jlozano commented Dec 30, 2020 •

edited

Loading

Issues in building from source #20

Issues in building from source #20

Comments

divyam3897 commented Jan 17, 2019

ThomasHagebols commented Jan 18, 2019

divyam3897 commented Jan 18, 2019 • edited Loading

ThomasHagebols commented Jan 18, 2019

scott-gray commented Jan 18, 2019

divyam3897 commented Jan 18, 2019

scott-gray commented Jan 18, 2019

ruiwang2uber commented Mar 13, 2019

ruiwang2uber commented Mar 13, 2019

rohitg1594 commented Apr 26, 2019

jlozano commented Dec 30, 2020 • edited Loading

divyam3897 commented Jan 18, 2019 •

edited

Loading

jlozano commented Dec 30, 2020 •

edited

Loading