spspmm lead to error: PyTorch CUDA error: an illegal memory access was encountered. #314

GooLiang · 2023-02-28T12:32:38Z

Hi, I'm having the same problem with #174.
I have two large adjacency matrices, the details are as follows
adj_l
SparseTensor(row=tensor([ 0, 0, 0, ..., 736388, 736388, 736388], device='cuda:2'),
col=tensor([ 145, 2215, 3205, ..., 21458, 22283, 31934], device='cuda:2'),
val=tensor([0.0909, 0.0909, 0.0909, ..., 0.1000, 0.1000, 0.1000], device='cuda:2'),
size=(736389, 59965), nnz=7505078, density=0.02%)
adj_r
SparseTensor(row=tensor([ 0, 0, 0, ..., 59962, 59963, 59964], device='cuda:2'),
col=tensor([222683, 370067, 430465, ..., 38176, 514545, 334613], device='cuda:2'),
val=tensor([0.1429, 0.1429, 0.1429, ..., 0.5000, 1.0000, 1.0000], device='cuda:2'),
size=(59965, 736389), nnz=7505078, density=0.02%)

Convert them to sparse format and use the following code,
rowA, colA, _ = adj_l.coo()
rowB, colB, _ = adj_r.coo()
indexA = torch. stack((rowA,colA))
indexB = torch. stack((rowB,colB))
valueA = adj_l.storage._value
valueB = adj_r.storage._value
indexC, valueC = spspmm(indexA, valueA, indexB, valueB, adj_l.size(0), adj_l.size(1), adj_r.size(1), coalesced=True)
Then an error will be reported. CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Even with CUDA_LAUNCH_BLOCKING=1. There is no more information, I believe this is caused by too much memory for the two sparse matrices. Is there any way to run it on gpu?

rusty1s · 2023-03-01T08:04:07Z

What version of torch-sparse are you working on? 0.6.16 had some fixes to the spspmm routine, see https://github.com/rusty1s/pytorch_sparse/releases.

GooLiang · 2023-03-11T09:07:30Z

Thank you for your reply.
But my cuda version is 10.1 torch version is 1.12. Can I solve this problem without upgrading?

rusty1s · 2023-03-11T10:16:42Z

You mean without upgrading CUDA? You should be able to install from wheels via pip -f ... command, while manual compilation will probably fail due to CUDA version conflict. Let me know if I am missing something.

GooLiang · 2023-03-11T10:33:27Z

I mean if I want to install torch-sparse 0.6.16, I need the following dependency ring: torch1.13->cuda11.6.
However, my cuda only works with version 10.2 due to some utility restrictions, so I don't think I can install torch-sparse 6.16. Is there a way to support torch-sparse 6.16 under cuda10.1?

rusty1s · 2023-03-11T11:53:50Z

The CUDA version needs to match with the one installed by PyTorch, not necessarily your system CUDA.

GooLiang · 2023-03-12T01:32:25Z

I tried to upgrade torch-sparse 0.6.16, however, I got a new error when running the previous code, is there any solution?

RuntimeError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
CUDA error: insufficient resources when calling cusparseSpGEMM_workEstimation( handle, opA, opB, &alpha, matA, matB, &beta, matC, computeType, CUSPARSE_SPGEMM_DEFAULT, spgemmDesc, &bufferSize1, dBuffer1)
File "/root/anaconda3/envs/lyx/lib/python3.7/site-packages/torch_sparse/matmul.py", line 96, in spspmm_sum
C = torch.sparse.mm(A, B)
File "/root/anaconda3/envs/lyx/lib/python3.7/site-packages/torch_sparse/matmul.py", line 120, in spspmm
return spspmm_sum(src, other)
File "/root/anaconda3/envs/lyx/lib/python3.7/site-packages/torch_sparse/matmul.py", line 143, in matmul
return spspmm(src, other, reduce)
File "/root/anaconda3/envs/lyx/lib/python3.7/site-packages/torch_sparse/matmul.py", line 151, in
self, other, reduce)
File "/home/public/lyx/Nars_ensemble/ogbn/data_ogbn.py", line 251, in hg_propagate_sparse_pyg_freebase
new_adjs[name] = adj_l.matmul(adj_r.to(prop_device)).to(store_device)
File "/home/public/lyx/Nars_ensemble/ogbn/train_ogbn.py", line 99, in main
features_list_dict, extra_features_buffer = hg_propagate_sparse_pyg_freebase(adjs, threshold_metalen, tgt_type, args.num_hops, max_length, extra_metapath, prop_device, prop_feats=True, echo=True)
File "/home/public/lyx/Nars_ensemble/ogbn/train_ogbn.py", line 261, in
main(args)
File "/root/anaconda3/envs/lyx/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/lyx/lib/python3.7/runpy.py", line 193, in _run_module_as_main (Current frame)
"main", mod_spec)
RuntimeError: CUDA error: insufficient resources when calling cusparseSpGEMM_workEstimation( handle, opA, opB, &alpha, matA, matB, &beta, matC, computeType, CUSPARSE_SPGEMM_DEFAULT, spgemmDesc, &bufferSize1, dBuffer1)

rusty1s · 2023-03-12T08:23:40Z

What does torch_sparse.__version__ return? The new routine in 0.6.16 doesn't make any use of cusparse anymore, so this error is a bit irritating to me.

GooLiang · 2023-03-12T08:28:11Z

torch_sparse.version return
'0.6.16+pt113cu116'

GooLiang · 2023-03-14T13:55:18Z

Hi, do you have any idea about this problem?
thanks for help

rusty1s · 2023-03-15T11:17:01Z

Mh, can you show me the content of /root/anaconda3/envs/lyx/lib/python3.7/site-packages/torch_sparse/matmul.py?

GooLiang · 2023-03-15T11:31:58Z

matmul.txt

rusty1s · 2023-03-15T12:11:59Z

Mh, looks like this is an issue with PyTorch then, not with torch_sparse. I assume that

import torch

A = torch.randn(5, 5).to_torch_coo_tensor().cuda()
torch.sparse.mm(A, A)

also fails for you?

GooLiang · 2023-03-15T12:21:41Z

Yes, running the above code will report the following error.
AttributeError: 'Tensor' object has no attribute 'to_torch_coo_tensor'
However, my torch has been upgraded to 1.13.1.
use torch.version return: '1.13.1+cu116'

rusty1s · 2023-03-15T12:24:57Z

Needs to be A = torch.randn(5, 5).to_sparse().cuda(), sorry for the confusion.

GooLiang · 2023-03-15T12:27:25Z

Running the above code is successful :(

rusty1s · 2023-03-15T12:57:34Z

Then I am at a loss :(

What happens if you run

adj_l @ adj_r

in your code above?

GooLiang · 2023-03-15T13:03:48Z

Suppose there are five different data adj_l, adj_r. After running the first four without any problem, running the fifth one will report the error mentioned at the beginning of my question.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

After I upgraded the torch and torch_sparse, I ran it again and got the second error:
RuntimeError: CUDA error: insufficient resources when calling cusparseSpGEMM_workEstimation( handle, opA, opB, &alpha, matA, matB, &beta, matC, computeType, CUSPARSE_SPGEMM_DEFAULT, spgemmDesc, &bufferSize1, dBuffer1)

rusty1s · 2023-03-15T13:50:33Z

Do you have a reproducible example? Happy to look into it.

GooLiang · 2023-03-18T15:48:49Z

I have uploaded adl_l and adj_r to google cloud. You can download these two data and run:
adj_l.to('cuda:0').matmul(adj_l.to('cuda:0'))
https://drive.google.com/drive/folders/1JPjktlEP-mdacuGiBDbdWluBTA32Oiqn?usp=share_link
Thanks for your help!

rusty1s · 2023-03-21T20:02:08Z

Thanks. Will look into it.

GooLiang · 2023-03-27T11:16:49Z

Hi, Sorry to bother you.
Is there any update on this issue?

rusty1s · 2023-03-28T13:42:52Z

I can reproduce this :( I assume that your matrices are too large for torch.sparse.mm. You can see that

adj_l = adj_l.to_torch_sparse_csr_tensor()
adj_r = adj_r.to_torch_sparse_csr_tensor()
out = adj_l @ adj_r

also fails, while something like

adj_l = adj_l[:10000]
adj_r = adj_r[:, :10000]
out = adj_l @ adj_r

works. I suggest to create a similar issue in https://github.com/pytorch/pytorch.

GooLiang · 2023-03-30T01:16:05Z

This dataset is actually data from ogbn-mag's PF and FP relationship, I noticed that your work also appears on the mag early list, maybe I'll study your previous work to see how to use torch_sparse to support the mag dataset.
Anyway, thanks for your help!

rusty1s · 2023-03-30T05:09:52Z

We never used sparse-sparse matrix multiplication in our benchmarks, so this issue never appeared to us.

github-actions · 2023-09-27T01:10:09Z

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?

GooLiang · 2024-01-12T13:09:42Z

Hi, Sorry to bother you.
Is there any update on this issue?

rusty1s · 2024-01-12T13:40:27Z

We are using the PyTorch routine now for SpSpMM, so this is either no longer an issue or needs to be routed to the PyTorch team directly.

github-actions bot added the stale label Sep 27, 2023

rusty1s added bug Something isn't working and removed stale labels Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spspmm lead to error: PyTorch CUDA error: an illegal memory access was encountered. #314

spspmm lead to error: PyTorch CUDA error: an illegal memory access was encountered. #314

GooLiang commented Feb 28, 2023

rusty1s commented Mar 1, 2023

GooLiang commented Mar 11, 2023

rusty1s commented Mar 11, 2023

GooLiang commented Mar 11, 2023

rusty1s commented Mar 11, 2023

GooLiang commented Mar 12, 2023

rusty1s commented Mar 12, 2023

GooLiang commented Mar 12, 2023

GooLiang commented Mar 14, 2023

rusty1s commented Mar 15, 2023

GooLiang commented Mar 15, 2023

rusty1s commented Mar 15, 2023

GooLiang commented Mar 15, 2023

rusty1s commented Mar 15, 2023

GooLiang commented Mar 15, 2023

rusty1s commented Mar 15, 2023

GooLiang commented Mar 15, 2023

rusty1s commented Mar 15, 2023

GooLiang commented Mar 18, 2023

rusty1s commented Mar 21, 2023

GooLiang commented Mar 27, 2023

rusty1s commented Mar 28, 2023 •

edited

Loading

GooLiang commented Mar 30, 2023

rusty1s commented Mar 30, 2023

github-actions bot commented Sep 27, 2023

GooLiang commented Jan 12, 2024

rusty1s commented Jan 12, 2024

spspmm lead to error: PyTorch CUDA error: an illegal memory access was encountered. #314

spspmm lead to error: PyTorch CUDA error: an illegal memory access was encountered. #314

Comments

GooLiang commented Feb 28, 2023

rusty1s commented Mar 1, 2023

GooLiang commented Mar 11, 2023

rusty1s commented Mar 11, 2023

GooLiang commented Mar 11, 2023

rusty1s commented Mar 11, 2023

GooLiang commented Mar 12, 2023

rusty1s commented Mar 12, 2023

GooLiang commented Mar 12, 2023

GooLiang commented Mar 14, 2023

rusty1s commented Mar 15, 2023

GooLiang commented Mar 15, 2023

rusty1s commented Mar 15, 2023

GooLiang commented Mar 15, 2023

rusty1s commented Mar 15, 2023

GooLiang commented Mar 15, 2023

rusty1s commented Mar 15, 2023

GooLiang commented Mar 15, 2023

rusty1s commented Mar 15, 2023

GooLiang commented Mar 18, 2023

rusty1s commented Mar 21, 2023

GooLiang commented Mar 27, 2023

rusty1s commented Mar 28, 2023 • edited Loading

GooLiang commented Mar 30, 2023

rusty1s commented Mar 30, 2023

github-actions bot commented Sep 27, 2023

GooLiang commented Jan 12, 2024

rusty1s commented Jan 12, 2024

rusty1s commented Mar 28, 2023 •

edited

Loading