Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependencies conflict in running Llama-2-13b autoregressive sampling on Inf2 #47

Open
mahendra-paranjpe opened this issue Sep 26, 2023 · 4 comments

Comments

@mahendra-paranjpe
Copy link

mahendra-paranjpe commented Sep 26, 2023

Running notebook - https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/transformers-neuronx/inference/meta-llama-2-13b-sampling.ipynb on inf2.48xlarge

Error while running last block - line no 4
from transformers_neuronx.llama.model import LlamaForSampling

results in:

>>> from transformers_neuronx.llama.model import LlamaForSampling
2023-Sep-27 06:59:32.0474 22340:22340 ERROR  TDRV:tdrv_get_dev_info                       No neuron device available
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/llama/model.py", line 17, in <module>
    from transformers_neuronx import decoder
  File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/decoder.py", line 18, in <module>
    from transformers_neuronx import compiler
  File "/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/transformers_neuronx/compiler.py", line 33, in <module>
    from libneuronxla import neuron_xla_compile
ImportError: cannot import name 'neuron_xla_compile' from 'libneuronxla' (/root/aws_neuron_venv_pytorch/lib64/python3.7/site-packages/libneuronxla/__init__.py)

huggingface/optimum-neuron#213 - This suggests to update latest version of torch-neuronx. And aws-neuron/transformers-neuronx#33 this suggest specific to torch-neuronx-1.13.1.1.10.0

When tried installing the specific version, it failed with following exception.

python -m pip install torch-neuronx==1.13.1.1.10.0 -U
Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting torch-neuronx==1.13.1.1.10.0
  Using cached https://pip.repos.neuron.amazonaws.com/torch-neuronx/torch_neuronx-1.13.1.1.10.0-py3-none-any.whl (2.4 MB)
Requirement already satisfied: torch==1.13.* in ./aws_neuron_venv_pytorch/lib/python3.7/site-packages (from torch-neuronx==1.13.1.1.10.0) (1.13.1)
INFO: pip is looking at multiple versions of torch-neuronx to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement torch-xla==1.13.1+torchneurona (from torch-neuronx) (from versions: 1.0, 1.11.0+torchneuron2, 1.11.0+torchneuron3, 1.12.0+torchneuron3, 1.13.0+torchneuron3, 1.13.0+torchneuron4, 1.13.0+torchneuron5, 1.13.1+torchneuron6, 1.13.1+torchneuron7, 1.13.1+torchneuron8)
ERROR: No matching distribution found for torch-xla==1.13.1+torchneurona

Additional info on different versions available as of now.

pip index versions torch-neuronx
WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
torch-neuronx (1.13.1.1.11.0)
Available versions: 1.13.1.1.11.0, 1.13.1.1.10.1, 1.13.1.1.10.0, 1.13.1.1.9.1, 1.13.1.1.9.0, 1.13.1.1.8.0, 1.13.1.1.7.0, 1.13.0.1.6.1, 1.13.0.1.6.0, 1.13.0.1.5.0, 1.13.0.1.4.0, 1.12.0.1.4.0, 1.11.0.1.2.0, 1.11.0.1.1.1, 1.0
  INSTALLED: 1.13.1.1.9.1
  LATEST:    1.13.1.1.11.0


pip index versions torch-xla
WARNING: pip index is currently an experimental command. It may be removed/changed in a future release without prior warning.
torch-xla (1.13.1+torchneuron8)
Available versions: 1.13.1+torchneuron8, 1.13.1+torchneuron7, 1.13.1+torchneuron6, 1.13.0+torchneuron5, 1.13.0+torchneuron4, 1.13.0+torchneuron3, 1.12.0+torchneuron3, 1.11.0+torchneuron3, 1.11.0+torchneuron2, 1.0
  INSTALLED: 1.13.1+torchneuron8
  LATEST:    1.13.1+torchneuron8

Following packages are installed

anyio==3.7.1
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
attrs==23.1.0
aws-neuronx-runtime-discovery==2.9
awscli==1.29.54
backcall==0.2.0
beautifulsoup4==4.12.2
bleach==6.0.0
boto3==1.28.54
botocore==1.31.54
cached-property==1.5.2
cachetools==5.3.1
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.2.0
cloud-tpu-client==0.10
colorama==0.4.4
comm==0.1.4
debugpy==1.7.0
decorator==5.1.1
defusedxml==0.7.1
docutils==0.16
ec2-metadata==2.10.0
entrypoints==0.4
environment-kernels==1.2.0
exceptiongroup==1.1.3
fastjsonschema==2.18.0
filelock==3.12.2
fsspec==2023.1.0
google-api-core==1.34.0
google-api-python-client==1.8.0
google-auth==2.23.0
google-auth-httplib2==0.1.1
googleapis-common-protos==1.60.0
httplib2==0.22.0
huggingface-hub==0.16.4
idna==3.4
importlib-metadata==6.7.0
importlib-resources==5.12.0
iniconfig==2.0.0
ipykernel==6.16.2
ipython==7.34.0
ipython-genutils==0.2.0
ipywidgets==8.1.1
islpy==2022.2.1
jedi==0.19.0
Jinja2==3.1.2
jmespath==1.0.1
jsonschema==4.17.3
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-server==1.24.0
jupyter_client==7.4.9
jupyter_core==4.12.0
jupyterlab-pygments==0.2.2
jupyterlab-widgets==3.0.9
libneuronxla==0.5.413
lockfile==0.12.2
MarkupSafe==2.1.3
matplotlib-inline==0.1.6
mistune==3.0.1
nbclassic==1.0.0
nbclient==0.7.4
nbconvert==7.6.0
nbformat==5.8.0
nest-asyncio==1.5.8
networkx==2.6.3
neuronx-cc==2.9.0.16+fa12ba55a
neuronx-hwm==2.9.0.1+f79d59e7b
notebook==6.5.6
notebook_shim==0.2.3
numpy==1.21.6
nvidia-cublas-cu11==11.10.3.66
nvidia-cuda-nvrtc-cu11==11.7.99
nvidia-cuda-runtime-cu11==11.7.99
nvidia-cudnn-cu11==8.5.0.96
oauth2client==4.1.3
packaging==23.1
pandocfilters==1.5.0
parso==0.8.3
pexpect==4.8.0
pgzip==0.3.5
pickleshare==0.7.5
Pillow==9.5.0
pkgutil_resolve_name==1.3.10
pluggy==1.2.0
prometheus-client==0.17.1
prompt-toolkit==3.0.39
protobuf==3.20.3
psutil==5.9.5
ptyprocess==0.7.0
pyasn1==0.5.0
pyasn1-modules==0.3.0
pycparser==2.21
Pygments==2.16.1
pyparsing==3.1.1
pyrsistent==0.19.3
pytest==7.4.2
python-daemon==3.0.1
python-dateutil==2.8.2
PyYAML==6.0.1
pyzmq==24.0.1
qtconsole==5.4.4
QtPy==2.4.0
regex==2023.8.8
requests==2.31.0
requests-unixsocket==0.3.0
rsa==4.7.2
s3transfer==0.6.2
safetensors==0.3.3
scipy==1.7.3
Send2Trash==1.8.2
sentencepiece==0.1.99
six==1.16.0
sniffio==1.3.0
soupsieve==2.4.1
terminado==0.17.1
tinycss2==1.2.1
tokenizers==0.13.3
tomli==2.0.1
torch==1.13.1
torch-neuronx==1.13.1.1.9.1
torch-xla==1.13.1+torchneuron8
torchvision==0.14.1
tornado==6.2
tqdm==4.66.1
traitlets==5.9.0
transformers==4.30.2
transformers-neuronx==0.7.84
typing_extensions==4.7.1
uritemplate==3.0.1
urllib3==1.26.16
wcwidth==0.2.6
webencodings==0.5.1
websocket-client==1.6.1
wget==3.2
widgetsnbextension==4.0.9
zipp==3.15.0
@awsilya
Copy link

awsilya commented Oct 1, 2023

@mahendra-paranjpe the most common reason for this error:

tdrv_get_dev_info No neuron device available

is not running on the right instance type. Are you running on inf2 ?

@mahendra-paranjpe
Copy link
Author

yes. it is inf2.48xlarge.

@mrnikwaws
Copy link

Hi @mahendra-paranjpe,

This can indicate that installation is not complete - e.g. missing drivers. Please check: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/torch-neuronx.html. Note the system packages (rpm/dpkg) files for installation (e.g. https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22 "Drivers and Tools"), and that you are running on one of the supported OS versions.

If you think that is installed correctly - it is possible the driver is not correctly loaded for some reason. Try:

sudo modprobe neuron

... then retry your test. If neither of those works please post back here.

@aws-donkrets
Copy link

Hi @mahendra-paranjpe - haven't heard back whether mrnikwaws comments solved your ticket. Closing this out for now. If you are still encountering a problem please reopen or create a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants