Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file #196

Open
yandachen opened this issue Mar 7, 2023 · 6 comments

Comments

@yandachen
Copy link

Hello, I installed your package using setup/setup.sh. The single-GPU command in the tutorial works fine, but when I run the multi-GPU command deepspeed --num_gpus 8 --num_nodes 2 --master_addr machine1 train.py --config conf/tutorial-gpt2-micro.yaml --nnodes 2 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id tutorial-gpt2-micro-multi-node I received an error message saying that

File "miniconda3/envs/mistral/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1775, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: ~/.cache/torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory.

I also tried running the same code in the same environment but on a different machine, and this time I get the error message

File "miniconda3/envs/mistral/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1494, in verify_ninja_availability
raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

Do you have any idea about how to resolve this issue? I installed all packages using setup/setup.sh so I guess my package versions follow what you included in the requirements files. Thanks!

@J38
Copy link
Contributor

J38 commented Mar 7, 2023

This worked for me today:

# create new conda environment
conda create -n mistral-march-2023 python=3.8.12 pytorch=1.11.0 torchdata cudatoolkit=11.3 -c pytorch
conda activate mistral-march-2023
pip install -r setup/pip-requirements.txt

# install flash attention
git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
python setup.py install

# clone mistral
cd ..
git clone https://github.com/stanford-crfm/mistral.git
cd mistral
git checkout mistral-flash-dec-2022

# install modified transformers
cd ..
git clone https://github.com/huggingface/transformers.git
cd transformers
# copy the modified modeling_gpt2.py into the transformers repo before installing
cp ../mistral/transformers/models/gpt2/modeling_gpt2.py src/transformers/models/gpt2/modeling_gpt2.py
pip install -e .

# run demo
# note: I think some of the default configurations are probably broken and you'll need to modify based on your experiment, but it is easy to do so
cd ..
cd mistral
deepspeed --hostfile hostfile --num_gpus 8 --num_nodes 1 --master_addr sphinx6 train.py --config conf/mistral-micro.yaml --nnodes 1 --nproc_per_node 8 --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-bf16-conf.json --run_id mistral-w-flash-demo

@yandachen
Copy link
Author

Thanks for your prompt response. I ran the above code but received this error message:

CerberusError: config could not be validated against schema. The errors are,
{'training_arguments': [{'gradient_checkpointing': ['unknown field']}]}

Can you let me know how to fix this?

By the way, what command are you using to install the python packages needed for mistral? Are you using pip install -r setup/pip-requirements.txt? Just want to confirm so that I'm using the same version.

Thanks.

@J38
Copy link
Contributor

J38 commented Mar 12, 2023

I updated the branch to fix those configuration issues.

And yes I think I did that pip as well and forgot that from the install.

@J38
Copy link
Contributor

J38 commented Mar 12, 2023

I'm probably going to update main branch to do this and put changes in main into a separate branch.

@J38
Copy link
Contributor

J38 commented Mar 12, 2023

So main should be sort of like mistral-flash-dec-2022 ...

@yandachen
Copy link
Author

Hello, thanks so much for working on this! The code you provided above works!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants