torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file #196

yandachen · 2023-03-07T03:51:52Z

Hello, I installed your package using setup/setup.sh. The single-GPU command in the tutorial works fine, but when I run the multi-GPU command deepspeed --num_gpus 8 --num_nodes 2 --master_addr machine1 train.py --config conf/tutorial-gpt2-micro.yaml --nnodes 2 --nproc_per_node 8 --training_arguments.fp16 true --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-conf.json --run_id tutorial-gpt2-micro-multi-node I received an error message saying that

File "miniconda3/envs/mistral/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1775, in _import_module_from_library
module = importlib.util.module_from_spec(spec)
File "", line 556, in module_from_spec
File "", line 1166, in create_module
File "", line 219, in _call_with_frames_removed
ImportError: ~/.cache/torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory.

I also tried running the same code in the same environment but on a different machine, and this time I get the error message

File "miniconda3/envs/mistral/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1494, in verify_ninja_availability
raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

Do you have any idea about how to resolve this issue? I installed all packages using setup/setup.sh so I guess my package versions follow what you included in the requirements files. Thanks!

The text was updated successfully, but these errors were encountered:

J38 · 2023-03-07T07:14:45Z

This worked for me today:

# create new conda environment
conda create -n mistral-march-2023 python=3.8.12 pytorch=1.11.0 torchdata cudatoolkit=11.3 -c pytorch
conda activate mistral-march-2023
pip install -r setup/pip-requirements.txt

# install flash attention
git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
python setup.py install

# clone mistral
cd ..
git clone https://github.com/stanford-crfm/mistral.git
cd mistral
git checkout mistral-flash-dec-2022

# install modified transformers
cd ..
git clone https://github.com/huggingface/transformers.git
cd transformers
# copy the modified modeling_gpt2.py into the transformers repo before installing
cp ../mistral/transformers/models/gpt2/modeling_gpt2.py src/transformers/models/gpt2/modeling_gpt2.py
pip install -e .

# run demo
# note: I think some of the default configurations are probably broken and you'll need to modify based on your experiment, but it is easy to do so
cd ..
cd mistral
deepspeed --hostfile hostfile --num_gpus 8 --num_nodes 1 --master_addr sphinx6 train.py --config conf/mistral-micro.yaml --nnodes 1 --nproc_per_node 8 --training_arguments.per_device_train_batch_size 4 --training_arguments.deepspeed conf/deepspeed/z2-small-bf16-conf.json --run_id mistral-w-flash-demo

yandachen · 2023-03-07T22:38:06Z

Thanks for your prompt response. I ran the above code but received this error message:

CerberusError: config could not be validated against schema. The errors are,
{'training_arguments': [{'gradient_checkpointing': ['unknown field']}]}

Can you let me know how to fix this?

By the way, what command are you using to install the python packages needed for mistral? Are you using pip install -r setup/pip-requirements.txt? Just want to confirm so that I'm using the same version.

Thanks.

J38 · 2023-03-12T09:51:01Z

I updated the branch to fix those configuration issues.

And yes I think I did that pip as well and forgot that from the install.

J38 · 2023-03-12T09:59:56Z

I'm probably going to update main branch to do this and put changes in main into a separate branch.

J38 · 2023-03-12T10:00:24Z

So main should be sort of like mistral-flash-dec-2022 ...

yandachen · 2023-03-12T15:12:28Z

Hello, thanks so much for working on this! The code you provided above works!

zqOuO mentioned this issue Sep 18, 2024

No such file or directory:fused zqOuO/Score-based-Generative-Models-with-Adaptive-Momentum#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file #196

torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file #196

yandachen commented Mar 7, 2023

J38 commented Mar 7, 2023 •

edited

Loading

yandachen commented Mar 7, 2023

J38 commented Mar 12, 2023

J38 commented Mar 12, 2023

J38 commented Mar 12, 2023

yandachen commented Mar 12, 2023

torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file #196

torch_extensions/py38_cu113/fused_adam/fused_adam.so: cannot open shared object file #196

Comments

yandachen commented Mar 7, 2023

J38 commented Mar 7, 2023 • edited Loading

yandachen commented Mar 7, 2023

J38 commented Mar 12, 2023

J38 commented Mar 12, 2023

J38 commented Mar 12, 2023

yandachen commented Mar 12, 2023

J38 commented Mar 7, 2023 •

edited

Loading