"RuntimeError: Invalid device argument : did you call init? "When setting CUDA_VISIBLE_DEVICES #2199

zhanghanxing2022 · 2024-12-18T14:33:00Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Generally, there is no need to actively torch.cuda.init()

Current behaviour

When CUDA_VISIBLE_DEVICES is set, a Runtime Error occurs: Invalid device argument : did you call init?

When I manually initialize before '`component_trace = _Fire(component, args, parsed_flag_args, context, name)', the problem still persists.

When CUDA_VISIBLE_DEVICES is not specified, this issue will not occur.

Steps to reproduce

CUDA_VISIBLE_DEVICES=2,3 accelerate launch -m axolotl.cli.train Mistral-7B-Instruct-v0.3.yml

Config yaml

base_model: ./mistralai
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

strict: false

chat_template: llama3

datasets:
  - path: /tmp/5014fa1bc5b682bb_train_data.json
    type: 
      # 请根据需要调整这些字段名称以匹配实际的JSON字段
      field_instruction: question_statement  # 显式指定指令字段
      field_output: text            # 显式指定输出字段

      # 定义格式
      format: |-
        User: {question_statement} 
        Assistant: {text}
dataset_prepared_path:
val_set_size: 0.05
output_dir: id_24

sequence_len: 4056
sample_packing: false
pad_to_sequence_len: true
# trust_remote_code: true
# [2024-12-15 16:08:07,077] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:403] [PID:1119962] [RANK:1] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
adapter: lora
# # If you already have a lora model trained that you want to load, put that here.
# lora_model_dir: ./mistralai-lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out: false

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: false
# Stop training after this many evaluation losses have increased in a row
# https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
early_stopping_patience: 
# Resume from a specific checkpoint dir
# resume_from_checkpoint: ./mistralai-resume_from_checkpoint
# Don't mess with this, it's here for accelerate and torchrun
local_rank:
logging_steps: 1
# xformers_attention:
flash_attention: true
# s2_attention:

wandb_project: Gradients-On-Demand
wandb_entity:
wandb_mode: offline
wandb_run: your_name
wandb_runid: default

# hub_model_id:
# hub_repo:
# hub_strategy: checkpoint
# hub_token: false


saves_per_epoch: 4
warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
max_steps: 10
debug:
deepspeed: /data/deepspeed_configs/zero3.json
weight_decay: 0.0
fsdp:
fsdp_config:

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

Python 3.11.11

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

NanoCode012 · 2024-12-20T04:21:39Z

Which GPUs are you using? I just used the CUDA_VISIBLE_DEVICES yesterday, and it seemed to not have this issue.

zhanghanxing2022 · 2024-12-22T12:30:49Z

H200

zhanghanxing2022 added the bug Something isn't working label Dec 18, 2024

NanoCode012 added the waiting for reporter label Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"RuntimeError: Invalid device argument : did you call init? "When setting CUDA_VISIBLE_DEVICES #2199

"RuntimeError: Invalid device argument : did you call init? "When setting CUDA_VISIBLE_DEVICES #2199

zhanghanxing2022 commented Dec 18, 2024

NanoCode012 commented Dec 20, 2024

zhanghanxing2022 commented Dec 22, 2024

"RuntimeError: Invalid device argument : did you call init? "When setting CUDA_VISIBLE_DEVICES #2199

"RuntimeError: Invalid device argument : did you call init? "When setting CUDA_VISIBLE_DEVICES #2199

Comments

zhanghanxing2022 commented Dec 18, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Dec 20, 2024

zhanghanxing2022 commented Dec 22, 2024