Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"RuntimeError: Invalid device argument : did you call init? "When setting CUDA_VISIBLE_DEVICES #2199

Open
6 of 8 tasks
zhanghanxing2022 opened this issue Dec 18, 2024 · 2 comments
Labels
bug Something isn't working waiting for reporter

Comments

@zhanghanxing2022
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Generally, there is no need to actively torch.cuda.init()

Current behaviour

When CUDA_VISIBLE_DEVICES is set, a Runtime Error occurs: Invalid device argument : did you call init?
image
When I manually initialize before '`component_trace = _Fire(component, args, parsed_flag_args, context, name)', the problem still persists.
image
When CUDA_VISIBLE_DEVICES is not specified, this issue will not occur.

Steps to reproduce

CUDA_VISIBLE_DEVICES=2,3 accelerate launch -m axolotl.cli.train Mistral-7B-Instruct-v0.3.yml

Config yaml

base_model: ./mistralai
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

strict: false

chat_template: llama3

datasets:
  - path: /tmp/5014fa1bc5b682bb_train_data.json
    type: 
      # 请根据需要调整这些字段名称以匹配实际的JSON字段
      field_instruction: question_statement  # 显式指定指令字段
      field_output: text            # 显式指定输出字段

      # 定义格式
      format: |-
        User: {question_statement} 
        Assistant: {text}
dataset_prepared_path:
val_set_size: 0.05
output_dir: id_24

sequence_len: 4056
sample_packing: false
pad_to_sequence_len: true
# trust_remote_code: true
# [2024-12-15 16:08:07,077] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:403] [PID:1119962] [RANK:1] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.
adapter: lora
# # If you already have a lora model trained that you want to load, put that here.
# lora_model_dir: ./mistralai-lora
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out: false

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: false
# Stop training after this many evaluation losses have increased in a row
# https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallback
early_stopping_patience: 
# Resume from a specific checkpoint dir
# resume_from_checkpoint: ./mistralai-resume_from_checkpoint
# Don't mess with this, it's here for accelerate and torchrun
local_rank:
logging_steps: 1
# xformers_attention:
flash_attention: true
# s2_attention:

wandb_project: Gradients-On-Demand
wandb_entity:
wandb_mode: offline
wandb_run: your_name
wandb_runid: default

# hub_model_id:
# hub_repo:
# hub_strategy: checkpoint
# hub_token: false


saves_per_epoch: 4
warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
max_steps: 10
debug:
deepspeed: /data/deepspeed_configs/zero3.json
weight_decay: 0.0
fsdp:
fsdp_config:

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

Python 3.11.11

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@zhanghanxing2022 zhanghanxing2022 added the bug Something isn't working label Dec 18, 2024
@NanoCode012
Copy link
Collaborator

Which GPUs are you using? I just used the CUDA_VISIBLE_DEVICES yesterday, and it seemed to not have this issue.

@zhanghanxing2022
Copy link
Author

H200

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working waiting for reporter
Projects
None yet
Development

No branches or pull requests

2 participants