You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check that this issue hasn't been reported before.
I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
Generally, there is no need to actively torch.cuda.init()
Current behaviour
When CUDA_VISIBLE_DEVICES is set, a Runtime Error occurs: Invalid device argument : did you call init?
When I manually initialize before '`component_trace = _Fire(component, args, parsed_flag_args, context, name)', the problem still persists.
When CUDA_VISIBLE_DEVICES is not specified, this issue will not occur.
base_model: ./mistralaimodel_type: MistralForCausalLMtokenizer_type: LlamaTokenizerstrict: falsechat_template: llama3datasets:
- path: /tmp/5014fa1bc5b682bb_train_data.jsontype:
# 请根据需要调整这些字段名称以匹配实际的JSON字段field_instruction: question_statement # 显式指定指令字段field_output: text # 显式指定输出字段# 定义格式format: |- User: {question_statement} Assistant: {text}dataset_prepared_path:
val_set_size: 0.05output_dir: id_24sequence_len: 4056sample_packing: falsepad_to_sequence_len: true# trust_remote_code: true# [2024-12-15 16:08:07,077] [WARNING] [axolotl.utils.config.models.input.hint_trust_remote_code:403] [PID:1119962] [RANK:1] `trust_remote_code` is set to true. Please make sure that you reviewed the remote code/model.adapter: lora# # If you already have a lora model trained that you want to load, put that here.# lora_model_dir: ./mistralai-loralora_r: 16lora_alpha: 32lora_dropout: 0.05lora_target_linear: truelora_fan_in_fan_out: falsegradient_accumulation_steps: 4micro_batch_size: 2num_epochs: 1optimizer: adamw_bnb_8bitlr_scheduler: cosinelearning_rate: 0.0002train_on_inputs: falsegroup_by_length: falsebf16: falsefp16: truetf32: falsegradient_checkpointing: false# Stop training after this many evaluation losses have increased in a row# https://huggingface.co/transformers/v4.2.2/_modules/transformers/trainer_callback.html#EarlyStoppingCallbackearly_stopping_patience:
# Resume from a specific checkpoint dir# resume_from_checkpoint: ./mistralai-resume_from_checkpoint# Don't mess with this, it's here for accelerate and torchrunlocal_rank:
logging_steps: 1# xformers_attention:flash_attention: true# s2_attention:wandb_project: Gradients-On-Demandwandb_entity:
wandb_mode: offlinewandb_run: your_namewandb_runid: default# hub_model_id:# hub_repo:# hub_strategy: checkpoint# hub_token: falsesaves_per_epoch: 4warmup_steps: 10evals_per_epoch: 4eval_table_size:
eval_max_new_tokens: 128max_steps: 10debug:
deepspeed: /data/deepspeed_configs/zero3.jsonweight_decay: 0.0fsdp:
fsdp_config:
Possible solution
No response
Which Operating Systems are you using?
Linux
macOS
Windows
Python Version
Python 3.11.11
axolotl branch-commit
main
Acknowledgements
My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.
The text was updated successfully, but these errors were encountered:
Please check that this issue hasn't been reported before.
Expected Behavior
Generally, there is no need to actively torch.cuda.init()
Current behaviour
When CUDA_VISIBLE_DEVICES is set, a Runtime Error occurs: Invalid device argument : did you call init?
When I manually initialize before '`component_trace = _Fire(component, args, parsed_flag_args, context, name)', the problem still persists.
When CUDA_VISIBLE_DEVICES is not specified, this issue will not occur.
Steps to reproduce
CUDA_VISIBLE_DEVICES=2,3 accelerate launch -m axolotl.cli.train Mistral-7B-Instruct-v0.3.yml
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
Python 3.11.11
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: