This guide lists the arguments you can adjust in the training scripts.
Since VL-RLHF is built on Transformers, the arguments of Transformers are also available for VL-RLHF Trainer. So, here we only list the arguments added by VL-RLHF.
These are arguments shared by all types of VL-RLHF Trainers:
--model_name_or_path
: Path of the pretrained model weights.--max_length
: Max length of each sample.--max_prompt_length
: Max length of the prompt.--max_target_length
: Max length of the target text.--dataset_name
: The name of the dataset. can bevlfeedback_paired
for the VLFeedback dataset,rlhfv
for the RLHF-V dataset,vlquery_json
for customized multimodal conversation data stored in json format,plain_dpo
for customized multimodal comparison data stored in json format.--data_path
: Path to the json file. Only needed for customized dataset. If you use VLFeedback or RLHF-V, these datasets will be automatically downloaded from huggingface and loaded via thedatasets
package.--image_root
: Root directory of the images. Only needed for customized dataset. It will be joined with the image path of each sample in the json file.--data_ratio
: Ratio between the number of training data and evaluation data.--dataset_num_proc
: Number of processors for processing data.--freeze_vision_tower
: Whether to freeze the vision encoder of the model. Defaults toTrue
.--lora_r
: LoRA rank.--lora_alpha
: LoRA alpha.--lora_dropout
: LoRA dropout.--lora_target_modules
: LoRA target modules. Split by,
, e.g."c_attn,attn.c_proj,w1,w2"
. You can set it toauto
to use default lora target modules.--lora_bias
: LoRA bias.--use_lora
: Whether to use LoRA. Defaults toFalse
--q_lora
: Whether to use QLoRA. Defaults toFalse
.--bits
: Bits of QLoRA.--modules_to_save
: Additional modules that should be saved in the checkpoint. Split by,
.--use_flash_attention_2
: Whether to use FlashAttention2 for effective training.--project_name
: Name of the project. Used by wandb.--group_name
: Group name of this experiment. Used by wandb.
--beta
: beta in DPO loss.--score_margin
: Currently only used for VLFeedback dataset. For a pair of responses, only when the difference of their scores is larger thanscore_margin
can they be selected as a training sample. Defaults to-1
, which uses all pairs.--loss_type
: Same as theloss_type
argument of TRL DPOTrainer. Can be one of["sigmoid", "hinge", "ipo", "kto_pair", "ddpo"]
.
DDPO is a variant of DPO, where the loss_type
is set to ddpo
.
KTO (paired) is a variant of DPO, where the loss_type
is set to kto_pair
.
There is currently no additional arguments for SFT.