You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, I would like to thank you for adding this wonderful training method to the literature. I have some questions regarding the method:
Is it possible to use Flash Attention, xFormers, or Torch Compile with this method?
Is the VRAM usage of the model sub-quadratic or quadratic when increasing the max sequence length? Although quantization and LoRA-style optimizations reduce VRAM usage, the inability to use Flash Attention poses a challenge for pretraining or fine-tuning with long texts. Are you planning to optimize to address this issue?
Will the model weights saved with this method be in bf16 format and can they be used in other training software(TRL and sft after pretraining) without any problems?
Finally, I have a suggestion. Integrating Galore v2 into the LLaMA Factory, similar to Galore v1, would allow it to be combined with pretraining methods such as LLaMA Pro. Please consider this integration.
Thank you.
The text was updated successfully, but these errors were encountered:
First of all, I would like to thank you for adding this wonderful training method to the literature. I have some questions regarding the method:
Finally, I have a suggestion. Integrating Galore v2 into the LLaMA Factory, similar to Galore v1, would allow it to be combined with pretraining methods such as LLaMA Pro. Please consider this integration.
Thank you.
The text was updated successfully, but these errors were encountered: