Questions and Suggestions for Enhancing Galore v2 #2

kostum123 · 2024-07-12T18:44:01Z

First of all, I would like to thank you for adding this wonderful training method to the literature. I have some questions regarding the method:

Is it possible to use Flash Attention, xFormers, or Torch Compile with this method?
Is the VRAM usage of the model sub-quadratic or quadratic when increasing the max sequence length? Although quantization and LoRA-style optimizations reduce VRAM usage, the inability to use Flash Attention poses a challenge for pretraining or fine-tuning with long texts. Are you planning to optimize to address this issue?
Will the model weights saved with this method be in bf16 format and can they be used in other training software(TRL and sft after pretraining) without any problems?

Finally, I have a suggestion. Integrating Galore v2 into the LLaMA Factory, similar to Galore v1, would allow it to be combined with pretraining methods such as LLaMA Pro. Please consider this integration.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions and Suggestions for Enhancing Galore v2 #2

Questions and Suggestions for Enhancing Galore v2 #2

kostum123 commented Jul 12, 2024

Questions and Suggestions for Enhancing Galore v2 #2

Questions and Suggestions for Enhancing Galore v2 #2

Comments

kostum123 commented Jul 12, 2024