You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, congratulations on your fantastic work, and thank you for open-sourcing it!
I encountered an issue while fine-tuning the 384p miniFLUX model on the OpenVid HD subset (~0.4M items). After 40K steps, I noticed a degradation in the quality of the generated text-to-video results.
Details of my setup:
Dataset: The prompts for the OpenVid HD subset were generated using VILA.
Issue:
Attached, you can find a comparison of the original 384p miniFLUX (on the left) and the fine-tuned version (on the right). The prompt used was:
"A fat rabbit wearing a purple robe walking through a fantasy landscape."
Is this degradation expected, or could there be an issue with the fine-tuning process? I would greatly appreciate any insights or recommendations for debugging and improving the results.
1.mp4
Thank you in advance for your time and support!
Best regards,
Levon
The text was updated successfully, but these errors were encountered:
Thank you for your response. I am currently using 8 GPUs (8x A100) with a learning rate of 1e-5.
One observation I made is that the current training code lacks mixed training with both image and video data. The paper mentions that image data is utilized at a proportion of 12.5% in each batch during training, but the published code relies solely on video data. We have been following this approach, and this discrepancy might be contributing to the quality degradation we’re observing.
Please let me know if you require any additional details or have recommendations on how to proceed.
I also discover that the loss does not drop very much, for instance, the loss begins with 0.0458 and ends with 0.0423. Is that normal? @jy0205@levon-khachatryan
First of all, congratulations on your fantastic work, and thank you for open-sourcing it!
I encountered an issue while fine-tuning the 384p miniFLUX model on the OpenVid HD subset (~0.4M items). After 40K steps, I noticed a degradation in the quality of the generated text-to-video results.
Details of my setup:
Issue:
Attached, you can find a comparison of the original 384p miniFLUX (on the left) and the fine-tuned version (on the right). The prompt used was:
"A fat rabbit wearing a purple robe walking through a fantasy landscape."
Is this degradation expected, or could there be an issue with the fine-tuning process? I would greatly appreciate any insights or recommendations for debugging and improving the results.
1.mp4
Thank you in advance for your time and support!
Best regards,
Levon
The text was updated successfully, but these errors were encountered: