-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multimodal Large Language Model Support? #15
Comments
that is a good idea! |
we add some early results in https://github.com/TencentQQGYLab/ELLA?tab=readme-ov-file#-emma---efficient-multi-modal-adapter-work-in-progress |
Thanks! Awesome! These early results seem to have IPAdpater-like capabilities. Probably it also has strong in-context learning capability like given a pair of (original image, target image) as an example, it can learn and modify another image. |
Hello, may I ask where this part of open source work can be seen? What is the paper? |
I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL. I have read your work on multimodal integration, can you briefly describe your approach? |
wow! Can you show some comparisons between the ELLA-SDXL results you reproduced and the original SDXL results?
EMMA is actually using both text and image embeddings as input for the Connector. The current method is still quite simple and not sufficient to write a 8-page paper, so we are conducting more experiments. |
We have improved the basic Ella structure and used Geneval for model evaluation. According to the result of repetition, the improvement in the Two objects and Color attribution is significant. Is it consistent with your conclusion? In addition, according to your description of EMMA, it feels similar to the idea of M2Chat, maybe we can maintain frequent communication in the future work |
@plastic0313 Can you please share your training scripts? I think it will really unlocks a lot of space for the community to explore. Say taking advantage of LLaMa 3 |
This depends on the data you use. Generally speaking, VLM-annotated captions contain a lot of accurate color descriptions, so the performance on color is much better.
Of course, you can contact me through the contact information on my personal website. |
Are there any version of EMMA (even on beta) that I can we can play with? Would love to learn more about some preliminary results on the image generation. |
Any chance you could share your work @plastic0313 ? Sincerely, best regards. |
maybe i'm the only one, but the lack of weights or experiments or training logs really make the SDXL claims hard to believe. it seems like most people feel the weights don't actually exist, nor do they behave the way they are claimed to in the paper. releasing the weights and training data can be of help to your project for this. |
Could you mind sharing what kind of datasets you used? 34M dataset is really challenging for me. |
i am not sure they actually even did what they claim.. we have been trying to train it for ~2 months. it just doesnt work for sdxl since it has two text encoders. |
The author said they used attention pooling to transform Ella embedding for fitting pooled embedding in SDXL. I have implemented this and it works. You can have a try. |
just publish your results instead. |
@George0726 May you provide details or implementation, please? Is it self-attention pooling? |
i dont think george is being honest about it. his issues opened on my project do not indicate a deep understanding of this type of work or the code to write |
Why are you so rude to me? I just kindly ask a question about the bugs of your extended SD3 network.. I can share my ella code here.
|
Maybe I am not as good as you. You are so rude to judge me on my integrity. I cannot get the results better than SDXL version. I am not it is about the size of the dataset I use or the ELLA itself. But at least I know using average pooling make it generate images ...
|
my mistake, George. your question was about an input type mismatch, which everybody knows is due to gradient checkpointing bug that was resolved in diffusers main branch. please accept my apologies. we also didn't have good luck with SDXL ELLA. probably because of the two text encoders. |
Hi! Great work!
Have you tried leveraging MLLM to be the prompt encoder? We have open-source MLLM now, and I think this will be an easy extension but very powerful one. For example, we could give image prompts without ControlNet or other mechanisms to inject image information. We just tell MLLM what we want with text and images, then SD generates it for us.
Update: I see this in Conclusion and Limitation. If you can release training code, then probably the community can also try to approach this direction and to adapt various LLMs
The text was updated successfully, but these errors were encountered: