Image + Audio + Text input using Llama 3.2 [DO NOT MERGE] #127

farzadab · 2024-10-01T18:20:18Z

This PR is not in a state to be merged, but it shows how Llama 3.2 can be used to combine image, text, and audio inputs together and get the correct response.

Take a look at llama32_script.py to see how this is done:

llama 3.2 11B vision instruct model is loaded
weights from ultravox-v0_4 (trained on llama 3.1 8B) are loaded without modification
input consists of image and audio

Note: before using the script, a few lines in the transformers library need to be manually commented out to allow for this approach:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/mllama/modeling_mllama.py#L2152-L2155
These lines don't allow you to specify inputs_embeds when vision input is present. Hopefully we can upstream this change in the future.

farzadab added 3 commits September 30, 2024 14:58

partial 3.2 support

f5b7715

fix

8957b85

working script of combining audio + image + text

50094bb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Image + Audio + Text input using Llama 3.2 [DO NOT MERGE] #127

Image + Audio + Text input using Llama 3.2 [DO NOT MERGE] #127

farzadab commented Oct 1, 2024

Image + Audio + Text input using Llama 3.2 [DO NOT MERGE] #127

Are you sure you want to change the base?

Image + Audio + Text input using Llama 3.2 [DO NOT MERGE] #127

Conversation

farzadab commented Oct 1, 2024