Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plan on upadating the code for LLaMA models? #128

Open
iBibek opened this issue Jan 25, 2024 · 11 comments
Open

Any plan on upadating the code for LLaMA models? #128

iBibek opened this issue Jan 25, 2024 · 11 comments

Comments

@iBibek
Copy link

iBibek commented Jan 25, 2024

Thank you for the great repo.

Is there any plan from your side to update the code for LLaMA model? or is there anything I can do to update the codes to visualize LLaMA model?

@Bachstelze
Copy link

Doesn't it work as decoder model?
I have successfully run Mistral (with lots of redundant shortcuts). The architecture should be similar.

@iBibek
Copy link
Author

iBibek commented Jan 28, 2024

@Bachstelze , this is good news.
Can you please share the code (if its possible then )?

@Bachstelze
Copy link

The code is similar to the GPT example in this repo:

from transformers import AutoTokenizer, AutoModel
from bertviz import head_view
from bertviz import model_view

# load the model
# Vicuna is an instruction-model based on Llama
model_name = "lmsys/vicuna-7b-delta-v1.1" # mistralai/Mistral-7B-Instruct-v0.1
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, output_attentions=True)

input_sentence = """The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words.\n
Input: If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\n\nOutput:"""
input_sentence = "Generate a positive review for a place."
inputs = tokenizer.encode(input_sentence, return_tensors='pt')
outputs = model(inputs)
attention = outputs[-1]  # Output includes attention weights when output_attentions=True
tokens = tokenizer.convert_ids_to_tokens(inputs[0]) 

# save the complete model view
html_head_view = head_view(attention, tokens, html_action='return')
with open("all_head_view.html", 'w') as file:
    file.write(html_head_view.data)

html_model_view = model_view(attention, tokens, html_action='return')
with open("all_model_view.html", 'w') as file:
    file.write(html_model_view.data)

# save the view just for certain layers if the browser can't display the whole
# shorter inputs are easier to display
layers = [1]
html_head_view = head_view(attention, tokens, html_action='return', include_layers=layers)

with open("short_head_view.html", 'w') as file:
    file.write(html_head_view.data)

html_model_view = model_view(attention, tokens, html_action='return', include_layers=layers)
with open("short_model_view.html", 'w') as file:
    file.write(html_model_view.data)

The loading and processing already take 30 GB of RAM. My machine starts to swap at this point and i just save the html to visualize it after the RAM is free again.

The output looks very repetitive.
model_view_vicuna_small_instruction
head_view_vicuna_small_instruction
long_head_view

In the case of Vicuna (lmsys/vicuna-7b-delta-v1.1 From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning) all heads consist of the same weight shape. Every token can only attend to its previous tokens due to the one-directional objective of GPTS, e.g. the first token can only attend to the start of the sentence token and itself. Interestingly, every token equally shares its weights to all possible tokens. Therefore, the attention weights are strongest for the tokens at the beginning and decrease towards the end. This entails a "L" shape like the positive multiplicative inverse.

Let me know if you find other patterns or have a good explanation for this phenomenon

@iBibek
Copy link
Author

iBibek commented Jan 30, 2024

@Bachstelze Thank you so much <3

@MarioRicoIbanez
Copy link

Hi! I am trying to use also bertviz with LLMs. But have you manage to see not only self-attentions of the first iteration but the attention of the genearted word too? Using model.generate method.

@Icamd
Copy link

Icamd commented Apr 24, 2024

Hi! I am trying to use also bertviz with LLMs. But have you manage to see not only self-attentions of the first iteration but the attention of the genearted word too? Using model.generate method.

Have you solve the problem? Thank you!

@MarioRicoIbanez
Copy link

Hi, I finally ended up using captum and it works perfectly!

https://captum.ai/tutorials/Llama2_LLM_Attribution

@Icamd
Copy link

Icamd commented Apr 27, 2024

Hi, I finally ended up using captum and it works perfectly!

https://captum.ai/tutorials/Llama2_LLM_Attribution

Thank you for the information! I find this works as well: https://github.com/mattneary/attention. I will try using captum, thank you!

@Bachstelze
Copy link

@Icamd Does https://github.com/mattneary/attention work well with bigger GPTs? Do you know how the attention weights are aggregated into one view?

@MarioRicoIbanez Can we use captum to view the attention pattern?

@Bachstelze
Copy link

The Llama 3 model sinks most of the time all attention to the "begin of the text" token.
It is possible to load the model in 4 or 8 quantizations and run BertViz, e.g. in google colab: https://colab.research.google.com/drive/1Fcgug4a6rv9F-Wej0rNveiM_SMNZOtrr?usp=sharing

@iBibek iBibek closed this as completed Jul 18, 2024
@iBibek iBibek reopened this Jul 18, 2024
@iBibek
Copy link
Author

iBibek commented Jul 18, 2024

@Bachstelze , can you please clarify on the part where you said :

The Llama 3 model sinks most of the time all attention to the "begin of the text" token.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants