Add support for grouped-query attention #9

yasuhisa-nakashima · 2023-09-17T15:24:09Z

Add support for grouped-query attention for Llama 2 70B and Code Llama 34B compatibility.

References:

amithrm · 2023-09-22T21:06:28Z

nemo/nemo/collections/nlp/modules/common/megatron/transformer.py

+        # This is a noop for normal attention where ng == np. When using grouped-query attention this
+        # creates a view that has the keys and values virtually repeated along their dimension to
+        # match the number of queries.
+        key_layer = key_layer.repeat_interleave(


How is this repeat_intervleaving done? Does it use an explicit torch.view() operation?

@amithrm
Instead of reshaping a tensor like torch.view, it repeats elements of a tensor.

GQA does not have a one-to-one correspondence between query heads and key/value heads like MHA.
Instead, multiple query heads share a single key/value head.
By virtually repeating shared key/value heads until the number of heads becomes num_attention_heads, core_attention can treat MHA and GQA equivalently.

The middle and right illustrations in Figure 2 will be transformed to have the same shape as the left one through this operation.

Wouldn't this cause calculations to be duplicated/redundant across GQA groups?

This operation does not increase the time complexity.
GQA reduces the complexity by reducing the output dimension of the projection layer that transforms hidden states into key/value heads.
The complexity of the dot product in the core attention is equivalent for both GQA and MHA.

ptoulme-aws · 2023-09-26T22:45:18Z

nemo/examples/nlp/language_modeling/llama_13b.sh

@@ -5,6 +5,7 @@ export TP=8
 export PP=4
 export N_LAYERS=40
 export N_AH=40
+export N_QG=40


Can you post performance numbers and convergence curves for pretraining with your changes? Can you use below config

tensor_parallel:8
pipeline_parallel:8
data_parallel:1
global_batch_size:256
activation_checkpointing:full
precision:bf16+SR
dataset:bookcorpus
lrscheduler: cosineannealing

I have confirmed that the number of seconds per step is reduced on 7B with a smaller num_query_groups setting than the normal setting.

We plan to start training on 70B next week.
We will share the results through our contact at AWS Japan.

In preparation, we plan to do some experiments with smaller settings.
I will share the results as soon as we are done.

Add support for grouped-query attention

5809002

yasuhisa-nakashima requested review from aws-maens and musunita as code owners September 17, 2023 15:24

amithrm reviewed Sep 22, 2023

View reviewed changes

yasuhisa-nakashima requested a review from amithrm September 23, 2023 18:53

ptoulme-aws reviewed Sep 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for grouped-query attention #9

Add support for grouped-query attention #9

yasuhisa-nakashima commented Sep 17, 2023

amithrm Sep 22, 2023

yasuhisa-nakashima Sep 23, 2023

ptoulme-aws Sep 26, 2023

yasuhisa-nakashima Sep 27, 2023 •

edited

Loading

ptoulme-aws Sep 26, 2023

yasuhisa-nakashima Sep 27, 2023

Add support for grouped-query attention #9

Are you sure you want to change the base?

Add support for grouped-query attention #9

Conversation

yasuhisa-nakashima commented Sep 17, 2023

amithrm Sep 22, 2023

Choose a reason for hiding this comment

yasuhisa-nakashima Sep 23, 2023

Choose a reason for hiding this comment

ptoulme-aws Sep 26, 2023

Choose a reason for hiding this comment

yasuhisa-nakashima Sep 27, 2023 • edited Loading

Choose a reason for hiding this comment

ptoulme-aws Sep 26, 2023

Choose a reason for hiding this comment

yasuhisa-nakashima Sep 27, 2023

Choose a reason for hiding this comment

yasuhisa-nakashima Sep 27, 2023 •

edited

Loading