Makes Llama checkpoint convertion compatible with fused up/gate projection #26

evellasques · 2024-04-17T14:47:50Z

Issue #, 24

Description of changes:

Recent merging of up/down projection in Llama requires the equivalent merging in the HF to NeMo conversion scripts (and subsequent splitting in the NeMo to HF script).

This change fixes that for the following converters:

convert_nemo_checkpoint_to_hf_llama.py
convert_hf_checkpoint_to_nemo_llama.py
convert_hf_checkpoint_to_nemo_llama_70b.py

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…ction

amithrm · 2024-04-22T18:25:07Z

Thanks @evellasques for the PR. Going over it!

HolySahar · 2024-04-22T20:21:19Z

.../examples/nlp/language_modeling/checkpoint_conversion/convert_nemo_checkpoint_to_hf_llama.py

@@ -125,8 +126,7 @@ def convert_checkpoint(config_file,
        "self_attention.dense.weight": (1, "self_attn.o_proj.weight", 1, 0),
        "post_attention_layernorm.weight": (0, "post_attention_layernorm.weight", None, 0),
        "self_attention.core_attention.rotary_emb.inv_freq": (0, "self_attn.rotary_emb.inv_freq", None, 0),
-        "mlp.dense_h_to_4h.weight": (1, "mlp.gate_proj.weight", 0, 0),
-        "mlp.dense_h_to_4h_2.weight": (1, "mlp.up_proj.weight", 0, 0),
+        "mlp.dense_h_to_4h.weight": (1, "mlp.gate_proj_up_proj.weight", 0, 0),


why considering "gate" and "up" proj to be fused for HF checkpoint? Shouldn't you split them from nemo checkpoint instead and then save as separate "gate" and "up" params for HF?

HolySahar · 2024-04-22T20:25:02Z

.../examples/nlp/language_modeling/checkpoint_conversion/convert_nemo_checkpoint_to_hf_llama.py

@@ -217,6 +217,13 @@ def convert_checkpoint(config_file,
                hf_model[hf_key_q], hf_model[hf_key_k], hf_model[hf_key_v] = torch.split(hf_model[hf_key], size_per_seg, dim=0)
                hf_model.pop(hf_key)

+            if "dense_h_to_4h" in k:


I think this is not accurate. "gate" and "proj" fusion is per tp rank in the nemo checkpoint. So, you can't first concatenate all tps and then split to "gate" and "proj". Instead you should split them for each tp rank.

Makes Llama checkpoint convertion compatible with fused up/gate proje…

c7dfdff

…ction

evellasques requested review from aws-maens and musunita as code owners April 17, 2024 14:47

This was referenced Apr 17, 2024

Issue with Llama conversion for new release #24

Closed

bugfix: inv_freq buffer in Llama RotaryEmbedding shouldn't be persistent #21

Closed

Typo in nemo.collections.nlp.parts.serialization.py #25

Closed

HolySahar reviewed Apr 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Makes Llama checkpoint convertion compatible with fused up/gate projection #26

Makes Llama checkpoint convertion compatible with fused up/gate projection #26

evellasques commented Apr 17, 2024

amithrm commented Apr 22, 2024

HolySahar Apr 22, 2024

HolySahar Apr 22, 2024

Makes Llama checkpoint convertion compatible with fused up/gate projection #26

Are you sure you want to change the base?

Makes Llama checkpoint convertion compatible with fused up/gate projection #26

Conversation

evellasques commented Apr 17, 2024

amithrm commented Apr 22, 2024

HolySahar Apr 22, 2024

Choose a reason for hiding this comment

HolySahar Apr 22, 2024

Choose a reason for hiding this comment