Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move to lit gpt llama impl #3

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

move to lit gpt llama impl #3

wants to merge 5 commits into from

Conversation

samsja
Copy link
Collaborator

@samsja samsja commented Jul 16, 2024

add llama

@samsja samsja requested a review from Jackmin801 July 16, 2024 16:48
@samsja samsja force-pushed the feat-llama-impl branch 2 times, most recently from f2a515e to 682d365 Compare July 17, 2024 11:54
Comment on lines +184 to +193
def get_model(config: Config) -> GPT:
# Load model
config_model = LlamaConfig.from_pretrained(config.path_model, attn_implementation=config.attn_implementation)
return LlamaForCausalLM.from_pretrained(pretrained_model_name_or_path=config.path_model, config=config_model)
if isinstance(config.llama_config, ModelConfig):
llama_config = config.llama_config
else:
with open(config.llama_config) as f:
llama_config = ModelConfig(**json.load(f))

llama_config.attention_impl = config.attention_impl
return GPT(llama_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm would the initialisation be the same across diloco workers?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes indeed, is there a way to use hf hub to push the init ckpt like we did before ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like GPT inherits from nn.Module instead of transformers PreTrainedModel so it doesnt have the from_pretrained function. Maybe we can change it to allow loading from hub

input_ids = inputs_ids[:, :-1]
target = inputs_ids[:, 1:]

output = model(input_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing seqlens?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch

@samsja
Copy link
Collaborator Author

samsja commented Jul 31, 2024

fyi @Jackmin801 this pr was not meant to be merged for now, it was more to compare. Tho we have a 20% better mfu with it (even with torch compile) so we might want it.

I will address your comments

@Jackmin801
Copy link
Member

Jackmin801 commented Jul 31, 2024

where are the MFU speedups coming from? if its from better layer implementations I think it might be better to just modify (copy over to our repo and modify) the HF modeling_llama.py to use the layers and maintain the more general HF PreTrainedModel API that the training script already had (the model loading semantics and loss calculation would be in the model implementation and not the training script)

Its also easier to have FP8 support this way (I have an implementation using transformer engine that uses this implementation method; https://github.com/PrimeIntellect-ai/OpenDiLoCo_internal/pull/71)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants