-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama3 8B support, tiktoken tokenizer #158
Conversation
50e45b0
to
171a7f3
Compare
Note, while I was testing int4 quantization with the llama3-8B model I found this bug: #159 |
great job! how much tokens per second are getting, mind sharing some stats @Artyom17 |
it is a bit slower than Mistral or llama 2. I got 165 t/s on H100 with llama 3, while llama 2 gave me 185 t/s, Mistral-7B - 175 t/s |
Ping? |
cc: @yanboliang |
I'll review it tomorrow! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Can you update the corresponding benchmark number in README? Thank you!
Unfortunately, all the benchmarks in the README.md are made on 8xA100, but I have access only to 8xH100. |
@Artyom17 I was looking into how to deal with the 70b. And I found this old script. https://github.com/tloen/llama-int8/blob/main/example.py Is this useful? |
@yanboliang Could we run the A100 numbers and add it to the README? |
Yea, I can run & update the A100 numbers. Probably we can do a small update on README to split benchmarks into A100/H100/AMD. |
Perf numbers on A100: #166 |
Hey @Artyom17 ive generalised the support for llama 3, im able to convert both llama 3 8b and 70b. Pls see the pr to your fork here . By pre converting the safetensors format to the PyTorch bin format. The hf conversion script works as is. And all I needed to add were the model configs in the |
Some performance numbers on 8xA10
|
Nice! Looking at it, thanks a lot! |
Yeah, I've tested it, it works (with some misspelling caveats I mentioned in the PR). I am not sure we can integrate these changes atm, since it will create dependency on a third-party models (the eastwind/* ones), but gpt-fast owners may correct me if I am wrong. The best outcome would be if HF adopts your conversion and releases those .bin files properly. Alternatively, gpt-fast users should be able to convert .safetensors to .bin (I am not super familiar with this process, how hard is it?). The right flow of events IMO should be as follows:
|
@Artyom17 all I did was load the model into memory and do This is doable but you would need enough memory to load the HF model and then save it into the "unsafe" version. If one is running these models then I guess it's safe to assume they have the requirements to do this themselves, or have code to do it in here. But I don't want to ruin this repo and add a dependency to transformers lol. Imo a better solution is to modify the existing code to work with safetensors but idk how difficult that is. |
Well, you can't use 70B model anyway, unless you have a beefy machine with A100/H100 80Gb. |
Hello, I managed to brute force my way to convert from .safetensors to .bin for Meta-Llama-3-70B by loading the model and using `from transformers import AutoModel model = AutoModel.from_pretrained(checkpoint_dir) I wonder if I may ask a few questions:
Thank you for your efforts! |
Hey @danieltmeta that's exactly what I did. Eastwind is my huggingface account btw 🤣. Not sure why you are getting different results. We should get the exact same weights. Are you on the latest transformers and safetensors libraries? Also if you want to use that, pls see use my version of the llama 3 integration in this PR. #169 |
I think supporting safetensor is the simplest way to do this (minimal code changes)
|
@jerrymannil thank you! I will make this change and test it out on my other PR tomorrow. |
Surprisingly, Llama 3 switched to Tiktoken tokenizer from SentencePiece. This PR implements wrappers for both - Tiktoken and SentencePiece tokenizers, as well as adding params for Llama-3-8B* and -70B* models.
As to scripts/convert_hf_checkpoint.py. Llama3 on HF doesn't have those pytorch_model-xxxx-xxxx.bin files anymore; instead, the Pytorch model is located in 'original' sub-dir with different names pattern ('consolidated.XX.pth'). For 8B models it is a single file that just needs to be copied as model.pth into the parent directory, no need to mess with names of the weights.
The original/tokenizer.model also just needs to be copied into its parent directory (and Tiktoken tokenizer must be used instead of the SentencePieceProcessor).
As to 70B model - it is not covered by this PR since it is not clear to me how to handle multiple consolidate.XX.pth files with THE SAME weight names in each (unlike how it is with pytorch_model_XXXX-of-XXXXX.bin files, where each .bin contains a certain subset of the weights).