llama3 8B support, tiktoken tokenizer #158

Artyom17 · 2024-04-19T00:42:43Z

Surprisingly, Llama 3 switched to Tiktoken tokenizer from SentencePiece. This PR implements wrappers for both - Tiktoken and SentencePiece tokenizers, as well as adding params for Llama-3-8B* and -70B* models.

As to scripts/convert_hf_checkpoint.py. Llama3 on HF doesn't have those pytorch_model-xxxx-xxxx.bin files anymore; instead, the Pytorch model is located in 'original' sub-dir with different names pattern ('consolidated.XX.pth'). For 8B models it is a single file that just needs to be copied as model.pth into the parent directory, no need to mess with names of the weights.
The original/tokenizer.model also just needs to be copied into its parent directory (and Tiktoken tokenizer must be used instead of the SentencePieceProcessor).

As to 70B model - it is not covered by this PR since it is not clear to me how to handle multiple consolidate.XX.pth files with THE SAME weight names in each (unlike how it is with pytorch_model_XXXX-of-XXXXX.bin files, where each .bin contains a certain subset of the weights).

Artyom17 · 2024-04-19T22:21:00Z

Note, while I was testing int4 quantization with the llama3-8B model I found this bug: #159

Muhtasham · 2024-04-22T16:33:01Z

great job! how much tokens per second are getting, mind sharing some stats @Artyom17

Artyom17 · 2024-04-22T17:31:31Z

great job! how much tokens per second are getting, mind sharing some stats @Artyom17

it is a bit slower than Mistral or llama 2. I got 165 t/s on H100 with llama 3, while llama 2 gave me 185 t/s, Mistral-7B - 175 t/s

Artyom17 · 2024-04-25T22:20:27Z

Ping?

Chillee · 2024-04-25T22:24:59Z

cc: @yanboliang

yanboliang · 2024-04-26T05:16:04Z

I'll review it tomorrow!

yanboliang

Looks good to me! Can you update the corresponding benchmark number in README? Thank you!

Artyom17 · 2024-04-28T06:47:27Z

Looks good to me! Can you update the corresponding benchmark number in README? Thank you!

Unfortunately, all the benchmarks in the README.md are made on 8xA100, but I have access only to 8xH100.

nivibilla · 2024-04-28T08:53:12Z

@Artyom17 I was looking into how to deal with the 70b. And I found this old script.

https://github.com/tloen/llama-int8/blob/main/example.py
Particularly the load function

Is this useful?

Chillee · 2024-04-29T03:01:39Z

@yanboliang Could we run the A100 numbers and add it to the README?

yanboliang · 2024-04-29T03:49:21Z

Yea, I can run & update the A100 numbers. Probably we can do a small update on README to split benchmarks into A100/H100/AMD.

yanboliang · 2024-04-29T06:14:19Z

Perf numbers on A100: #166

nivibilla · 2024-04-29T18:04:28Z

Hey @Artyom17 ive generalised the support for llama 3, im able to convert both llama 3 8b and 70b. Pls see the pr to your fork here . By pre converting the safetensors format to the PyTorch bin format. The hf conversion script works as is. And all I needed to add were the model configs in the model.py file

nivibilla · 2024-04-29T18:42:29Z

Some performance numbers on 8xA10
python /generate.py --compile --checkpoint_path ./llama-3-8b-instruct-hf-pt/model.pth

# 70b TP8
Average tokens/sec: 21.79
Memory used: 21.66 GB/GPU

# 8b TP8
Average tokens/sec: 112.74
Memory used: 4.19 GB/GPU

# 8b NO_TP
Average tokens/sec: 34.06
Memory used: 16.43 GB/GPU

Artyom17 · 2024-04-29T18:53:09Z

Hey @Artyom17 ive generalised the support for llama 3, im able to convert both llama 3 8b and 70b. Pls see the pr to your fork here . By pre converting the safetensors format to the PyTorch bin format. The hf conversion script works as is. And all I needed to add were the model configs in the model.py file

Nice! Looking at it, thanks a lot!

Artyom17 · 2024-04-29T20:54:09Z

Hey @Artyom17 ive generalised the support for llama 3, im able to convert both llama 3 8b and 70b. Pls see the pr to your fork here . By pre converting the safetensors format to the PyTorch bin format. The hf conversion script works as is. And all I needed to add were the model configs in the model.py file

Yeah, I've tested it, it works (with some misspelling caveats I mentioned in the PR). I am not sure we can integrate these changes atm, since it will create dependency on a third-party models (the eastwind/* ones), but gpt-fast owners may correct me if I am wrong. The best outcome would be if HF adopts your conversion and releases those .bin files properly. Alternatively, gpt-fast users should be able to convert .safetensors to .bin (I am not super familiar with this process, how hard is it?).

The right flow of events IMO should be as follows:

This PR gets landed, really hope it happens soon (@Chillee )
Either HF adopts the bin files or there is a way to convert .safetensors to .bin for gpt-fast users;
You, @nivibilla create a PR here, in gpt-fast repo that adds proper llama3-70b support (and unifying 8b support, like you did in that other PR).

nivibilla · 2024-04-29T21:04:04Z

@Artyom17 all I did was load the model into memory and do save_pretrained(dir, safe_serialization=False).

This is doable but you would need enough memory to load the HF model and then save it into the "unsafe" version. If one is running these models then I guess it's safe to assume they have the requirements to do this themselves, or have code to do it in here. But I don't want to ruin this repo and add a dependency to transformers lol.

Imo a better solution is to modify the existing code to work with safetensors but idk how difficult that is.

Artyom17 · 2024-04-29T21:16:35Z

@Artyom17 all I did was load the model into memory and do save_pretrained(dir, safe_serialization=False).

This is doable but you would need enough memory to load the HF model and then save it into the "unsafe" version. If one is running these models then I guess it's safe to assume they have the requirements to do this themselves, or have code to do it in here. But I don't want to ruin this repo and add a dependency to transformers lol.

Imo a better solution is to modify the existing code to work with safetensors but idk how difficult that is.

Well, you can't use 70B model anyway, unless you have a beefy machine with A100/H100 80Gb.
Adding safetensor models support into gpt-fast sounds doable too, found this article: https://medium.com/@mandalsouvik/safetensors-a-simple-and-safe-way-to-store-and-distribute-tensors-d9ba1931ba04

danieltmeta · 2024-05-01T23:57:45Z

@Artyom17 all I did was load the model into memory and do save_pretrained(dir, safe_serialization=False).

This is doable but you would need enough memory to load the HF model and then save it into the "unsafe" version. If one is running these models then I guess it's safe to assume they have the requirements to do this themselves, or have code to do it in here. But I don't want to ruin this repo and add a dependency to transformers lol.

Imo a better solution is to modify the existing code to work with safetensors but idk how difficult that is.

Hello, I managed to brute force my way to convert from .safetensors to .bin for Meta-Llama-3-70B by loading the model and using save_pretrained(dir, safe_serialization=False) with the following code:

`from transformers import AutoModel

model = AutoModel.from_pretrained(checkpoint_dir)
model.save_pretrained(checkpoint_dir, safe_serialization=False)`

I wonder if I may ask a few questions:

Am I using the right library to implement save_pretrained for 70B? Or it doesn't matter?
I ended up generating 61 .bin files, while this other person generated 30?
https://huggingface.co/eastwind/llama-3-70b-instruct-hf-pt/tree/main
For method, I am missing 'lm_head.weight', was wondering if it was related to the previous two questions?

Thank you for your efforts!

nivibilla · 2024-05-02T06:50:17Z

Hey @danieltmeta that's exactly what I did. Eastwind is my huggingface account btw 🤣.

Not sure why you are getting different results. We should get the exact same weights. Are you on the latest transformers and safetensors libraries?

Also if you want to use that, pls see use my version of the llama 3 integration in this PR. #169

jerrymannil · 2024-05-02T17:38:42Z

@Artyom17 @nivibilla

I think supporting safetensor is the simplest way to do this (minimal code changes)
I had actually tried this with LL3 8B model and seems to work.

Detect safetensor or pytorch bin based on "index.json" file
Use safetensors.torch.load_file instead of torch.load() here

nivibilla · 2024-05-02T17:45:27Z

@jerrymannil thank you! I will make this change and test it out on my other PR tomorrow.

WIP: llama3 support, tiktoken tokenizer

4280a4a

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 19, 2024

Artyom17 changed the title ~~WIP: llama3 support, tiktoken tokenizer~~ WIP: llama3 8B support, tiktoken tokenizer Apr 19, 2024

cpuhrsch requested a review from Chillee April 19, 2024 17:21

Finalizing

171a7f3

Artyom17 force-pushed the art/llama3-support branch from 50e45b0 to 171a7f3 Compare April 19, 2024 22:15

Artyom17 marked this pull request as ready for review April 19, 2024 22:16

Artyom17 changed the title ~~WIP: llama3 8B support, tiktoken tokenizer~~ llama3 8B support, tiktoken tokenizer Apr 19, 2024

malfet approved these changes Apr 23, 2024

View reviewed changes

yanboliang approved these changes Apr 27, 2024

View reviewed changes

yanboliang mentioned this pull request Apr 29, 2024

Llama3 8b perf numbers on A100 #166

Merged

nivibilla mentioned this pull request Apr 29, 2024

Art/llama3 support SesameAILabs/gpt-fast#1

Closed

Chillee merged commit 30d69b3 into pytorch-labs:main Apr 29, 2024
1 check passed

nivibilla mentioned this pull request Apr 29, 2024

Unified Llama 3 (8b,70b) + Safetensors support #169

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama3 8B support, tiktoken tokenizer #158

llama3 8B support, tiktoken tokenizer #158

Artyom17 commented Apr 19, 2024 •

edited

Loading

Artyom17 commented Apr 19, 2024

Muhtasham commented Apr 22, 2024

Artyom17 commented Apr 22, 2024

Artyom17 commented Apr 25, 2024

Chillee commented Apr 25, 2024

yanboliang commented Apr 26, 2024

yanboliang left a comment

Artyom17 commented Apr 28, 2024

nivibilla commented Apr 28, 2024 •

edited

Loading

Chillee commented Apr 29, 2024

yanboliang commented Apr 29, 2024

yanboliang commented Apr 29, 2024

nivibilla commented Apr 29, 2024 •

edited

Loading

nivibilla commented Apr 29, 2024 •

edited

Loading

Artyom17 commented Apr 29, 2024

Artyom17 commented Apr 29, 2024

nivibilla commented Apr 29, 2024 •

edited

Loading

Artyom17 commented Apr 29, 2024

danieltmeta commented May 1, 2024

nivibilla commented May 2, 2024 •

edited

Loading

jerrymannil commented May 2, 2024

nivibilla commented May 2, 2024

llama3 8B support, tiktoken tokenizer #158

llama3 8B support, tiktoken tokenizer #158

Conversation

Artyom17 commented Apr 19, 2024 • edited Loading

Artyom17 commented Apr 19, 2024

Muhtasham commented Apr 22, 2024

Artyom17 commented Apr 22, 2024

Artyom17 commented Apr 25, 2024

Chillee commented Apr 25, 2024

yanboliang commented Apr 26, 2024

yanboliang left a comment

Choose a reason for hiding this comment

Artyom17 commented Apr 28, 2024

nivibilla commented Apr 28, 2024 • edited Loading

Chillee commented Apr 29, 2024

yanboliang commented Apr 29, 2024

yanboliang commented Apr 29, 2024

nivibilla commented Apr 29, 2024 • edited Loading

nivibilla commented Apr 29, 2024 • edited Loading

Artyom17 commented Apr 29, 2024

Artyom17 commented Apr 29, 2024

nivibilla commented Apr 29, 2024 • edited Loading

Artyom17 commented Apr 29, 2024

danieltmeta commented May 1, 2024

nivibilla commented May 2, 2024 • edited Loading

jerrymannil commented May 2, 2024

nivibilla commented May 2, 2024

Artyom17 commented Apr 19, 2024 •

edited

Loading

nivibilla commented Apr 28, 2024 •

edited

Loading

nivibilla commented Apr 29, 2024 •

edited

Loading

nivibilla commented Apr 29, 2024 •

edited

Loading

nivibilla commented Apr 29, 2024 •

edited

Loading

nivibilla commented May 2, 2024 •

edited

Loading