From 6e00f492758fdb350df14cb19870e32c7b1ac788 Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Sun, 28 Apr 2024 23:10:58 -0700 Subject: [PATCH 1/3] Add Llama3-8B perf numbers --- README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/README.md b/README.md index b1210f2..bd8fbfa 100644 --- a/README.md +++ b/README.md @@ -89,6 +89,8 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh | Llama-2-70B | Base | OOM || | | 8-bit | 19.13 | 1322.58 | | | 4-bit (G=32) | 25.25 | 1097.66 | +| Llama-3-8B | Base | 93.95 | 1508.18 | +| | 8-bit | 114.35 | 978.02 | ### Speculative Sampling [Verifier: Llama-70B (int4), Draft: Llama-7B (int4)](./scripts/speculate_70B_int4.sh): 48.4 tok/s @@ -104,6 +106,10 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh | | 2 | 21.32 | 1481.87 | | | 4 | 38.01 | 1340.76 | | | 8 | 62.50 | 1135.29 | +| Llama-3-8B | 1 | 93.97 | 1508.46 | +| | 2 | 149.44 | 1358.63 | +| | 4 | 217.80 | 1218.76 | +| | 8 | 271.03 | 1041.99 | ### Tensor Parallelism + Quantization | Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | From 0fb6914ade98123e0cfa56a59a7eb30c6af6c1ad Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Sat, 15 Jun 2024 22:30:58 -0700 Subject: [PATCH 2/3] Update --- README.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index bd8fbfa..9ad8bc8 100644 --- a/README.md +++ b/README.md @@ -89,8 +89,8 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh | Llama-2-70B | Base | OOM || | | 8-bit | 19.13 | 1322.58 | | | 4-bit (G=32) | 25.25 | 1097.66 | -| Llama-3-8B | Base | 93.95 | 1508.18 | -| | 8-bit | 114.35 | 978.02 | +| Llama-3-8B | Base | 94.25 | 1411.95 | +| | 8-bit | 139.55 | 1047.23 | ### Speculative Sampling [Verifier: Llama-70B (int4), Draft: Llama-7B (int4)](./scripts/speculate_70B_int4.sh): 48.4 tok/s @@ -106,10 +106,10 @@ Benchmarks run on an 8xA100-80GB, power limited to 330W with a hybrid cube mesh | | 2 | 21.32 | 1481.87 | | | 4 | 38.01 | 1340.76 | | | 8 | 62.50 | 1135.29 | -| Llama-3-8B | 1 | 93.97 | 1508.46 | -| | 2 | 149.44 | 1358.63 | -| | 4 | 217.80 | 1218.76 | -| | 8 | 271.03 | 1041.99 | +| Llama-3-8B | 1 | 94.19 | 1411.76 | +| | 2 | 150.48 | 1208.80 | +| | 4 | 219.77 | 991.63 | +| | 8 | 274.65 | 768.55 | ### Tensor Parallelism + Quantization | Model | Technique | Tokens/Second | Memory Bandwidth (GB/s) | From 744c9279a856428901613e39feef4270f2d2156c Mon Sep 17 00:00:00 2001 From: Yanbo Liang Date: Sun, 16 Jun 2024 19:48:29 -0700 Subject: [PATCH 3/3] Update --- README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/README.md b/README.md index 9ad8bc8..900651a 100644 --- a/README.md +++ b/README.md @@ -70,6 +70,7 @@ codellama/CodeLlama-34b-Python-hf mistralai/Mistral-7B-v0.1 mistralai/Mistral-7B-Instruct-v0.1 mistralai/Mistral-7B-Instruct-v0.2 +meta-llama/Meta-Llama-3-8B ``` For example, to convert Llama-2-7b-chat-hf