Add llama 2 model #2262

tikikun · 2023-07-18T16:35:53Z

Meta just released llama 2 model, allowing commercial usage

https://ai.meta.com/resources/models-and-libraries/llama/

I have checked the model implementation and it seems different from llama_v1, maybe need a re-implementation

Green-Sky · 2023-07-18T16:42:36Z

link to paper: https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/

Azeirah · 2023-07-18T16:46:48Z

Interesting to note that the model evaluation section in their paper lists a 34b model even though the site doesn't talk about it. I wonder if it'll be available.

Does anyone have access to the models yet? I signed up but haven't received an e-mail. It's not super clear to me if it's meant to be instant or not.

Green-Sky · 2023-07-18T16:46:58Z

Interestingly, the paper talks about a 34B model, which is missing from the model card.
edit: @Azeirah was faster lol

slaren · 2023-07-18T16:55:14Z

The paper implies that they are planning to release the 34B model later.

Green-Sky · 2023-07-18T16:56:54Z

@Azeirah no, i did not hear back yet either.

Once your request is approved, you will receive a signed URL over email. Then run the download.sh script, passing the URL provided when prompted to start the download.

Keep in mind that the links expire after 24 hours and a certain amount of downloads. If you start seeing errors such as 403: Forbidden, you can always re-request a link.

also, they are available on hf if your email is the same https://huggingface.co/meta-llama

Azeirah · 2023-07-18T17:05:44Z

I was really hopeful for an alternative to gpt-4 for coding assistance, but the evaluation states their 70B model is about equivalent in performance to gpt-3.5.

Not bad, but the jump in quality from 3.5 to 4 has been what it made it really useful in day-to-day coding tasks. ;(

At the very least, it does look like the 7B and 13B variants will be amazing local chatbots for low perf devices.

dmadisetti · 2023-07-18T17:20:11Z

I just got access, but the download is flaky, check sums are not matching and the auth is hit or miss.
Notable is the chat specific models:

https://github.com/facebookresearch/llama/blob/main/download.sh#L24C1-L43C7

Will update if I am actually able to download these weights

goranmoomin · 2023-07-18T17:28:47Z

The updated model code for Llama 2 is at the same facebookresearch/llama repo, diff here: meta-llama/llama@6d4c0c2

Seems codewise, the only difference is the addition of GQA on large models, i.e. the repeat_kv part that repeats the same k/v attention heads on larger models to require less memory for the k/v cache.

According to the paper, smaller models (i.e. the 7b/13b ones) don't have GQA, so in theory it seems it should be able to run unmodified.

dmadisetti · 2023-07-18T17:34:58Z

Email below with tracking links stripped. Same as llama-1 for the most part. Now if it would actually download.....

You’re all set to start building with Llama 2.

The models listed below are now available to you as a commercial license holder. By downloading a model, you are agreeing to the terms and conditions of the license, acceptable use policy and Meta’s privacy policy.

Model weights available:

Llama-2-7b
Llama-2-7b-chat
Llama-2-13b
Llama-2-13b-chat
Llama-2-70b
Llama-2-70b-chat

With each model download, you’ll receive a copy of the Llama 2 Community License and Acceptable Use Policy, and can find all other information on the model and code on GitHub.

How to download the models:

Visit GitHub and clone [the Llama repository](https://github.com/facebookresearch/llama) from there in order to download the model code
Run the download.sh script and and follow the prompts for downloading the models.
When asked for your unique custom URL, please insert the following:
<redacted for legal reasons>
Select which model weights to download

The unique custom URL provided will remain valid for model downloads for 24 hours, and requests can be submitted multiple times.
Now you’re ready to start building with Llama 2.

Helpful tips:
Please read the instructions in the GitHub repo and use the provided code examples to understand how to best interact with the models. In particular, for the fine-tuned chat models you must use appropriate formatting and correct system/instruction tokens to get the best results from the model.

You can find additional information about how to responsibly deploy Llama models in our Responsible Use Guide.

If you need to report issues:
If you or any Llama 2 user becomes aware of any violation of our license or acceptable use policies - or any bug or issues with Llama 2 that could lead to any such violations - please report it through one of the following means:

Reporting issues with the model: Llama GitHub
Giving feedback about potentially problematic output generated by the model: [Llama output feedback](https://developers.facebook.com/llama_output_feedback)
Reporting bugs and security concerns: [Bug Bounty Program](https://facebook.com/whitehat/info)
Reporting violations of the Acceptable Use Policy: [[email protected]](mailto:[email protected])

Subscribe to get the latest updates on Llama and Meta AI.

Meta’s GenAI Team

swyxio · 2023-07-18T17:40:27Z

anyone else also randomly getting

Resolving download.llamameta.net (download.llamameta.net)... 13.33.88.72, 13.33.88.62, 13.33.88.45, ...
Connecting to download.llamameta.net (download.llamameta.net)|13.33.88.72|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-07-19 01:24:43 ERROR 403: Forbidden.

for the small files? but /llama-2-7b-chat/consolidated.00.pth is downloading fine it seems. will share checksums when i have them

BetaDoggo · 2023-07-18T17:41:00Z

I tried the 7B and it seems to be working fine, with cuda acceleration as well.

Azeirah · 2023-07-18T17:41:16Z

anyone else also randomly getting
Resolving download.llamameta.net (download.llamameta.net)... 13.33.88.72, 13.33.88.62, 13.33.88.45, ...
Connecting to download.llamameta.net (download.llamameta.net)|13.33.88.72|:443... connected.
HTTP request sent, awaiting response... 403 Forbidden
2023-07-19 01:24:43 ERROR 403: Forbidden.
for the small files? but /llama-2-7b-chat/consolidated.00.pth is downloading fine it seems. will share checksums when i have them

I genuinely just think their servers are a bit overloaded given what I see posted here. It's a big release

trrahul · 2023-07-18T17:42:42Z

Yeah the GGML models are on hf now.
https://huggingface.co/TheBloke/Llama-2-7B-GGML
https://huggingface.co/TheBloke/Llama-2-13B-GGML

Azeirah · 2023-07-18T17:46:18Z

Yeah the GGML models are on hf now.
https://huggingface.co/TheBloke/Llama-2-7B-GGML
https://huggingface.co/TheBloke/Llama-2-13B-GGML

Thebloke is a wizard O_O

Johnhersh · 2023-07-18T17:57:10Z

Yeah the GGML models are on hf now.
https://huggingface.co/TheBloke/Llama-2-7B-GGML
https://huggingface.co/TheBloke/Llama-2-13B-GGML

These worked as-is for me

LoganDark · 2023-07-18T18:09:47Z

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!

Azeirah · 2023-07-18T18:11:26Z

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!

Wouldn't be surprised if he's uploading from a service like AWS or Azure, those have insane bandwidth available.

Johnhersh · 2023-07-18T18:11:53Z

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.

LoganDark · 2023-07-18T18:12:18Z

Yeah the GGML models are on hf now. https://huggingface.co/TheBloke/Llama-2-7B-GGML https://huggingface.co/TheBloke/Llama-2-13B-GGML

Holy heck what is this dude's upload speed? I'm watching https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/tree/main fill in live, they uploading gigabytes of model per minute!

Wouldn't be surprised if he's uploading from a service like AWS or Azure, those have insane bandwidth available.

As in, renting a VPS or dedicated server just to quantize + upload? (actually, come to think of it, that is an official recommendation by huggingface, wouldn't be surprised...)

LoganDark · 2023-07-18T18:13:10Z

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Azeirah · 2023-07-18T18:15:28Z

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p

Johnhersh · 2023-07-18T18:16:45Z

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p

Quantized. I'm using llama-2-13b.ggmlv3.q4_1.bin

LoganDark · 2023-07-18T18:19:09Z

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

Depends on if you're using the quantised or non-quantised version as well, neither of you two posted which model you're using so comparing doesn't make sense :p

Quantized. I'm using llama-2-13b.ggmlv3.q4_1.bin

q4_0 should be even faster for only slightly less accuracy

Green-Sky · 2023-07-18T18:20:34Z

iirc q4_1 has an outdated perf/size tradeoff, use one of the kquants instead. (or q4_0)

nullhook · 2023-07-18T18:36:06Z

inferencing with q4_1 on M1 Max (64GB)

2.99 ms per token is slow

LoganDark · 2023-07-18T18:36:48Z

It works, but it is veeeeery slow in silicon macs.

Hmm really? On the 13B one I get crazy-good speed.

Woah, apple silicon is literally god, I don't get anywhere near those speeds with my 3060 pulling hundreds of watts (:/)

huh nevermind

(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)

Johnhersh · 2023-07-18T18:41:21Z

huh nevermind

(llama-2-13b-chat.ggmlv3.q4_0 with all layers offloaded)

How do you offload the layers?

SlyEcho · 2023-07-19T13:03:50Z

I was using @TheBloke's quantized 7B model.

Just passed the args -c 4096 and no scaling and a big file (>3000 tokens) with -f and it was generating coherent text.

ggerganov · 2023-07-19T13:05:20Z

I think I have a 70B prototype here: #2276

Needs some more work and not 100% sure it is correct, but text generation looks coherent.

wizzard0 · 2023-07-19T19:47:48Z

Note #2276 breaks non-GQA models:

error loading model: llama.cpp: tensor 'layers.0.attention.wk.weight' has wrong shape; expected  4096 x   512, got  4096 x  4096
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model 'llama-2-7b.ggmlv3.q2_K.bin'
main: error: unable to load model

TikkunCreation · 2023-07-19T21:33:15Z

So the chat model uses something like
{BOS}[INST] <<SYS>>
{system}
<</SYS>>

{instruct-0} [/INST] {response-0} {EOS}{BOS}[INST] {instruct-1} [/INST] {response-1} {EOS}{BOS}[INST] {instruct-N} [/INST]
The model generate EOS automatically, but there's no way to insert BOS with the current code in this repo, neither in main nor in server.

For clarity, it uses <s> and </s> for EOS and BOS (I checked with a python script using tokenizer.model)

jxy · 2023-07-19T23:11:33Z

I made a simple change to main to add BOS.

diff --git a/examples/main/main.cpp b/examples/main/main.cpp
index bcbcf12..5906cde 100644
--- a/examples/main/main.cpp
+++ b/examples/main/main.cpp
@@ -605,6 +605,8 @@ int main(int argc, char ** argv) {
             // replace end of text token with newline token when in interactive mode
             if (id == llama_token_eos() && params.interactive && !params.instruct) {
                 id = llama_token_newline.front();
+                embd_inp.push_back(llama_token_bos());
+                is_interacting = true;
                 if (params.antiprompt.size() != 0) {
                     // tokenize and inject first reverse prompt
                     const auto first_antiprompt = ::llama_tokenize(ctx, params.antiprompt.front(), false);

and run it like

./main -m "$MODEL" -c 4096 -n -1 --in-prefix ' [INST] ' --in-suffix ' [/INST]' -i -p \
"[INST] <<SYS>>
$SYSTEM
<</SYS>>

$FIRST_MESSAGE [/INST]"

I don't know if we want an argument like --insert-bos-after-eos to main.

Regarding <s> and </s>, main or server cannot encode those to BOS or EOS.

SlyEcho · 2023-07-19T23:29:06Z

I think inp_pfx and inp_sfx should also be changed?

XiongjieDai · 2023-07-20T01:28:46Z

Hi ! I'm sorry i'm new on github. I tried to download Llama 2 but it's not working, the cmd's program close without downloading anything after I wrote the model (I've download and install "wget" before, and i don't know how to get "md5sum" on Windows). Can anybody help me please ?

If you have Git Bash installed, you can run the .sh file from the Git Bash command line with: bash path/to/script.sh

jxy · 2023-07-20T01:51:47Z

I think inp_pfx and inp_sfx should also be changed?

Those are hard coded for the instruct mode

  -ins, --instruct      run in instruction mode (use with Alpaca models)

ziwang-com · 2023-07-20T03:12:50Z

Global launch, llama2-map module library frame composition

【23-7-20】全球首发，llama2-map模块库架构图
https://github.com/ziwang-com/AGI-MAP

Green-Sky · 2023-07-20T09:23:50Z

@ziwang-com those are just callgraphs for the python code. I'm sorry, but the python code already is simple to read as is, we don't really need those images. (also imho they feel harder to read than the python code)

sowa705 · 2023-07-20T16:40:01Z

I think inp_pfx and inp_sfx should also be changed?

Those are hard coded for the instruct mode
  -ins, --instruct      run in instruction mode (use with Alpaca models)

Would it be possible to move them into the model file? That would solve the issue of different models having different prompt formats

viniciusarruda · 2023-07-20T17:46:34Z

Is Meta tokenizer identical to llama_cpp tokenizer? I think it should be. But I'm having a issue while decoding/encoding.
This is also related to the chat completion format already mentioned above by @kharvd @jxy @TikkunCreation
You can see the issue in details and also replicate it here. I'm comparing Meta original tokenizer with a model from @TheBloke .

jxy · 2023-07-21T03:19:50Z

for llama-2-chat, #2304

jxy · 2023-07-21T05:58:31Z

and server, #2306

ggerganov · 2023-07-23T09:17:14Z

70B support should be ready to merge in #2276

Btw, I did some tests with 7Bv2 and the generated texts from short prompts using Q4_0 and Q5_0 definitely feel weird. I wrote more about it in the PR description. Would be nice if other people confirm the observations.

kurnevsky · 2023-08-09T18:40:01Z

It doesn't work with the following input:

llama-cpp -c 4096 -gqa 8 -t 16 -m llama-2-70b.ggmlv3.q4_K_M.bin -p "### HUMAN:\na\n\n### RESPONSE:\nb\n\n### HUMAN:\nb\n\n### RESPONSE:"

The error is GGML_ASSERT: /build/source/ggml.c:10648: ne02 == ne12.

WiSaGaN · 2023-08-18T06:40:43Z

The error is GGML_ASSERT: /build/source/ggml.c:10648: ne02 == ne12.

It worked in the vanilla case for me, but got similar error when I run the binary from "make LLAMA_CLBLAST=1". "-gqa 8" was added in both cases.

kurnevsky · 2023-08-18T07:11:46Z

I actually do use LLAMA_CLBLAST, but tested without gpu offloading - didn't know it affects the execution somehow :)
And I got this error on the model from https://huggingface.co/TheBloke/Llama-2-70B-GGML

Nyceane · 2023-09-14T22:53:14Z

@kurnevsky I am having same problem, are you able to fix it?

cebtenzzre · 2023-09-15T00:27:52Z

I am having same problem, are you able to fix it?

See #3002. Known workarounds are to not use the OpenCL backend with LLaMA 2, or to not use k-quants (Q*_K).

kleenkanteen · 2023-10-18T04:53:09Z

@tikikun What do you mean to add the llama 2 model when this repo about the llama model? Also on the main page why does it say "Supported models:" and then lists a bunch of other LLMs when this repo is just about llama?

ggerganov · 2023-10-18T07:31:45Z

LLaMA v2 and many other models are currently supported by llama.cpp.
See the status page for more info

kleenkanteen · 2023-10-18T12:35:13Z

What do you mean that it's currently supported. Isn't llama.cpp just about llama 1?

…

On Wed., Oct. 18, 2023, 3:32 a.m. Georgi Gerganov, ***@***.***> wrote: Closed #2262 <#2262> as completed. — Reply to this email directly, view it on GitHub <#2262 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQF3ZKSPPHPZHZXUABYEO5DX76AXTAVCNFSM6AAAAAA2OVCXFGVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJQGY4DQNZRG44DMMI> . You are receiving this because you commented.Message ID: <ggerganov/llama. ***@***.***>

ggerganov · 2023-10-18T13:07:22Z

No, llama.cpp can run inference for all model architectures listed in the status page. It started just with LLaMA v1, but since then there has been a lot of progress and it now supports a variety of models.

Green-Sky mentioned this issue Jul 18, 2023

[Enhancement] Llama 2 model support #2263

Closed

Green-Sky added 🦙. llama model Model specific labels Jul 18, 2023

cosmic-snow mentioned this issue Jul 19, 2023

Llama 2 support nomic-ai/gpt4all#1227

Closed

viniciusarruda mentioned this issue Jul 19, 2023

Running LLaMA v2 with chat format. abetlen/llama-cpp-python#507

Closed

4 tasks

coder543 mentioned this issue Jul 20, 2023

Web Chat server does not properly escape HTML output #2300

Closed

4 tasks

ghost mentioned this issue Jul 26, 2023

Lack of documentation regarding RoPE scaling #2402

Closed

ggerganov closed this as completed Oct 18, 2023

github-project-automation bot moved this to Done in @mudler's board Oct 18, 2023

Add llama 2 model #2262

Add llama 2 model #2262

Comments

tikikun commented Jul 18, 2023

Green-Sky commented Jul 18, 2023

Azeirah commented Jul 18, 2023 • edited Loading

Green-Sky commented Jul 18, 2023 • edited Loading

slaren commented Jul 18, 2023

Green-Sky commented Jul 18, 2023

Azeirah commented Jul 18, 2023 • edited Loading

dmadisetti commented Jul 18, 2023

goranmoomin commented Jul 18, 2023

dmadisetti commented Jul 18, 2023

swyxio commented Jul 18, 2023

BetaDoggo commented Jul 18, 2023

Azeirah commented Jul 18, 2023

trrahul commented Jul 18, 2023 • edited Loading

Azeirah commented Jul 18, 2023

Johnhersh commented Jul 18, 2023

LoganDark commented Jul 18, 2023 • edited Loading

Azeirah commented Jul 18, 2023

Johnhersh commented Jul 18, 2023

LoganDark commented Jul 18, 2023 • edited Loading

LoganDark commented Jul 18, 2023 • edited Loading

Azeirah commented Jul 18, 2023

Johnhersh commented Jul 18, 2023

LoganDark commented Jul 18, 2023

Green-Sky commented Jul 18, 2023

nullhook commented Jul 18, 2023 • edited Loading

LoganDark commented Jul 18, 2023

Johnhersh commented Jul 18, 2023

SlyEcho commented Jul 19, 2023 • edited Loading

ggerganov commented Jul 19, 2023

wizzard0 commented Jul 19, 2023

TikkunCreation commented Jul 19, 2023 • edited by tobi Loading

jxy commented Jul 19, 2023

SlyEcho commented Jul 19, 2023

XiongjieDai commented Jul 20, 2023

jxy commented Jul 20, 2023

ziwang-com commented Jul 20, 2023 • edited Loading

Green-Sky commented Jul 20, 2023 • edited Loading

sowa705 commented Jul 20, 2023

viniciusarruda commented Jul 20, 2023

jxy commented Jul 21, 2023

jxy commented Jul 21, 2023

ggerganov commented Jul 23, 2023

kurnevsky commented Aug 9, 2023

WiSaGaN commented Aug 18, 2023 • edited Loading

kurnevsky commented Aug 18, 2023

Nyceane commented Sep 14, 2023

cebtenzzre commented Sep 15, 2023 • edited Loading

kleenkanteen commented Oct 18, 2023

ggerganov commented Oct 18, 2023

kleenkanteen commented Oct 18, 2023 via email

ggerganov commented Oct 18, 2023

Azeirah commented Jul 18, 2023 •

edited

Loading

Green-Sky commented Jul 18, 2023 •

edited

Loading

Azeirah commented Jul 18, 2023 •

edited

Loading

trrahul commented Jul 18, 2023 •

edited

Loading

LoganDark commented Jul 18, 2023 •

edited

Loading

LoganDark commented Jul 18, 2023 •

edited

Loading

LoganDark commented Jul 18, 2023 •

edited

Loading

nullhook commented Jul 18, 2023 •

edited

Loading

SlyEcho commented Jul 19, 2023 •

edited

Loading

TikkunCreation commented Jul 19, 2023 •

edited by tobi

Loading

ziwang-com commented Jul 20, 2023 •

edited

Loading

Green-Sky commented Jul 20, 2023 •

edited

Loading

WiSaGaN commented Aug 18, 2023 •

edited

Loading

cebtenzzre commented Sep 15, 2023 •

edited

Loading