Fix #8, fix #41 Add llamacpp support #11

mmyjona · 2023-04-04T12:09:07Z

add basic llama.cpp support via abetlen/llama-cpp-python

…t-llama-mmyjona

mmyjona · 2023-04-13T09:02:35Z

ggerganov · 2023-04-13T10:37:30Z

I think the devs are doing a very good job at supporting the latest llama.cpp code base and providing python bindings

ThatcherC · 2023-04-16T20:32:58Z

I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!

INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]}
INFO:server.lib.sseserver:LISTENING TO: inferences
INFO:server.lib.sseserver:LISTENING
INFO:server.app:Received inference request llama-local
INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local
INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 -
llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local
INFO:server.lib.api.inference:Done streaming SSE

mmyjona · 2023-04-17T02:14:10Z

I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!

INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]}
INFO:server.lib.sseserver:LISTENING TO: inferences
INFO:server.lib.sseserver:LISTENING
INFO:server.app:Received inference request llama-local
INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local
INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 -
llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local
INFO:server.lib.api.inference:Done streaming SSE

I cannot found anything unusual from the log. Maybe you should add some log at server/lib/inference/init.py line line 623. And check if there is any problem with llama-cpp-python?

mmyjona · 2023-04-17T12:15:21Z

I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!

INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]}
INFO:server.lib.sseserver:LISTENING TO: inferences
INFO:server.lib.sseserver:LISTENING
INFO:server.app:Received inference request llama-local
INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local
INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 -
llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local
INFO:server.lib.api.inference:Done streaming SSE

Did your set the prompt template file for alpaca? Here is my as an example:

### Instruction:
{prompt}

### Response:

ThatcherC · 2023-04-17T13:14:51Z

Aha, that prompt did the trick! Thanks! Do you have example prompts for other models? I didn't realize they used {} formatting so I was just using the prompts that ship with llama.cpp.

…t-llama-mmyjona

mmyjona · 2023-04-17T14:38:18Z

Aha, that prompt did the trick! Thanks! Do you have example prompts for other models? I didn't realize they used {} formatting so I was just using the prompts that ship with llama.cpp.

sure, I added some more in the Readme file.

mmyjona added 4 commits April 4, 2023 18:09

show in ui

a01f7fd

add llama.cpp support

29bf96c

revert gpt-3.5-turbo settings

3cc539c

add llama-cpp-python dep

88dba57

mmyjona changed the title ~~Feat llama mmyjona~~ Add llamacpp support Apr 4, 2023

mmyjona changed the title ~~Add llamacpp support~~ fix #8 Add llamacpp support Apr 4, 2023

mmyjona added 4 commits April 6, 2023 10:19

Merge branch 'main' into feat-llama-mmyjona

2bc3fc5

Merge branch 'main' of https://github.com/nat/openplayground into fea…

2e0f34b

…t-llama-mmyjona

mod stop words

9c020c3

fix typo

618cd74

mmyjona added 2 commits April 13, 2023 19:29

add prompt template env setting, add support for more models

ec4072d

update readme, add vicuna etc.

f1186a2

mmyjona changed the title ~~fix #8 Add llamacpp support~~ fix #8, #41 Add llamacpp support Apr 13, 2023

mmyjona changed the title ~~fix #8, #41 Add llamacpp support~~ Fix #8, #41 Add llamacpp support Apr 13, 2023

mmyjona changed the title ~~Fix #8, #41 Add llamacpp support~~ Fix #8, fix #41 Add llamacpp support Apr 13, 2023

add some stopwords

2373de9

zainhuda mentioned this pull request Apr 16, 2023

Help - where is server/models.json? #48

Open

mmyjona added 2 commits April 17, 2023 22:30

Merge branch 'main' of https://github.com/nat/openplayground into fea…

cc74f07

…t-llama-mmyjona

fix typo in readme, add some example prompt tempaltes

9f8fc0c

ralyodio approved these changes Apr 25, 2023

View reviewed changes

zainhuda mentioned this pull request Jun 14, 2023

Add my model or API into source code? #70

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #8, fix #41 Add llamacpp support #11

Fix #8, fix #41 Add llamacpp support #11

mmyjona commented Apr 4, 2023

mmyjona commented Apr 13, 2023

ggerganov commented Apr 13, 2023

ThatcherC commented Apr 16, 2023

mmyjona commented Apr 17, 2023 •

edited

Loading

mmyjona commented Apr 17, 2023

ThatcherC commented Apr 17, 2023

mmyjona commented Apr 17, 2023

Fix #8, fix #41 Add llamacpp support #11

Are you sure you want to change the base?

Fix #8, fix #41 Add llamacpp support #11

Conversation

mmyjona commented Apr 4, 2023

mmyjona commented Apr 13, 2023

ggerganov commented Apr 13, 2023

ThatcherC commented Apr 16, 2023

mmyjona commented Apr 17, 2023 • edited Loading

mmyjona commented Apr 17, 2023

ThatcherC commented Apr 17, 2023

mmyjona commented Apr 17, 2023

mmyjona commented Apr 17, 2023 •

edited

Loading