Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #8, fix #41 Add llamacpp support #11

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

mmyjona
Copy link
Contributor

@mmyjona mmyjona commented Apr 4, 2023

add basic llama.cpp support via abetlen/llama-cpp-python

@mmyjona mmyjona changed the title Feat llama mmyjona Add llamacpp support Apr 4, 2023
@mmyjona mmyjona changed the title Add llamacpp support fix #8 Add llamacpp support Apr 4, 2023
@mmyjona
Copy link
Contributor Author

mmyjona commented Apr 13, 2023

image

@ggerganov
Copy link

+1 for using abetlen/llama-cpp-python

I think the devs are doing a very good job at supporting the latest llama.cpp code base and providing python bindings

@mmyjona mmyjona changed the title fix #8 Add llamacpp support fix #8, #41 Add llamacpp support Apr 13, 2023
@mmyjona mmyjona changed the title fix #8, #41 Add llamacpp support Fix #8, #41 Add llamacpp support Apr 13, 2023
@mmyjona mmyjona changed the title Fix #8, #41 Add llamacpp support Fix #8, fix #41 Add llamacpp support Apr 13, 2023
@ThatcherC
Copy link

I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!

INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]}
INFO:server.lib.sseserver:LISTENING TO: inferences
INFO:server.lib.sseserver:LISTENING
INFO:server.app:Received inference request llama-local
INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local
INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 -
llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local
INFO:server.lib.api.inference:Done streaming SSE

@mmyjona
Copy link
Contributor Author

mmyjona commented Apr 17, 2023

I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!

INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]}
INFO:server.lib.sseserver:LISTENING TO: inferences
INFO:server.lib.sseserver:LISTENING
INFO:server.app:Received inference request llama-local
INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local
INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 -
llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local
INFO:server.lib.api.inference:Done streaming SSE

I cannot found anything unusual from the log. Maybe you should add some log at server/lib/inference/init.py line line 623. And check if there is any problem with llama-cpp-python?

@mmyjona
Copy link
Contributor Author

mmyjona commented Apr 17, 2023

I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!

INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]}
INFO:server.lib.sseserver:LISTENING TO: inferences
INFO:server.lib.sseserver:LISTENING
INFO:server.app:Received inference request llama-local
INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local
INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 -
llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local
INFO:server.lib.api.inference:Done streaming SSE

Did your set the prompt template file for alpaca? Here is my as an example:

### Instruction:
{prompt}

### Response:

@ThatcherC
Copy link

Aha, that prompt did the trick! Thanks! Do you have example prompts for other models? I didn't realize they used {} formatting so I was just using the prompts that ship with llama.cpp.

@mmyjona
Copy link
Contributor Author

mmyjona commented Apr 17, 2023

Aha, that prompt did the trick! Thanks! Do you have example prompts for other models? I didn't realize they used {} formatting so I was just using the prompts that ship with llama.cpp.

sure, I added some more in the Readme file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants