Chat completion outputs a single token (Llama 3.1 8B Instruct GGUF Q6 K L) #9624

gislerro · 2024-09-24T08:15:32Z

gislerro
Sep 24, 2024

I'm using the following gguf model: https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/blob/main/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf

I'm starting the llama cpp server by running the following command:

/llama.cpp/llama-server --host 0.0.0.0 --port 7020 --alias Meta-Llama-3.1-8B-Instruct-GGUF-Q6_K_L --gpu-layers 33 --model /huggingface/hub/models--bartowski--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/9a8dec50f04fa8fad1dc1e7bc20a84a512e2bb01/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf --threads-http 1 --ctx-size 1024 --metrics --chat-template llama3

When I now try to do a completion with the openai client it times out:

    openai_client = openai.AsyncOpenAI(
        base_url=f"http://localhost:{LocalPorts['llama-cpp']}/v1",
        api_key=ApiKeys["llama-cpp"],
    )

    async for openai_model in openai_client.models.list():
        assert model.id == openai_model.id

    logger.info("Starting chat completion...")
    start_time = time.time()
    response = await openai_client.chat.completions.create(
        model=model.id,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, how are you today?"},
        ],
        timeout=10,
    )
    end_time = time.time()
    logger.info(f"Chat completion took {end_time - start_time} seconds")

    assert response.choices[0].message.content is not None
    logger.info(f"Assistant: {response.choices[0].message.content}")
    logger.info(response.choices[0])

And here's the log from the llama-cpp server corresponding to the completion request which just times out:

INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot launch_slot_: id  0 | task 0 | processing task
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 0 | tokenizing prompt, len = 1
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 0 | prompt tokenized, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 28
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 0 | kv cache rm [0, end)
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 28, n_tokens = 28, progress = 1.000000
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 0 | prompt done, n_past = 28, n_tokens = 28
INFO     openai._base_client:_base_client.py:1652 Retrying request to /chat/completions in 0.499898 seconds
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 0 | slot context shift, n_keep = 1, n_left = 1022, n_discard = 511
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot      release: id  0 | task 0 | stop processing: n_past = 540, truncated = 1
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot print_timing: id  0 | task 0 |
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] prompt eval time =     125.11 ms /    28 tokens (    4.47 ms per token,   223.81 tokens per second)
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] eval time =   17167.93 ms /  1024 tokens (   16.77 ms per token,    59.65 tokens per second)
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] total time =   17293.04 ms /  1052 tokens
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] srv  update_slots: all slots are idle
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] request: POST /v1/chat/completions 127.0.0.1 200
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot launch_slot_: id  0 | task 1025 | processing task
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 1025 | tokenizing prompt, len = 1
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 1025 | prompt tokenized, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 28
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 1025 | kv cache rm [0, end)
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 1025 | prompt processing progress, n_past = 28, n_tokens = 28, progress = 1.000000
INFO     llm_orchestrator.logger:model_thread.py:230 [llama-cpp] slot update_slots: id  0 | task 1025 | prompt done, n_past = 28, n_tokens = 28
INFO     openai._base_client:_base_client.py:1652 Retrying request to /chat/completions in 0.963183 seconds

It doesn't return anything. I've tried setting --n-predict -2 according to the following [issue](# #3969 (comment)) but that makes the model only produce a single token:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "G",
        "role": "assistant"
      }
    }
  ],
  "created": 1727164154,
  "model": "Meta-Llama-3.1-8B-Instruct-GGUF-Q6_K_L",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 1,
    "prompt_tokens": 28,
    "total_tokens": 29
  },
  "id": "chatcmpl-3PaWsHJQpjhQN0Ekzv8MqAmat4YcQYLf",
  "__verbose": {
    "content": "G",
    "id_slot": 0,
    "stop": true,
    "model": "Meta-Llama-3.1-8B-Instruct-GGUF-Q6_K_L",
    "tokens_predicted": 1,
    "tokens_evaluated": 28,
    "generation_settings": {
      "n_ctx": 1024,
      "n_predict": -2,
      "model": "Meta-Llama-3.1-8B-Instruct-GGUF-Q6_K_L",
      "seed": 4294967295,
      "seed_cur": 1116617295,
      "temperature": 0.800000011920929,
      "dynatemp_range": 0,
      "dynatemp_exponent": 1,
      "top_k": 40,
      "top_p": 0.949999988079071,
      "min_p": 0.05000000074505806,
      "tfs_z": 1,
      "typical_p": 1,
      "repeat_last_n": 64,
      "repeat_penalty": 1,
      "presence_penalty": 0,
      "frequency_penalty": 0,
      "mirostat": 0,
      "mirostat_tau": 5,
      "mirostat_eta": 0.10000000149011612,
      "penalize_nl": false,
      "stop": [],
      "max_tokens": -1,
      "n_keep": 0,
      "n_discard": 0,
      "ignore_eos": false,
      "stream": false,
      "n_probs": 0,
      "min_keep": 0,
      "grammar": "",
      "samplers": [
        "top_k",
        "tfs_z",
        "typ_p",
        "top_p",
        "min_p",
        "temperature"
      ]
    },
    "prompt": "<|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nHello, how are you today?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
    "truncated": false,
    "stopped_eos": false,
    "stopped_word": false,
    "stopped_limit": true,
    "stopping_word": "",
    "tokens_cached": 28,
    "timings": {
      "prompt_n": 28,
      "prompt_ms": 126.814,
      "prompt_per_token_ms": 4.529071428571428,
      "prompt_per_second": 220.79581118803918,
      "predicted_n": 1,
      "predicted_ms": 0.041,
      "predicted_per_token_ms": 0.041,
      "predicted_per_second": 24390.243902439022
    },
    "index": 0,
    "oaicompat_token_ctr": 1
  }
}

I have the same issue with Llama 3 the OpenAI API server just does not work...

ggerganov · 2024-09-24T11:12:56Z

ggerganov
Sep 24, 2024
Maintainer

Works fine on my end:

$ ▶ curl -s --request POST --url http://127.0.0.1:7020/v1/chat/completions --header "Content-Type: application/json" --data '{"messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Hello, how are you today?" } ], "n_predict": 64}' | jq
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "I'm doing well, thank you for asking! I'm a large language model, so I don't have emotions like humans do, but I'm always ready to help and assist with any questions or tasks you may have. How about you? How's your day going so far?",
        "role": "assistant"
      }
    }
  ],
  "created": 1727176285,
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 58,
    "prompt_tokens": 28,
    "total_tokens": 86
  },
  "id": "chatcmpl-vB0PElZDtrMn2PXwBDhF7tkNl54IFFQl"
}

Can you show the output of the command above?

6 replies

gislerro Sep 24, 2024
Author

When I omit n-predict from the request it seems like a new task is created for each token, until the context is filled:

...
INFO - [llama-cpp] srv  update_slots: posting NEXT_RESPONSE
INFO - [llama-cpp] que          post: new task, id = 1022, front = 0
INFO - [llama-cpp] slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 1024, n_past = 538, n_system_tokens = 0, n_cache_tokens = 0, truncated = 1
INFO - [llama-cpp] srv  update_slots: decoding batch, n_tokens = 1
INFO - [llama-cpp] slot process_toke: id  0 | task 0 | n_decoded = 1022, n_remaining = -1, next token: 'G'
INFO - [llama-cpp] srv  update_slots: run slots completed
INFO - [llama-cpp] que    start_loop: waiting for new tasks
INFO - [llama-cpp] que    start_loop: processing new tasks
INFO - [llama-cpp] que    start_loop: processing task, id = 1022
INFO - [llama-cpp] que    start_loop: update slots
INFO - [llama-cpp] srv  update_slots: posting NEXT_RESPONSE
INFO - [llama-cpp] que          post: new task, id = 1023, front = 0
INFO - [llama-cpp] slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 1024, n_past = 539, n_system_tokens = 0, n_cache_tokens = 0, truncated = 1
INFO - [llama-cpp] srv  update_slots: decoding batch, n_tokens = 1
INFO - [llama-cpp] slot process_toke: id  0 | task 0 | n_decoded = 1023, n_remaining = -1, next token: 'G'
INFO - [llama-cpp] srv  update_slots: run slots completed
INFO - [llama-cpp] que    start_loop: waiting for new tasks
INFO - [llama-cpp] que    start_loop: processing new tasks
INFO - [llama-cpp] que    start_loop: processing task, id = 1023
INFO - [llama-cpp] que    start_loop: update slots
INFO - [llama-cpp] srv  update_slots: posting NEXT_RESPONSE
INFO - [llama-cpp] que          post: new task, id = 1024, front = 0
INFO - [llama-cpp] slot update_slots: id  0 | task 0 | slot decode token, n_ctx = 1024, n_past = 540, n_system_tokens = 0, n_cache_tokens = 0, truncated = 1
INFO - [llama-cpp] srv  update_slots: decoding batch, n_tokens = 1
INFO - [llama-cpp] slot process_toke: id  0 | task 0 | stopped due to running out of context capacity, n_decoded = 1024, n_ctx = 1024
INFO - [llama-cpp] slot process_toke: id  0 | task 0 | n_decoded = 1024, n_remaining = -1, next token: 'G'
INFO - [llama-cpp] slot      release: id  0 | task 0 | stop processing: n_past = 540, truncated = 1
INFO - [llama-cpp] slot print_timing: id  0 | task 0 |
INFO - [llama-cpp] prompt eval time =     420.38 ms /    28 tokens (   15.01 ms per token,    66.61 tokens per second)
INFO - [llama-cpp] eval time =   17225.32 ms /  1024 tokens (   16.82 ms per token,    59.45 tokens per second)
INFO - [llama-cpp] total time =   17645.69 ms /  1052 tokens
INFO - [llama-cpp] srv          send: sending result for task id = 0
INFO - [llama-cpp] srv          send: task id = 0 moved to result queue
INFO - [llama-cpp] srv  update_slots: run slots completed
INFO - [llama-cpp] que    start_loop: waiting for new tasks
INFO - [llama-cpp] que    start_loop: processing new tasks
INFO - [llama-cpp] que    start_loop: processing task, id = 1024
INFO - [llama-cpp] que    start_loop: update slots
INFO - [llama-cpp] srv  update_slots: all slots are idle
INFO - [llama-cpp] que    start_loop: waiting for new tasks
INFO - [llama-cpp] srv  remove_waiti: remove task 0 from waiting list. current waiting = 1 (before remove)
INFO - [llama-cpp] request: POST /v1/chat/completions 127.0.0.1 200
INFO - [llama-cpp] request:  {"messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Hello, how are you today?" } ]}
INFO - [llama-cpp] response: {"choices":[{"finish_reason":"stop","index":0,"message":{"content":"GGGGGGGGGGGGGGG....

ggerganov Sep 24, 2024
Maintainer

The GGGGG... output indicates some compute problem, but I cannot reproduce this on neither of my machines.

Can you post the full log of:

./llama-cli -m ~/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf -p "I believe the meaning of life is" -ngl 99 -c 1024 -s 1 -n 64

gislerro Sep 24, 2024
Author

The cli completion works as expected:

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 

sampler seed: 1
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 1024, n_batch = 2048, n_predict = 64, n_keep = 1

I believe the meaning of life is subjective. What a person finds meaningful is what is meaningful to them. It’s about living in harmony with what makes them happy, fulfilled and brings them a sense of purpose. What does that mean to you? What gives your life meaning? I want to hear about it.
We all have our own struggles, our own

llama_perf_sampler_print:    sampling time =       4.97 ms /    72 runs   (    0.07 ms per token, 14475.27 tokens per second)
llama_perf_context_print:        load time =    1655.36 ms
llama_perf_context_print: prompt eval time =      34.88 ms /     8 tokens (    4.36 ms per token,   229.36 tokens per second)
llama_perf_context_print:        eval time =    1037.30 ms /    63 runs   (   16.47 ms per token,    60.73 tokens per second)
llama_perf_context_print:       total time =    1089.56 ms /    71 tokens

Funnily when I hit the completion endpoint when I started the llama cpp server from my python application then a sensible completion does get generated but in the wrong place (the completion is in model_extra - it should be in choices[0].text):

I'll refactor my code so that llama cpp doesn't start as a subprocess but rather as a seperate thread see if that solves it 😅

gislerro Sep 25, 2024
Author

I cannot figure out whats wrong. I'm using llama.cpp built from source:

FROM nvidia/cuda:12.6.1-devel-ubuntu24.04 AS base
...

#############################
##### LLAMA.CPP BUILDER #####
#############################
FROM base AS llama-cpp-builder

# Install basic build utilities
RUN apt-get update && apt-get install --no-install-recommends -y \
    build-essential \
    cmake \
    git
    
# Build llama.cpp from source
RUN git clone https://github.com/ggerganov/llama.cpp /llama.cpp
WORKDIR /llama.cpp

RUN make GGML_CUDA=1 -j4

...

#############################
### DEVELOPMENT CONTAINER ###
#############################

FROM python-base AS dev
...
COPY --from=llama-cpp-builder /llama.cpp /llama.cpp
...

then the following python script gives a completion that does not meet OpenAI API spec (the completion should be in completion.choices[0].text and not completion.content)
The chat completion is just 'GGGGG...' (but the response does follow the OpenAI API spec)

import asyncio
from pprint import pprint
import signal
import subprocess
import threading
from typing import List

import time
import httpx
import openai

from huggingface_hub import hf_hub_download  # type: ignore[import-untyped]


class InferenceThread(threading.Thread):
    cmd: List[str]

    process: subprocess.Popen | None
    stop_event: threading.Event

    def __init__(self, *, id: str, cmd: List[str]) -> None:
        super().__init__()

        self.cmd = cmd
        self.name = id

        self.process = None
        self.stop_event = threading.Event()

    def run(self) -> None:
        self.process = subprocess.Popen(
            self.cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True
        )

        while True:
            retcode = self.process.poll()
            if retcode is not None:
                print(f"Unexpected exit of inference process: {retcode}")
                break
            if self.stop_event.is_set():
                break
            time.sleep(1)

        print(f"{self.name}: Sending SIGINT")
        self.process.send_signal(signal.SIGINT)

    def stop(self) -> None:
        self.stop_event.set()


async def check_health(port: int, api_key: str):
    async with httpx.AsyncClient() as client:
        try:
            response = await client.get(
                f"http://localhost:{port}/health",
                headers={"Authorization": f"Bearer {api_key})"},
            )
            return response.status_code == 200
        except Exception:
            return False


async def main() -> None:
    id = "Meta-Llama-3.1-8B-Instruct-GGUF-Q6_K_L"
    huggingface_repo = "bartowski/Meta-Llama-3.1-8B-Instruct-GGUF"
    huggingface_file = "Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf"
    context_size = 256
    chat_template = "llama3"

    port = 7020
    api_key = "llama-cpp"

    model_gguf = hf_hub_download(repo_id=huggingface_repo, filename=huggingface_file)

    args = "--host 0.0.0.0"
    args += f" --port {port}"
    args += f" --api-key {api_key}"
    args += f" --alias {id}"
    args += f" --model {model_gguf}"
    args += f" --chat-template {chat_template}"
    args += f" --ctx-size {context_size}"
    args += " --gpu-layers 33"
    args += " --n-predict 256"

    args = args.split(" ")
    cmd = ["/llama.cpp/llama-server", *args]

    thread = InferenceThread(id=id, cmd=cmd)
    thread.start()

    while not await check_health(port, api_key):
        await asyncio.sleep(3)

    openai_client = openai.AsyncOpenAI(
        base_url=f"http://localhost:{port}/v1",
        api_key=api_key,
    )

    async for openai_model in openai_client.models.list():
        assert id == openai_model.id

    prompt = "I believe the meaning of life is"

    completion_start = time.time()
    completion = await openai_client.completions.create(
        model=id,
        prompt=prompt,
        timeout=10,
    )
    completion_end = time.time()

    chat_start = time.time()
    chat_completion = await openai_client.chat.completions.create(
        model=id,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Hello, how are you today?"},
        ],
        timeout=10,
    )
    chat_end = time.time()
    thread.stop()
    thread.join()

    print(f"Inference cmd:\n{" ".join(cmd)}")

    print("\n")
    print("OpenAI completion:")
    print(f"{completion_end - completion_start} seconds")
    pprint(completion.model_dump())
    # This fails but shouldn't
    # assert len(completion.choices[0].text) > 0

    print("\n")
    print("OpenAI chat completion:")
    print(f"{chat_end - chat_start} seconds")
    pprint(chat_completion.model_dump())
    print("\n\n")
    # This is ok but nonsence content
    assert chat_completion.choices[0].message.content is not None


if __name__ == "__main__":
    asyncio.run(main())

the above script prints the following output:

/llama.cpp/llama-server --host 0.0.0.0 --port 7020 --api-key llama-cpp --alias Meta-Llama-3.1-8B-Instruct-GGUF-Q6_K_L --model /huggingface/hub/models--bartowski--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/9a8dec50f04fa8fad1dc1e7bc20a84a512e2bb01/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf --chat-template llama3 --ctx-size 256 --gpu-layers 33 --n-predict 256


OpenAI completion:
4.06883978843689 seconds
{'choices': None,
 'content': ' not something that can be found in a book or in a single event, '
            'but rather it is something that we must create for ourselves. '
            "It's a combination of moments, experiences and relationships that "
            'add up to a sense of fulfillment and purpose. But what is the key '
            'to making the most of life and living a life that truly feels '
            'meaningful?\n'
            'One key is to cultivate a sense of curiosity and openness. When '
            'we approach life with a sense of wonder and curiosity, we are '
            'more likely to experience new things, meet new people, and learn '
            'new skills. This can lead to a sense of growth and expansion, and '
            'help us to see the world in new and exciting ways.\n'
            'Another key is to focus on the present moment. When we get caught '
            'up in worries about the past or future, we can miss out on the '
            'beauty and wonder of the present moment. Practicing mindfulness '
            'and being fully engaged in the present can help us to appreciate '
            'the small joys in life and find a sense of peace and '
            'contentment.\n'
            'A third key is to cultivate a sense of gratitude. When we focus '
            "on the things we are grateful for, rather than dwelling on what's "
            'lacking, we can shift our perspective and see the world in a more '
            'positive light. Practicing gratitude can help us to appreciate '
            'the beauty',
 'created': None,
 'generation_settings': {'dynatemp_exponent': 1.0,
                         'dynatemp_range': 0.0,
                         'frequency_penalty': 0.0,
                         'grammar': '',
                         'ignore_eos': False,
                         'max_tokens': -1,
                         'min_keep': 0,
                         'min_p': 0.05000000074505806,
                         'mirostat': 0,
                         'mirostat_eta': 0.10000000149011612,
                         'mirostat_tau': 5.0,
                         'model': 'Meta-Llama-3.1-8B-Instruct-GGUF-Q6_K_L',
                         'n_ctx': 256,
                         'n_discard': 0,
                         'n_keep': 0,
                         'n_predict': 256,
                         'n_probs': 0,
                         'penalize_nl': False,
                         'presence_penalty': 0.0,
                         'repeat_last_n': 64,
                         'repeat_penalty': 1.0,
                         'samplers': ['top_k',
                                      'tfs_z',
                                      'typ_p',
                                      'top_p',
                                      'min_p',
                                      'temperature'],
                         'seed': 4294967295,
                         'seed_cur': 95027684,
                         'stop': [],
                         'stream': False,
                         'temperature': 0.800000011920929,
                         'tfs_z': 1.0,
                         'top_k': 40,
                         'top_p': 0.949999988079071,
                         'typical_p': 1.0},
 'id': None,
 'id_slot': 0,
 'index': 0,
 'model': 'Meta-Llama-3.1-8B-Instruct-GGUF-Q6_K_L',
 'object': None,
 'prompt': 'I believe the meaning of life is',
 'stop': True,
 'stopped_eos': False,
 'stopped_limit': True,
 'stopped_word': False,
 'stopping_word': '',
 'system_fingerprint': None,
 'timings': {'predicted_ms': 3967.169,
             'predicted_n': 256,
             'predicted_per_second': 64.52964317880081,
             'predicted_per_token_ms': 15.49675390625,
             'prompt_ms': 35.08,
             'prompt_n': 8,
             'prompt_per_second': 228.0501710376283,
             'prompt_per_token_ms': 4.385},
 'tokens_cached': 136,
 'tokens_evaluated': 8,
 'tokens_predicted': 256,
 'truncated': True,
 'usage': None}


OpenAI chat completion:
4.1125383377075195 seconds
{'choices': [{'finish_reason': 'stop',
              'index': 0,
              'logprobs': None,
              'message': {'content': 'GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG',
                          'function_call': None,
                          'refusal': None,
                          'role': 'assistant',
                          'tool_calls': None}}],
 'created': 1727258372,
 'id': 'chatcmpl-0ih5yEpIE7ZaqKmRNjS80S2Ol1yndGH4',
 'model': 'Meta-Llama-3.1-8B-Instruct-GGUF-Q6_K_L',
 'object': 'chat.completion',
 'service_tier': None,
 'system_fingerprint': None,
 'usage': {'completion_tokens': 256,
           'completion_tokens_details': None,
           'prompt_tokens': 28,
           'total_tokens': 284}}

ngxson Sep 25, 2024
Collaborator

Gibberish output may indicate that your model file is corrupted. You should try re-download the file, or try another model file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chat completion outputs a single token (Llama 3.1 8B Instruct GGUF Q6 K L) #9624

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Chat completion outputs a single token (Llama 3.1 8B Instruct GGUF Q6 K L) #9624

gislerro Sep 24, 2024

Replies: 1 comment · 6 replies

ggerganov Sep 24, 2024 Maintainer

gislerro Sep 24, 2024 Author

ggerganov Sep 24, 2024 Maintainer

gislerro Sep 24, 2024 Author

gislerro Sep 25, 2024 Author

ngxson Sep 25, 2024 Collaborator

gislerro
Sep 24, 2024

Replies: 1 comment 6 replies

ggerganov
Sep 24, 2024
Maintainer

gislerro Sep 24, 2024
Author

ggerganov Sep 24, 2024
Maintainer

gislerro Sep 24, 2024
Author

gislerro Sep 25, 2024
Author

ngxson Sep 25, 2024
Collaborator