You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello everybody,
I'm new to llama-cpp-python, and I'm trying to understand how to get some statistics such as speed in tokens/second (prompt analysis and generation) after a call to llm.create_completion(...). I could measure the times by myself with time.perf_counter(), and at least in streaming mode get the number of generated tokens by counting the chunks yield()'d by create_completion(), but I don't know how to get the number of tokens of the prompt.
Inspecting the source code of Llama.py, I've found that in one case (at line 1718) _create_completion() yields a dict containing an item with key "usage" that is a dictionary containing the lengths of prompt_tokens[], completion_tokens[] and the sum of the two, but it's not clear to me how to have that yield() used and why usage is not present in the other calls to yield() (at least for the last chunk yielded: that's what the ollama-python library does, so it shouldn't be difficult).
Any idea of any other way of knowing the number of tokens of the prompt?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello everybody,
I'm new to llama-cpp-python, and I'm trying to understand how to get some statistics such as speed in tokens/second (prompt analysis and generation) after a call to llm.create_completion(...). I could measure the times by myself with time.perf_counter(), and at least in streaming mode get the number of generated tokens by counting the chunks yield()'d by create_completion(), but I don't know how to get the number of tokens of the prompt.
Inspecting the source code of Llama.py, I've found that in one case (at line 1718) _create_completion() yields a dict containing an item with key "usage" that is a dictionary containing the lengths of prompt_tokens[], completion_tokens[] and the sum of the two, but it's not clear to me how to have that yield() used and why usage is not present in the other calls to yield() (at least for the last chunk yielded: that's what the ollama-python library does, so it shouldn't be difficult).
Any idea of any other way of knowing the number of tokens of the prompt?
Beta Was this translation helpful? Give feedback.
All reactions