-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to generate longer token streams? #27
Comments
I have some relevant questions here. Appreciation in advance for reading and answering them :) Q2: If the above understanding is correct (i.e., response length is still limited by context length for a single prompt using streamingLLM), would it be helpful to use multiple follow-up questions to prolong the response/split the input? e.g., is the streamingLLM potentially a good tool to generate long code exceeding the context length using multiple prompts? Q3: This question is related to the long-term memory of inputs (FAQ 3); for extending output with follow-up questions, how many follow-up questions would be a reasonable threshold for the first prompt still being considered in generating the response (if the long-term memory is not applicable as in FAQ3)? Q4: Similarly, if splitting inputs into several follow-up questions, would the streamingLLM capable of achieving something like RAG, e.g., considering earlier prompts in generating the latest response? If so, how many earlier prompts would it consider? Q5: Are there some thresholds to consider when choosing the number of follow-up prompts? I noticed the mt_bench.jsonl mostly has two turns. Does it mean streamingLLM mostly remembers the (n-1)th prompt as the earliest input when generating the nth response? |
Hello, You're observing this because our text generation function terminates once an EOS (End Of Sentence) token is produced. You can see this behavior in the code here: run_streaming_llama.py, Line 54. Given the nature of our questions, the model doesn't always necessitate extensive answers. For generating longer texts, I recommend referring to our perplexity evaluation code, located here: eval_long_ppl.py. Guangxuan |
Hello Guangxuan, Thank you so much for the helpful answer! Assuming enough space for cache to support a very large recent_size, and the # chat history in the cache can keep growing, do you think there is any upper limit in streamingLLM's ability to utilize the cached # chat history as "long-term memory" for generating the current response? Many thanks! Zhaoxin |
Have everything running on python3.10 under ubuntu 22.04 with 2x 24 gig gpus.
Tested original and revised versions of 'mt_bench.jsonl' and output is good with a 70b 4bit gptq model.
Trying to increase number of tokens streamed but it appears fixed for each generation.
Edited 'run_streaming_llama.py - line 61
def streaming_inference(model, tokenizer, prompts, kv_cache=None, max_gen_len=10000):
but output length is similar to default 2000
Edited kv_cache.py 'recent_size=512,' to similar values but length of output remains the same.
Would appreciate options and/or edits required to generate 10000+ tokens
Cheers
The text was updated successfully, but these errors were encountered: