Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用ggmlv3 q6_K model, inference會掉字 #30

Closed
wennycooper opened this issue Sep 21, 2023 · 3 comments
Closed

使用ggmlv3 q6_K model, inference會掉字 #30

wennycooper opened this issue Sep 21, 2023 · 3 comments

Comments

@wennycooper
Copy link

wennycooper commented Sep 21, 2023

您好,
我使用ggml quantize 成為 q6_K format, 然後用以下 code 做inference

`
from langchain.llms import LlamaCpp
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# load Llama-2 model
llm = LlamaCpp(
    model_path="/workspace/test/TaiwanLLama_v1.0/Taiwan-LLaMa-13b-1.0.ggmlv3.q6_K.bin",
    n_gpu_layers=16,
    n_batch=8,
    n_ctx=2048,
    temperature=0.1,
    max_tokens=512,
    callback_manager=callback_manager,
)

# response = run_simple_qa(llm, query)
prompt_template = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"""
prompt = prompt_template.format("什麼是深度學習?")
response = llm(prompt)

`
結果會掉字... 如下:

深度學是機器學的一子集,基人工神經結。使得計算機能通別模式大量中學,而不需要明編程。深度學算法用分、進行和別模式

@wennycooper wennycooper changed the title 使用ggmlv3 q6_K inference 會掉字 使用ggmlv3 q6_K model, inference會掉字 Sep 21, 2023
@PenutChen
Copy link

PenutChen commented Sep 22, 2023

從程式碼裡面的敘述 Callbacks support token-wise streaming 來看,很有可能是 StreamingStdOutCallbackHandler 的問題。你可以尋找看看有沒有 StreamingStdOutCallbackHandler 與 UTF-8 CJK Character 相關的 Issue,或者不要使用 Streaming 等 Inference 完之後把 Response 印出來就好

這類 Streaming 的輸出遇上 BPE Tokenizer 與中日韓文字都很容易發生類似的情況

@PenutChen
Copy link

PenutChen commented Sep 22, 2023

@wennycooper 我剛剛實際測試了一下,我發現你使用的模型是 ggml v3 的格式,這個格式已經被 ggml 官方標記為 deprecation
所以我是使用 gguf 格式的,請參考唐鳳的這份 repo
這邊用 q4_0 做測試,看起來輸出是沒有問題,可能跟套件版本有關係

# CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama_cpp_python==0.2.6
# pip install langchain==0.0.298

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.llms import LlamaCpp

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# load Llama-2 model
llm = LlamaCpp(
    model_path="/path/to/Taiwan-LLaMa-13b-1.0.Q4_0.gguf",
    n_gpu_layers=100,
    n_batch=8,
    n_ctx=512,
    temperature=0.1,
    max_tokens=512,
    callback_manager=callback_manager,
)

# response = run_simple_qa(llm, query)
prompt_template = """A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"""
prompt = prompt_template.format("什麼是深度學習?")
response = llm(prompt)
print(response)

image

Edit: 剛剛測試 gguf q6_k 結果也是沒問題的

image

@wennycooper
Copy link
Author

感謝!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants