Getting 'internal server error' while running ollama_demo file for lightrag for various smaller models. NEED HELP. #498

shalini-agarwal · 2024-12-20T08:27:16Z

INFO:httpx:HTTP Request: POST http://127.0.0.1:11434/api/chat "HTTP/1.1 500 Internal Server Error" ....... ollama._types.ResponseError: POST predict: Post "http://127.0.0.1:52582/completion": EOF

I am trying to run the lightrag_ollama_demo.py file from examples folder in the GitHub repository. I have been constantly getting this error where Ollama encounters an internal server error and stops midway while doing entity extraction. I have tried Llama3.21b, TinyLlama, Phi, Qwen2.5:0.5b with nomic-embed-text, mxbai-embed-large and snowflake-arctic-embed:22m as embedding models. I have tried different combinations of LLM and the embedding models, but I get the same error for all these models. For Qwen, it did work a few times but other times I got this error again. I saw that others also got this error and some suggestions were to change the OLLAMA_KV_CACHE_TYPE to q8_0 and others suggested that after new changes, this error has been fixed. I tried changing the KV value to q8_0 through this command - launchctl setenv OLLAMA_KV_CACHE_TYPE q8_0 in my terminal but even that didn't work. And I pulled all the recent changes only day before yesterday but I am still getting this error.
Here is my Ollama log if that helps -

The text was updated successfully, but these errors were encountered:

blakkd · 2024-12-25T03:52:33Z

msg="truncating input prompt" limit=2048 prompt=4157 keep=5 new=2048

It seems like ollama is loading your TinyLlama with Ollama's default context size (which is 2048 on Ollama).
You should verify your model with ollama show --modelfile [your model name].
Maybe you created a model using a modelfile but set the base Ollama model (got by ollama run or ollama pull for example) in your LightRAG script?

Another thing, I don't really as I had to experiment more on this but I'm wondering if the OLLAMA_NUM_PARALLEL env variable isn't affecting embedding speed. In case, I personally set it up to 1.

Last thing: you can set the value of OLLAMA_FLASH_ATTENTION to save a bit of RAM/VRAM.
As this you can already get more context length (eg. in my case going from 11K to 15K) without touching the kv cache quantization level.
I personally had bad experience when reducing the kv cache to q8_0 on small models (8b) which lead to incoherent repetitive answers or collapses... fp16 seamed ok.
Larger model (32b in my case) seem to be less affected but still I'd rather not touching it.

Also make sure you set embedding_dim in your lightrag script as per my other comment here #503 (comment)

I hope this will already help you, but share more if you need.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting 'internal server error' while running ollama_demo file for lightrag for various smaller models. NEED HELP. #498

Getting 'internal server error' while running ollama_demo file for lightrag for various smaller models. NEED HELP. #498

shalini-agarwal commented Dec 20, 2024

blakkd commented Dec 25, 2024

Getting 'internal server error' while running ollama_demo file for lightrag for various smaller models. NEED HELP. #498

Getting 'internal server error' while running ollama_demo file for lightrag for various smaller models. NEED HELP. #498

Comments

shalini-agarwal commented Dec 20, 2024

blakkd commented Dec 25, 2024