-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running model from a GGUF file, only #326
Comments
Did you add the HuggingFace Token? I got the same error Here are the ways you can add it https://github.com/EricLBuehler/mistral.rs?tab=readme-ov-file#getting-models-from-hf-hub. Also, this chat helped me as I was getting a 403 Error after that https://discuss.huggingface.co/t/error-403-what-to-do-about-it/12983. I had to accept the Llama license. |
@joshpopelka20 I want to run model from a local GGUF file, only - exactly the same way as in llama.cpp. Communication with HF (or any other) servers shouldn't ever be required for that. |
A recent issue also showed UX issues with this: #295 (comment) UPDATE: This local model support may have been a very new feature it seems, which might explain the current UX issues: #308 I found the README a bit confusing too vs llama-cpp for local GGUF (it doesn't help that it refers to terms you need to configure, but then uses short option names, the linked CLI args output also appears outdated from what a git build shows). I was not able to use absolute or relative paths in a way that Still like you it fails, but here is the extra output for knowing why: $ RUST_BACKTRACE=1 ./mistralrs-server --token-source none gguf -m . -t . -f model.gguf
2024-05-18T01:41:28.388727Z INFO mistralrs_server: avx: true, neon: false, simd128: false, f16c: true
2024-05-18T01:41:28.388775Z INFO mistralrs_server: Sampling method: penalties -> temperature -> topk -> topp -> multinomial
2024-05-18T01:41:28.388781Z INFO mistralrs_server: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-18T01:41:28.388828Z INFO mistralrs_server: Model kind is: quantized from gguf (no adapters)
2024-05-18T01:41:28.388869Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-05-18T01:41:28.658484Z INFO mistralrs_core::pipeline::gguf: Loading `"tokenizer.json"` locally at `"./tokenizer.json"`
2024-05-18T01:41:29.024145Z INFO mistralrs_core::pipeline::gguf: Loading `"config.json"` locally at `"./config.json"`
2024-05-18T01:41:29.024256Z INFO hf_hub: Token file not found "/root/.cache/huggingface/token"
2024-05-18T01:41:29.333151Z INFO mistralrs_core::pipeline: Loading `"model.gguf"` locally at `"./model.gguf"`
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:290:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: tokio::runtime::context::runtime::enter_runtime
4: tokio::runtime::runtime::Runtime::block_on
5: mistralrs_server::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. Note that
mistral.rs/mistralrs-core/src/pipeline/mod.rs Lines 180 to 187 in ca9bf7d
mistral.rs/mistralrs-core/src/pipeline/mod.rs Line 210 in ca9bf7d
mistral.rs/mistralrs-server/src/main.rs Lines 94 to 95 in ca9bf7d
Initial attemptLet's follow the problem from CLI to hugging face API call with the token: EDIT: Collapsed for brevity (not relevant)mistral.rs/mistralrs-server/src/main.rs Line 245 in ca9bf7d
mistral.rs/mistralrs-server/src/main.rs Lines 284 to 296 in ca9bf7d
mistral.rs/mistralrs-core/src/pipeline/gguf.rs Lines 278 to 303 in ca9bf7d
mistral.rs/mistralrs-core/src/pipeline/macros.rs Lines 133 to 138 in ca9bf7d
mistral.rs/mistralrs-core/src/utils/tokens.rs Lines 15 to 18 in ca9bf7d
The macro calls the huggingface API and adds the token https://docs.rs/hf-hub/latest/hf_hub/api/sync/struct.ApiBuilder.html#method.with_token /// Sets the token to be used in the API
pub fn with_token(mut self, token: Option<String>) -> Self {
self.token = token;
self
}
fn build_headers(&self) -> HeaderMap {
let mut headers = HeaderMap::new();
let user_agent = format!("unkown/None; {NAME}/{VERSION}; rust/unknown");
headers.insert(USER_AGENT, user_agent);
if let Some(token) = &self.token {
headers.insert(AUTHORIZATION, format!("Bearer {token}"));
}
headers
} Because the empty string was passed in, it passes that conditional and we add the authorization HTTP header with Next up, back in mistral.rs/mistralrs-core/src/pipeline/macros.rs Lines 147 to 154 in ca9bf7d
WorkaroundI'm terrible at debugging, so I sprinkled a bunch of mistral.rs/mistralrs-core/src/pipeline/macros.rs Lines 180 to 191 in ca9bf7d
The mistral.rs/mistralrs-core/src/pipeline/macros.rs Lines 2 to 18 in ca9bf7d
I'm not familiar with what this part of the code is trying to do, but for local/offline use the HF API shouldn't be queried at all... but it seems to be enforced? Since
# Example:
if resp.into_response().is_some_and(|r| !(matches!(r.status(), 401 | 404))) { Proper solution is probably to opt-out of the HF API entirely though? |
Hi @MoonRide303! Our close integration with the HF hub is intentional, as generally it is better to use the official tokenizer. However, I agree that it would be nice to enable loading from only a GGUF file. I'll begin work on this, and it shouldn't be too hard.
I think this behavior can be improved, I'll make a modification. |
Agree with #326 (comment). The prior PR change was the minimal chnages needed to load a known HF model from local. It is an awkward UX to have for a local-only model. |
I think there is a strong use case for loading from file without access to hugging face. HF is good! But, if you're trying to use an LLM in production, it's another failure point if your access to HF goes down. Also, there is always the risk that the creators of the LLM model might deny access to the repo at some point in the future. Anyways, trying to get this to work locally now with the rust library. |
Yes, especially when using a GGUF file as otherwise, there is always ISQ. I'm working on adding this in #345.
Ah, sorry, that was an oversight. I just merged #348, which both exposes those, and also exposes the Device, DType and a few other useful types so that you do not need to explicitly depend on our Candle branch. |
@MoonRide303, @polarathene, @Jeadie, @joshpopelka20, @ShelbyJenkins I just merged #345, which enables using the GGUF tokenizer. The implementation is tested against the HF tokenizer in CI, so you have a guarantee that it is correct. This is the applicable readme section. Here is an example:
I would appreciate your thoughts on how this can be improved! |
@EricLBuehler Not strictly related to this issue, but I updated to current (12.5) CUDA version few days ago, and mistral.rs (as of v0.1.11) no longer compiles. Not blocking compilation with a newer (and possibly backward-compatible) versions of CUDA would be definitely an improvement, allowing me to verify if / how the fix works ^^ (alternative: provide binary releases).
|
@MoonRide303, yes, but unfortunately that's a problem higher up in the dependency graph. There's a PR for that here: coreylowman/cudarc#238, and I'll let you know when it gets merged. Alternatively, could you try out one of our docker containers: https://github.com/EricLBuehler/mistral.rs/pkgs/container/mistral.rs |
I am currently working on a project where I need to use the Could you please provide an example of how to invoke Thank you for your assistance! |
Absolutely, here is a simple example of running a GGUF model purely locally:
Please feel free to let me know if you have any questions! |
@EricLBuehler I tried running it, but I encountered the following error:
Since this is a local example, I assumed that the HuggingFace Token wouldn't be necessary. Is this not the case? |
HI @solaoi, that should be fixed now, can you please try it again after a |
|
@EricLBuehler Pros: it compiles. Cons: doesn't work.
Not sure what might be causing this - but llama.cpp compiles and works without issues, so I'd assume my env is fine. |
@MoonRide303, it looks like you are using Windows. This issue has been reported here (coreylowman/cudarc#219) and here (huggingface/candle#2175). Can you add the path to your |
@EricLBuehler .so ELFs and LD_LIBRARY_PATH won't work on Windows. I am compiling and using dynamically linked CUDA-accelerated llama.cpp builds without issues, so CUDA .dlls should be in my path already.
|
Right, sorry, my mistake. On Windows, do you know if you have multiple CUDA installations, can you run: |
@EricLBuehler Got some, but those were just empty dirs from old versions:
I removed all except 12.5, and it didn't help in any way. But it shouldn't matter as long as necessary .dll are in the path (and they are). |
I did a fresh @EricLBuehler I assume the reason you're not experiencing that is because the default for I pointed out the 401 issue earlier. It can be bypassed after a patch, but proper solution would be to skip calling out to HF in the first place? The loaders HF method isn't doing much beyond getting the paths and then calling the local method directly after implicitly with that extra data?: mistral.rs/mistralrs-core/src/pipeline/gguf.rs Lines 282 to 291 in 527e7f5
What is the actual minimum amount of paths needed? Can that whole method be skipped if paths are provided locally? Is the chat template required or can it fallback to a default (perhaps with warning)? I'm not sure what Otherwise this macro is presumably juggling conditions of API call vs fallback/alternative? (also I'm not too fond of duplicating the macro to adjust for local GGUF): mistral.rs/mistralrs-core/src/pipeline/macros.rs Lines 156 to 244 in 527e7f5
Ah I see the paths struct here: mistral.rs/mistralrs-core/src/pipeline/mod.rs Lines 98 to 111 in 527e7f5
How about this?:
I am a bit more familiar with this area of the project now, I might be able to take a shot at it once my active PR is merged 😅 Original responsePerhaps I am not using the command correctly: AttemptsCan ignore most of this, From the $ RUST_BACKTRACE=1 target/release/mistralrs-server gguf -m /models/Hermes-2-Pro-Mistral-7B.Q4_K_M -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
2024-05-30T00:06:47.470010Z INFO mistralrs_core::pipeline::gguf: Loading model `/models/Hermes-2-Pro-Mistral-7B.Q4_K_M` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-30T00:06:47.508433Z INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
thread 'main' panicked at mistralrs-core/src/pipeline/gguf_tokenizer.rs:65:31:
no entry found for key
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::panicking::panic_display
3: core::option::expect_failed
4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
6: tokio::runtime::context::runtime::enter_runtime
7: mistralrs_server::main Error "no entry found for key" From the model directory, absolute path to $ RUST_BACKTRACE=1 /mist/target/release/mistralrs-server gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: tokio::runtime::context::runtime::enter_runtime
4: mistralrs_server::main 401 unauthorized. Just to double check I copied the $ RUST_BACKTRACE=1 ./server gguf -m . -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: tokio::runtime::context::runtime::enter_runtime
4: mistralrs_server::main 401 unauthorized again.
$ RUST_BACKTRACE=1 ./server --token-source none gguf -m /models/Hermes-2-Pro-Mistral-7B.Q4_K_M -t /models/Hermes-2-Pro-Mistral-7B.Q4_K_M -f Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(404, Response[status: 404, status_text: Not Found, url: https://huggingface.co//models/Hermes-2-Pro-Mistral-7B.Q4_K_M/resolve/main/tokenizer.json]))
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf::{{closure}}
3: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
4: tokio::runtime::context::runtime::enter_runtime
5: mistralrs_server::main So it's still trying to connect to HF 🤷♂️ (because of the mandatory This model was one that you mentioned had a duplicate field (that error isn't being encountered here, although previously I had to add a patch to bypass a 401 panic, which you can see above). |
@MoonRide303, @polarathene, the following command works on my machine after I merged #362:
Note: as documented in the README here, you need to specify the model id, file, and chat template when loading a local GGUF model without using the HF tokenizer. If you are using the HF tokenizer, you may specify
Yes, it just queries the HTTP side and if that failes treats them as local paths. My thinking was that we should always try HTTP first, but maybe you can flip that in a future PR?
Not really, the
That seems like a great idea, perhaps |
Just realized you were referencing a change in the past hour, built again and your example works properly now 🎉 Original response$ RUST_BACKTRACE=1 ./server -i --token-source none --chat-template /mist/chat_templates/mistral.json gguf -m . -f /models/mistral-7b-instruct-v0.1.Q4_K_M.gguf
thread 'main' panicked at mistralrs-core/src/pipeline/gguf.rs:282:58:
RequestError(Status(401, Response[status: 401, status_text: Unauthorized, url: https://huggingface.co/api/models/revision/main]))
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
3: tokio::runtime::context::runtime::enter_runtime
4: mistralrs_server::main That is the same model as you used? I saw you link to it the other day (might be worth having the link by the README example if troubleshooting with a common model is advised?). $ ./server --version
mistralrs-server 0.1.11
$ git log
commit 527e7f5282c991d399110e21ddbef6c51bba607c (grafted, HEAD -> master, origin/master, origin/HEAD)
Author: Eric Buehler <[email protected]>
Date: Wed May 29 10:12:24 2024 -0400
Merge pull request #360 from EricLBuehler/fix_unauth
Fix no auth token for local loading Oh... mistook the PR you referenced as an older one, I see that's new. However same command but changing just $ RUST_BACKTRACE=1 target/release/mistralrs-server -i --token-source none --chat-template /mist/chat_templates/mistral.json gguf -m . -f /models/Hermes-2-Pro-Mistral-7B.Q4_K_M.gguf
2024-05-30T11:06:44.051393Z INFO mistralrs_core::pipeline::gguf: Loading model `.` on Cuda(CudaDevice(DeviceId(1)))...
2024-05-30T11:06:44.099117Z INFO mistralrs_core::pipeline::gguf: Model config:
general.architecture: llama
general.file_type: 15
general.name: jeffq
general.quantization_version: 2
llama.attention.head_count: 32
llama.attention.head_count_kv: 8
llama.attention.layer_norm_rms_epsilon: 0.00001
llama.block_count: 32
llama.context_length: 32768
llama.embedding_length: 4096
llama.feed_forward_length: 14336
llama.rope.dimension_count: 128
llama.rope.freq_base: 10000
thread 'main' panicked at mistralrs-core/src/pipeline/gguf_tokenizer.rs:65:31:
no entry found for key
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::panicking::panic_display
3: core::option::expect_failed
4: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_path
5: <mistralrs_core::pipeline::gguf::GGUFLoader as mistralrs_core::pipeline::Loader>::load_model_from_hf
6: tokio::runtime::context::runtime::enter_runtime
7: mistralrs_server::main I tried another model and it was loading then panicked because I mistyped a different chat template filename, probably should have verified the file existed before it began to attempt loading the model. Tried a few other GGUF models from HF and some also failed with |
@polarathene I think you should be able to run the Hermes model now. I just merged #363, which allows the default unigram UNK token (0) in case it is missing.
Yeah, we only support the
|
@MoonRide303, coreylowman/cudarc#240 should fix this. |
@MoonRide303, I think it should be fixed now, coreylowman/cudarc#240 was merged and there are reports that it works for others. Can you please try it again after a |
I don't know much about these, but after a |
Great! For reference, see these docs: https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#ggml |
@EricLBuehler tested (full rebuild) on current master (1d21c5f), result:
Proper name for this one (as of CUDA 12.5 on Windows) should be:
|
@MoonRide303 I opened an issue: coreylowman/cudarc#242. I'll let you know when a fix gets merged. |
@EricLBuehler I've noticed the change in cudarc was merged, so I tried to rebuild - it seems problems with CUDA dlls are now solved. But it still asks me for
When I copied this config from the original HF repo and added port parameter, I managed to launch the server:
Though it seems it didn't read chat template at all (not from the GGUF, and not from the separate config file). Trying to chat after running
|
Great work! I just re-started implementing for my crate. For reference I'm giving two options for loading GGUFs. Both are designed to be as easy as possible (for the Python immigrants). Option 1) from presets with pre-downloaded tokenizer.json, tokenizer_config, and config.json. Given the user's vram, it then downloads the largest quant that will fit in the vram. The tokenizer.json is no longer required since you've implemented tokenizer from GGUF (legend), and presumably I can use that interface or use the code in my crate for my tokenizer needs. I then plan to pass those paths after loading to mistral.rs just like I currently do with llama.cpp. IMO the
I would implement this and submit a PR, but I haven't looked at the downstream code enough to understand the implications. If you think it's a good idea, I can make an attempt. On chat templates - If we could implement loading the chat template from GGUF, I'm not sure we'd need any thing else. The reason llama.cpp doesn't do this is because they don't/won't implement a Jinja parser, but as most models are including the chat template in the GGUF now I'm not sure if there is a reason to manually load it. I don't mind manually adding a chat_template.json to the presets I have, but it makes loading a model from a single file more difficult. Another option might be to accept the chat template as a string. |
@MoonRide303 that was a bug and should be fixed now. |
Hi @ShelbyJenkins! That's exciting, I look forward to seeing future developments.
That sounds good. If you could also be sure to implement it for the Normal and GGML loaders too, that would be great.
Yes, that is something which will be easy to do. I'll add support for that in the coming days and will write here when I do. |
@ShelbyJenkins I just merged support for multiple GGUF files in #379. |
I'll be refactoring this, it's on my todo.
Again heads up that this is something I'll be tackling 😅 My interests are in simplifying the existing code for maintenance, I also am likely to change the After that's been handled, it would be a better time for you to PR your own improvements for UX. Just mentioning this to avoid us both working on conflicting changes at the same time for this portion of |
I'm having trouble following this requirement? Where is the example of sharded GGUF? Was there a misunderstanding with what @ShelbyJenkins was describing? (Multiple GGUF files that are distinct by their quantization to support lower VRAM requirements) Your PR changed quite a bit and I'm lacking clarity as to why it was needed as it seems otherwise redundant complexity? |
Doesn't look fixed as of v0.1.15 (9712da6) - it still asks for tokenizer config in separate file:
And even when provided in separate file (which shouldn't be needed) - yes, the server starts without warnings:
But the chat template still doesn't work:
|
@polarathene, great, looking forward to those refactors. Regarding the sharded GGUF files, here is one such example. That PR also reduced GGUF code duplication by centralizing the |
Hi @MoonRide303, I just merged #385 which adds more verbose and hopefully more helpful logging during loading. I think there are a few things going on here.
|
@EricLBuehler Model itself knows how to handle system role, it's just the limitation of the default template. But you're right that's a separate issue and mistral.rs behaviour was okay (exception defined in the template) - and it can also be worked around by providing custom template accepting system role. As of this issue - it seems that #386 is the last missing part, then. |
I was actually going to handle that (and do it better IMO), but your changes would have caused heavy conflicts to resolve, so I'm somewhat glad I was checking development activity as I was about to start around the time I discovered the merged PR 😓 @ShelbyJenkins if you want to tackle your own improvements to the loader go ahead. The pace of development in this area is a bit too frequent for me to want to try touch it for the time being. |
@polarathene as mentioned in #380, I will roll back those changes. Looking forward to seeing your implementation!
I don't plan on working much in that area for the foreseeable future, aside from the rollback, which I'll do shortly. The pace of development there should not be excessive. |
@polarathene @EricLBuehler Ok, I'll take look at this over this week and weekend. It's not a high priority for my project, but I think it's valuable to people consuming this library 👍 |
Hi @MoonRide303! I just merged #416 which enables loading the GGUF chat template from a GGUF file as well as #397 which adds support for the GPT2 (BPE) tokenizer type which extends support. This command now works:
Output
|
Something is wrong with tokenizer - it fails at
|
@MoonRide303 not sure about that, please open another issue. Closing this as the feature is complete, please feel free to reopen! |
Describe the bug
Running model from a GGUF file using llama.cpp is very straightforward, just like that:
server -v -ngl 99 -m Phi-3-mini-4k-instruct-Q6_K.gguf
and if model is supported, it just works.
I tried to do the same using mistral.rs, and I've got that:
Why it asks me for a tokenizer file, when it's included in the GGUF file? I understand having this as an option (if I wanted to try out different tokenizer / configuration), but by default it should just use information provided in the gguf file itself.
Next attempt, when I copied tokenizer.json from the original model repo:
And another attempt, after copying config.json (which I think is also unnecessary, as llama.cpp works fine without it):
I wanted to give mistral.rs a shot, but it's a really painful experience for now.
Latest commit
ca9bf7d (v0.1.8)
The text was updated successfully, but these errors were encountered: