Update vllama docs for uqff

EricLBuehler · Sep 30, 2024 · ce02618 · ce02618
1 parent e449543
commit ce02618
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 3 deletions.
diff --git a/docs/UQFF.md b/docs/UQFF.md
@@ -174,4 +174,5 @@ Have you created a UQFF model on Hugging Face? If so, please [create an issue](h
 
 | Name | Base model | UQFF model |
 | -- | -- | -- |
-| Phi 3.5 Mini Instruct | microsoft/Phi-3.5-mini-instruct | EricB/Phi-3.5-mini-instruct-UQFF |
+| Phi 3.5 Mini Instruct | microsoft/Phi-3.5-mini-instruct | [EricB/Phi-3.5-mini-instruct-UQFF](EricB/Phi-3.5-mini-instruct-UQFF) |
+| Llama 3.2 Vision | meta-llama/Llama-3.2-11B-Vision-Instruct | [EricB/Llama-3.2-11B-Vision-Instruct-UQFF](https://huggingface.co/EricB/Llama-3.2-11B-Vision-Instruct-UQFF) |
diff --git a/docs/VLLAMA.md b/docs/VLLAMA.md
@@ -2,7 +2,7 @@
 
 Mistral.rs supports the Llama 3.2 vision model, with examples in the Rust, Python, and HTTP APIs. ISQ quantization is supported to allow running the model with less memory requirements.
 
-UQFF quantizations will be released shortly.
+UQFF quantizations are also available.
 
 The Python and HTTP APIs support sending images as:
 - URL
@@ -13,11 +13,14 @@ The Rust API takes an image from the [image](https://docs.rs/image/latest/image/
 
 > Note: Some examples use the [Cephalo Llama 3.2 model](lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k), a member of the [Cephalo](https://huggingface.co/collections/lamm-mit/cephalo-664f3342267c4890d2f46b33) model collection. This model is finetune of Llama 3.2 with enhanced capabilities in scientific images. To use the base Llama 3.2 Vision model, simply use the [associated model ID](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct).
 
+> Note: When using device mapping or model topology, only the text model and its layers will be managed. This is because it contains most of the model parameters. *The text model has 40 layers*.
+
 ## ToC
 - [Interactive mode](#interactive-mode)
 - [HTTP server](#http-server)
 - [Rust API](#rust)
 - [Python API](#python)
+- [UQFF models](#uqff-models)
 
 ## Interactive mode
 
@@ -247,4 +250,16 @@ print(res.usage)
 ```
 
 - You can find an example of encoding the [image via base64 here](../examples/python/phi3v_base64.py).
-- You can find an example of loading an [image locally here](../examples/python/phi3v_local_img.py).
+- You can find an example of loading an [image locally here](../examples/python/phi3v_local_img.py).
+
+## UQFF models
+[UQFF](UQFF.md) is a quantized file format similar to GGUF based on ISQ. It removes the memory and compute requirements that come with ISQ by providing ready-made quantizations! The key advantage over GGUF is the flexibility to store multiple quantizations in one file.
+
+We provide UQFF files ([EricB/Llama-3.2-11B-Vision-Instruct-UQFF](https://huggingface.co/EricB/Llama-3.2-11B-Vision-Instruct-UQFF)) for this Llama 3.2 Vision model.
+
+You can use these UQFF files to easily use quantized versions of Llama 3.2 Vision.
+
+For example:
+```
+./mistralrs-server -i vision-plain -m meta-llama/Llama-3.2-11B-Vision-Instruct -a vllama --from-uqff EricB/Llama-3.2-11B-Vision-Instruct-UQFF/llama-3.2-11b-vision-q4k.uqff
+```