Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the llama.cpp interface #1298

Merged
merged 10 commits into from
Dec 16, 2024
2 changes: 1 addition & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ name: Tests

on:
pull_request:
branches: [main]
branches: [main,v1.0]
push:
branches: [main]

Expand Down
45 changes: 45 additions & 0 deletions docs/reference/models/anthropic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Anthropic

!!! Installation

You need to install the `anthropic` library to be able to use the Anthropic API in Outlines. Or alternatively you can run:

```bash
pip install "outlines[anthropic]"
```

## Anthropic models

Outlines supports models available via the Anthropic API, e.g. Claude 3.5 Haiku or Claude 3.5 Sonner. You can initialize the model by passing the model name to `outlines.models.Anthropic`:

```python
from outlines import models

model = models.Anthropic("claude-3-5-haiku-latest")
model = models.Anthropic("claude-3-5-sonnet-latest")
```

Check the [Anthropic documentation](https://docs.anthropic.com/en/docs/about-claude/models) for an up-to-date list of available models. You can pass any paramater you would pass to the Anthropic SDK as keyword arguments:

```python
model = models.Anthropic(
"claude-3.5-haiku-latest",
api_key="<my api key>"
)
```

## Text generation

To generate text using an Anthropic model you need to build a `Generator` object, possibly with the desired output type. You can then call the model by calling the `Generator`. The method accepts every argument that you could pass to the `client.completions.create` function, as keyword arguments:

```python
from outlines import models, Generator

model = models.Anthropic("claude-3-5-haiku-latest")
generator = Generator(model)
result = generator("Prompt", max_tokens=1024)
```

See the [Anthropic SDK documentation](https://github.com/anthropics/anthropic-sdk-python/blob/main/src/anthropic/resources/messages.py) for the list of available arguments.

The Anthropic API currently does not support structured generation.
88 changes: 88 additions & 0 deletions docs/reference/models/gemini.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Gemini

!!! Installation

You need to install the `google-generativeai` library to be able to use the Gemini API in Outlines. Or alternatively you can run:

```bash
pip install "outlines[gemini]"
```

## Gemini models

Outlines supports models available via the Gemini API, e.g. Gemini 1.5. You can initialize the model by passing the model name to `outlines.models.Gemini`:

```python
from outlines import models

model = models.Gemini("gemini-1-5-flash")
model = models.Gemini("gemini-1-5-pro")
```

Check the [Gemini documentation](https://ai.google.dev/gemini-api/docs/models/gemini) for an up-to-date list of available models.

## Text generation

To generate text using a Gemini model you need to build a `Generator` object, possibly with the desired output type. You can then call the model by calling the `Generator`. The method accepts every argument that you could pass to the `client.completions.create` function, as keyword arguments:

```python
from outlines import models, Generator

model = models.Gemini("gemini-1-5-flash")
generator = Generator(model)
result = generator("Prompt", max_tokens=1024)
```

### Structured generation

Gemini provides support for structured outputs.

#### Json Schema

Outlines provides support for JSON Schema-based structured generation with the Gemini models:

```python
from collections import TypedDict
from outlines import Generator, models
from outlines.types import Json

model = models.Gemini("gemini-1-5-flash")

class Person(TypedDict):
first_name: str
last_name: str
age: int

generator = Generator(model, Json(Person))
generator("current indian prime minister on january 1st 2023")
# Person(first_name='Narendra', last_name='Modi', age=72)
```

Because of the current limitations of the Gemini SDK only The following objects can be used to define the structure of the Json object:
- A Pydantic model
- A TypedDict

#### Multiple choices

Outlines provides support for multiple-choices structured generation. Enums and lists of choices are supported:

```python
from enum import Enum
from outlines import Generator, models
from outlines.types import Choice

model = models.Gemini("gemini-1-5-flash")

class Foo(Enum):
foo = "Foo"
fizz = "Fizz"
fuzz = "Fuzz"

generator = Generator(model, Choice(Foo))
generator("current indian prime minister on january 1st 2023")
# Person(first_name='Narendra', last_name='Modi', age=72)
```

The following objects can be used to define the choices:
- An Enum object
- A Python list
100 changes: 19 additions & 81 deletions docs/reference/models/llamacpp.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,50 +12,38 @@ Outlines provides an integration with [Llama.cpp](https://github.com/ggerganov/l

## Load the model

You can initialize the model by passing the name of the repository on the HuggingFace Hub, and the filenames (or glob pattern):
To load a model you can use the same interface as you would using `llamap-cpp-python` directly. The default method is to initialize the model by passing the path to the weights on your machine. Assuming [Phi2's weights](https://huggingface.co/TheBloke/phi-2-GGUF) are in the current directory:

```python
from outlines import models

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
llm = models.LlamaCpp("./phi-2.Q4_K_M.gguf")
```

This will download the model files to the hub cache folder and load the weights in memory.
You can initialize the model by passing the name of the repository on the HuggingFace Hub, and the filenames (or glob pattern):

You can also initialize the model by passing the path to the weights on your machine. Assuming [Phi2's weights](https://huggingface.co/TheBloke/phi-2-GGUF) are in the current directory:

```python
from outlines import models
from llama_cpp import Llama

llm = Llama("./phi-2.Q4_K_M.gguf")
model = models.LlamaCpp(llm)
model = models.LlamaCpp.from_pretrained("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
```

If you need more control, you can pass the same keyword arguments to the model as you would pass in the [llama-ccp-library][llamacpp]:
This will download the model files to the hub cache folder and load the weights in memory.


You can pass the same keyword arguments to the model as you would pass in the [llama-ccp-library][llamacpp]:

```python
from outlines import models

model = models.llamacpp(
model = models.LlamaCpp(
"TheBloke/phi-2-GGUF",
"phi-2.Q4_K_M.gguf"
n_ctx=512, # to set the context length value
)
```

**Main parameters:**

| Parameters | Type | Description | Default |
|------------|------|-------------|---------|
| `n_gpu_layers`| `int` | Number of layers to offload to GPU. If -1, all layers are offloaded | `0` |
| `split_mode` | `int` | How to split the model across GPUs. `1` for layer-wise split, `2` for row-wise split | `1` |
| `main_gpu` | `int` | Main GPU | `0` |
| `tensor_split` | `Optional[List[float]]` | How split tensors should be distributed across GPUs. If `None` the model is not split. | `None` |
| `n_ctx` | `int` | Text context. Inference from the model if set to `0` | `0` |
| `n_threads` | `Optional[int]` | Number of threads to use for generation. All available threads if set to `None`.| `None` |
| `verbose` | `bool` | Print verbose outputs to `stderr` | `False` |

See the [llama-cpp-python documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for the full list of parameters.

### Load the model on GPU
Expand All @@ -69,87 +57,39 @@ See the [llama-cpp-python documentation](https://llama-cpp-python.readthedocs.io
```python
from outlines import models

model = models.llamacpp(
model = models.LlamaCpp(
"TheBloke/phi-2-GGUF",
"phi-2.Q4_K_M.gguf",
n_gpu_layers=-1, # to use GPU acceleration
)
```

This also works with generators built with `generate.regex`, `generate.json`, `generate.cfg`, `generate.format` and `generate.choice`.

### Load LoRA adapters
## Generate text


You can load LoRA adapters dynamically:
To generate text you must first create a `Generator` object by passing the model instance and, possibley, the expected output type:
rlouf marked this conversation as resolved.
Show resolved Hide resolved

```python
from outlines import models, generate

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
generator = generate.text(model)
answer_1 = generator("prompt")

model.load_lora("./path/to/adapter.gguf")
answer_2 = generator("prompt")
model = models.LlamaCpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
generator = Generator(model)
```

To load another adapter you need to re-initialize the model. Otherwise the adapter will be added on top of the previous one:
You can pass to the generator the same keyword arguments you would pass in `llama-cpp-python`:

```python
from outlines import models

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
model.load_lora("./path/to/adapter1.gguf") # Load first adapter

model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
model.load_lora("./path/to/adapter2.gguf") # Load second adapter
answer = generator("A prompt", presence_penalty=0.8)
```

## Generate text

In addition to the parameters described in the [text generation section](../text.md) you can pass extra keyword arguments, for instance to set sampling parameters not exposed in Outlines' public API:
You can also stream the tokens:

```python
from outlines import models, generate


model = models.llamacpp("TheBloke/phi-2-GGUF", "phi-2.Q4_K_M.gguf")
generator = generate.text(model)

answer = generator("A prompt", presence_penalty=0.8)
tokens = generator.stream("A prompt")
```
torymur marked this conversation as resolved.
Show resolved Hide resolved

**Extra keyword arguments:**

The value of the keyword arguments you pass to the generator suspersede the values set when initializing the sampler or generator. All extra sampling methods and repetition penalties are disabled by default.

| Parameters | Type | Description | Default |
|------------|------|-------------|---------|
| `suffix` | `Optional[str]` | A suffix to append to the generated text. If `None` no suffix is added. | `None` |
| `echo` | `bool` | Whether to preprend the prompt to the completion. | `False` |
| `seed` | `int` | The random seed to use for sampling. | `None` |
| `max_tokens` | `Optional[int]` | The maximum number of tokens to generate. If `None` the maximum number of tokens depends on `n_ctx`. | `16` |
| `frequence_penalty` | `float` | The penalty to apply to tokens based on their frequency in the past 64 tokens. | `0.0` |
| `presence_penalty` | `float` | The penalty to apply to tokens based on their presence in the past 64 tokens. | `0.0` |
| `repeat_penalty` | `float` | The penalty to apply to repeated tokens in the past 64 tokens. | `1.` |
| `stopping_criteria` | `Optional[StoppingCriteriaList]` | A list of stopping criteria to use. | `None`
| `logits_processor` | `Optional[LogitsProcessorList]` | A list of logits processors to use. The logits processor used for structured generation will be added to this list. | `None`
| `temperature` | `float` | The temperature to use for sampling | `1.0` |
| `top_p` | `float` | The top-p value to use for [nucleus sampling][degeneration]. | `1.` |
| `min_p` | `float` | The min-p value to use for [minimum-p sampling][minimum-p]. | `0.` |
| `typical_p` | `float` | The p value to use for [locally typical sampling][locally-typical]. | `1.0` |
| `stop` | `Optional[Union[str, List[str]]]` | A list of strings that stop generation when encountered. | `[]` |
| `top_k` | `int` | The top-k value used for [top-k sampling][top-k]. Negative value to consider all logit values. | `-1.` |
| `tfs_z` | `float` | The [tail-free sampling][tail-free] parameter. | `1.0` |
| `mirostat_mode` | `int` | The [mirostat sampling][mirostat] mode. | `0` |
| `mirostat_tau` | `float` | The target cross-entropy for [mirostat sampling][mirostat].| `5.0` |
| `mirostat_eta` | `float` | The learning rate used to update `mu` in [mirostat sampling][mirostat]. | `0.1` |

See the [llama-cpp-python documentation][llama-cpp-python-call] for the full and up-to-date list of parameters and the [llama.cpp code][llama-cpp-sampling-params] for the default values of other
sampling parameters.

### Streaming


## Installation

Expand Down Expand Up @@ -216,8 +156,6 @@ CMAKE_ARGS="-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS" pip install llama-cpp-
- SYCL




[llamacpp]: https://github.com/abetlen/llama-cpp-python
[llama-cpp-python-call]: https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__call__
[llama-cpp-python-install]: https://github.com/abetlen/llama-cpp-python/tree/08b16afe11e7b42adec2fed0a781123383476045?tab=readme-ov-file#supported-backends
Expand Down
Loading
Loading