Skip to content

Commit

Permalink
Chore/56 option to include large unused dependencies (#60)
Browse files Browse the repository at this point in the history
* chore: Add new `e5` and `e5-cpu` extras

* feat: Make OpenAIEmbedder the default, over E5Embedder

* chore: Change torch version to +cpu

* chore: Pin torch-cpu version

* chore: Remove +cpu

* feat: Add `is_installed` and `raise_if_not_installed` convenience functions

* chore: Remove all extras

* docs: Remove extras from readme

* feat: Add `RagSystem.from_config` class method

* feat: More extras

* feat: Add Demo.from_config

* style: Rename script commands

* feat: Add CLI module

* style: Type hints

* feat: Allow JSON config

* feat: Loading of config

* fix: Rename Openai to OpenAI, load from config correctly

* fix: `from_config` takes in a file name

* docs: Update changelog

* docs: Update changelog

* chore: Add cpu extra to CI

* fix: Deal with torch CI bug

* chore: Do not install vllm if 'cpu' extra is enabled

* chore: Remove vllm during CI

* chore: Update torch in CI

* chore: Update deps for CI

* chore: CI

* chore: Torch CI

* fix: psycopg2 vs psycopg2-binary

* tests: Skip tests if generator can't be initialised

* chore: Add ExtraMissing

* style: MissingExtra and MissingPackage messages

* tests: Skip on MissingExtra

* fix: Change psycopg2-binary to psycopg2

* fix: Psycopg2
  • Loading branch information
saattrupdan authored Aug 21, 2024
1 parent 33d3413 commit cdbaf0e
Show file tree
Hide file tree
Showing 18 changed files with 1,196 additions and 340 deletions.
5 changes: 4 additions & 1 deletion .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,10 @@ jobs:
- name: Install Dependencies
run: |
poetry env use "${{ matrix.python-version }}"
poetry install --extras all
poetry install --extras all --extras cpu
- name: Fix PyTorch bug
run: poetry remove vllm sentence_transformers && poetry add torch==2.0.0 transformers==4.36.0 sentence_transformers

- name: Test with pytest
run: poetry run pytest
Expand Down
16 changes: 16 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,22 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.


## [Unreleased]
### Added
- Added new `e5` and `cpu` extras, where `e5` installs the `sentence-transformers`
dependency required for the `E5Embedder`, and you can add `cpu` to install the
CPU-version of `torch` to save disk space (note that this is not available on MacOS,
however).
- Added new `from_config` class methods to `RagSystem` and `Demo` to create instances
from a configuration file (YAML or JSON). See the readme for more information.
- Added new `ragger-demo` and `ragger-compile` command line interfaces to run the demo
and compile the RAG system, respectively. Compilation is useful in cases where you
want to ensure that all components have everything downloaded and installed before
use. Both of these take a single `--config-file` argument to specify a configuration
file. See the readme for more information.

### Changed
- Changed default embedder in `RagSystem` to `OpenAIEmbedder` from `E5Embedder`.

### Fixed
- Raise `ImportError` when initialising `OpenAIEmbedder` without the `openai` package
installed.
Expand Down
23 changes: 8 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,26 +26,19 @@ Developer(s):
Installation with `pip`:

```bash
pip install ragger[all]@git+ssh://[email protected]/alexandrainst/ragger.git
pip install ragger[default]@git+ssh://[email protected]/alexandrainst/ragger.git
```

Installation with `poetry`:

```bash
poetry add git+ssh://[email protected]/alexandrainst/ragger.git --extras all
poetry add git+ssh://[email protected]/alexandrainst/ragger.git --extras default
```

You can replace the `all` extra with any combination of the following, to install only
the components you need:

- `postgres`
- `vllm`
- `openai`
- `demo`

For `pip`, this is done by comma-separating the extras (e.g., `ragger[vllm,demo]`),
while for `poetry`, you add multiple `--extras` flags (e.g., `--extras vllm --extras
demo`).
The `default` extra will make sure that you have all the necessary dependencies for
the default components (see below). If you want to use other components, you usually
need to install additional dependencies - these will be listed to you when you try to
use these components.


## Quick Start
Expand Down Expand Up @@ -99,8 +92,8 @@ imported from `ragger.document_store`.

Embedders are used to embed documents. These can all be imported from `ragger.embedder`.

- `E5Embedder`: An embedder that uses an E5 model. (default)
- `OpenAIEmbedder`: An embedder that uses the OpenAI Embeddings API.
- `OpenAIEmbedder`: An embedder that uses the OpenAI Embeddings API. (default)
- `E5Embedder`: An embedder that uses an E5 model.


### Embedding Stores
Expand Down
7 changes: 6 additions & 1 deletion makefile
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,12 @@ format: ## Format the code
@poetry run ruff format .

type-check: ## Run type checking
@poetry run mypy . --install-types --non-interactive --ignore-missing-imports --show-error-codes --check-untyped-defs
@poetry run mypy . \
--install-types \
--non-interactive \
--ignore-missing-imports \
--show-error-codes \
--check-untyped-defs

setup-environment-variables:
@poetry run python src/scripts/fix_dot_env_file.py
Expand Down
1,010 changes: 771 additions & 239 deletions poetry.lock

Large diffs are not rendered by default.

42 changes: 31 additions & 11 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,49 +14,69 @@ python = ">=3.10,<4.0"
numpy = "^1.26.4"
python-dotenv = "^1.0.1"
pydantic = "^2.8.2"
sentence-transformers = "^2.7.0"
gradio = { version = "^4.27.0", optional = true }
click = "^8.1.7"
tiktoken = { version = ">=0.7.0,<1.0.0", optional = true }
openai = { version = "^1.23.2", optional = true }
vllm = { markers = "sys_platform != 'darwin'", version = "^0.4.0", optional = true }
vllm = { version = "^0.5.4", optional = true, markers = "sys_platform != 'darwin' and extra != 'cpu'" }
torch = [
{ version = "^2.3.0", optional = true, source = "pypi", markers = "sys_platform == 'darwin' or extra != 'cpu'" },
{ version = "^2.3.0+cpu", optional = true, source = "torch_cpu", markers = "sys_platform != 'darwin' and extra == 'cpu'" },
]
psycopg2-binary = { version = "^2.9.9", optional = true }
sentence_transformers = { version = "^2.7.0", optional = true }
gradio = { version = "^4.27.0", optional = true }

[tool.poetry.group.dev.dependencies]
pytest = ">=8.1.1"
pytest-cov = ">=4.1.0"
pre-commit = ">=3.6.2"
readme-coverage-badger = ">=0.1.2"
click = ">=8.1.7"
ruff = ">=0.3.2"
mypy = ">=1.9.0"
nbstripout = ">=0.7.1"

[tool.poetry.extras]
default = [
"openai",
"tiktoken",
]
postgres = [
"psycopg2-binary",
]
vllm = [
"vllm",
]
openai = [
"openai",
"tiktoken",
e5 = [
"sentence-transformers",
]
cpu = [
"torch",
]
demo = [
"gradio"
"gradio",
]
all = [
"psycopg2-binary",
"vllm",
"openai",
"tiktoken",
"torch",
"vllm",
"psycopg2-binary",
"sentence-transformers",
"gradio",
]


[[tool.poetry.source]]
name = "pypi"

[[tool.poetry.source]]
name = "torch_cpu"
url = "https://download.pytorch.org/whl/cpu"
priority = "explicit"

[tool.poetry.scripts]
ragger-demo = "ragger.cli:run_demo"
ragger-compile = "ragger.cli:compile"

[tool.ruff]
target-version = "py311"
line-length = 88
Expand Down
1 change: 1 addition & 0 deletions src/ragger/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
format="%(asctime)s ⋅ %(name)s ⋅ %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
logging.getLogger("httpx").setLevel(logging.CRITICAL)

# Fetches the version of the package as defined in pyproject.toml
__version__ = importlib.metadata.version(__package__)
50 changes: 50 additions & 0 deletions src/ragger/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
"""Command-line interface for the `ragger` package."""

import logging
from pathlib import Path

import click

from .demo import Demo
from .rag_system import RagSystem

logger = logging.getLogger(__package__)


@click.command()
@click.option(
"--config_file",
"-c",
default=None,
type=click.Path(exists=True, dir_okay=False),
help="Path to the configuration file, which should be a JSON or YAML file.",
)
def run_demo(config_file: Path | None) -> None:
"""Run a RAG demo.
Args:
config_file:
Path to the configuration file.
"""
rag_system = RagSystem.from_config(config_file=config_file)
demo = Demo.from_config(rag_system=rag_system, config_file=config_file)
demo.launch()


@click.command()
@click.option(
"--config_file",
"-c",
default=None,
type=click.Path(exists=True, dir_okay=False),
help="Path to the configuration file, which should be a JSON or YAML file.",
)
def compile(config_file: Path) -> None:
"""Compile a RAG system.
Args:
config_file:
Path to the configuration file.
"""
RagSystem.from_config(config_file=config_file)
logger.info("RAG system compiled successfully.")
67 changes: 59 additions & 8 deletions src/ragger/demo.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
"""A Gradio demo of the RAG system."""

import importlib.util
import json
import logging
import os
Expand All @@ -9,9 +8,6 @@
import warnings
from pathlib import Path

from huggingface_hub import CommitScheduler, HfApi
from huggingface_hub.utils import EntryNotFoundError, RepositoryNotFoundError

from .constants import (
DANISH_DEMO_TITLE,
DANISH_DESCRIPTION,
Expand All @@ -29,15 +25,21 @@
ENGLISH_THANK_YOU_FEEDBACK,
)
from .data_models import Document, PersistentSharingConfig
from .generator import OpenaiGenerator
from .generator import OpenAIGenerator
from .rag_system import RagSystem
from .utils import format_answer
from .utils import format_answer, is_installed, load_config, raise_if_not_installed

if importlib.util.find_spec("gradio") is not None:
if is_installed(package_name="gradio"):
import gradio as gr

if is_installed(package_name="huggingface_hub"):
from huggingface_hub import CommitScheduler, HfApi
from huggingface_hub.utils import EntryNotFoundError, RepositoryNotFoundError

if typing.TYPE_CHECKING:
import gradio as gr
from huggingface_hub import CommitScheduler, HfApi
from huggingface_hub.utils import EntryNotFoundError, RepositoryNotFoundError

Message = str | None
Exchange = tuple[Message, Message]
Expand Down Expand Up @@ -107,6 +109,11 @@ def __init__(
The configuration for persistent sharing of the demo. If None then no
persistent sharing is used. Defaults to None.
"""
raise_if_not_installed(
package_names=["gradio", "huggingface_hub"],
extras_mapping=dict(gradio="demo", huggingface_hub="demo"),
)

title_mapping = dict(da=DANISH_DEMO_TITLE, en=ENGLISH_DEMO_TITLE)
description_mapping = dict(da=DANISH_DESCRIPTION, en=ENGLISH_DESCRIPTION)
feedback_instruction_mapping = dict(
Expand Down Expand Up @@ -209,6 +216,50 @@ def __init__(
self.retrieved_documents: list[Document] = list()
self.blocks: gr.Blocks | None = None

@classmethod
def from_config(
cls, rag_system: RagSystem, config_file: str | Path | None
) -> "Demo":
"""Create a demo from a configuration.
Args:
rag_system:
The RAG system.
config_file:
Path to the configuration file.
Returns:
The demo.
"""
config = load_config(config_file=config_file)

kwargs: dict[str, typing.Any] = dict(rag_system=rag_system)
if "feedback_db_path" in config:
kwargs["feedback_db_path"] = Path(config["feedback_db_path"])
if "feedback_mode" in config:
kwargs["feedback_mode"] = config["feedback_mode"]
if "gradio_theme" in config:
kwargs["gradio_theme"] = config["gradio_theme"]
if "title" in config:
kwargs["title"] = config["title"]
if "description" in config:
kwargs["description"] = config["description"]
if "feedback_instruction" in config:
kwargs["feedback_instruction"] = config["feedback_instruction"]
if "thank_you_feedback" in config:
kwargs["thank_you_feedback"] = config["thank_you_feedback"]
if "input_box_placeholder" in config:
kwargs["input_box_placeholder"] = config["input_box_placeholder"]
if "submit_button_value" in config:
kwargs["submit_button_value"] = config["submit_button_value"]
if "no_documents_reply" in config:
kwargs["no_documents_reply"] = config["no_documents_reply"]
if "persistent_sharing_config" in config:
kwargs["persistent_sharing_config"] = PersistentSharingConfig(
**config["persistent_sharing_config"]
)
return cls(**kwargs)

def build_demo(self) -> "gr.Blocks":
"""Build the demo.
Expand Down Expand Up @@ -343,7 +394,7 @@ def push_to_hub(self) -> None:
key=self.persistent_sharing_config.hf_token_variable_name,
value=os.environ[self.persistent_sharing_config.hf_token_variable_name],
)
if isinstance(self.rag_system.generator, OpenaiGenerator):
if isinstance(self.rag_system.generator, OpenAIGenerator):
api.add_space_secret(
repo_id=self.persistent_sharing_config.space_repo_id,
key="OPENAI_API_KEY",
Expand Down
20 changes: 10 additions & 10 deletions src/ragger/document_store.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
"""Store and fetch documents from a database."""

import importlib.util
import json
import sqlite3
import typing
from contextlib import contextmanager
from pathlib import Path

from .data_models import Document, DocumentStore, Index
from .utils import is_installed, raise_if_not_installed

if importlib.util.find_spec("psycopg2") is not None:
if is_installed(package_name="psycopg2"):
import psycopg2

if typing.TYPE_CHECKING:
Expand Down Expand Up @@ -300,13 +300,11 @@ def __init__(
The name of the column in the table that stores the document text.
Defaults to "text".
"""
psycopg2_not_installed = importlib.util.find_spec("psycopg2") is None
if psycopg2_not_installed:
raise ImportError(
"The `postgres` extra is required to use the `PostgresDocumentStore`. "
"Please install it by running `pip install ragger[postgres]@"
"git+ssh://[email protected]/alexandrainst/ragger.git` and try again."
)
raise_if_not_installed(
package_names=["psycopg2"],
extras_mapping=dict(psycopg2="postgres"),
installation_alias_mapping=dict(psycopg2="psycopg2-binary"),
)

self.host = host
self.port = port
Expand All @@ -333,7 +331,9 @@ def __init__(
)

@contextmanager
def _connect(self) -> typing.Generator[psycopg2.extensions.connection, None, None]:
def _connect(
self,
) -> "typing.Generator[psycopg2.extensions.connection, None, None]":
"""Connect to the PostgreSQL database.
Yields:
Expand Down
Loading

0 comments on commit cdbaf0e

Please sign in to comment.