Skip to content

Commit

Permalink
consolidate custom embeddings content
Browse files Browse the repository at this point in the history
  • Loading branch information
ccurme committed Dec 16, 2024
1 parent d44af41 commit 58c6163
Show file tree
Hide file tree
Showing 2 changed files with 96 additions and 130 deletions.
33 changes: 2 additions & 31 deletions docs/docs/contributing/how_to/integrations/package.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -291,37 +291,8 @@ import VectorstoreSource from '../../../../src/theme/integration_template/integr
Embeddings are used to convert `str` objects from `Document.page_content` fields
into a vector representation (represented as a list of floats).

The `Embeddings` class must inherit from the [Embeddings](https://python.langchain.com/api_reference/core/embeddings/langchain_core.embeddings.embeddings.Embeddings.html#langchain_core.embeddings.embeddings.Embeddings)
base class. This interface has 5 methods that can be implemented.

| Method/Property | Description |
|------------------------ |------------------------------------------------------|
| `__init__` | Initialize the embeddings object. (optional) |
| `embed_query` | Embed a list of texts. (required) |
| `embed_documents` | Embed a list of documents. (required) |
| `aembed_query` | Asynchronously embed a list of texts. (optional) |
| `aembed_documents` | Asynchronously embed a list of documents. (optional) |

### Constructor

The `__init__` constructor is optional but common, but can be used to set up any necessary attributes
that a user can pass in when initializing the embeddings object. Common attributes include

- `model` - the id of the model to use for embeddings

### Embedding queries vs documents

The `embed_query` and `embed_documents` methods are required. These methods both operate
on string inputs (the accessing of `Document.page_content` attributes) is handled
by the VectorStore using the embedding model for legacy reasons.

`embed_query` takes in a single string and returns a single embedding as a list of floats.
If your model has different modes for embedding queries vs the underlying documents, you can
implement this method to handle that.

`embed_documents` takes in a list of strings and returns a list of embeddings as a list of lists of floats.

### Implementation
Refer to the [Custom Embeddings Guide](/docs/how_to/custom_embeddings) guide for
detail on a starter embeddings [implementation](/docs/how_to/custom_embeddings/#implementation).

You can start from the following template or langchain-cli command:

Expand Down
193 changes: 94 additions & 99 deletions docs/docs/how_to/custom_embeddings.ipynb
Original file line number Diff line number Diff line change
@@ -1,15 +1,5 @@
{
"cells": [
{
"cell_type": "raw",
"id": "a6c3a6e0-a94f-4d40-9022-2c7ac2380f6d",
"metadata": {},
"source": [
"---\n",
"sidebar_position: 0\n",
"---"
]
},
{
"attachments": {},
"cell_type": "markdown",
Expand All @@ -20,7 +10,7 @@
"\n",
"We'll explore how to create a custom embedding model using LangChain's Embeddings interface. Embeddings are critical in natural language processing applications as they convert text into a numerical form that algorithms can understand, thereby enabling a wide range of applications such as similarity search, text classification, and clustering.\n",
"\n",
"Implementing embeddings using the standard `Embeddings` interface will allow your embeddings to be utilized in existing `LangChain` abstractions (e.g., as the embeddings for a particular `Vectorstore` or cached using `CacheBackedEmbeddings`).\n",
"Implementing embeddings using the standard [Embeddings](https://python.langchain.com/api_reference/core/embeddings/langchain_core.embeddings.embeddings.Embeddings.html) interface will allow your embeddings to be utilized in existing `LangChain` abstractions (e.g., as the embeddings for a particular [VectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html) or cached using [CacheBackedEmbeddings](/docs/how_to/caching_embeddings/)).\n",
"\n",
"## Interface\n",
"\n",
Expand All @@ -38,14 +28,26 @@
"\n",
"These methods ensure that your embedding model can be integrated seamlessly into the LangChain framework, providing both synchronous and asynchronous capabilities for scalability and performance optimization.\n",
"\n",
"\n",
":::{.callout-note}\n",
"`embed_documents` takes in a list of plain text, not a list of LangChain `Document` objects. The name of this method\n",
"may change in future versions of LangChain.\n",
"`Embeddings` do not currently implement the [Runnable](/docs/concepts/runnables/) interface and are also **not** instances of pydantic `BaseModel`.\n",
":::\n",
"\n",
"### Embedding queries vs documents\n",
"\n",
"The `embed_query` and `embed_documents` methods are required. These methods both operate\n",
"on string inputs (the accessing of `Document.page_content` attributes) is handled\n",
"by the vector store using the embedding model for legacy reasons.\n",
"\n",
"`embed_query` takes in a single string and returns a single embedding as a list of floats.\n",
"If your model has different modes for embedding queries vs the underlying documents, you can\n",
"implement this method to handle that. \n",
"\n",
"`embed_documents` takes in a list of strings and returns a list of embeddings as a list of lists of floats.\n",
"\n",
":::{.callout-important}\n",
"`Embeddings` do not currently implement the `Runnable` interface and are also **not** instances of pydantic `BaseModel`.\n",
":::{.callout-note}\n",
"`embed_documents` takes in a list of plain text, not a list of LangChain `Document` objects. The name of this method\n",
"may change in future versions of LangChain.\n",
":::"
]
},
Expand All @@ -56,7 +58,7 @@
"source": [
"## Implementation\n",
"\n",
"As an example, we'll implement a simple embeddings model that will count the characters in the text and generate a fixed size vector containing the character counts. The model will be case insensitive, and either count the characters from a-z or only the vowels (a, e, i, o, u). This model is for illustrative purposes only."
"As an example, we'll implement a simple embeddings model that returns a constant vector. The model will be case insensitive, and either count the characters from a-z or only the vowels (a, e, i, o, u). This model is for illustrative purposes only."
]
},
{
Expand All @@ -66,83 +68,92 @@
"metadata": {},
"outputs": [],
"source": [
"from collections import Counter\n",
"from typing import List\n",
"\n",
"from langchain_core.embeddings import Embeddings\n",
"\n",
"\n",
"class CharCountEmbeddings(Embeddings):\n",
" \"\"\"Embedding model that counts occurrences of characters in text.\n",
"class ParrotLinkEmbeddings(Embeddings):\n",
" \"\"\"ParrotLink embedding model integration.\n",
"\n",
" When contributing an implementation to LangChain, carefully document\n",
" the embedding model including the initialization parameters, include\n",
" an example of how to initialize the model and include any relevant\n",
" links to the underlying models documentation or API.\n",
" # TODO: Populate with relevant params.\n",
" Key init args — completion params:\n",
" model: str\n",
" Name of ParrotLink model to use.\n",
"\n",
" Example:\n",
" See full list of supported init args and their descriptions in the params section.\n",
"\n",
" # TODO: Replace with relevant init params.\n",
" Instantiate:\n",
" .. code-block:: python\n",
"\n",
" from langchain_community.embeddings import CharCountEmbeddings\n",
" from langchain_parrot_link import ParrotLinkEmbeddings\n",
"\n",
" embeddings = ChatCountEmbeddings(only_vowels=True)\n",
" print(embeddings.embed_documents([\"Hello world\", \"Test\"]))\n",
" print(embeddings.embed_query(\"Quick Brown Fox\"))\n",
" \"\"\"\n",
" embed = ParrotLinkEmbeddings(\n",
" model=\"...\",\n",
" # api_key=\"...\",\n",
" # other params...\n",
" )\n",
"\n",
" def __init__(self, *, only_vowels: bool = False) -> None:\n",
" \"\"\"Initialize the embedding model.\n",
" Embed single text:\n",
" .. code-block:: python\n",
"\n",
" Args:\n",
" only_vowels: If True, the embedding will count only the\n",
" vowels (a, e, i, o, u) and produce a 5-dimensional vector.\n",
" If False, counts all lowercase alphabetic characters,\n",
" producing a 26-dimensional vector.\n",
" \"\"\"\n",
" input_text = \"The meaning of life is 42\"\n",
" embed.embed_query(input_text)\n",
"\n",
" self.only_vowels = only_vowels\n",
" .. code-block:: python\n",
"\n",
" # TODO: Example output.\n",
"\n",
" # TODO: Delete if token-level streaming isn't supported.\n",
" Embed multiple text:\n",
" .. code-block:: python\n",
"\n",
" input_texts = [\"Document 1...\", \"Document 2...\"]\n",
" embed.embed_documents(input_texts)\n",
"\n",
" .. code-block:: python\n",
"\n",
" # TODO: Example output.\n",
"\n",
" # TODO: Delete if native async isn't supported.\n",
" Async:\n",
" .. code-block:: python\n",
"\n",
" await embed.aembed_query(input_text)\n",
"\n",
" # multiple:\n",
" # await embed.aembed_documents(input_texts)\n",
"\n",
" .. code-block:: python\n",
"\n",
" # TODO: Example output.\n",
"\n",
" \"\"\"\n",
"\n",
" def __init__(self, model: str):\n",
" self.model = model\n",
"\n",
" def embed_documents(self, texts: List[str]) -> List[List[float]]:\n",
" \"\"\"Embed multiple documents by counting specific character sets.\"\"\"\n",
" return [self._embed_text(text) for text in texts]\n",
" \"\"\"Embed search docs.\"\"\"\n",
" return [[0.5, 0.6, 0.7] for _ in texts]\n",
"\n",
" def embed_query(self, text: str) -> List[float]:\n",
" \"\"\"Embed a single query by counting specific character sets.\"\"\"\n",
" return self._embed_text(text)\n",
"\n",
" def _embed_text(self, text: str) -> List[float]:\n",
" \"\"\"Helper function to create a character count vector from text.\"\"\"\n",
" text = text.lower() # Normalize text to lowercase for case insensitivity.\n",
" count = Counter(text)\n",
" if self.only_vowels:\n",
" # Embed only vowels\n",
" vowels = \"aeiou\"\n",
" return [count.get(vowel, 0) for vowel in vowels]\n",
" else:\n",
" # Embed all letters from 'a' to 'z'\n",
" return [count.get(chr(i), 0) for i in range(ord(\"a\"), ord(\"z\") + 1)]\n",
"\n",
" # The async methods are optional.\n",
" # Delete them if you do not have an actual async imlementation.\n",
" async def aembed_documents(self, texts: List[str]) -> List[List[float]]:\n",
" \"\"\"Asynchronous embed search docs.\"\"\"\n",
" # This implementation is only for illustrative purposes.\n",
" # If you're connecting to an API, you should provide\n",
" # an actual async implementation (e.g., using httpx AsyncClient\n",
" # https://www.python-httpx.org/async/).\n",
" # If you do not have an actual async implementation, please\n",
" # DELETE this method as LangChain already provides a first pass\n",
" # optimization which involves delegating to the sync method.\n",
" # If you do not have a native async implementation, just delete this\n",
" # method. LangChain basically does this\n",
" return [self._embed_text(text) for text in texts]\n",
"\n",
" async def aembed_query(self, text: str) -> List[float]:\n",
" \"\"\"Asynchronous embed query text.\"\"\"\n",
" # See comment above for the aembed_documents regarding\n",
" # native async implementation\n",
" return self._embed_text(text)"
" \"\"\"Embed query text.\"\"\"\n",
" return self.embed_documents([text])[0]\n",
"\n",
" # optional: add custom async implementations here\n",
" # you can also delete these, and the base class will\n",
" # use the default implementation, which calls the sync\n",
" # version in an async executor:\n",
"\n",
" # async def aembed_documents(self, texts: List[str]) -> List[List[float]]:\n",
" # \"\"\"Asynchronous Embed search docs.\"\"\"\n",
" # ...\n",
"\n",
" # async def aembed_query(self, text: str) -> List[float]:\n",
" # \"\"\"Asynchronous Embed query text.\"\"\"\n",
" # ..."
]
},
{
Expand All @@ -163,15 +174,15 @@
"name": "stdout",
"output_type": "stream",
"text": [
"[[1, 1, 0, 0, 0], [0, 3, 0, 0, 0], [0, 1, 0, 1, 0], [0, 0, 0, 1, 0]]\n",
"[0, 4, 0, 0, 0]\n"
"[[0.5, 0.6, 0.7], [0.5, 0.6, 0.7]]\n",
"[0.5, 0.6, 0.7]\n"
]
}
],
"source": [
"embeddings = CharCountEmbeddings(only_vowels=True)\n",
"print(embeddings.embed_documents([\"abce\", \"eee\", \"hello\", \"fox\"]))\n",
"print(embeddings.embed_query(\"eeee\"))"
"embeddings = ParrotLinkEmbeddings(\"test-model\")\n",
"print(embeddings.embed_documents([\"Hello\", \"world\"]))\n",
"print(embeddings.embed_query(\"Hello\"))"
]
},
{
Expand All @@ -181,25 +192,9 @@
"source": [
"## Contributing\n",
"\n",
"We welcome contributions of Embedding models to the LangChain code base!\n",
"\n",
"Here's a checklist to help make sure your contribution gets added to LangChain:\n",
"\n",
"Documentation:\n",
"\n",
"* The model contains doc-strings for all initialization arguments, as these will be surfaced in the [API Reference](https://api.python.langchain.com/en/stable/langchain_api_reference.html).\n",
"* The class doc-string for the model contains a link to the model API if the model is powered by a service.\n",
"\n",
"Tests:\n",
"\n",
"* [ ] Add an integration tests to test the integration with the API or model.\n",
"\n",
"Optimizations:\n",
"We welcome contributions of Embedding models to the LangChain code base.\n",
"\n",
"If your implementation is an integration with an `API` consider providing async native support (e.g., via httpx AsyncClient).\n",
" \n",
"* [ ] Provided a native async of `aembed_documents`\n",
"* [ ] Provided a native async of `aembed_query`"
"If you aim to contribute an embedding model for a new provider (e.g., with a new set of dependencies or SDK), we encourage you to publish your implementation in a separate `langchain-*` integration package. This will enable you to appropriately manage dependencies and version your package. Please refer to our [contributing guide](/docs/contributing/how_to/integrations/) for a walkthrough of this process."
]
}
],
Expand All @@ -219,7 +214,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.4"
"version": "3.10.4"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 58c6163

Please sign in to comment.