consolidate custom embeddings content

langchain-ai · Dec 16, 2024 · 58c6163 · 58c6163
1 parent d44af41
commit 58c6163
Show file tree

Hide file tree

Showing 2 changed files with 96 additions and 130 deletions.
diff --git a/docs/docs/contributing/how_to/integrations/package.mdx b/docs/docs/contributing/how_to/integrations/package.mdx
@@ -291,37 +291,8 @@ import VectorstoreSource from '../../../../src/theme/integration_template/integr
 Embeddings are used to convert `str` objects from `Document.page_content` fields
 into a vector representation (represented as a list of floats).
 
-The `Embeddings` class must inherit from the [Embeddings](https://python.langchain.com/api_reference/core/embeddings/langchain_core.embeddings.embeddings.Embeddings.html#langchain_core.embeddings.embeddings.Embeddings)
-base class. This interface has 5 methods that can be implemented.
-
-| Method/Property         | Description                                          |
-|------------------------ |------------------------------------------------------|
-| `__init__`              | Initialize the embeddings object. (optional)         |
-| `embed_query`           | Embed a list of texts. (required)                    |
-| `embed_documents`       | Embed a list of documents. (required)                |
-| `aembed_query`          | Asynchronously embed a list of texts. (optional)     |
-| `aembed_documents`      | Asynchronously embed a list of documents. (optional) |
-
-### Constructor
-
-The `__init__` constructor is optional but common, but can be used to set up any necessary attributes
-that a user can pass in when initializing the embeddings object. Common attributes include
-
-- `model` - the id of the model to use for embeddings
-
-### Embedding queries vs documents
-
-The `embed_query` and `embed_documents` methods are required. These methods both operate
-on string inputs (the accessing of `Document.page_content` attributes) is handled
-by the VectorStore using the embedding model for legacy reasons.
-
-`embed_query` takes in a single string and returns a single embedding as a list of floats.
-If your model has different modes for embedding queries vs the underlying documents, you can
-implement this method to handle that. 
-
-`embed_documents` takes in a list of strings and returns a list of embeddings as a list of lists of floats.
-
-### Implementation
+Refer to the [Custom Embeddings Guide](/docs/how_to/custom_embeddings) guide for
+detail on a starter embeddings [implementation](/docs/how_to/custom_embeddings/#implementation).
 
 You can start from the following template or langchain-cli command:
 

diff --git a/docs/docs/how_to/custom_embeddings.ipynb b/docs/docs/how_to/custom_embeddings.ipynb
@@ -1,15 +1,5 @@
 {
  "cells": [
-  {
-   "cell_type": "raw",
-   "id": "a6c3a6e0-a94f-4d40-9022-2c7ac2380f6d",
-   "metadata": {},
-   "source": [
-    "---\n",
-    "sidebar_position: 0\n",
-    "---"
-   ]
-  },
   {
    "attachments": {},
    "cell_type": "markdown",
@@ -20,7 +10,7 @@
     "\n",
     "We'll explore how to create a custom embedding model using LangChain's Embeddings interface. Embeddings are critical in natural language processing applications as they convert text into a numerical form that algorithms can understand, thereby enabling a wide range of applications such as similarity search, text classification, and clustering.\n",
     "\n",
-    "Implementing embeddings using the standard `Embeddings` interface will allow your embeddings to be utilized in existing `LangChain` abstractions (e.g., as the embeddings for a particular `Vectorstore` or cached using `CacheBackedEmbeddings`).\n",
+    "Implementing embeddings using the standard [Embeddings](https://python.langchain.com/api_reference/core/embeddings/langchain_core.embeddings.embeddings.Embeddings.html) interface will allow your embeddings to be utilized in existing `LangChain` abstractions (e.g., as the embeddings for a particular [VectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html) or cached using [CacheBackedEmbeddings](/docs/how_to/caching_embeddings/)).\n",
     "\n",
     "## Interface\n",
     "\n",
@@ -38,14 +28,26 @@
     "\n",
     "These methods ensure that your embedding model can be integrated seamlessly into the LangChain framework, providing both synchronous and asynchronous capabilities for scalability and performance optimization.\n",
     "\n",
+    "\n",
     ":::{.callout-note}\n",
-    "`embed_documents` takes in a list of plain text, not a list of LangChain `Document` objects. The name of this method\n",
-    "may change in future versions of LangChain.\n",
+    "`Embeddings` do not currently implement the [Runnable](/docs/concepts/runnables/) interface and are also **not** instances of pydantic `BaseModel`.\n",
     ":::\n",
     "\n",
+    "### Embedding queries vs documents\n",
+    "\n",
+    "The `embed_query` and `embed_documents` methods are required. These methods both operate\n",
+    "on string inputs (the accessing of `Document.page_content` attributes) is handled\n",
+    "by the vector store using the embedding model for legacy reasons.\n",
+    "\n",
+    "`embed_query` takes in a single string and returns a single embedding as a list of floats.\n",
+    "If your model has different modes for embedding queries vs the underlying documents, you can\n",
+    "implement this method to handle that. \n",
+    "\n",
+    "`embed_documents` takes in a list of strings and returns a list of embeddings as a list of lists of floats.\n",
     "\n",
-    ":::{.callout-important}\n",
-    "`Embeddings` do not currently implement the `Runnable` interface and are also **not** instances of pydantic `BaseModel`.\n",
+    ":::{.callout-note}\n",
+    "`embed_documents` takes in a list of plain text, not a list of LangChain `Document` objects. The name of this method\n",
+    "may change in future versions of LangChain.\n",
     ":::"
    ]
   },
@@ -56,7 +58,7 @@
    "source": [
     "## Implementation\n",
     "\n",
-    "As an example, we'll implement a simple embeddings model that will count the characters in the text and generate a fixed size vector containing the character counts. The model will be case insensitive, and either count the characters from a-z or only the vowels (a, e, i, o, u). This model is for illustrative purposes only."
+    "As an example, we'll implement a simple embeddings model that returns a constant vector. The model will be case insensitive, and either count the characters from a-z or only the vowels (a, e, i, o, u). This model is for illustrative purposes only."
    ]
   },
   {
@@ -66,83 +68,92 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from collections import Counter\n",
     "from typing import List\n",
     "\n",
     "from langchain_core.embeddings import Embeddings\n",
     "\n",
     "\n",
-    "class CharCountEmbeddings(Embeddings):\n",
-    "    \"\"\"Embedding model that counts occurrences of characters in text.\n",
+    "class ParrotLinkEmbeddings(Embeddings):\n",
+    "    \"\"\"ParrotLink embedding model integration.\n",
     "\n",
-    "    When contributing an implementation to LangChain, carefully document\n",
-    "    the embedding model including the initialization parameters, include\n",
-    "    an example of how to initialize the model and include any relevant\n",
-    "    links to the underlying models documentation or API.\n",
+    "    # TODO: Populate with relevant params.\n",
+    "    Key init args — completion params:\n",
+    "        model: str\n",
+    "            Name of ParrotLink model to use.\n",
     "\n",
-    "    Example:\n",
+    "    See full list of supported init args and their descriptions in the params section.\n",
     "\n",
+    "    # TODO: Replace with relevant init params.\n",
+    "    Instantiate:\n",
     "        .. code-block:: python\n",
     "\n",
-    "            from langchain_community.embeddings import CharCountEmbeddings\n",
+    "            from langchain_parrot_link import ParrotLinkEmbeddings\n",
     "\n",
-    "            embeddings = ChatCountEmbeddings(only_vowels=True)\n",
-    "            print(embeddings.embed_documents([\"Hello world\", \"Test\"]))\n",
-    "            print(embeddings.embed_query(\"Quick Brown Fox\"))\n",
-    "    \"\"\"\n",
+    "            embed = ParrotLinkEmbeddings(\n",
+    "                model=\"...\",\n",
+    "                # api_key=\"...\",\n",
+    "                # other params...\n",
+    "            )\n",
     "\n",
-    "    def __init__(self, *, only_vowels: bool = False) -> None:\n",
-    "        \"\"\"Initialize the embedding model.\n",
+    "    Embed single text:\n",
+    "        .. code-block:: python\n",
     "\n",
-    "        Args:\n",
-    "            only_vowels: If True, the embedding will count only the\n",
-    "                vowels (a, e, i, o, u) and produce a 5-dimensional vector.\n",
-    "                If False, counts all lowercase alphabetic characters,\n",
-    "                producing a 26-dimensional vector.\n",
-    "        \"\"\"\n",
+    "            input_text = \"The meaning of life is 42\"\n",
+    "            embed.embed_query(input_text)\n",
     "\n",
-    "        self.only_vowels = only_vowels\n",
+    "        .. code-block:: python\n",
+    "\n",
+    "            # TODO: Example output.\n",
+    "\n",
+    "    # TODO: Delete if token-level streaming isn't supported.\n",
+    "    Embed multiple text:\n",
+    "        .. code-block:: python\n",
+    "\n",
+    "             input_texts = [\"Document 1...\", \"Document 2...\"]\n",
+    "            embed.embed_documents(input_texts)\n",
+    "\n",
+    "        .. code-block:: python\n",
+    "\n",
+    "            # TODO: Example output.\n",
+    "\n",
+    "    # TODO: Delete if native async isn't supported.\n",
+    "    Async:\n",
+    "        .. code-block:: python\n",
+    "\n",
+    "            await embed.aembed_query(input_text)\n",
+    "\n",
+    "            # multiple:\n",
+    "            # await embed.aembed_documents(input_texts)\n",
+    "\n",
+    "        .. code-block:: python\n",
+    "\n",
+    "            # TODO: Example output.\n",
+    "\n",
+    "    \"\"\"\n",
+    "\n",
+    "    def __init__(self, model: str):\n",
+    "        self.model = model\n",
     "\n",
     "    def embed_documents(self, texts: List[str]) -> List[List[float]]:\n",
-    "        \"\"\"Embed multiple documents by counting specific character sets.\"\"\"\n",
-    "        return [self._embed_text(text) for text in texts]\n",
+    "        \"\"\"Embed search docs.\"\"\"\n",
+    "        return [[0.5, 0.6, 0.7] for _ in texts]\n",
     "\n",
     "    def embed_query(self, text: str) -> List[float]:\n",
-    "        \"\"\"Embed a single query by counting specific character sets.\"\"\"\n",
-    "        return self._embed_text(text)\n",
-    "\n",
-    "    def _embed_text(self, text: str) -> List[float]:\n",
-    "        \"\"\"Helper function to create a character count vector from text.\"\"\"\n",
-    "        text = text.lower()  # Normalize text to lowercase for case insensitivity.\n",
-    "        count = Counter(text)\n",
-    "        if self.only_vowels:\n",
-    "            # Embed only vowels\n",
-    "            vowels = \"aeiou\"\n",
-    "            return [count.get(vowel, 0) for vowel in vowels]\n",
-    "        else:\n",
-    "            # Embed all letters from 'a' to 'z'\n",
-    "            return [count.get(chr(i), 0) for i in range(ord(\"a\"), ord(\"z\") + 1)]\n",
-    "\n",
-    "    # The async methods are optional.\n",
-    "    # Delete them if you do not have an actual async imlementation.\n",
-    "    async def aembed_documents(self, texts: List[str]) -> List[List[float]]:\n",
-    "        \"\"\"Asynchronous embed search docs.\"\"\"\n",
-    "        # This implementation is only for illustrative purposes.\n",
-    "        # If you're connecting to an API, you should provide\n",
-    "        # an actual async implementation (e.g., using httpx AsyncClient\n",
-    "        # https://www.python-httpx.org/async/).\n",
-    "        # If you do not have an actual async implementation, please\n",
-    "        # DELETE this method as LangChain already provides a first pass\n",
-    "        # optimization which involves delegating to the sync method.\n",
-    "        # If you do not have a native async implementation, just delete this\n",
-    "        # method. LangChain basically does this\n",
-    "        return [self._embed_text(text) for text in texts]\n",
-    "\n",
-    "    async def aembed_query(self, text: str) -> List[float]:\n",
-    "        \"\"\"Asynchronous embed query text.\"\"\"\n",
-    "        # See comment above for the aembed_documents regarding\n",
-    "        # native async implementation\n",
-    "        return self._embed_text(text)"
+    "        \"\"\"Embed query text.\"\"\"\n",
+    "        return self.embed_documents([text])[0]\n",
+    "\n",
+    "    # optional: add custom async implementations here\n",
+    "    # you can also delete these, and the base class will\n",
+    "    # use the default implementation, which calls the sync\n",
+    "    # version in an async executor:\n",
+    "\n",
+    "    # async def aembed_documents(self, texts: List[str]) -> List[List[float]]:\n",
+    "    #     \"\"\"Asynchronous Embed search docs.\"\"\"\n",
+    "    #     ...\n",
+    "\n",
+    "    # async def aembed_query(self, text: str) -> List[float]:\n",
+    "    #     \"\"\"Asynchronous Embed query text.\"\"\"\n",
+    "    #     ..."
    ]
   },
   {
@@ -163,15 +174,15 @@
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "[[1, 1, 0, 0, 0], [0, 3, 0, 0, 0], [0, 1, 0, 1, 0], [0, 0, 0, 1, 0]]\n",
-      "[0, 4, 0, 0, 0]\n"
+      "[[0.5, 0.6, 0.7], [0.5, 0.6, 0.7]]\n",
+      "[0.5, 0.6, 0.7]\n"
      ]
     }
    ],
    "source": [
-    "embeddings = CharCountEmbeddings(only_vowels=True)\n",
-    "print(embeddings.embed_documents([\"abce\", \"eee\", \"hello\", \"fox\"]))\n",
-    "print(embeddings.embed_query(\"eeee\"))"
+    "embeddings = ParrotLinkEmbeddings(\"test-model\")\n",
+    "print(embeddings.embed_documents([\"Hello\", \"world\"]))\n",
+    "print(embeddings.embed_query(\"Hello\"))"
    ]
   },
   {
@@ -181,25 +192,9 @@
    "source": [
     "## Contributing\n",
     "\n",
-    "We welcome contributions of Embedding models to the LangChain code base!\n",
-    "\n",
-    "Here's a checklist to help make sure your contribution gets added to LangChain:\n",
-    "\n",
-    "Documentation:\n",
-    "\n",
-    "* The model contains doc-strings for all initialization arguments, as these will be surfaced in the [API Reference](https://api.python.langchain.com/en/stable/langchain_api_reference.html).\n",
-    "* The class doc-string for the model contains a link to the model API if the model is powered by a service.\n",
-    "\n",
-    "Tests:\n",
-    "\n",
-    "* [ ] Add an integration tests to test the integration with the API or model.\n",
-    "\n",
-    "Optimizations:\n",
+    "We welcome contributions of Embedding models to the LangChain code base.\n",
     "\n",
-    "If your implementation is an integration with an `API` consider providing async native support (e.g., via httpx AsyncClient).\n",
-    " \n",
-    "* [ ] Provided a native async of `aembed_documents`\n",
-    "* [ ] Provided a native async of `aembed_query`"
+    "If you aim to contribute an embedding model for a new provider (e.g., with a new set of dependencies or SDK), we encourage you to publish your implementation in a separate `langchain-*` integration package. This will enable you to appropriately manage dependencies and version your package. Please refer to our [contributing guide](/docs/contributing/how_to/integrations/) for a walkthrough of this process."
    ]
   }
  ],
@@ -219,7 +214,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.4"
+   "version": "3.10.4"
   }
  },
  "nbformat": 4,