Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INSTRUCTOR models not working with sentence-transformers via langchain #2567

Open
BBC-Esq opened this issue Apr 2, 2024 · 5 comments
Open

Comments

@BBC-Esq
Copy link

BBC-Esq commented Apr 2, 2024

This is a challenging issue that I've been working on...First, here is my entire script:

SCRIPT
import shutil
import yaml
import gc
from langchain_community.docstore.document import Document
from langchain_community.embeddings import HuggingFaceInstructEmbeddings, HuggingFaceEmbeddings, HuggingFaceBgeEmbeddings
from langchain_community.vectorstores import TileDB
from document_processor import load_documents, split_documents
from loader_images import specify_image_loader
import torch
from utilities import validate_symbolic_links, my_cprint
from pathlib import Path
import os
import logging
from PySide6.QtCore import QDir
import time
import pickle

logging.basicConfig(
    level=logging.INFO,
    format='%(name)s - %(pathname)s:%(lineno)s - %(funcName)s'
)
logging.getLogger('chromadb.db.duckdb').setLevel(logging.WARNING)
logging.getLogger('sentence_transformers').setLevel(logging.WARNING)

class CreateVectorDB:
    def __init__(self, database_name):
        self.ROOT_DIRECTORY = Path(__file__).resolve().parent
        self.SOURCE_DIRECTORY = self.ROOT_DIRECTORY / "Docs_for_DB"
        self.PERSIST_DIRECTORY = self.ROOT_DIRECTORY / "Vector_DB" / database_name
        self.SAVE_JSON_DIRECTORY = self.ROOT_DIRECTORY / "Docs_for_DB" / database_name

    def load_config(self, root_directory):
        with open(root_directory / "config.yaml", 'r', encoding='utf-8') as stream:
            return yaml.safe_load(stream)
    
    def initialize_vector_model(self, embedding_model_name, config_data):
        EMBEDDING_MODEL_NAME = config_data.get("EMBEDDING_MODEL_NAME")
        compute_device = config_data['Compute_Device']['database_creation']
        model_kwargs = {"device": compute_device}
        encode_kwargs = {'normalize_embeddings': False, 'batch_size': 8}

        if compute_device.lower() == 'cpu':
            encode_kwargs['batch_size'] = 2
        else:
            batch_size_mapping = {
                'sentence-t5-xxl': 1,
                ('instructor-xl', 'sentence-t5-xl'): 2,
                'instructor-large': 3,
                ('jina-embedding-l', 'bge-large', 'gte-large', 'roberta-large', 'mxbai-embed-large-v1'): 4,
                'jina-embedding-s': 9,
                ('bge-small', 'gte-small'): 10,
                ('MiniLM',): 20,
            }

            for key, value in batch_size_mapping.items():
                if isinstance(key, tuple):
                    if any(model_name_part in EMBEDDING_MODEL_NAME for model_name_part in key):
                        encode_kwargs['batch_size'] = value
                        break
                else:
                    if key in EMBEDDING_MODEL_NAME:
                        encode_kwargs['batch_size'] = value
                        break
                        
            my_cprint(f"Vector model initialized with a batch size of {encode_kwargs['batch_size']}", "blue")

        if "instructor" in embedding_model_name:
            embed_instruction = config_data['embedding-models']['instructor'].get('embed_instruction')
            query_instruction = config_data['embedding-models']['instructor'].get('query_instruction')
            encode_kwargs['show_progress_bar'] = True
            
            model = HuggingFaceInstructEmbeddings(
                model_name=embedding_model_name,
                model_kwargs=model_kwargs,
                embed_instruction=embed_instruction,
                query_instruction=query_instruction,
                encode_kwargs=encode_kwargs
            )
        elif "bge" in embedding_model_name:
            query_instruction = config_data['embedding-models']['bge'].get('query_instruction')
            encode_kwargs['show_progress_bar'] = True
            
            model = HuggingFaceBgeEmbeddings(
                model_name=embedding_model_name,
                model_kwargs=model_kwargs,
                query_instruction=query_instruction,
                encode_kwargs=encode_kwargs
            )
        else:
            model = HuggingFaceEmbeddings(
                model_name=embedding_model_name,
                show_progress=True,
                model_kwargs=model_kwargs,
                encode_kwargs=encode_kwargs
            )

        return model, encode_kwargs

    def create_database(self, texts, embeddings):
        my_cprint("Creating vectors and database...\n\n NOTE:\n\nNOTE: The progress bar only relates to computing vectors, not inserting them into the database.  Rest assured, after it reaches 100% it is still working unless you get an error message.\n", "yellow")

        start_time = time.time()

        if not self.PERSIST_DIRECTORY.exists():
            self.PERSIST_DIRECTORY.mkdir(parents=True, exist_ok=True)

        db = TileDB.from_documents(
            documents=texts,
            embedding=embeddings,
            index_uri=str(self.PERSIST_DIRECTORY),
            allow_dangerous_deserialization=True,
            metric="euclidean",
            index_type="FLAT",
        )

        print("Database created.")

        end_time = time.time()
        elapsed_time = end_time - start_time

        my_cprint("Database saved.", "cyan")
        print(f"Creation of vectors and inserting into the database took {elapsed_time:.2f} seconds.")
        
    def save_documents_to_json(self, json_docs_to_save):
        self.SAVE_JSON_DIRECTORY.mkdir(parents=True, exist_ok=True)

        for document in json_docs_to_save:
            document_hash = document.metadata.get('hash', None)
            if document_hash:
                json_filename = f"{document_hash}.json"
                json_file_path = self.SAVE_JSON_DIRECTORY / json_filename
                
                actual_file_path = document.metadata.get('file_path')
                if os.path.islink(actual_file_path):
                    resolved_path = os.path.realpath(actual_file_path)
                    document.metadata['file_path'] = resolved_path

                document_json = document.json(indent=4)
                
                with open(json_file_path, 'w', encoding='utf-8') as json_file:
                    json_file.write(document_json)
            else:
                print("Warning: Document missing 'hash' in metadata. Skipping JSON creation.")
    
    def load_audio_documents(self, source_dir: Path = None) -> list:
        if source_dir is None:
            source_dir = self.SOURCE_DIRECTORY
        json_paths = [f for f in source_dir.iterdir() if f.suffix.lower() == '.json']
        docs = []

        for json_path in json_paths:
            try:
                with open(json_path, 'r', encoding='utf-8') as json_file:
                    json_str = json_file.read()
                    doc = Document.parse_raw(json_str)
                    docs.append(doc)
            except Exception as e:
                my_cprint(f"Error loading {json_path}: {e}", "red")

        return docs
    
    def clear_docs_for_db_folder(self):
        for item in self.SOURCE_DIRECTORY.iterdir():
            if item.is_file() or item.is_symlink():
                try:
                    item.unlink()
                except Exception as e:
                    print(f"Failed to delete {item}: {e}")
    
    def run(self):
        config_data = self.load_config(self.ROOT_DIRECTORY)
        EMBEDDING_MODEL_NAME = config_data.get("EMBEDDING_MODEL_NAME")
        
        # load non-image/non-audio documents
        documents = load_documents(self.SOURCE_DIRECTORY)
        
        # load image documents
        image_documents = specify_image_loader()
        documents.extend(image_documents)
        
        json_docs_to_save = documents
        
        # load audio documents
        audio_documents = self.load_audio_documents()  # Now calling the method internally
        documents.extend(audio_documents)
        if len(audio_documents) > 0:
            print(f"Loaded {len(audio_documents)} audio transcription(s)...")

        # split each document in the list of documents
        texts = split_documents(documents)

        # initialize vector model
        embeddings, encode_kwargs = self.initialize_vector_model(EMBEDDING_MODEL_NAME, config_data)

        # create database
        self.create_database(texts, embeddings)
        
        self.save_documents_to_json(json_docs_to_save)
        
        del embeddings.client
        del embeddings
        torch.cuda.empty_cache()
        gc.collect()
        my_cprint("Embedding model removed from memory.", "red")
        
        # clear ingest folder
        self.clear_docs_for_db_folder()
        print("Cleared all files and symlinks in Docs_for_DB folder.")

This works fine when using sentence-transformers==2.2.2. However, when I upgrade to sentence-transformers==2.6.1 I get this error:

ERROR
Traceback (most recent call last):
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\gui_tabs_databases.py", line 23, in run
    create_vector_db.run() # calls database_interactions.py
    ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 193, in run
    embeddings, encode_kwargs = self.initialize_vector_model(EMBEDDING_MODEL_NAME, config_data)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 72, in initialize_vector_model
    model = HuggingFaceInstructEmbeddings(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\langchain_community\embeddings\huggingface.py", line 153, in __init__
    self.client = INSTRUCTOR(
                  ^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 191, in __init__
    modules = self._load_sbert_model(
              ^^^^^^^^^^^^^^^^^^^^^^^
TypeError: INSTRUCTOR._load_sbert_model() got an unexpected keyword argument 'token'

I've verified that when using a BGE model (via HuggingFaceBgeEmbeddings), GTE model (via HuggingFaceEmbeddings) and all-mpnet-base-v2 (via HuggingFaceEmbeddings) everything works fine. I've tried every which way to get it to work...

Since I really like the "instructor" models in my program, this forces me to stay at sentence-transformers==2.2.2 or, alternatively, abandon them in order to upgrade so I can use newer models (e.g. mxbai-embed-large-v1). I wouldn't normally ask, but I've spend dozens of hours trying to solve this...ranging from using SentenceTransformers directly pursuant to the API on your website to custom wrappers, etc.

Can anyone help me and/or @tomaarsen in particular if he has time? I don't know if this is an issue for sentence-transformers itself, its integration with HuggingFaceInstructEmbeddings from Langchain, or just my code...Thanks in advance!

[EDIT] I am aware that Instructor models are unique in that the prompt is not included in pooling, as stated on your website's instructions/examples, and I DID examine SentenceTransformers itself and see where you took that into account:

        if model_name_or_path in ("hkunlp/instructor-base", "hkunlp/instructor-large", "hkunlp/instructor-xl"):
            self.set_pooling_include_prompt(include_prompt=False)
        elif (
            model_name_or_path
            and "/" in model_name_or_path
            and "instructor" in model_name_or_path.split("/")[1].lower()
        ):
            if any([module.include_prompt for module in self if isinstance(module, Pooling)]):
                logger.warning(
                    "Instructor models require `include_prompt=False` in the pooling configuration. "
                    "Either update the model configuration or call `model.set_pooling_include_prompt(False)` after loading the model."
                )

(taken from version 2.6.0)

I just simply can't figure out why HuggingFaceInstructEmbeddings isn't working while HuggingFaceEmbeddings and HuggingFaceBgeEmbeddings work fine when I pip install sentence-transformers above 2.2.2...

This is literally the only issue that has stymied my program from upgrading the crucial dependency that is sentence-transformers...Thanks again and love the repo!

@tomaarsen
Copy link
Collaborator

Hello!

The issue originates in https://github.com/xlang-ai/instructor-embedding, which was created explicitly for Sentence Transformers 2.2.2. They haven't kept their code up to date with the recent Sentence Transformer updates, hence the failures. This is why HuggingFaceInstructEmbeddings fails while HuggingFaceEmbeddings and HuggingFaceBgeEmbeddings work.

A good solution would be to try this PR: xlang-ai/instructor-embedding#112 with:

pip install git+https://github.com/SilasMarvin/instructor-embedding.git@silas-update-for-newer-sentence-transformers

and the most recent sentence-transformers. That combination should work correctly.

  • Tom Aarsen

@BBC-Esq
Copy link
Author

BBC-Esq commented Apr 4, 2024

Thanks, I checked it out. Now I'm getting the error below. My program downloads the instructor models into a specific directory. It does not use the default "cache" location. I do this for various reasons. As such, I specify the path to the model rather than the Huggingface repo ID when instantiating the model...I'm guessing this is the reason why I'm getting this error...Any clue?

Traceback (most recent call last):
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\gui_tabs_databases.py", line 23, in run
    create_vector_db.run() # calls database_interactions.py
    ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 198, in run
    embeddings, encode_kwargs = self.initialize_vector_model(EMBEDDING_MODEL_NAME, config_data)
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\database_interactions.py", line 70, in initialize_vector_model
    model = HuggingFaceInstructEmbeddings(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\langchain_community\embeddings\huggingface.py", line 158, in __init__
    self.client = INSTRUCTOR(
                  ^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\sentence_transformers\SentenceTransformer.py", line 191, in __init__
    modules = self._load_sbert_model(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\InstructorEmbedding\instructor.py", line 455, in _load_sbert_model
    model_path = snapshot_download(**download_kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\huggingface_hub\utils\_validators.py", line 111, in _inner_fn
    validate_repo_id(arg_value)
  File "D:\Scripts\ChromaDB-Plugin-for-LM-Studio\v4_3 - working\Lib\site-packages\huggingface_hub\utils\_validators.py", line 159, in validate_repo_id
    raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': 'D:/Scripts/ChromaDB-Plugin-for-LM-Studio/v4_3 - working/Embedding_Models/hkunlp--instructor-base'. Use `repo_type` argument if needed.

@BBC-Esq
Copy link
Author

BBC-Esq commented Apr 4, 2024

I resolved this error by using the huggingface repo id instead:

hkunlp/instructor-base

I'm guessing this IS NOT a true fix, however, since I notice that the "_load_sbert_model" method within sentence-transformers has a parameter named model_name_or_path...implying that it'll accept a repo id or path. Here's my code snippet:

        if "instructor" in embedding_model_name:
            encode_kwargs['show_progress_bar'] = True
            
            model = HuggingFaceInstructEmbeddings(
                model_name="hkunlp/instructor-base",
                model_kwargs=model_kwargs,
                encode_kwargs=encode_kwargs,
            )

To temporarily obviate the issue I simply tried "hkunlp/instructor-base" instead of "embedding_model_name"...I did this to get to the next troubleshooting step for the time being...

IT WORKED! The database was successfully created. MOREOVER, I was able to successfully search it!

SUMMARY:

The script provided at https://github.com/SilasMarvin/instructor-embedding/tree/silas-update-for-newer-sentence-transformers fixes the error TypeError: INSTRUCTOR._load_sbert_model() got an unexpected keyword argument 'token'

Question: Are you willing to modify SentenceTransformer's _load_sbert_model method such that it works with the original instructorembedding library? That would make it unnecessary to rely on the modification by SilasMarvin? I only ask because instructorembedding is obviously not being updated even though it's their responsibility to do so...

It seems to me (as a lay person) that you'd simply need to have an intermediary function between how the instructorembedding library expects to load a model versus sentence-transformers does it now. This would also make huggingfaceinstructembeddings from langchain would work as-is. In the interest of full disclosure, I reviewed the huggingfaceinstructembeddings class within Langchain's source code and, just like instructorembedding, it hasn't been updated in eons so...

Basically, even though it's the instructorembedding and/or Langchain's peoples' responsibilities to update their code in compliance with sentence-transformers, I'm asking if sentence-transformers would accommodate them and provide a fix in its source code instead?

The benefit would be that Instructor models would work with newer versions of the sentence-transformers library out of the box and people like me could still use pip install instructorembedding instead relying on a specific branch of an unofficial fork of the instructorembedding repo. Doesn't hurt to ask, right?

Thanks again. Please let me know if there's a way I can contribute.

@BBC-Esq
Copy link
Author

BBC-Esq commented Apr 4, 2024

FINALLY, regarding the error of not being able to load a model locally, I finally solved this issue by using the cache_folder parameter from langchain specified here:

https://api.python.langchain.com/en/latest/embeddings/langchain_community.embeddings.huggingface.HuggingFaceInstructEmbeddings.html#langchain_community.embeddings.huggingface.HuggingFaceInstructEmbeddings

I assume that this connects with the cache_folder parameter within sentence-transformers here:

https://www.sbert.net/docs/package_reference/SentenceTransformer.html

So this narrow issue, at least, seems solved. Just thought others might want to know.

@SilasMarvin
Copy link

The fix for this just got merged into Instructor Embedding: xlang-ai/instructor-embedding@5cca65e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants