Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Vector Search using API #2937

Open
TheNeeloy opened this issue Jan 4, 2025 · 0 comments
Open

[BUG]: Vector Search using API #2937

TheNeeloy opened this issue Jan 4, 2025 · 0 comments
Assignees
Labels
possible bug Bug was reported but is not confirmed or is unable to be replicated.

Comments

@TheNeeloy
Copy link

How are you running AnythingLLM?

Docker (local)

What happened?

Hi, thanks for releasing such an awesome project; really is helping with swapping out models and providers quickly during LLM experiments. I have a question about the intended effect of the api/v1/workspace/{slug}/vector-search API endpoint.
Based off of these issues and PRs: #2811, #2812, #2815

TLDR:

When testing the new vector-search API endpoint, I found that I needed to add metadata to my query to retrieve the vector with distance 0. However, I thought that the vector search originally was based purely on the page content, excluding metadata. Below, I wrote down my environment setup, testing process, expectations, results, and questions. Thanks for your time!

Workspace and System Setup:

My AnythingLLM instance is hosted locally via Docker. It is using the default, out of the box, AnythingLLM embedding provider and LanceDB vector database settings. I setup a workspace using Ollama as the provider, running a llama3.2:1b LLM.

This is the response from /api/v1/workspace/{slug} (my workspace slug is testing_api):

{
  "workspace": [
    {
      "id": 7,
      "name": "testing_api",
      "slug": "testing_api",
      "vectorTag": null,
      "createdAt": "2025-01-04T01:25:26.088Z",
      "openAiTemp": 0.7,
      "openAiHistory": 20,
      "lastUpdatedAt": "2025-01-04T01:25:26.088Z",
      "openAiPrompt": "Given the following conversation, relevant context, and a follow up question, reply with an answer to the current question the user is asking. Return only your response to the question given the above information following the users instructions as needed.",
      "similarityThreshold": 0.25,
      "chatProvider": "ollama",
      "chatModel": "llama3.2:1b",
      "topN": 4,
      "chatMode": "chat",
      "pfpFilename": null,
      "agentProvider": null,
      "agentModel": null,
      "queryRefusalResponse": "There is no relevant information in this workspace to answer your query.",
      "documents": [
        {
          "id": 9,
          "docId": "efd8d182-048e-41d4-aa61-3dcb0c98fff2",
          "filename": "raw-pirate-cab0d2bf-4cf4-4020-a5ff-233e02c5067f.json",
          "docpath": "testing_temp_folder/raw-pirate-cab0d2bf-4cf4-4020-a5ff-233e02c5067f.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"cab0d2bf-4cf4-4020-a5ff-233e02c5067f\",\"url\":\"file://pirate.txt\",\"title\":\"pirate.txt\",\"docAuthor\":\"\",\"description\":\"what a pirate says\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:17:35 AM\",\"wordCount\":3,\"token_count_estimate\":4}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:17:36.080Z",
          "lastUpdatedAt": "2025-01-04T06:17:36.080Z"
        },
        {
          "id": 10,
          "docId": "cddb338d-d6ba-4636-a229-b5be995cef93",
          "filename": "raw-long_file-efe0f77c-db8f-4b79-9531-c797621f251e.json",
          "docpath": "testing_temp_folder/raw-long_file-efe0f77c-db8f-4b79-9531-c797621f251e.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"efe0f77c-db8f-4b79-9531-c797621f251e\",\"url\":\"file://long_file.txt\",\"title\":\"long_file.txt\",\"docAuthor\":\"\",\"description\":\"bunch of as\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:17:36 AM\",\"wordCount\":2001,\"token_count_estimate\":2001}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:17:36.876Z",
          "lastUpdatedAt": "2025-01-04T06:17:36.876Z"
        },
        {
          "id": 11,
          "docId": "ea926340-3c71-46df-8e1e-72e1629b5de0",
          "filename": "raw-pirate-a17387aa-8307-4e92-b07a-8dd92d26b68e.json",
          "docpath": "testing_temp_folder/raw-pirate-a17387aa-8307-4e92-b07a-8dd92d26b68e.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"a17387aa-8307-4e92-b07a-8dd92d26b68e\",\"url\":\"file://pirate.txt\",\"title\":\"pirate.txt\",\"docAuthor\":\"\",\"description\":\"what a pirate says\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:21:05 AM\",\"wordCount\":3,\"token_count_estimate\":4}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:21:06.143Z",
          "lastUpdatedAt": "2025-01-04T06:21:06.143Z"
        },
        {
          "id": 12,
          "docId": "89e8bb51-8ff6-433f-ac56-6d12d5f76158",
          "filename": "raw-long_file-f83a5abe-e69c-40c8-907f-2a2fece3e3b1.json",
          "docpath": "testing_temp_folder/raw-long_file-f83a5abe-e69c-40c8-907f-2a2fece3e3b1.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"f83a5abe-e69c-40c8-907f-2a2fece3e3b1\",\"url\":\"file://long_file.txt\",\"title\":\"long_file.txt\",\"docAuthor\":\"\",\"description\":\"bunch of as\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:21:06 AM\",\"wordCount\":2001,\"token_count_estimate\":2001}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:21:06.927Z",
          "lastUpdatedAt": "2025-01-04T06:21:06.927Z"
        },
        {
          "id": 13,
          "docId": "e11d5f8f-1029-4881-a8cf-62ca09adc97f",
          "filename": "raw-pirate-9c2c0cdf-246e-485e-a1fa-15bca9782b54.json",
          "docpath": "testing_temp_folder/raw-pirate-9c2c0cdf-246e-485e-a1fa-15bca9782b54.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"9c2c0cdf-246e-485e-a1fa-15bca9782b54\",\"url\":\"file://pirate.txt\",\"title\":\"pirate.txt\",\"docAuthor\":\"\",\"description\":\"what a pirate says\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:24:00 AM\",\"wordCount\":3,\"token_count_estimate\":4}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:24:00.807Z",
          "lastUpdatedAt": "2025-01-04T06:24:00.807Z"
        },
        {
          "id": 14,
          "docId": "d076c9bf-84da-4894-b8c4-d734c375ec8b",
          "filename": "raw-long_file-703df755-c8b4-4449-9918-afe877621d4f.json",
          "docpath": "testing_temp_folder/raw-long_file-703df755-c8b4-4449-9918-afe877621d4f.json",
          "workspaceId": 7,
          "metadata": "{\"id\":\"703df755-c8b4-4449-9918-afe877621d4f\",\"url\":\"file://long_file.txt\",\"title\":\"long_file.txt\",\"docAuthor\":\"\",\"description\":\"bunch of as\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:24:00 AM\",\"wordCount\":2001,\"token_count_estimate\":2001}",
          "pinned": false,
          "watched": false,
          "createdAt": "2025-01-04T06:24:01.638Z",
          "lastUpdatedAt": "2025-01-04T06:24:01.638Z"
        }
      ],
      "threads": [
        {
          "user_id": null,
          "slug": "19f23a3c-9ecb-4b34-9750-d974829d65f6"
        },
        {
          "user_id": null,
          "slug": "5be87d7a-7abc-436c-8860-77d2e55d6718"
        }
      ]
    }
  ]
}

This is the response from /api/v1/system:

{
  "settings": {
    "RequiresAuth": false,
    "AuthToken": false,
    "JWTSecret": false,
    "StorageDir": "/app/server/storage",
    "MultiUserMode": false,
    "DisableTelemetry": "true",
    "EmbeddingEngine": "native",
    "HasExistingEmbeddings": true,
    "HasCachedEmbeddings": true,
    "VoyageAiApiKey": false,
    "GenericOpenAiEmbeddingApiKey": false,
    "GenericOpenAiEmbeddingMaxConcurrentChunks": 500,
    "GeminiEmbeddingApiKey": false,
    "VectorDB": "lancedb",
    "PineConeKey": false,
    "ChromaApiKey": false,
    "MilvusPassword": false,
    "LLMProvider": "ollama",
    "OpenAiKey": false,
    "OpenAiModelPref": "gpt-4o",
    "AzureOpenAiKey": false,
    "AzureOpenAiTokenLimit": 4096,
    "AnthropicApiKey": false,
    "AnthropicModelPref": "claude-2",
    "GeminiLLMApiKey": true,
    "GeminiLLMModelPref": "gemini-pro",
    "GeminiSafetySetting": "BLOCK_MEDIUM_AND_ABOVE",
    "LocalAiApiKey": false,
    "OllamaLLMBasePath": "http://172.17.0.1:11434",
    "OllamaLLMModelPref": "llama3.2:1b",
    "OllamaLLMTokenLimit": "4096",
    "OllamaLLMKeepAliveSeconds": "300",
    "OllamaLLMPerformanceMode": "base",
    "NovitaLLMApiKey": false,
    "TogetherAiApiKey": false,
    "FireworksAiLLMApiKey": false,
    "PerplexityApiKey": true,
    "OpenRouterApiKey": false,
    "MistralApiKey": false,
    "GroqApiKey": false,
    "HuggingFaceLLMAccessToken": false,
    "TextGenWebUIAPIKey": false,
    "LiteLLMApiKey": false,
    "GenericOpenAiKey": false,
    "AwsBedrockLLMConnectionMethod": "iam",
    "AwsBedrockLLMAccessKeyId": false,
    "AwsBedrockLLMAccessKey": false,
    "AwsBedrockLLMSessionToken": false,
    "CohereApiKey": false,
    "DeepSeekApiKey": false,
    "ApipieLLMApiKey": false,
    "XAIApiKey": false,
    "WhisperProvider": "local",
    "WhisperModelPref": "Xenova/whisper-small",
    "TextToSpeechProvider": "native",
    "TTSOpenAIKey": false,
    "TTSElevenLabsKey": false,
    "TTSPiperTTSVoiceModel": "en_US-hfc_female-medium",
    "TTSOpenAICompatibleKey": false,
    "AgentGoogleSearchEngineId": null,
    "AgentGoogleSearchEngineKey": null,
    "AgentSearchApiKey": null,
    "AgentSearchApiEngine": "google",
    "AgentSerperApiKey": null,
    "AgentBingSearchApiKey": null,
    "AgentSerplyApiKey": null,
    "AgentSearXNGApiUrl": null,
    "AgentTavilyApiKey": null,
    "DisableViewChatHistory": false
  }
}

Goal:

I've written a simple Python API to interface with the AnythingLLM instance, and I want to test out its functionality before using the API in my experiments. I've implemented functions for creating a folder, uploading a raw text document, moving a file into a folder, adding a file to a workspace, and performing vector search within a workspace given a query. Every function works as expected, except for the vector_search function.

Below is my Python API and testing script (it assumes the AnythingLLM API key is set as an environment variable ANYTHINGLLM_API_KEY):

# Standard
import os
import json
from pprint import pprint

# 3rd Party
from requests import get, post


def create_folder(ipv4, port, api_key, folder_name, verbose=False):
    """
    Create empty folder in server's root storage directory.

    Returns <Failure State>.
    Failure State is True if failed.
    """

    url = f'http://{ipv4}:{port}/api/v1/document/create-folder'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'name': folder_name
    }

    try:
        response = post(url, headers=headers, json=data, stream=False)
    except Exception as exception:
        print(exception)
        return True
    
    response_dict = json.loads(response.text)
    if verbose:
        pprint(response_dict)
    return not response_dict['success']


def upload_raw_text(ipv4, port, api_key, content, title, description="", verbose=False):
    """
    Uploads document with raw text to database.
    Title is required, but description is not.
    Other fields in metadata (ie. url, published) will 
    automatically be filled in.

    Returns <Failure State, Saved File Path>.
    Failure State is True if failed and should not use Saved File Path.
    Saved File Path can be then used to add document to a workspace.
    """

    url = f'http://{ipv4}:{port}/api/v1/document/raw-text'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'textContent': content,
        'metadata': {
            'title': title,
            'description': description,
            'docAuthor': '',
            'docSource': '',
            'chunkSource': ''
        }
    }

    try:
        response = post(url, headers=headers, json=data, stream=False)
    except Exception as exception:
        print(exception)
        return True, None
    
    response_dict = json.loads(response.text)
    file_path = response_dict['documents'][0]['location']
    if verbose:
        pprint(response_dict)
    return not response_dict['success'], file_path


def move_file(ipv4, port, api_key, from_file_path, to_folder, verbose=False):
    """
    Move file from one folder to another.

    Returns <Failure State, New Saved File Path>.
    Failure State is True if failed and should not use New Saved File Path.
    New Saved File Path can be then used to add document to a workspace.
    """

    file_name = from_file_path.split('/')[-1]
    to_file_path = '/'.join([to_folder, file_name])

    url = f'http://{ipv4}:{port}/api/v1/document/move-files'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'files': [{
            'from': from_file_path,
            'to': to_file_path
        }]
    }

    try:
        response = post(url, headers=headers, json=data, stream=False)
    except Exception as exception:
        print(exception)
        return True, None
    
    response_dict = json.loads(response.text)
    if verbose:
        pprint(response_dict)
    return not response_dict['success'], to_file_path


def add_file_to_workspace(ipv4, port, slug, api_key, file_path, verbose=False):
    """
    Adds file from server to specific workspace by slug.
    Will embed file if not already cached.

    Returns <Failure State>.
    Failure State is True if failed.
    """

    url = f'http://{ipv4}:{port}/api/v1/workspace/{slug}/update-embeddings'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'adds': [file_path]
    }

    try:
        response = post(url, headers=headers, json=data, stream=False)
    except Exception as exception:
        print(exception)
        return True
    
    response_dict = json.loads(response.text)
    if verbose:
        pprint(response_dict)
    return not response.text == 'Internal Server Error'


def vector_search(ipv4, port, slug, api_key, query, top_n, score_threshold, verbose=False):
    """
    Searches for closest vectors to query.

    Returns <Failure State, Response>.
    Failure State is True if failed and should not access Response.
    """

    url = f'http://{ipv4}:{port}/api/v1/workspace/{slug}/vector-search'
    headers = {
        'accept': 'application/json',
        'Authorization': f'Bearer {api_key}',
        'Content-Type': 'application/json'
    }
    data = {
        'query': query,
        'topN': top_n,
        'scoreThreshold': score_threshold
    }

    try:
        response = post(url, headers=headers, json=data, stream=False)
    except Exception as exception:
        print(exception)
        return True, None
    
    response_dict = json.loads(response.text)
    if verbose:
        pprint(response_dict)
    return False, response_dict


if __name__ == '__main__':
    """
    Testing API functions.
    """

    # Create temp folder for testing
    fail = create_folder('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), 'testing_temp_folder', verbose=True)

    # Add raw text to server
    fail, file_path = upload_raw_text('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), 'Yo ho ho!', 'pirate', 'what a pirate says', verbose=True)
    
    # Move file into temp folder
    fail, file_path = move_file('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, 'testing_temp_folder', verbose=True)

    # Embed file and add to workspace
    fail = add_file_to_workspace('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, verbose=True)

    # Add really long text with 2K tokens to server
    text = ""
    for _ in range(2000):
        text = text + "a "
    fail, file_path = upload_raw_text('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), text, 'long_file', 'bunch of as', verbose=True)

    # Move long file into temp folder
    fail, file_path = move_file('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, 'testing_temp_folder', verbose=True)

    # Embed long file and add to workspace
    fail = add_file_to_workspace('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, verbose=True)

    # Query workspace vectors
    fail, response = vector_search('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), 'Yo ho ho!', 2, 0.0, verbose=True)

    # Query workspace vectors
    test_query = '<document_metadata>\nsourceDocument: pirate.txt\npublished: 1/4/2025, 6:17:35 AM\n</document_metadata>\n\nYo ho ho!'
    fail, response = vector_search('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), test_query, 2, 0.0, verbose=True)

Expectation:

My understanding (and correct me if I am wrong) is that:

  • The api/v1/document/raw-text endpoint will add a json document to the system with fields like its id, title, description, pageContent, etc. I see that the pageContent is directly taken from the textContent field from the request.
  • Then, when using the api/v1/workspace/{slug}/update-embeddings endpoint to embed and add the document to a workspace, ONLY the pageContent field will be split into chunks, passed through the embedder, and stored into LanceDB.
  • My understanding is that the metadata is not also chunked and passed through the embedder.
  • Finally, when calling the api/v1/workspace/{slug}/vector-search endpoint, the query string will similarly be passed through the embedder. The endpoint returns chunks that are most similar to the given query.

Based on my assumptions, I expect that, if I query the workspace's vector database using the same exact textContent I used to add a document to the server, the query should return a vector with a distance of 0 and a similarity of 1.
I tested with textContent with less than 5 tokens, so the text is not chunked separately, and the entire text should be returned with a similarity of 1.
You can see the tests above at the bottom of the provided script.

Reality:

The response of the first call to my vector_search function returns vectors with nonzero distance, and low score.
Below is the resulting response (it has two search results because I ran the script multiple times, so the workspace has multiple files in it with the same name and contents.)

{'results': [{'distance': 0.8602899312973022,                                                                                                                                                
              'id': 'c83063a4-46a5-412c-b1c4-38cc1adce2a1',                                                                                                                                  
              'metadata': {'author': None,                                                                                                                                                   
                           'chunkSource': None,                                                                                                                                              
                           'description': 'what a pirate says',                                                                                                                              
                           'docSource': None,                                                                                                                                                
                           'published': '1/4/2025, 6:21:05 AM',                                                                                                                              
                           'title': 'pirate.txt',                                                                                                                                            
                           'tokenCount': 4,                                                                                                                                                  
                           'url': 'file://pirate.txt',                                                                                                                                       
                           'wordCount': 3},                                                                                                                                                  
              'score': 0.13971006870269775,                                                                                                                                                  
              'text': '<document_metadata>\n'                                                                                                                                                
                      'sourceDocument: pirate.txt\n'                                                                                                                                         
                      'published: 1/4/2025, 6:21:05 AM\n'                                                                                                                                    
                      '</document_metadata>\n'                                                                                                                                               
                      '\n'                                                                                                                                                                   
                      'Yo ho ho!'},                                                                                                                                                          
             {'distance': 0.8745168447494507,                                                                                                                                                
              'id': '2f3ae7dd-cbe3-47f8-b885-edada0c850c4',                                                                                                                                  
              'metadata': {'author': None,                                                                                                                                                   
                           'chunkSource': None,                                                                                                                                              
                           'description': 'what a pirate says',                                                                                                                              
                           'docSource': None,                                                                                                                                                
                           'published': '1/4/2025, 6:17:35 AM',                                                                                                                              
                           'title': 'pirate.txt',                                                                                                                                            
                           'tokenCount': 4,                                                                                                                                                  
                           'url': 'file://pirate.txt',                                                                                                                                       
                           'wordCount': 3},                                                                                                                                                  
              'score': 0.12548315525054932,                                                                                                                                                  
              'text': '<document_metadata>\n'                                                                                                                                                
                      'sourceDocument: pirate.txt\n'
                      'published: 1/4/2025, 6:17:35 AM\n'
                      '</document_metadata>\n'
                      '\n'
                      'Yo ho ho!'}]}

Debugging:

I was weirded out that the scores of both documents returned in the response were not the same, even though their content is the exact same. The only difference was their metadata and text fields.
So, I tried running the vector search again, but using the exact string from the text field of one of the documents in the response (ie. test_query = '<document_metadata>\nsourceDocument: pirate.txt ......).

And this time, the query returned a response with a vector of distance 0 and similarity 0:

{'results': [{'distance': 0,
              'id': '2f3ae7dd-cbe3-47f8-b885-edada0c850c4',
              'metadata': {'author': None,
                           'chunkSource': None,
                           'description': 'what a pirate says',
                           'docSource': None,
                           'published': '1/4/2025, 6:17:35 AM',
                           'title': 'pirate.txt',
                           'tokenCount': 4,
                           'url': 'file://pirate.txt',
                           'wordCount': 3},
              'score': 0,
              'text': '<document_metadata>\n'
                      'sourceDocument: pirate.txt\n'
                      'published: 1/4/2025, 6:17:35 AM\n'
                      '</document_metadata>\n'
                      '\n'
                      'Yo ho ho!'},
             {'distance': 0.008605420589447021,
              'id': 'b8525a99-6563-4869-9248-237b20f1ed84',
              'metadata': {'author': None,
                           'chunkSource': None,
                           'description': 'what a pirate says',
                           'docSource': None,
                           'published': '1/4/2025, 6:24:00 AM',
                           'title': 'pirate.txt',
                           'tokenCount': 4,
                           'url': 'file://pirate.txt',
                           'wordCount': 3},
              'score': 0.991394579410553,
              'text': '<document_metadata>\n'
                      'sourceDocument: pirate.txt\n'
                      'published: 1/4/2025, 6:24:00 AM\n'
                      '</document_metadata>\n'
                      '\n'
                      'Yo ho ho!'}]}

I'm not fluent in JavaScript, but I took a look at the following source files for debugging, and I thought the code should return a vector closest based on the text excluding the metadata:

Questions:

  1. What is the expected result of embedding a raw-text content and querying with the vector-search endpoint?
  2. If it is expected that we need to include the metadata to retrieve the closest vector with distance 0, could there be a different endpoint added where the vector search is based purely on the text content from the raw-text endpoint?
  3. Why is the similarity 0 when the distance is 0 in the example above?

Are there known steps to reproduce?

No response

@TheNeeloy TheNeeloy added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
possible bug Bug was reported but is not confirmed or is unable to be replicated.
Projects
None yet
Development

No branches or pull requests

2 participants