You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thanks for releasing such an awesome project; really is helping with swapping out models and providers quickly during LLM experiments. I have a question about the intended effect of the api/v1/workspace/{slug}/vector-search API endpoint.
Based off of these issues and PRs: #2811, #2812, #2815
TLDR:
When testing the new vector-search API endpoint, I found that I needed to add metadata to my query to retrieve the vector with distance 0. However, I thought that the vector search originally was based purely on the page content, excluding metadata. Below, I wrote down my environment setup, testing process, expectations, results, and questions. Thanks for your time!
Workspace and System Setup:
My AnythingLLM instance is hosted locally via Docker. It is using the default, out of the box, AnythingLLM embedding provider and LanceDB vector database settings. I setup a workspace using Ollama as the provider, running a llama3.2:1b LLM.
This is the response from /api/v1/workspace/{slug} (my workspace slug is testing_api):
{
"workspace": [
{
"id": 7,
"name": "testing_api",
"slug": "testing_api",
"vectorTag": null,
"createdAt": "2025-01-04T01:25:26.088Z",
"openAiTemp": 0.7,
"openAiHistory": 20,
"lastUpdatedAt": "2025-01-04T01:25:26.088Z",
"openAiPrompt": "Given the following conversation, relevant context, and a follow up question, reply with an answer to the current question the user is asking. Return only your response to the question given the above information following the users instructions as needed.",
"similarityThreshold": 0.25,
"chatProvider": "ollama",
"chatModel": "llama3.2:1b",
"topN": 4,
"chatMode": "chat",
"pfpFilename": null,
"agentProvider": null,
"agentModel": null,
"queryRefusalResponse": "There is no relevant information in this workspace to answer your query.",
"documents": [
{
"id": 9,
"docId": "efd8d182-048e-41d4-aa61-3dcb0c98fff2",
"filename": "raw-pirate-cab0d2bf-4cf4-4020-a5ff-233e02c5067f.json",
"docpath": "testing_temp_folder/raw-pirate-cab0d2bf-4cf4-4020-a5ff-233e02c5067f.json",
"workspaceId": 7,
"metadata": "{\"id\":\"cab0d2bf-4cf4-4020-a5ff-233e02c5067f\",\"url\":\"file://pirate.txt\",\"title\":\"pirate.txt\",\"docAuthor\":\"\",\"description\":\"what a pirate says\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:17:35 AM\",\"wordCount\":3,\"token_count_estimate\":4}",
"pinned": false,
"watched": false,
"createdAt": "2025-01-04T06:17:36.080Z",
"lastUpdatedAt": "2025-01-04T06:17:36.080Z"
},
{
"id": 10,
"docId": "cddb338d-d6ba-4636-a229-b5be995cef93",
"filename": "raw-long_file-efe0f77c-db8f-4b79-9531-c797621f251e.json",
"docpath": "testing_temp_folder/raw-long_file-efe0f77c-db8f-4b79-9531-c797621f251e.json",
"workspaceId": 7,
"metadata": "{\"id\":\"efe0f77c-db8f-4b79-9531-c797621f251e\",\"url\":\"file://long_file.txt\",\"title\":\"long_file.txt\",\"docAuthor\":\"\",\"description\":\"bunch of as\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:17:36 AM\",\"wordCount\":2001,\"token_count_estimate\":2001}",
"pinned": false,
"watched": false,
"createdAt": "2025-01-04T06:17:36.876Z",
"lastUpdatedAt": "2025-01-04T06:17:36.876Z"
},
{
"id": 11,
"docId": "ea926340-3c71-46df-8e1e-72e1629b5de0",
"filename": "raw-pirate-a17387aa-8307-4e92-b07a-8dd92d26b68e.json",
"docpath": "testing_temp_folder/raw-pirate-a17387aa-8307-4e92-b07a-8dd92d26b68e.json",
"workspaceId": 7,
"metadata": "{\"id\":\"a17387aa-8307-4e92-b07a-8dd92d26b68e\",\"url\":\"file://pirate.txt\",\"title\":\"pirate.txt\",\"docAuthor\":\"\",\"description\":\"what a pirate says\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:21:05 AM\",\"wordCount\":3,\"token_count_estimate\":4}",
"pinned": false,
"watched": false,
"createdAt": "2025-01-04T06:21:06.143Z",
"lastUpdatedAt": "2025-01-04T06:21:06.143Z"
},
{
"id": 12,
"docId": "89e8bb51-8ff6-433f-ac56-6d12d5f76158",
"filename": "raw-long_file-f83a5abe-e69c-40c8-907f-2a2fece3e3b1.json",
"docpath": "testing_temp_folder/raw-long_file-f83a5abe-e69c-40c8-907f-2a2fece3e3b1.json",
"workspaceId": 7,
"metadata": "{\"id\":\"f83a5abe-e69c-40c8-907f-2a2fece3e3b1\",\"url\":\"file://long_file.txt\",\"title\":\"long_file.txt\",\"docAuthor\":\"\",\"description\":\"bunch of as\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:21:06 AM\",\"wordCount\":2001,\"token_count_estimate\":2001}",
"pinned": false,
"watched": false,
"createdAt": "2025-01-04T06:21:06.927Z",
"lastUpdatedAt": "2025-01-04T06:21:06.927Z"
},
{
"id": 13,
"docId": "e11d5f8f-1029-4881-a8cf-62ca09adc97f",
"filename": "raw-pirate-9c2c0cdf-246e-485e-a1fa-15bca9782b54.json",
"docpath": "testing_temp_folder/raw-pirate-9c2c0cdf-246e-485e-a1fa-15bca9782b54.json",
"workspaceId": 7,
"metadata": "{\"id\":\"9c2c0cdf-246e-485e-a1fa-15bca9782b54\",\"url\":\"file://pirate.txt\",\"title\":\"pirate.txt\",\"docAuthor\":\"\",\"description\":\"what a pirate says\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:24:00 AM\",\"wordCount\":3,\"token_count_estimate\":4}",
"pinned": false,
"watched": false,
"createdAt": "2025-01-04T06:24:00.807Z",
"lastUpdatedAt": "2025-01-04T06:24:00.807Z"
},
{
"id": 14,
"docId": "d076c9bf-84da-4894-b8c4-d734c375ec8b",
"filename": "raw-long_file-703df755-c8b4-4449-9918-afe877621d4f.json",
"docpath": "testing_temp_folder/raw-long_file-703df755-c8b4-4449-9918-afe877621d4f.json",
"workspaceId": 7,
"metadata": "{\"id\":\"703df755-c8b4-4449-9918-afe877621d4f\",\"url\":\"file://long_file.txt\",\"title\":\"long_file.txt\",\"docAuthor\":\"\",\"description\":\"bunch of as\",\"docSource\":\"\",\"chunkSource\":\"\",\"published\":\"1/4/2025, 6:24:00 AM\",\"wordCount\":2001,\"token_count_estimate\":2001}",
"pinned": false,
"watched": false,
"createdAt": "2025-01-04T06:24:01.638Z",
"lastUpdatedAt": "2025-01-04T06:24:01.638Z"
}
],
"threads": [
{
"user_id": null,
"slug": "19f23a3c-9ecb-4b34-9750-d974829d65f6"
},
{
"user_id": null,
"slug": "5be87d7a-7abc-436c-8860-77d2e55d6718"
}
]
}
]
}
I've written a simple Python API to interface with the AnythingLLM instance, and I want to test out its functionality before using the API in my experiments. I've implemented functions for creating a folder, uploading a raw text document, moving a file into a folder, adding a file to a workspace, and performing vector search within a workspace given a query. Every function works as expected, except for the vector_search function.
Below is my Python API and testing script (it assumes the AnythingLLM API key is set as an environment variable ANYTHINGLLM_API_KEY):
# Standard
import os
import json
from pprint import pprint
# 3rd Party
from requests import get, post
def create_folder(ipv4, port, api_key, folder_name, verbose=False):
"""
Create empty folder in server's root storage directory.
Returns <Failure State>.
Failure State is True if failed.
"""
url = f'http://{ipv4}:{port}/api/v1/document/create-folder'
headers = {
'accept': 'application/json',
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json'
}
data = {
'name': folder_name
}
try:
response = post(url, headers=headers, json=data, stream=False)
except Exception as exception:
print(exception)
return True
response_dict = json.loads(response.text)
if verbose:
pprint(response_dict)
return not response_dict['success']
def upload_raw_text(ipv4, port, api_key, content, title, description="", verbose=False):
"""
Uploads document with raw text to database.
Title is required, but description is not.
Other fields in metadata (ie. url, published) will
automatically be filled in.
Returns <Failure State, Saved File Path>.
Failure State is True if failed and should not use Saved File Path.
Saved File Path can be then used to add document to a workspace.
"""
url = f'http://{ipv4}:{port}/api/v1/document/raw-text'
headers = {
'accept': 'application/json',
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json'
}
data = {
'textContent': content,
'metadata': {
'title': title,
'description': description,
'docAuthor': '',
'docSource': '',
'chunkSource': ''
}
}
try:
response = post(url, headers=headers, json=data, stream=False)
except Exception as exception:
print(exception)
return True, None
response_dict = json.loads(response.text)
file_path = response_dict['documents'][0]['location']
if verbose:
pprint(response_dict)
return not response_dict['success'], file_path
def move_file(ipv4, port, api_key, from_file_path, to_folder, verbose=False):
"""
Move file from one folder to another.
Returns <Failure State, New Saved File Path>.
Failure State is True if failed and should not use New Saved File Path.
New Saved File Path can be then used to add document to a workspace.
"""
file_name = from_file_path.split('/')[-1]
to_file_path = '/'.join([to_folder, file_name])
url = f'http://{ipv4}:{port}/api/v1/document/move-files'
headers = {
'accept': 'application/json',
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json'
}
data = {
'files': [{
'from': from_file_path,
'to': to_file_path
}]
}
try:
response = post(url, headers=headers, json=data, stream=False)
except Exception as exception:
print(exception)
return True, None
response_dict = json.loads(response.text)
if verbose:
pprint(response_dict)
return not response_dict['success'], to_file_path
def add_file_to_workspace(ipv4, port, slug, api_key, file_path, verbose=False):
"""
Adds file from server to specific workspace by slug.
Will embed file if not already cached.
Returns <Failure State>.
Failure State is True if failed.
"""
url = f'http://{ipv4}:{port}/api/v1/workspace/{slug}/update-embeddings'
headers = {
'accept': 'application/json',
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json'
}
data = {
'adds': [file_path]
}
try:
response = post(url, headers=headers, json=data, stream=False)
except Exception as exception:
print(exception)
return True
response_dict = json.loads(response.text)
if verbose:
pprint(response_dict)
return not response.text == 'Internal Server Error'
def vector_search(ipv4, port, slug, api_key, query, top_n, score_threshold, verbose=False):
"""
Searches for closest vectors to query.
Returns <Failure State, Response>.
Failure State is True if failed and should not access Response.
"""
url = f'http://{ipv4}:{port}/api/v1/workspace/{slug}/vector-search'
headers = {
'accept': 'application/json',
'Authorization': f'Bearer {api_key}',
'Content-Type': 'application/json'
}
data = {
'query': query,
'topN': top_n,
'scoreThreshold': score_threshold
}
try:
response = post(url, headers=headers, json=data, stream=False)
except Exception as exception:
print(exception)
return True, None
response_dict = json.loads(response.text)
if verbose:
pprint(response_dict)
return False, response_dict
if __name__ == '__main__':
"""
Testing API functions.
"""
# Create temp folder for testing
fail = create_folder('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), 'testing_temp_folder', verbose=True)
# Add raw text to server
fail, file_path = upload_raw_text('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), 'Yo ho ho!', 'pirate', 'what a pirate says', verbose=True)
# Move file into temp folder
fail, file_path = move_file('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, 'testing_temp_folder', verbose=True)
# Embed file and add to workspace
fail = add_file_to_workspace('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, verbose=True)
# Add really long text with 2K tokens to server
text = ""
for _ in range(2000):
text = text + "a "
fail, file_path = upload_raw_text('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), text, 'long_file', 'bunch of as', verbose=True)
# Move long file into temp folder
fail, file_path = move_file('localhost', '3001', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, 'testing_temp_folder', verbose=True)
# Embed long file and add to workspace
fail = add_file_to_workspace('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), file_path, verbose=True)
# Query workspace vectors
fail, response = vector_search('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), 'Yo ho ho!', 2, 0.0, verbose=True)
# Query workspace vectors
test_query = '<document_metadata>\nsourceDocument: pirate.txt\npublished: 1/4/2025, 6:17:35 AM\n</document_metadata>\n\nYo ho ho!'
fail, response = vector_search('localhost', '3001', 'testing_api', os.environ.get('ANYTHINGLLM_API_KEY'), test_query, 2, 0.0, verbose=True)
Expectation:
My understanding (and correct me if I am wrong) is that:
The api/v1/document/raw-text endpoint will add a json document to the system with fields like its id, title, description, pageContent, etc. I see that the pageContent is directly taken from the textContent field from the request.
Then, when using the api/v1/workspace/{slug}/update-embeddings endpoint to embed and add the document to a workspace, ONLY the pageContent field will be split into chunks, passed through the embedder, and stored into LanceDB.
My understanding is that the metadata is not also chunked and passed through the embedder.
Finally, when calling the api/v1/workspace/{slug}/vector-search endpoint, the query string will similarly be passed through the embedder. The endpoint returns chunks that are most similar to the given query.
Based on my assumptions, I expect that, if I query the workspace's vector database using the same exact textContent I used to add a document to the server, the query should return a vector with a distance of 0 and a similarity of 1.
I tested with textContent with less than 5 tokens, so the text is not chunked separately, and the entire text should be returned with a similarity of 1.
You can see the tests above at the bottom of the provided script.
Reality:
The response of the first call to my vector_search function returns vectors with nonzero distance, and low score.
Below is the resulting response (it has two search results because I ran the script multiple times, so the workspace has multiple files in it with the same name and contents.)
I was weirded out that the scores of both documents returned in the response were not the same, even though their content is the exact same. The only difference was their metadata and text fields.
So, I tried running the vector search again, but using the exact string from the text field of one of the documents in the response (ie. test_query = '<document_metadata>\nsourceDocument: pirate.txt ......).
And this time, the query returned a response with a vector of distance 0 and similarity 0:
I'm not fluent in JavaScript, but I took a look at the following source files for debugging, and I thought the code should return a vector closest based on the text excluding the metadata:
What is the expected result of embedding a raw-text content and querying with the vector-search endpoint?
If it is expected that we need to include the metadata to retrieve the closest vector with distance 0, could there be a different endpoint added where the vector search is based purely on the text content from the raw-text endpoint?
Why is the similarity 0 when the distance is 0 in the example above?
Are there known steps to reproduce?
No response
The text was updated successfully, but these errors were encountered:
How are you running AnythingLLM?
Docker (local)
What happened?
Hi, thanks for releasing such an awesome project; really is helping with swapping out models and providers quickly during LLM experiments. I have a question about the intended effect of the
api/v1/workspace/{slug}/vector-search
API endpoint.Based off of these issues and PRs: #2811, #2812, #2815
TLDR:
When testing the new vector-search API endpoint, I found that I needed to add metadata to my query to retrieve the vector with distance 0. However, I thought that the vector search originally was based purely on the page content, excluding metadata. Below, I wrote down my environment setup, testing process, expectations, results, and questions. Thanks for your time!
Workspace and System Setup:
My AnythingLLM instance is hosted locally via Docker. It is using the default, out of the box, AnythingLLM embedding provider and LanceDB vector database settings. I setup a workspace using Ollama as the provider, running a llama3.2:1b LLM.
This is the response from
/api/v1/workspace/{slug}
(my workspace slug istesting_api
):This is the response from
/api/v1/system
:Goal:
I've written a simple Python API to interface with the AnythingLLM instance, and I want to test out its functionality before using the API in my experiments. I've implemented functions for creating a folder, uploading a raw text document, moving a file into a folder, adding a file to a workspace, and performing vector search within a workspace given a query. Every function works as expected, except for the
vector_search
function.Below is my Python API and testing script (it assumes the AnythingLLM API key is set as an environment variable
ANYTHINGLLM_API_KEY
):Expectation:
My understanding (and correct me if I am wrong) is that:
api/v1/document/raw-text
endpoint will add a json document to the system with fields like itsid
,title
,description
,pageContent
, etc. I see that thepageContent
is directly taken from thetextContent
field from the request.api/v1/workspace/{slug}/update-embeddings
endpoint to embed and add the document to a workspace, ONLY thepageContent
field will be split into chunks, passed through the embedder, and stored into LanceDB.api/v1/workspace/{slug}/vector-search
endpoint, the query string will similarly be passed through the embedder. The endpoint returns chunks that are most similar to the given query.Based on my assumptions, I expect that, if I query the workspace's vector database using the same exact
textContent
I used to add a document to the server, the query should return a vector with a distance of0
and a similarity of1
.I tested with
textContent
with less than5
tokens, so the text is not chunked separately, and the entire text should be returned with a similarity of1
.You can see the tests above at the bottom of the provided script.
Reality:
The response of the first call to my
vector_search
function returns vectors with nonzero distance, and low score.Below is the resulting response (it has two search results because I ran the script multiple times, so the workspace has multiple files in it with the same name and contents.)
Debugging:
I was weirded out that the scores of both documents returned in the response were not the same, even though their content is the exact same. The only difference was their
metadata
andtext
fields.So, I tried running the vector search again, but using the exact string from the
text
field of one of the documents in the response (ie.test_query = '<document_metadata>\nsourceDocument: pirate.txt ......
).And this time, the query returned a response with a vector of distance
0
and similarity0
:I'm not fluent in JavaScript, but I took a look at the following source files for debugging, and I thought the code should return a vector closest based on the text excluding the metadata:
anything-llm/collector/processRawText/index.js
Line 37 in c6547ec
anything-llm/server/endpoints/api/workspace/index.js
Line 511 in c6547ec
anything-llm/server/models/documents.js
Line 82 in c6547ec
anything-llm/server/utils/vectorDbProviders/lance/index.js
Line 279 in c6547ec
anything-llm/server/endpoints/api/workspace/index.js
Line 958 in c6547ec
anything-llm/server/utils/vectorDbProviders/lance/index.js
Line 157 in c6547ec
anything-llm/server/utils/vectorDbProviders/lance/index.js
Line 29 in c6547ec
Questions:
0
, could there be a different endpoint added where the vector search is based purely on the text content from the raw-text endpoint?0
when the distance is0
in the example above?Are there known steps to reproduce?
No response
The text was updated successfully, but these errors were encountered: