Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed default local model to nomic #1943

Merged
merged 14 commits into from
Aug 1, 2024
Merged

Conversation

hagen-danswer
Copy link
Contributor

No description provided.

Copy link

vercel bot commented Jul 26, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 1, 2024 1:53am

@@ -21,10 +21,12 @@ RUN apt-get remove -y --allow-remove-essential perl-base && \
RUN python -c "from transformers import AutoModel, AutoTokenizer, TFDistilBertForSequenceClassification; \
from huggingface_hub import snapshot_download; \
AutoTokenizer.from_pretrained('danswer/intent-model'); \
AutoTokenizer.from_pretrained('intfloat/e5-base-v2'); \
AutoTokenizer.from_pretrained('nomic-ai/nomic-embed-text-v1'); \
AutoTokenizer.from_pretrained('nomic-ai/nomic-bert-2048'); \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make this work while airgapped, you need to .from_pretrained and snapshot_download not only nomic-ai/nomic-embed-text-v1, but also nomic-ai/nomic-bert-2048.

It was hard to find the exact reasoning for this, but I'm pretty sure it has something to do with nomic-embed-text-v1 being built on top of nomic-bert-2048 and that it needs to run .py scripts located only in the the nomic-bert-2048 repo here

ASYM_QUERY_PREFIX = os.environ.get("ASYM_QUERY_PREFIX", "query: ")
ASYM_PASSAGE_PREFIX = os.environ.get("ASYM_PASSAGE_PREFIX", "passage: ")
ASYM_QUERY_PREFIX = os.environ.get("ASYM_QUERY_PREFIX", "search_query: ")
ASYM_PASSAGE_PREFIX = os.environ.get("ASYM_PASSAGE_PREFIX", "search_document: ")
# Purely an optimization, memory limitation consideration
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the defaults for nomic-ai/nomic-embed-text-v1

model = SentenceTransformer(model_name)
model = SentenceTransformer(
model_name_or_path=model_name, trust_remote_code=True
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to the needing to also install nomic-bert-2048
there is a script that has to be executed to use the model (unsure when) that is located in nomic-bert-2048 and not in nomic-embed-text-v1 (a couple .py scripts you can see here)

Not 100% sure though

@@ -116,8 +116,9 @@ def get_tokenizer(model_name: str | None, provider_type: str | None) -> BaseToke
if provider_type.lower() == "openai":
# Used across ada and text-embedding-3 models
return _check_tokenizer_cache("openai")
# If we are given a cloud provider_type that isn't OpenAI, we default to trying to use the model_name
# this means we are approximating the token count which may leave some performance on the table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general note

snapshot_download('danswer/intent-model'); \
snapshot_download('intfloat/e5-base-v2'); \
snapshot_download('mixedbread-ai/mxbai-rerank-xsmall-v1')"
RUN python -c "from transformers import AutoTokenizer; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to combine these into a single layer. If you do a single RUN it creates a single layer that can be cached.

model = SentenceTransformer(model_name)
model = SentenceTransformer(
model_name_or_path=model_name,
trust_remote_code=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment here:
"Some model architectures that aren't built into the Transformers or Sentence Transformer need to be downloaded to be loaded locally. This does not mean data is sent to remote servers for inference, however the remote code can be fairly arbitrary so only use trusted models"

@yuhongsun96 yuhongsun96 merged commit 1be1959 into main Aug 1, 2024
5 checks passed
@yuhongsun96 yuhongsun96 deleted the set-default-local-to-nomic branch August 1, 2024 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants