-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17525: Text Embedder Query Parser #2809
base: main
Are you sure you want to change the base?
Conversation
…k and it's ready for a first pull request
I'll keep polishing it and finalise the documentation, but I think it's ready for review! |
Just as a reminder, currently the check fails with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this module introduce a competing "model store" to Solr's existing "file store"? The "file store" was developed with "models" in mind, in addition to initially being developed for plugin packages.
solr/solr-ref-guide/modules/query-guide/pages/embedding-text.adoc
Outdated
Show resolved
Hide resolved
Let me elaborate here: Given that, I am not that familiar with the file store, so if it can help in having a better and cleaner solution I'll be more than happy to take a look at it (after the 5th of November) |
I would love to get a walk through on this exciting feature at the next community meetup... |
solr/modules/llm/src/test/org/apache/solr/llm/embedding/DummyEmbeddingModel.java
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Outdated
Show resolved
Hide resolved
The role of the FileStore is to handle distributed node/cluster synchronization of the underlying bytes; that's it. Is that relevant/useful for the model store you need? If it's not, then forget it but I suspect it is. I understand that a Solr plugin needs to load the bytes once instead of for each call to query/index/whatever :-). This sounds like a layer above it like a Caffeine Cache with an eviction policy. Maybe it should be configured in solrconfig.xml along with the other caches? It could be generic but probably best to do this simple thing for the needs of this model stuff that I'm not too familiar with. |
I'm willing to help out slinging some code here in aims to prevent duplication of similar mechanisms within Solr. |
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Show resolved
Hide resolved
…beddingModel.java Co-authored-by: Christine Poerschke <[email protected]>
Thanks @cpoerschke, I incorporated all your suggestions! |
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Outdated
Show resolved
Hide resolved
solr/modules/llm/src/java/org/apache/solr/llm/embedding/SolrEmbeddingModel.java
Outdated
Show resolved
Hide resolved
final float[] embedding; | ||
|
||
public DummyEmbeddingModel(int[] embedding) { | ||
this.embedding = new float[] {embedding[0], embedding[1], embedding[2], embedding[3]}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SeaseLtd@e7d9e63 suggests to support float values and dimensions other than 4.
I suggest registering this query parser as "textEmbedder" but not simply "embed" which is a puzzling word by itself without context aside from it being a query parser. Any way, I contributed improvements to the "filestore" based branch, mostly to SolrEmbeddingModel. More is needed -- in Solr we load classes with SolrResourceLoader and not with Class.forName. I realized that things didn't compile even before my changes... tests need work. I'm skeptical "RestTestBase" is an appropriate base test class here. Also SolrJettyTestBase is deprecated; you can just use SolrJettyTestRule. The changes I did will require a cache to be defined in solrconfig.xml but I'd like to later improve it so that it's not needed to be configured (but would in-effect exist). |
…beddingModel.java Co-authored-by: Christine Poerschke <[email protected]>
…beddingModel.java Co-authored-by: Christine Poerschke <[email protected]>
…beddingModel.java Co-authored-by: Christine Poerschke <[email protected]>
…beddingModel.java Co-authored-by: Christine Poerschke <[email protected]>
…beddingModel.java Co-authored-by: Christine Poerschke <[email protected]>
…beddingModel.java Co-authored-by: Christine Poerschke <[email protected]>
Thanks @cpoerschke for the feedback, all merged. Just done the rename, but had to use snake case "text_embedder" as looking around for query parsers, was not able to find camelCase instances, there was no uniformity to be honest, but I was able to find multi term query parsers and they were normally or snake_case or just all lowercase, which make the name less readable. Proceeding now with the additional feedback and tests |
I've done another round of thinking, polishing and experimenting, my comments follow on the various open topics: TestsI spent a good 2-3 hours trying to refactor the tests from the current RestTestBase to SolrJettyTestRule. FileStore vs ManagedResourceCurrent Requirement: |
Thanks for your patience Alessandro and willingness to consider alternatives! Originally when I heard "model", I thought of some potentially large thing (the FileStore is good for large things); I wasn't thinking some metadata about a model. I also don't want to create new API surface area for something that at least sounds redundant. But the "ManagedResource" / REST support Solr has makes this maybe not an issue. A small timeline note: please defer merging until after #2706 so we don't interfere with that delicate big transition that is close now. It will certainly impact any PR (like this one) that introduces dependencies. |
https://issues.apache.org/jira/browse/SOLR-17525
Description
Scope of this issue is to integrate a new module able to use LLM (through managed services) to enhance aspect of Apache Solr.
Specifically this first Pull Request relates to handle embedding models and automatic text vectorisation in Solr.
Solution
The functionality has been introduced through LangChain4J (https://docs.langchain4j.dev).
The are several aspects I would like feedback on:
To do that I added security exceptions in both 'solr/server/etc/security.policy' and 'gradle/testing/randomization/policies/solr-tests.policy'.
It works but I have no idea if it's acceptable or the best way to do it
Tests
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.