Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-17525: Text Embedder Query Parser #2809

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

alessandrobenedetti
Copy link
Contributor

@alessandrobenedetti alessandrobenedetti commented Oct 29, 2024

https://issues.apache.org/jira/browse/SOLR-17525

Description

Scope of this issue is to integrate a new module able to use LLM (through managed services) to enhance aspect of Apache Solr.
Specifically this first Pull Request relates to handle embedding models and automatic text vectorisation in Solr.

Solution

The functionality has been introduced through LangChain4J (https://docs.langchain4j.dev).
The are several aspects I would like feedback on:

  • I used inversion of control to integrate support for multiple embedding models with minimal impact on the code. It works and it's decently clean, but I would love your feedback on this
  • Embedding models are accessed through client APIS and http requests are made to embed text using external services.
    To do that I added security exceptions in both 'solr/server/etc/security.policy' and 'gradle/testing/randomization/policies/solr-tests.policy'.
    It works but I have no idea if it's acceptable or the best way to do it

Tests

  • model storage tests added
  • query parsing tests added

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended, not available for branches on forks living under an organisation)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

@github-actions github-actions bot added the dependencies Dependency upgrades label Oct 29, 2024
@alessandrobenedetti
Copy link
Contributor Author

I'll keep polishing it and finalise the documentation, but I think it's ready for review!

@alessandrobenedetti
Copy link
Contributor Author

Just as a reminder, currently the check fails with:
"* What went wrong:
Execution failed for task ':solr:modules:llm:analyzeClassesDependencies'.

Dependency analysis found issues.
usedUndeclaredArtifacts

  • dev.langchain4j:langchain4j-core:0.35.0@jar
    unusedDeclaredArtifacts
  • dev.langchain4j:langchain4j-cohere:0.35.0@jar
  • dev.langchain4j:langchain4j-hugging-face:0.35.0@jar
  • dev.langchain4j:langchain4j-mistral-ai:0.35.0@jar
  • dev.langchain4j:langchain4j-open-ai:0.35.0@jar
  • dev.langchain4j:langchain4j:0.35.0@jar
    "
    I'll work on it in the next few days

Copy link
Contributor

@dsmiley dsmiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this module introduce a competing "model store" to Solr's existing "file store"? The "file store" was developed with "models" in mind, in addition to initially being developed for plugin packages.

@alessandrobenedetti
Copy link
Contributor Author

Why does this module introduce a competing "model store" to Solr's existing "file store"? The "file store" was developed with "models" in mind, in addition to initially being developed for plugin packages.

Let me elaborate here:
I took inspiration from the Learning To Rank module (ltr) where I was familiar with the model store:
org.apache.solr.ltr.store.ModelStore
similar to the file store but specialised in learning to rank models and their management and singleton instantiation .
The embeddingModel store does pretty much an equivalent job and grants direct access to embedding models so that the query parser or other components (like the update request processor in the works).

Given that, I am not that familiar with the file store, so if it can help in having a better and cleaner solution I'll be more than happy to take a look at it (after the 5th of November)

@epugh
Copy link
Contributor

epugh commented Oct 30, 2024

I would love to get a walk through on this exciting feature at the next community meetup...

@dsmiley
Copy link
Contributor

dsmiley commented Nov 7, 2024

The role of the FileStore is to handle distributed node/cluster synchronization of the underlying bytes; that's it. Is that relevant/useful for the model store you need? If it's not, then forget it but I suspect it is.

I understand that a Solr plugin needs to load the bytes once instead of for each call to query/index/whatever :-). This sounds like a layer above it like a Caffeine Cache with an eviction policy. Maybe it should be configured in solrconfig.xml along with the other caches? It could be generic but probably best to do this simple thing for the needs of this model stuff that I'm not too familiar with.

@dsmiley
Copy link
Contributor

dsmiley commented Nov 7, 2024

I'm willing to help out slinging some code here in aims to prevent duplication of similar mechanisms within Solr.

@alessandrobenedetti
Copy link
Contributor Author

alessandrobenedetti commented Nov 7, 2024

Thanks @cpoerschke, I incorporated all your suggestions!
Thanks @dsmiley for helping, in the meantime I branched here to tentatively use the FileStore: https://github.com/SeaseLtd/solr/tree/feature/SOLR-17525-fileStore
It seems more or less I adjusted the code to work without the dedicated EmbeddingModelStore, need to clean it up a bit and fix the tests, to use the FileStore.
I'm done for today,
If you jump in, your help would be much appreciated!

final float[] embedding;

public DummyEmbeddingModel(int[] embedding) {
this.embedding = new float[] {embedding[0], embedding[1], embedding[2], embedding[3]};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SeaseLtd@e7d9e63 suggests to support float values and dimensions other than 4.

@dsmiley
Copy link
Contributor

dsmiley commented Nov 12, 2024

I suggest registering this query parser as "textEmbedder" but not simply "embed" which is a puzzling word by itself without context aside from it being a query parser.

Any way, I contributed improvements to the "filestore" based branch, mostly to SolrEmbeddingModel. More is needed -- in Solr we load classes with SolrResourceLoader and not with Class.forName. I realized that things didn't compile even before my changes... tests need work. I'm skeptical "RestTestBase" is an appropriate base test class here. Also SolrJettyTestBase is deprecated; you can just use SolrJettyTestRule. The changes I did will require a cache to be defined in solrconfig.xml but I'd like to later improve it so that it's not needed to be configured (but would in-effect exist).

@alessandrobenedetti
Copy link
Contributor Author

Thanks @cpoerschke for the feedback, all merged.

Just done the rename, but had to use snake case "text_embedder" as looking around for query parsers, was not able to find camelCase instances, there was no uniformity to be honest, but I was able to find multi term query parsers and they were normally or snake_case or just all lowercase, which make the name less readable.

Proceeding now with the additional feedback and tests

@alessandrobenedetti
Copy link
Contributor Author

I've done another round of thinking, polishing and experimenting, my comments follow on the various open topics:

Tests

I spent a good 2-3 hours trying to refactor the tests from the current RestTestBase to SolrJettyTestRule.
It proved less intuitive than expected, didn't manage to finish, still a lot of errors, bits of code that were handy in RestTestBase and had to be duplicated, and I couldn't foresee any particular improvement in readability or other benefits.
So is this a route we really need to adhere to?
Current tests seem decently readable, with a good coverage and easy to extend.
Looking around I couldn't find a uniform landscape, different modules use different tests style, so If it's about consistency and we care in uniforming this, maybe it's worth doing in a separate issue and by someone with enough knowledge and focus to do this in a decent timeframe?
On the other hand if there are real blockers/wrongdoing with the current tests I am more than happy to fix them.

FileStore vs ManagedResource

Current Requirement:
A model is a simple tiny Json that contains the parameter to connect to an embedding service.
Discussion:
Spent some time offline with David around this topic and it seems using the FileStore could be an overkill for this requirement.
I then explored configSets in Solr and how we manage them, including managedResources.
In the end, I returned to square 1:
the current EmbeddingModelStore is a ManagedResource that supports persistency.
The code seems simple and readable enough.
So, do we really need to replace it with something else?
And in case, what do you think we should replace it with?
Let me know your thoughts, and thanks again for the help so far!
I believe we are making progress!

@dsmiley
Copy link
Contributor

dsmiley commented Nov 13, 2024

Thanks for your patience Alessandro and willingness to consider alternatives! Originally when I heard "model", I thought of some potentially large thing (the FileStore is good for large things); I wasn't thinking some metadata about a model. I also don't want to create new API surface area for something that at least sounds redundant. But the "ManagedResource" / REST support Solr has makes this maybe not an issue.

A small timeline note: please defer merging until after #2706 so we don't interfere with that delicate big transition that is close now. It will certainly impact any PR (like this one) that introduces dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:search dependencies Dependency upgrades documentation Improvements or additions to documentation jetty-server tests tool:build
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants