Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion to implement range_search #15

Open
pjlambert opened this issue Feb 25, 2024 · 1 comment
Open

Suggestion to implement range_search #15

pjlambert opened this issue Feb 25, 2024 · 1 comment

Comments

@pjlambert
Copy link

Hi All, again - wonderful package and just terrific work.

One possible extension you might one day consider would be using FAISS's range_search function, instead of search (see https://github.com/facebookresearch/faiss/wiki/Special-operations-on-indexes#range-search). This would allow for a "many-to-many" match in the more traditional sense, perhaps aligning the behaviour of the LT package to prior fuzzy matching packages.

The main drawback is that it is not GPU-friendly, but works pretty efficiently on CPUs in my experience.

FWIW, my use-case is to match the universe of job-postings to DnB establishments. I use the range_search along with your firm-name embeddings to to build a dataset with all pairwise matches above a pretty low similarity threshold (0.5). This then gives me a huge set of potential matches, and I use an expectation-maximisation algorithm after this which considers both similarity-scores as well as other structured covariates (but not necessarily exact matching criteria) like industry codes, location-distance, etc to resolve the best match from this candidate set.

One day I would be happy to help implementing this, if you feel it's something you would want to pursue.

Thanks again for all the great work, it's hugely appreciated by many!

@econabhishek
Copy link
Collaborator

Thanks for the wonderful suggestion, Peter.
We too have tried range search on other projects and found it to be great.

Re: only CPU support, it's not a problem with the current version of the package - it only uses cpu faiss (primarily because of dependency issues. Feel free to create a pull request for this. We are going to update it soon (along with the paper and the models - we found a way to increase the amount of data available to us), so if you haven't made a request by then, I can implement it around mid-March.

We are thinking of creating a GPU only branch (but not offered a pip package- primarily because dependency management is a bit messed up with faiss GPU and other packages required - pip install X doesn't work well) for more scaled up applications.

I am glad that the package is working well for you. Hopefully we'll get close to a version 1.x.x soon.

Abhishek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants