Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phonetic vectorization? #13

Open
dkbarn opened this issue Nov 4, 2024 · 2 comments
Open

Phonetic vectorization? #13

dkbarn opened this issue Nov 4, 2024 · 2 comments

Comments

@dkbarn
Copy link

dkbarn commented Nov 4, 2024

This is more of a question than a bug report:

I have a somewhat different use case than is covered in the documentation of how to use this library. In my case, I am wanting to search for similar-sounding syllables, rather than character-by-character matching of text. So my plan is to use some sort of phonetic encoding on my corpus (i.e. Soundex, Metaphone, etc). But I am not certain how to do this in such a way that would be compatible with neofuzz's Process -- it doesn't look like scikit-learn provides an out-of-the-box Vectorizer for phonetic encoding of text. And I'm not sure if the SubWordVectorizer could somehow be leveraged for this.

Any pointers on how to achieve this with neofuzz?

@x-tabdeveloping
Copy link
Owner

I'd say the easiest way is to override the preprocessor attribute of a vectorizer:

from neofuzz import Process
from sklearn.feature_extraction.text import CountVectorizer
from pyphonetics import Metaphone

metaphone = Metaphone()

def phonetic_preprocessor(text: str) -> str:
    return metaphone.phonetics(text)

vectorizer = CountVectorizer(ngram_range=ngram_range, analyzer="char", preprocessor=phonetic_preprocessor)
process = Process(vectorizer, metric="cosine")

@x-tabdeveloping
Copy link
Owner

Now that you say, this would make a great addition to the docs probably

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants