Phonetic vectorization? #13

dkbarn · 2024-11-04T05:28:09Z

This is more of a question than a bug report:

I have a somewhat different use case than is covered in the documentation of how to use this library. In my case, I am wanting to search for similar-sounding syllables, rather than character-by-character matching of text. So my plan is to use some sort of phonetic encoding on my corpus (i.e. Soundex, Metaphone, etc). But I am not certain how to do this in such a way that would be compatible with neofuzz's Process -- it doesn't look like scikit-learn provides an out-of-the-box Vectorizer for phonetic encoding of text. And I'm not sure if the SubWordVectorizer could somehow be leveraged for this.

Any pointers on how to achieve this with neofuzz?

x-tabdeveloping · 2024-11-04T13:51:55Z

I'd say the easiest way is to override the preprocessor attribute of a vectorizer:

from neofuzz import Process
from sklearn.feature_extraction.text import CountVectorizer
from pyphonetics import Metaphone

metaphone = Metaphone()

def phonetic_preprocessor(text: str) -> str:
    return metaphone.phonetics(text)

vectorizer = CountVectorizer(ngram_range=ngram_range, analyzer="char", preprocessor=phonetic_preprocessor)
process = Process(vectorizer, metric="cosine")

x-tabdeveloping · 2024-11-04T13:55:54Z

Now that you say, this would make a great addition to the docs probably

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phonetic vectorization? #13

Phonetic vectorization? #13

dkbarn commented Nov 4, 2024

x-tabdeveloping commented Nov 4, 2024

x-tabdeveloping commented Nov 4, 2024

Phonetic vectorization? #13

Phonetic vectorization? #13

Comments

dkbarn commented Nov 4, 2024

x-tabdeveloping commented Nov 4, 2024

x-tabdeveloping commented Nov 4, 2024