Releases: x-tabdeveloping/neofuzz
v0.3.0
New in version 0.3.0
Now you can reorder your search results using Levenshtein distance!
Sometimes n-gram processes or vectorized processes don't quite order the results correctly.
In these cases you can retrieve a higher number of examples from the indexed corpus, then refine those results with Levenshtein distance.
This gives you the speed of Neofuzz, with the accuracy of TheFuzz :D
from neofuzz import char_ngram_process
process = char_ngram_process()
process.index(corpus)
top_5 = process.extract("your query", limit=30, refine_levenshtein=True)[:5]
v0.2.0
1. Added subword tokenization
If you intend to use subword features, that are more informative than character n-grams, you can now do so.
I've introduced a new vectorizer component that can utilise pretrained tokenizers from language models for feature extraction.
Example code:
from neofuzz import Process
from neofuzz.tokenization import SubWordVectorizer
# We can use bert's wordpiece tokenizer for feature extraction
vectorizer = SubWordVectorizer("bert-base-uncased")
process = Process(vectorizer, metric="cosine")
2. Added code for persisting processes
You might want to persist processes to disk and reuses them in production pipelines.
Neofuzz can now serialize indexed Process objects for you using joblib
.
You can save indexed processes like so:
from neofuzz import char_ngram_process
from neofuzz.tokenization import SubWordVectorizer
process = char_ngram_process()
process.index(corpus)
process.to_disk("process.joblib")
And then load them in a production environment:
from neofuzz import Process
process = Process.from_disk("process.joblib")