Skip to content

Releases: x-tabdeveloping/neofuzz

v0.3.0

06 Sep 14:48
f338b5d
Compare
Choose a tag to compare

New in version 0.3.0

Now you can reorder your search results using Levenshtein distance!
Sometimes n-gram processes or vectorized processes don't quite order the results correctly.
In these cases you can retrieve a higher number of examples from the indexed corpus, then refine those results with Levenshtein distance.

This gives you the speed of Neofuzz, with the accuracy of TheFuzz :D

from neofuzz import char_ngram_process

process = char_ngram_process()
process.index(corpus)

top_5 = process.extract("your query", limit=30, refine_levenshtein=True)[:5]

v0.2.0

21 May 10:05
Compare
Choose a tag to compare

1. Added subword tokenization

If you intend to use subword features, that are more informative than character n-grams, you can now do so.
I've introduced a new vectorizer component that can utilise pretrained tokenizers from language models for feature extraction.

Example code:

from neofuzz import Process
from neofuzz.tokenization import SubWordVectorizer

# We can use bert's wordpiece tokenizer for feature extraction
vectorizer = SubWordVectorizer("bert-base-uncased")
process = Process(vectorizer, metric="cosine")

2. Added code for persisting processes

You might want to persist processes to disk and reuses them in production pipelines.
Neofuzz can now serialize indexed Process objects for you using joblib.

You can save indexed processes like so:

   from neofuzz import char_ngram_process
   from neofuzz.tokenization import SubWordVectorizer
 
   process = char_ngram_process()
   process.index(corpus)
 
   process.to_disk("process.joblib")

And then load them in a production environment:

   from neofuzz import Process
 
   process = Process.from_disk("process.joblib")