Changing limit parameter influences top k results #10

SeanPedersen · 2024-10-26T20:01:14Z

I noticed a strange behavior: when I switch the process.extract(query, limit=k) from k=10 to k=1000 I get different top k results (better for higher k).

Expected behavior: the top 10 matches should be the same for limit=10 and limit=1000.

The text was updated successfully, but these errors were encountered:

x-tabdeveloping · 2024-10-27T13:40:36Z

Eeery.. Can you give me a minimal reproducible example of this so I can investigate?

Glombsen · 2024-12-20T16:39:11Z

Hey

I can confirm this behavior. I have to compare the UMI of a sequence with a long list of known UMIs. As soon as I change the limit of the search, the results change.

Here is an example code to reproduce:

`

import random
from neofuzz import char_ngram_process

umi_list = []
umi_check_list = []
letters = "ATGC"

for i in range(1000):
    umi_list.append(''.join(random.choice(letters) for i in range(20)))
    umi_check_list.append(''.join(random.choice(letters) for i in range(20)))

process = char_ngram_process()
process.index(umi_list)

difference = False
while not difference:
    
    umi = random.choice(umi_check_list)
    found_ten = process.extract(umi, limit=10, refine_levenshtein=True)
    found_thousend = process.extract(umi, limit=1000, refine_levenshtein=True)
    if found_ten[0][1] != found_thousend[0][1]:
        print(found_ten)
        print(found_thousend)
        difference = True

`

Hope it helps and thank for this great Module

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing limit parameter influences top k results #10

Changing limit parameter influences top k results #10

SeanPedersen commented Oct 26, 2024

x-tabdeveloping commented Oct 27, 2024

Glombsen commented Dec 20, 2024

Changing limit parameter influences top k results #10

Changing limit parameter influences top k results #10

Comments

SeanPedersen commented Oct 26, 2024

x-tabdeveloping commented Oct 27, 2024

Glombsen commented Dec 20, 2024