Skip to content

Scoring formula normalization coefficient #5

Closed Answered by ranok
mazzma12 asked this question in Q&A
Discussion options

You must be logged in to vote

I have tried a number of different normalization routines, and have one that seems to improve accuracy (very slightly). You're right that there is going to be some natural compression for a longer input, so ZipPy is biased against human samples that are close, because more symbols will result in better compression. I compute the average compression ratio per character in the PRELUDE_FILE, then add that to the compression ratio calculation, adjusted by the length of the sample.

I will commit this new version once I complete my testing, with a -n command line argument feature flag to not change the algorithm for existing users.

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by ranok
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
bug Something isn't working
2 participants
Converted from issue

This discussion was converted from issue #3 on November 07, 2023 15:54.