-
Hi, in the I wonder if you had any reason to do so, and would be interested to discuss more about it, since it seems to me that the formula in the original paper is more suited - but maybe you found some empirical optimization. Thank you, and again congratulations for the project and the presentation at Hacklu |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
To give more context: it seems to me that for each document (prelude and candidate) you are normalizing by its size and then doing the comparison. Whereas in the original paper mentioned, the denominator is computed w.r.t the candidate document (B) in both cases. |
Beta Was this translation helpful? Give feedback.
-
I have tried a number of different normalization routines, and have one that seems to improve accuracy (very slightly). You're right that there is going to be some natural compression for a longer input, so ZipPy is biased against human samples that are close, because more symbols will result in better compression. I compute the average compression ratio per character in the PRELUDE_FILE, then add that to the compression ratio calculation, adjusted by the length of the sample. I will commit this new version once I complete my testing, with a |
Beta Was this translation helpful? Give feedback.
I have tried a number of different normalization routines, and have one that seems to improve accuracy (very slightly). You're right that there is going to be some natural compression for a longer input, so ZipPy is biased against human samples that are close, because more symbols will result in better compression. I compute the average compression ratio per character in the PRELUDE_FILE, then add that to the compression ratio calculation, adjusted by the length of the sample.
I will commit this new version once I complete my testing, with a
-n
command line argument feature flag to not change the algorithm for existing users.