Scoring formula normalization coefficient #5

mazzma12 · 2023-11-07T13:24:10Z

mazzma12
Nov 7, 2023

Hi,

in the README.md you are mentioning a paper which applies the same technique for anomay detection. However you do not apply the same formula in your code.
From what I understood the normalization factors are different.

I wonder if you had any reason to do so, and would be interested to discuss more about it, since it seems to me that the formula in the original paper is more suited - but maybe you found some empirical optimization.

Thank you, and again congratulations for the project and the presentation at Hacklu

Answered by ranok

Feb 22, 2024

I have tried a number of different normalization routines, and have one that seems to improve accuracy (very slightly). You're right that there is going to be some natural compression for a longer input, so ZipPy is biased against human samples that are close, because more symbols will result in better compression. I compute the average compression ratio per character in the PRELUDE_FILE, then add that to the compression ratio calculation, adjusted by the length of the sample.

I will commit this new version once I complete my testing, with a -n command line argument feature flag to not change the algorithm for existing users.

View full answer

mazzma12 · 2023-11-10T08:42:39Z

mazzma12
Nov 10, 2023
Author

To give more context: it seems to me that for each document (prelude and candidate) you are normalizing by its size and then doing the comparison. Whereas in the original paper mentioned, the denominator is computed w.r.t the candidate document (B) in both cases.
IMO the latter should give a more stable comparison point between all the experiments since the denominator is the same in both cases.

0 replies

ranok · 2024-02-22T00:13:01Z

ranok
Feb 22, 2024
Maintainer

I have tried a number of different normalization routines, and have one that seems to improve accuracy (very slightly). You're right that there is going to be some natural compression for a longer input, so ZipPy is biased against human samples that are close, because more symbols will result in better compression. I compute the average compression ratio per character in the PRELUDE_FILE, then add that to the compression ratio calculation, adjusted by the length of the sample.

I will commit this new version once I complete my testing, with a -n command line argument feature flag to not change the algorithm for existing users.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scoring formula normalization coefficient #5

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Scoring formula normalization coefficient #5

mazzma12 Nov 7, 2023

Replies: 2 comments

mazzma12 Nov 10, 2023 Author

ranok Feb 22, 2024 Maintainer

mazzma12
Nov 7, 2023

mazzma12
Nov 10, 2023
Author

ranok
Feb 22, 2024
Maintainer