Add score-based spam detection besides blacklist #3

ktos · 2020-10-02T19:29:35Z

I'm submitting a ...
[ ] spammer report
[ ] bug report
[X] feature request
[ ] question about the decisions made in the repository
[ ] question about how to use this project
Summary

It seems that this tool is only a simple blacklist - but I think some kind of negative scoring system may be introduced.

Other information (e.g. detailed explanation, stack traces, related issues, suggestions how to fix, links for us to have context, eg. StackOverflow, personal fork, etc.)

I believe we can score PRs negatively (and positively) and mark as spam if a defined threshold is met. For example some things deducting score may be:

Changes only in text files (.md, .html).
Changes only in one file (or removal a single file),
Changes only in one line,
Changes consisting of words "awesome" or "amazing" ;) (aka: blacklisting words in commits messages and diffs themselves),
Empty descriptions,
"patch-1" as a name of remote branch.

Of course, it's not the best solution, as it won't be 100% bulletproof, but what do you think?

StefanJanssen95 · 2020-10-02T21:48:40Z

That is not really possible. someone changing 4 words with actual typo's can be a sincere pull request, and if I would make a sincere pull request and get labelled as spam right away I'm not sure if I would spent my time on a project like that.

ktos · 2020-10-02T23:07:09Z

In my mind most (or all, or configurable number of) checks must be met to PR be marked as spam, so legitimate correcting of typos shouldn't trigger anything.

maximelafarie · 2020-10-02T23:43:35Z

Thank you for your contribution @ktos! As said by @StefanJanssen95 we need to refine the criteria to detect sincere PRs and spam PRs the more accurate way possible.

As planned in #1, if we are detaching the blacklist from the build and make it an external JSON file, we absolutely can add some more details and indicators attached to a user.

It implies defining a new and more complete model based on criteria we would use to compute a trust-score.
In addition, it would be nice to let the user configure its own minimal allowed threshold in the GitHub Action. Feel free to make some suggestions, propose some code and make some PRs.

aminya · 2021-01-22T04:08:47Z

I think this is a cool idea, but hard to implement. Someone may train a neural network on the database to find some relation between the information and the spamminess of a PR.

If implemented the intelligent algorithm should not close the PRs automatically but labeling the PR as "possibly not following the standards".

maximelafarie added the enhancement New feature or request label Oct 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add score-based spam detection besides blacklist #3

Add score-based spam detection besides blacklist #3

ktos commented Oct 2, 2020

StefanJanssen95 commented Oct 2, 2020

ktos commented Oct 2, 2020

maximelafarie commented Oct 2, 2020

aminya commented Jan 22, 2021 •

edited

Loading

Add score-based spam detection besides blacklist #3

Add score-based spam detection besides blacklist #3

Comments

ktos commented Oct 2, 2020

StefanJanssen95 commented Oct 2, 2020

ktos commented Oct 2, 2020

maximelafarie commented Oct 2, 2020

aminya commented Jan 22, 2021 • edited Loading

aminya commented Jan 22, 2021 •

edited

Loading