Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect duplicates #34

Open
Daniel-Mietchen opened this issue Sep 7, 2014 · 3 comments
Open

Detect duplicates #34

Daniel-Mietchen opened this issue Sep 7, 2014 · 3 comments
Assignees
Labels

Comments

@Daniel-Mietchen
Copy link
Member

Fig. 1 of
https://en.wikisource.org/w/index.php?title=Wikisource:WikiProject_Open_Access/Programmatic_import_from_PubMed_Central/Modelling_the_Species_Distribution_of_Flat-Headed_Cats_%28Prionailurus_planiceps%29_an_Endangered_South-East_Asian_Small_Felid&oldid=5032599
was imported into
https://commons.wikimedia.org/wiki/File:Modelling-the-Species-Distribution-of-Flat-Headed-Cats-%28Prionailurus-planiceps%29-an-Endangered-South-pone.0009612.g001.jpg
but the image there already existed (in higher resolution) as
https://commons.wikimedia.org/wiki/File:Plionailurus_planiceps.png .
According to Commons policies, our upload should thus be deleted.

In such cases, it would be best if we could
(a) detect such a duplicate before upload
(b) post a message on that file's talk page with the proper metadata.

@notconfusing
Copy link
Member

Its because the sizes are different. We have been over this problem before
(though I can't find an issue for it). Without implementing computer vision
algorithms it'll be diffucult to detect. The other avenue we tried was to
get pubmed to give us the maximum resolution images they had, but after
some time they responded that their API will not support this. So we need
some fresh ideas.

Max Klein
http://notconfusing.com/

On Sun, Sep 7, 2014 at 2:50 PM, Daniel Mietchen [email protected]
wrote:

Assigned #34 #34 to
@notconfusing https://github.com/notconfusing.


Reply to this email directly or view it on GitHub
#34 (comment).

@jure
Copy link

jure commented Sep 13, 2014

I think you're right, and this problem won't be easily solved without some image similarity magic. There's a good list of (and discussion about) applicable open source solutions here: http://ejohn.org/blog/image-similarity-search-wanted

For Python specifically, this looks pretty useful: http://www.guguncube.com/1656/python-image-similarity-comparison-using-several-techniques

@notconfusing
Copy link
Member

thanks @jure i've never looked into python image similarity before that seems to a be a good starting point.

@jure jure mentioned this issue Sep 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants