Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implements processing files in serial manner && watch for new files && smarter delete to preserve largest file #57

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

DZamataev
Copy link

when num_processes==1, which leads to make DB insertions exactly after each hash and not in the end of processing a single chunk (a whole library).

adds some prints for total number of files to be processed and success insert to database.

…hich leads to make DB insertions exactly after each hash and not in the end of processing a single chunk (a whole library).

adds some prints for total number of files to be processed and success insert to database.
…ype.

Error log was:
File "C:\Python37\lib\site-packages\magic\magic.py", line 196, in errorcheck_null
    raise MagicException(err)
magic.magic.MagicException: b"cannot read `filename.jpg' (Permission denied)"
@DZamataev
Copy link
Author

Thanks for sharing the awesome script @philipbl !
It's really nice, but I've faced some issues on my Windows machine.
Installation was not flawless, I think because pip requirements does not include everything. I had to retry running script several times getting error messages for abscent modules (mongo, libMagic and so on), which I installed with pip one by one.
The next issue was multi processing. I am sorting thousands of files and parallel execution leaded to stuck DB writes. Hashes were going OK, they were fast, but no real writes to the DB occured after each hash. So when I interrupted my script the database was empty. I've tried to fix it at first by using chunks on parallel map function, but it didn't work because files entry were not considered iterable by parallel map. I don't know why. So I had to stick to serial execution which I implemented in this PR. Please consider merging it, cuz it's a huge yet seamless improvement. Another benefit with single process run with serial map function is instant keyboard interruptions. Somewhy I had to wait for minutes after pressing CTRL+C before with parallel map function approach.
Best regards.

@DZamataev DZamataev changed the title implements processing files in serial manner implements processing files in serial manner and watch for new files Feb 1, 2019
…nt given. When a file modification occurs recursively in this path, the modified file will be added as if ```add``` command was chosen. Useful when you are sorting your library and adding new images to it. Removing is not supported yet.

Also raises Pillow version because of Python 3.6 compatibility issues I experienced on Windows. Not tested so well after the update.
@DZamataev
Copy link
Author

Added a function I needed to watch for incoming files and add them as they are modified. Also updated Pillow.

@DZamataev
Copy link
Author

DZamataev commented Feb 1, 2019

It's really nice, but I've faced some issues on my Windows machine.
Installation was not flawless, I think because pip requirements does not include everything. I had to retry running script several times getting error messages for abscent modules (mongo, libMagic and so on), which I installed with pip one by one.

No issues after uninstalling python 3.7 and installing python 3.6 with updated version of Pillow dependency.
Only one extra dependency installation is necessary on windows. Lib magic. Use pip install python-magic

…ameter to sort files by size and preserve largest file on delete.
@DZamataev DZamataev changed the title implements processing files in serial manner and watch for new files implements processing files in serial manner && watch for new files && smarter delete to preserve largest file Feb 4, 2019
@DZamataev
Copy link
Author

DZamataev commented Feb 4, 2019

Now I also implemented --filter-largest parameter to sort files by size and preserve largest file on delete.
It is disabled by default for compatibility. Simply add --filter-largest parameter to the find --delete command and it will delete only smaller duplicates of the larger file which will stay in place.

@DZamataev
Copy link
Author

DZamataev commented Feb 5, 2019

there was a critical issue which is now fixed. But tests still dont pass because of the added option.

@DZamataev
Copy link
Author

DZamataev commented Feb 5, 2019

Test will be ok from now. BTW i have not covered added features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant