Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issue - Filename too long #71

Open
shabab opened this issue Feb 19, 2016 · 3 comments
Open

issue - Filename too long #71

shabab opened this issue Feb 19, 2016 · 3 comments

Comments

@shabab
Copy link

shabab commented Feb 19, 2016

Issue

Most of the modern filesystem, including ext3 and ext4, has a file / folder name limit which is 255 bytes or we can say 255 ANSI characters. If anyone use data encryption (mostly eCryptfs, Ubuntu default) for their layered architecture, this limit comes done.

Moreover, in indic languages, we use Unicode in stead of ANSI / ASCII. When the character code goes to hexadecimal, in some case we can only use 80-85 unicode chrarecters in practical.

Some of the Books in Wikisource has long names. eg: 'বঙ্গের_জাতীয়_ইতিহাস_(কায়স্থ_কাণ্ড,_ষষ্ঠাংশ,_দক্ষিণরাঢ়ীয়_কায়স্থ_কাণ্ড,_প্রথম_খণ্ড).djvu'. So, the temp folder name becomes very long with it's prefix 'OCR' and timestamp suffix. When the mkdir tries to make the directory, it throws a error, 'filename too long'.

Possible solution:

I was fiddling around the script and came up an idea of seperating the basename and filename. My proposed solution is as follows.

do_ocr.py:109
basename = os.path.basename(original_url)
filename = basename[:80] #limiting the filename if longer that 80 chars

mediawiki_uploader.py:212
pagename = basename.encode('utf-8') + "/" + indic_page_number

This is a very rough idea, but I think you get my point.

Thanks.

@tshrinivasan
Copy link
Owner

I am also thinking the same, to limit the characters to 80.
but, thinking on how to proceed when few books have lengthy common names and "part1, part2, part3, etc" as the suffix.

Share your thoughts.

@shabab
Copy link
Author

shabab commented Feb 22, 2016

Yes, the common names are an issue. So, I came up with another idea about adding another config variable for alternative filenames. I added another variable on config.ini named, 'filename_alt' and put there an alternative name for the file without any extension.

Then added a condition in the script to check the filename length. If it exceeds 80 then it will take the 'filename_alt' + filetype as file name, otherwise will work as usual. So, filetype and pagename will come from URL, only name string will come from the config.ini 'filename_alt' variable.

I have tried it with 3 books so far and all worked as expected. I can send you a pull request if you like.

@tshrinivasan
Copy link
Owner

Great.

Share the example book URLs and do a pull request.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants