issue

shabab · 2016-02-19T12:23:19Z

Most of the modern filesystem, including ext3 and ext4, has a file / folder name limit which is 255 bytes or we can say 255 ANSI characters. If anyone use data encryption (mostly eCryptfs, Ubuntu default) for their layered architecture, this limit comes done.

Moreover, in indic languages, we use Unicode in stead of ANSI / ASCII. When the character code goes to hexadecimal, in some case we can only use 80-85 unicode chrarecters in practical.

Some of the Books in Wikisource has long names. eg: 'বঙ্গের_জাতীয়_ইতিহাস_(কায়স্থ_কাণ্ড,_ষষ্ঠাংশ,_দক্ষিণরাঢ়ীয়_কায়স্থ_কাণ্ড,_প্রথম_খণ্ড).djvu'. So, the temp folder name becomes very long with it's prefix 'OCR' and timestamp suffix. When the mkdir tries to make the directory, it throws a error, 'filename too long'.

Possible solution:

I was fiddling around the script and came up an idea of seperating the basename and filename. My proposed solution is as follows.

do_ocr.py:109
basename = os.path.basename(original_url)
filename = basename[:80] #limiting the filename if longer that 80 chars

mediawiki_uploader.py:212
pagename = basename.encode('utf-8') + "/" + indic_page_number

This is a very rough idea, but I think you get my point.

Thanks.

The text was updated successfully, but these errors were encountered:

tshrinivasan · 2016-02-22T14:28:44Z

I am also thinking the same, to limit the characters to 80.
but, thinking on how to proceed when few books have lengthy common names and "part1, part2, part3, etc" as the suffix.

Share your thoughts.

shabab · 2016-02-22T15:26:06Z

Yes, the common names are an issue. So, I came up with another idea about adding another config variable for alternative filenames. I added another variable on config.ini named, 'filename_alt' and put there an alternative name for the file without any extension.

Then added a condition in the script to check the filename length. If it exceeds 80 then it will take the 'filename_alt' + filetype as file name, otherwise will work as usual. So, filetype and pagename will come from URL, only name string will come from the config.ini 'filename_alt' variable.

I have tried it with 3 books so far and all worked as expected. I can send you a pull request if you like.

tshrinivasan · 2016-02-22T16:19:16Z

Great.

Share the example book URLs and do a pull request.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue - Filename too long #71

issue - Filename too long #71

shabab commented Feb 19, 2016

tshrinivasan commented Feb 22, 2016

shabab commented Feb 22, 2016

tshrinivasan commented Feb 22, 2016

issue - Filename too long #71

issue - Filename too long #71

Comments

shabab commented Feb 19, 2016

Issue

Possible solution:

tshrinivasan commented Feb 22, 2016

shabab commented Feb 22, 2016

tshrinivasan commented Feb 22, 2016