You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most of the modern filesystem, including ext3 and ext4, has a file / folder name limit which is 255 bytes or we can say 255 ANSI characters. If anyone use data encryption (mostly eCryptfs, Ubuntu default) for their layered architecture, this limit comes done.
Moreover, in indic languages, we use Unicode in stead of ANSI / ASCII. When the character code goes to hexadecimal, in some case we can only use 80-85 unicode chrarecters in practical.
Some of the Books in Wikisource has long names. eg: 'বঙ্গের_জাতীয়_ইতিহাস_(কায়স্থ_কাণ্ড,_ষষ্ঠাংশ,_দক্ষিণরাঢ়ীয়_কায়স্থ_কাণ্ড,_প্রথম_খণ্ড).djvu'. So, the temp folder name becomes very long with it's prefix 'OCR' and timestamp suffix. When the mkdir tries to make the directory, it throws a error, 'filename too long'.
Possible solution:
I was fiddling around the script and came up an idea of seperating the basename and filename. My proposed solution is as follows.
do_ocr.py:109 basename = os.path.basename(original_url) filename = basename[:80] #limiting the filename if longer that 80 chars
I am also thinking the same, to limit the characters to 80.
but, thinking on how to proceed when few books have lengthy common names and "part1, part2, part3, etc" as the suffix.
Yes, the common names are an issue. So, I came up with another idea about adding another config variable for alternative filenames. I added another variable on config.ini named, 'filename_alt' and put there an alternative name for the file without any extension.
Then added a condition in the script to check the filename length. If it exceeds 80 then it will take the 'filename_alt' + filetype as file name, otherwise will work as usual. So, filetype and pagename will come from URL, only name string will come from the config.ini 'filename_alt' variable.
I have tried it with 3 books so far and all worked as expected. I can send you a pull request if you like.
Issue
Most of the modern filesystem, including ext3 and ext4, has a file / folder name limit which is 255 bytes or we can say 255 ANSI characters. If anyone use data encryption (mostly eCryptfs, Ubuntu default) for their layered architecture, this limit comes done.
Moreover, in indic languages, we use Unicode in stead of ANSI / ASCII. When the character code goes to hexadecimal, in some case we can only use 80-85 unicode chrarecters in practical.
Some of the Books in Wikisource has long names. eg: 'বঙ্গের_জাতীয়_ইতিহাস_(কায়স্থ_কাণ্ড,_ষষ্ঠাংশ,_দক্ষিণরাঢ়ীয়_কায়স্থ_কাণ্ড,_প্রথম_খণ্ড).djvu'. So, the temp folder name becomes very long with it's prefix 'OCR' and timestamp suffix. When the mkdir tries to make the directory, it throws a error, 'filename too long'.
Possible solution:
I was fiddling around the script and came up an idea of seperating the basename and filename. My proposed solution is as follows.
do_ocr.py:109
basename = os.path.basename(original_url)
filename = basename[:80] #limiting the filename if longer that 80 chars
mediawiki_uploader.py:212
pagename = basename.encode('utf-8') + "/" + indic_page_number
This is a very rough idea, but I think you get my point.
Thanks.
The text was updated successfully, but these errors were encountered: