Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

words are missing or out of order #30

Open
trzhong opened this issue May 12, 2020 · 9 comments
Open

words are missing or out of order #30

trzhong opened this issue May 12, 2020 · 9 comments

Comments

@trzhong
Copy link

trzhong commented May 12, 2020

I've read a epub in Chinese language using epr on macos 10.15.4, python 3.7:

窦文涛:今天 [1] 我终于见到了一位我一直想见到的老师——李玫瑾老师。虽然今天真的都无数次采访过您,通过电话连线。今天终于是见着真人了,我觉得您真是很有风度的一眉立目的那么一款,没想到看上去很温婉。样子的时候,会觉得您是穿着警服有点横

And the content displayed in ibooks is:

窦文涛:今天 [1] 我终于见到了一位我一直想见到的老师——李玫瑾老师。虽然今天真的是第一次见到您,但是在我和傅见锋[2] 做的节目当中,我们好像都无数次采访过您,通过电话连线。今天终于是见着真人了,我觉得您真是很有风度的一位女士!原来他们做点好采访,我没见到您样子的时候, 会觉得您是穿着警服有点横 眉立目的那么一款,没想到看上去很温婉。会觉得您是穿着警服有点横

Not only this paragraph or this book but also many have this problem.

@wustho
Copy link
Owner

wustho commented May 12, 2020

This is crucial, I will try Chinese epub when I'm free,... Since, originally this only supported english... But I will try and have a look

@wustho
Copy link
Owner

wustho commented May 12, 2020

Hey, there. I just tried looking it up, seems like this is out of my capability, sorry. Hope someone else make PR about this issue... It probably has something to do with HTMLtoLines(HTMLParser) class if anyone cares to help fixing this...

@trzhong
Copy link
Author

trzhong commented May 15, 2020

Since "textwrap.wrap()" cannot handle Chinese character properly, I try to add below codes in "HTMLtoLines.get_lines":

            else:
                w = width
                l = len(i)
                cjk_l = len(i.encode(encoding='UTF-8'))
                asc_l = int((l * 3 - cjk_l) / 3)
                if cjk_l > l:
                    w = int(w * l / (l * 2 - asc_l))
                text += textwrap.wrap(i, w) + [""]
        return text, self.imgs

Although it does display the content correctly, I don't think this is the best solution. I prefer a better wrap library.

@wustho
Copy link
Owner

wustho commented May 15, 2020

Wow, that's impressive troubleshooting... After I read your comment, I did some googling, and found this: https://bugs.python.org/issue24665

Indeed, as you said, textwrap.wrap() cannot handle Chinese character properly. And seems like issue regarding CJK support in textwrap is closed with rejected resolution based on confusions or some stuffs. So I think we won't get any support for non latin alphabet soon. For now I will add this issue as limitation in README while we're waiting for better wrap library as you suggested.

@wustho
Copy link
Owner

wustho commented Jul 12, 2020

@trzhong hey there,you might want to try https://github.com/aeosynth/bk as an alternative...

@aeosynth
Copy link

I added support for wide characters to bk. There may be other issues, for example I don't know the line breaking rules for asian text.

1q84 by murakami rendered to 30 columns:

1q84

@trzhong
Copy link
Author

trzhong commented Sep 27, 2020

I‘m still using my patch. Thx for the information.

@trzhong
Copy link
Author

trzhong commented Jan 17, 2021

Finally, I found [rich] as a solution to replace [textwrap].

from rich import cells
replace all [textwrap.text] with [cells.chop_cells]

That's all.

@wustho
Copy link
Owner

wustho commented Jan 18, 2021

Wow https://github.com/willmcgugan/rich seems so powerful and features rich, thanks for pointing that out, mate... I'll try to implement it to epy...

tang-yikai added a commit to tang-yikai/epr that referenced this issue Sep 25, 2024
Inspired by trzhong from issue
wustho#30 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants