words are missing or out of order #30

trzhong · 2020-05-12T06:11:05Z

I've read a epub in Chinese language using epr on macos 10.15.4, python 3.7:

窦文涛：今天 [1] 我终于见到了一位我一直想见到的老师——李玫瑾老师。虽然今天真的都无数次采访过您，通过电话连线。今天终于是见着真人了，我觉得您真是很有风度的一眉立目的那么一款，没想到看上去很温婉。样子的时候，会觉得您是穿着警服有点横

And the content displayed in ibooks is:

窦文涛：今天 [1] 我终于见到了一位我一直想见到的老师——李玫瑾老师。虽然今天真的是第一次见到您，但是在我和傅见锋[2] 做的节目当中，我们好像都无数次采访过您，通过电话连线。今天终于是见着真人了，我觉得您真是很有风度的一位女士！原来他们做点好采访，我没见到您样子的时候， 会觉得您是穿着警服有点横 眉立目的那么一款，没想到看上去很温婉。~~会觉得您是穿着警服有点横~~

Not only this paragraph or this book but also many have this problem.

wustho · 2020-05-12T09:27:44Z

This is crucial, I will try Chinese epub when I'm free,... Since, originally this only supported english... But I will try and have a look

wustho · 2020-05-12T23:07:49Z

Hey, there. I just tried looking it up, seems like this is out of my capability, sorry. Hope someone else make PR about this issue... It probably has something to do with HTMLtoLines(HTMLParser) class if anyone cares to help fixing this...

trzhong · 2020-05-15T15:53:27Z

Since "textwrap.wrap()" cannot handle Chinese character properly, I try to add below codes in "HTMLtoLines.get_lines":

            else:
                w = width
                l = len(i)
                cjk_l = len(i.encode(encoding='UTF-8'))
                asc_l = int((l * 3 - cjk_l) / 3)
                if cjk_l > l:
                    w = int(w * l / (l * 2 - asc_l))
                text += textwrap.wrap(i, w) + [""]
        return text, self.imgs

Although it does display the content correctly, I don't think this is the best solution. I prefer a better wrap library.

wustho · 2020-05-15T22:15:25Z

Wow, that's impressive troubleshooting... After I read your comment, I did some googling, and found this: https://bugs.python.org/issue24665

Indeed, as you said, textwrap.wrap() cannot handle Chinese character properly. And seems like issue regarding CJK support in textwrap is closed with rejected resolution based on confusions or some stuffs. So I think we won't get any support for non latin alphabet soon. For now I will add this issue as limitation in README while we're waiting for better wrap library as you suggested.

wustho · 2020-07-12T11:25:59Z

@trzhong hey there,you might want to try https://github.com/aeosynth/bk as an alternative...

aeosynth · 2020-07-17T06:24:38Z

I added support for wide characters to bk. There may be other issues, for example I don't know the line breaking rules for asian text.

1q84 by murakami rendered to 30 columns:

trzhong · 2020-09-27T15:27:00Z

I‘m still using my patch. Thx for the information.

trzhong · 2021-01-17T15:09:17Z

Finally, I found [rich] as a solution to replace [textwrap].

from rich import cells
replace all [textwrap.text] with [cells.chop_cells]

That's all.

wustho · 2021-01-18T00:43:34Z

Wow https://github.com/willmcgugan/rich seems so powerful and features rich, thanks for pointing that out, mate... I'll try to implement it to epy...

Inspired by trzhong from issue wustho#30 (comment)

wustho mentioned this issue Mar 3, 2021

Missing leading chinese characters #41

Open

tang-yikai added a commit to tang-yikai/epr that referenced this issue Sep 25, 2024

Update epr.py

02d73e0

Inspired by trzhong from issue wustho#30 (comment)

tang-yikai mentioned this issue Sep 25, 2024

Fix missing words on long lines for Chinese language. #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

words are missing or out of order #30

words are missing or out of order #30

trzhong commented May 12, 2020 •

edited

Loading

wustho commented May 12, 2020

wustho commented May 12, 2020 •

edited

Loading

trzhong commented May 15, 2020

wustho commented May 15, 2020

wustho commented Jul 12, 2020

aeosynth commented Jul 17, 2020

trzhong commented Sep 27, 2020

trzhong commented Jan 17, 2021

wustho commented Jan 18, 2021

words are missing or out of order #30

words are missing or out of order #30

Comments

trzhong commented May 12, 2020 • edited Loading

wustho commented May 12, 2020

wustho commented May 12, 2020 • edited Loading

trzhong commented May 15, 2020

wustho commented May 15, 2020

wustho commented Jul 12, 2020

aeosynth commented Jul 17, 2020

trzhong commented Sep 27, 2020

trzhong commented Jan 17, 2021

Finally, I found [rich] as a solution to replace [textwrap].

from rich import cells replace all [textwrap.text] with [cells.chop_cells]

wustho commented Jan 18, 2021

trzhong commented May 12, 2020 •

edited

Loading

wustho commented May 12, 2020 •

edited

Loading

from rich import cells
replace all [textwrap.text] with [cells.chop_cells]