Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

是否可以直接删除pixiv小说爬取内容中[newpage]和[chapter:]部分呢 #124

Open
Volta-XTY opened this issue Oct 12, 2024 · 3 comments

Comments

@Volta-XTY
Copy link

似乎相当多pixiv小说内容里都有大量的[newpage]和[chapter:],如下图所示:
image
这些多余部分主要会影响Sakura翻译器,造成行数不匹配进而进入逐行翻译:
image
上面的就是[newpage]无法被翻译器原样输出的例子。
进入逐行翻译以后,翻译器的效率剧烈下降,过滤掉这些字段或许有帮助。

@FishHawk
Copy link
Owner

我的行动点数有点跟不上网站维护了,得等等

@Volta-XTY
Copy link
Author

Volta-XTY commented Oct 15, 2024

/web/src/domain/translate/TranslateWeb.ts 里面似乎有对原文预处理的代码:
image
那么是不是可以姑且多加两个匹配规则作预处理呢:/\[newpage\]/ /\[chapter:[^\]]*\]/ 匹配到直接替换成空字符串。

@FishHawk
Copy link
Owner

那个是目录翻译处理,爬虫在后端。实在等不及可以提pr,爬虫这块不搭数据库也能测,用kotest就行。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants