Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kenlm训练及参数 #514

Open
Jamie2898 opened this issue Sep 6, 2024 · 3 comments
Open

kenlm训练及参数 #514

Jamie2898 opened this issue Sep 6, 2024 · 3 comments
Labels
question Further information is requested

Comments

@Jamie2898
Copy link

大佬你好,我用自己的领域数据(506M txt)训练了一个kenlm,但是测下来,地址纠错效果没有你那个2.9G版本的模型好。想问问你,参数怎么设置的?

我的训练命令如下:

build/bin/lmplz
-o 3
--verbose_header
--text /kenlm/train_dataset/sj_jt_506m.txt
--arpa /kenlm/trained_models/sj_jt_506m_240906.arps

build/bin/build_binary /kenlm/trained_models/sj_jt_506m_240906.arps /kenlm/trained_models/sj_jt_506m_240906.klm

3-gram训练出来的模型,再转klm后只有120M。


数据肯定是不够的,后续我还会再加。但是命令参数你和我一样吗?期待大佬的解答。;)

@Jamie2898 Jamie2898 added the question Further information is requested label Sep 6, 2024
@shibing624
Copy link
Owner

我用的5gram,海量训练集数据,训练集不开源。

@Jamie2898
Copy link
Author

好的,谢谢大佬。

1,还想问问你,你的训练数据的分词是基于字还是词啊?
2,有使用其他参数吗?我看kenlm还有其他参数,比如--prune。
3,可以知道上面的海量是多少G的数据吗?

期待大佬的解答。;)

@shibing624
Copy link
Owner

1、字;
2、有加--prune
3、百G

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants