We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
大佬你好,我用自己的领域数据(506M txt)训练了一个kenlm,但是测下来,地址纠错效果没有你那个2.9G版本的模型好。想问问你,参数怎么设置的?
我的训练命令如下:
build/bin/lmplz -o 3 --verbose_header --text /kenlm/train_dataset/sj_jt_506m.txt --arpa /kenlm/trained_models/sj_jt_506m_240906.arps
build/bin/build_binary /kenlm/trained_models/sj_jt_506m_240906.arps /kenlm/trained_models/sj_jt_506m_240906.klm
3-gram训练出来的模型,再转klm后只有120M。
数据肯定是不够的,后续我还会再加。但是命令参数你和我一样吗?期待大佬的解答。;)
The text was updated successfully, but these errors were encountered:
我用的5gram,海量训练集数据,训练集不开源。
Sorry, something went wrong.
好的,谢谢大佬。
1,还想问问你,你的训练数据的分词是基于字还是词啊? 2,有使用其他参数吗?我看kenlm还有其他参数,比如--prune。 3,可以知道上面的海量是多少G的数据吗?
期待大佬的解答。;)
1、字; 2、有加--prune 3、百G
No branches or pull requests
大佬你好,我用自己的领域数据(506M txt)训练了一个kenlm,但是测下来,地址纠错效果没有你那个2.9G版本的模型好。想问问你,参数怎么设置的?
我的训练命令如下:
build/bin/lmplz
-o 3
--verbose_header
--text /kenlm/train_dataset/sj_jt_506m.txt
--arpa /kenlm/trained_models/sj_jt_506m_240906.arps
build/bin/build_binary /kenlm/trained_models/sj_jt_506m_240906.arps /kenlm/trained_models/sj_jt_506m_240906.klm
3-gram训练出来的模型,再转klm后只有120M。
数据肯定是不够的,后续我还会再加。但是命令参数你和我一样吗?期待大佬的解答。;)
The text was updated successfully, but these errors were encountered: