Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QwenTokenizer与Qwen2Tokenizer #295

Open
sexan opened this issue Jul 23, 2024 · 3 comments
Open

QwenTokenizer与Qwen2Tokenizer #295

sexan opened this issue Jul 23, 2024 · 3 comments

Comments

@sexan
Copy link

sexan commented Jul 23, 2024

您好,感谢提供pai-megatron框架,关于qwen分词器,我有下面几点疑问,还望解答,谢谢!
1)请问qwen系列模型(qwen、qwen1.5、qwen2)的词表和分词方式一直都是一样的吗?
2)如果是一样的,为什么会有两个分词器:QwenTokenizer、Qwen2Tokenizer?
3)如果我想用qwen1.5模型,该选择哪个分词器?
image

@divisionblur
Copy link

您好,感谢提供pai-megatron框架,关于qwen分词器,我有下面几点疑问,还望解答,谢谢! 1)请问qwen系列模型(qwen、qwen1.5、qwen2)的词表和分词方式一直都是一样的吗? 2)如果是一样的,为什么会有两个分词器:QwenTokenizer、Qwen2Tokenizer? 3)如果我想用qwen1.5模型,该选择哪个分词器? image

Qwen2Tokenizer看起来是适配了megatron-core的,继承了MegatronTokenizer。

@sexan
Copy link
Author

sexan commented Jul 26, 2024

您好,感谢提供pai-megatron框架,关于qwen分词器,我有下面几点疑问,还望解答,谢谢! 1)请问qwen系列模型(qwen、qwen1.5、qwen2)的词表和分词方式一直都是一样的吗? 2)如果是一样的,为什么会有两个分词器:QwenTokenizer、Qwen2Tokenizer? 3)如果我想用qwen1.5模型,该选择哪个分词器? image

Qwen2Tokenizer看起来是适配了megatron-core的,继承了MegatronTokenizer。

这个继承是必须的吗,QwenTokenizer都没继承,为什么Qwen2Tokenizer开始继承了

@KKCDD
Copy link

KKCDD commented Aug 12, 2024

在examples/qwen1_5的训练脚本里有,除了run_pretrain_megatron_qwen.sh里面用的llama_tokenizer,其他都用的qwen2tokenizer。
从hf的代码上来看,llama tokenizer和qwen2tokenizer一样,但是pai里面的qwen2tokenizer实现继承了megatron-core做了适配。
qwen用的tiktoken方式,词表也不一样,现在应该都是用qwen2tokenizer了。
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants