Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请教下数据处理部分,tokenizer分词后的position_ids字段是怎么生成的 #418

Open
XiaozhuLove opened this issue Sep 2, 2024 · 4 comments
Labels
question Further information is requested

Comments

@XiaozhuLove
Copy link

请教下数据处理部分,tokenizer分词后的position_ids是怎么生成的
def tokenize_wo_pad_function(examples):
tokenized_examples = tokenizer(examples["patent"])
print("tokenized_examples00",type(tokenized_examples),len(tokenized_examples),tokenized_examples.keys())
# <class 'transformers.tokenization_utils_base.BatchEncoding'> 3 dict_keys(['input_ids', 'attention_mask', 'position_ids'])
return tokenized_examples# tokenizer(examples["patent"])
是在代码什么地方做了额外的处理吗,我在跑其他项目代码分词后只有'input_ids', 'attention_mask'两项,看不出代码哪里做了什么处理。期待作者的解疑答惑,万分感谢感谢!

@XiaozhuLove XiaozhuLove added the question Further information is requested label Sep 2, 2024
@shibing624
Copy link
Owner

tokenizer的方法默认是'input_ids', 'attention_mask', 'position_ids' 三个,tokenizer.encode(xx)的结果是2个。

@XiaozhuLove
Copy link
Author

感谢答复,我对比了另外的项目代码,它只有两项'input_ids', 'attention_mask',确实做了额外的处理,这样预训练文件可以小很多。
另外的一个训练时间的问题,同样的数据,在两套代码上运行时间显示不一样,一个是:
1725288344409
在MedicalGPT运行显示时间会长很多:
image
不知道什么原因,我对比了代码看不出有什么不同,请问有什么建议吗,感谢感谢!

@shibing624
Copy link
Owner

shibing624 commented Sep 4, 2024

对比的另外的项目是啥?可能有用训练加速,如flash attn等;

主要比较的是训练后的模型效果。

@XiaozhuLove
Copy link
Author

好的,谢谢,对比的是老版本的llama-factory,我再研究研究,训练时间这么大差距么。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants