请教下数据处理部分，tokenizer分词后的position_ids字段是怎么生成的 #418

XiaozhuLove · 2024-09-02T06:52:10Z

请教下数据处理部分，tokenizer分词后的position_ids是怎么生成的
def tokenize_wo_pad_function(examples):
tokenized_examples = tokenizer(examples["patent"])
print("tokenized_examples00",type(tokenized_examples),len(tokenized_examples),tokenized_examples.keys())
# <class 'transformers.tokenization_utils_base.BatchEncoding'> 3 dict_keys(['input_ids', 'attention_mask', 'position_ids'])
return tokenized_examples# tokenizer(examples["patent"])
是在代码什么地方做了额外的处理吗，我在跑其他项目代码分词后只有'input_ids', 'attention_mask'两项，看不出代码哪里做了什么处理。期待作者的解疑答惑，万分感谢感谢！

shibing624 · 2024-09-02T11:20:27Z

tokenizer的方法默认是'input_ids', 'attention_mask', 'position_ids' 三个，tokenizer.encode(xx)的结果是2个。

XiaozhuLove · 2024-09-03T05:19:26Z

感谢答复，我对比了另外的项目代码，它只有两项'input_ids', 'attention_mask'，确实做了额外的处理，这样预训练文件可以小很多。
另外的一个训练时间的问题，同样的数据，在两套代码上运行时间显示不一样，一个是：

在MedicalGPT运行显示时间会长很多：

不知道什么原因，我对比了代码看不出有什么不同，请问有什么建议吗，感谢感谢！

shibing624 · 2024-09-04T07:35:04Z

对比的另外的项目是啥？可能有用训练加速，如flash attn等；

主要比较的是训练后的模型效果。

XiaozhuLove · 2024-09-04T14:09:31Z

好的，谢谢，对比的是老版本的llama-factory，我再研究研究，训练时间这么大差距么。

XiaozhuLove added the question Further information is requested label Sep 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

请教下数据处理部分，tokenizer分词后的position_ids字段是怎么生成的 #418

请教下数据处理部分，tokenizer分词后的position_ids字段是怎么生成的 #418

XiaozhuLove commented Sep 2, 2024

shibing624 commented Sep 2, 2024

XiaozhuLove commented Sep 3, 2024

shibing624 commented Sep 4, 2024 •

edited

Loading

XiaozhuLove commented Sep 4, 2024

请教下数据处理部分，tokenizer分词后的position_ids字段是怎么生成的 #418

请教下数据处理部分，tokenizer分词后的position_ids字段是怎么生成的 #418

Comments

XiaozhuLove commented Sep 2, 2024

shibing624 commented Sep 2, 2024

XiaozhuLove commented Sep 3, 2024

shibing624 commented Sep 4, 2024 • edited Loading

XiaozhuLove commented Sep 4, 2024

shibing624 commented Sep 4, 2024 •

edited

Loading