We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
请教下数据处理部分,tokenizer分词后的position_ids是怎么生成的 def tokenize_wo_pad_function(examples): tokenized_examples = tokenizer(examples["patent"]) print("tokenized_examples00",type(tokenized_examples),len(tokenized_examples),tokenized_examples.keys()) # <class 'transformers.tokenization_utils_base.BatchEncoding'> 3 dict_keys(['input_ids', 'attention_mask', 'position_ids']) return tokenized_examples# tokenizer(examples["patent"]) 是在代码什么地方做了额外的处理吗,我在跑其他项目代码分词后只有'input_ids', 'attention_mask'两项,看不出代码哪里做了什么处理。期待作者的解疑答惑,万分感谢感谢!
The text was updated successfully, but these errors were encountered:
tokenizer的方法默认是'input_ids', 'attention_mask', 'position_ids' 三个,tokenizer.encode(xx)的结果是2个。
Sorry, something went wrong.
感谢答复,我对比了另外的项目代码,它只有两项'input_ids', 'attention_mask',确实做了额外的处理,这样预训练文件可以小很多。 另外的一个训练时间的问题,同样的数据,在两套代码上运行时间显示不一样,一个是: 在MedicalGPT运行显示时间会长很多: 不知道什么原因,我对比了代码看不出有什么不同,请问有什么建议吗,感谢感谢!
对比的另外的项目是啥?可能有用训练加速,如flash attn等;
主要比较的是训练后的模型效果。
好的,谢谢,对比的是老版本的llama-factory,我再研究研究,训练时间这么大差距么。
No branches or pull requests
请教下数据处理部分,tokenizer分词后的position_ids是怎么生成的
def tokenize_wo_pad_function(examples):
tokenized_examples = tokenizer(examples["patent"])
print("tokenized_examples00",type(tokenized_examples),len(tokenized_examples),tokenized_examples.keys())
# <class 'transformers.tokenization_utils_base.BatchEncoding'> 3 dict_keys(['input_ids', 'attention_mask', 'position_ids'])
return tokenized_examples# tokenizer(examples["patent"])
是在代码什么地方做了额外的处理吗,我在跑其他项目代码分词后只有'input_ids', 'attention_mask'两项,看不出代码哪里做了什么处理。期待作者的解疑答惑,万分感谢感谢!
The text was updated successfully, but these errors were encountered: