-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running tokenizer on dataset 速度逐渐变慢 #5443
Labels
pending
This problem is yet to be addressed
Comments
AlongWY
added a commit
to AlongWY/LLaMA-Factory
that referenced
this issue
Sep 17, 2024
2. update mistral format function call 3. fix knapsack, may cause hiyouga#5443 4. avoid supervised examples wrongly truncation hiyouga#5426
2 tasks
AlongWY
added a commit
to AlongWY/LLaMA-Factory
that referenced
this issue
Sep 18, 2024
2. fix knapsack, may cause hiyouga#5443 3. avoid supervised examples wrongly truncation
经过我的实际测试,#5458 应该解决了这个问题 |
没有 packing 也会下降到个位数吗?按理说应该不会吧 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Reminder
System Info
llamafactory
version: 0.9.1.dev0Reproduction
dataset
dataset: xxx
eval_dataset: xxx
template: qwen2_vl
cutoff_len: 4096
max_samples: 5000000
overwrite_cache: true
preprocessing_num_workers: 16
Expected behavior
训练过程中,Running tokenizer on dataset 的速度逐渐从 几百 samples/s 下降到 个位数。 请教下可能是哪里有问题?
Others
无
The text was updated successfully, but these errors were encountered: