Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running tokenizer on dataset 速度逐渐变慢 #5443

Open
1 task done
xuyue1112 opened this issue Sep 15, 2024 · 3 comments
Open
1 task done

Running tokenizer on dataset 速度逐渐变慢 #5443

xuyue1112 opened this issue Sep 15, 2024 · 3 comments
Labels
pending This problem is yet to be addressed

Comments

@xuyue1112
Copy link

Reminder

  • I have read the README and searched the existing issues.

System Info

  • llamafactory version: 0.9.1.dev0
  • Platform: Linux-5.15.120.bsk.2-amd64-x86_64-with-glibc2.31
  • Python version: 3.11.2
  • PyTorch version: 2.4.0 (GPU)
  • Transformers version: 4.45.0.dev0
  • Datasets version: 2.21.0
  • Accelerate version: 0.34.2
  • PEFT version: 0.12.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA A800-SXM4-40GB

Reproduction

dataset

dataset: xxx
eval_dataset: xxx
template: qwen2_vl
cutoff_len: 4096
max_samples: 5000000
overwrite_cache: true
preprocessing_num_workers: 16

Expected behavior

训练过程中,Running tokenizer on dataset 的速度逐渐从 几百 samples/s 下降到 个位数。 请教下可能是哪里有问题?

Others

@github-actions github-actions bot added the pending This problem is yet to be addressed label Sep 15, 2024
AlongWY added a commit to AlongWY/LLaMA-Factory that referenced this issue Sep 17, 2024
2. update mistral format function call
3. fix knapsack, may cause hiyouga#5443
4. avoid supervised examples wrongly truncation hiyouga#5426
AlongWY added a commit to AlongWY/LLaMA-Factory that referenced this issue Sep 18, 2024
2. fix knapsack, may cause hiyouga#5443
3. avoid supervised examples wrongly truncation
@AlongWY
Copy link
Contributor

AlongWY commented Sep 18, 2024

经过我的实际测试,#5458 应该解决了这个问题

@Wiselnn570
Copy link

Wiselnn570 commented Oct 26, 2024

@AlongWY 我也遇到了同样的问题,但你这个应该是针对packing情况的,如果没有packing应该怎么改呢

经过我的实际测试,#5458 应该解决了这个问题

@AlongWY
Copy link
Contributor

AlongWY commented Oct 28, 2024

没有 packing 也会下降到个位数吗?按理说应该不会吧

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed
Projects
None yet
Development

No branches or pull requests

3 participants