Running tokenizer on dataset 速度逐渐变慢 #5443

xuyue1112 · 2024-09-15T13:29:32Z

Reminder

I have read the README and searched the existing issues.

System Info

llamafactory version: 0.9.1.dev0
Platform: Linux-5.15.120.bsk.2-amd64-x86_64-with-glibc2.31
Python version: 3.11.2
PyTorch version: 2.4.0 (GPU)
Transformers version: 4.45.0.dev0
Datasets version: 2.21.0
Accelerate version: 0.34.2
PEFT version: 0.12.0
TRL version: 0.9.6
GPU type: NVIDIA A800-SXM4-40GB

Reproduction

dataset

dataset: xxx
eval_dataset: xxx
template: qwen2_vl
cutoff_len: 4096
max_samples: 5000000
overwrite_cache: true
preprocessing_num_workers: 16

Expected behavior

训练过程中，Running tokenizer on dataset 的速度逐渐从几百 samples/s 下降到个位数。请教下可能是哪里有问题？

Others

无

The text was updated successfully, but these errors were encountered:

2. update mistral format function call 3. fix knapsack, may cause hiyouga#5443 4. avoid supervised examples wrongly truncation hiyouga#5426

2. fix knapsack, may cause hiyouga#5443 3. avoid supervised examples wrongly truncation

AlongWY · 2024-09-18T14:38:21Z

经过我的实际测试，#5458 应该解决了这个问题

Wiselnn570 · 2024-10-26T11:21:36Z

@AlongWY 我也遇到了同样的问题，但你这个应该是针对packing情况的，如果没有packing应该怎么改呢

经过我的实际测试，#5458 应该解决了这个问题

AlongWY · 2024-10-28T09:42:17Z

没有 packing 也会下降到个位数吗？按理说应该不会吧

github-actions bot added the pending This problem is yet to be addressed label Sep 15, 2024

AlongWY added a commit to AlongWY/LLaMA-Factory that referenced this issue Sep 17, 2024

1. support flatting_packing

558b983

2. update mistral format function call 3. fix knapsack, may cause hiyouga#5443 4. avoid supervised examples wrongly truncation hiyouga#5426

AlongWY mentioned this issue Sep 17, 2024

Flatting Packing / maybe fix #5443 and #5426 #5458

Open

2 tasks

AlongWY added a commit to AlongWY/LLaMA-Factory that referenced this issue Sep 18, 2024

1. support flat_packing

7cab73b

2. fix knapsack, may cause hiyouga#5443 3. avoid supervised examples wrongly truncation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running tokenizer on dataset 速度逐渐变慢 #5443

Running tokenizer on dataset 速度逐渐变慢 #5443

xuyue1112 commented Sep 15, 2024

AlongWY commented Sep 18, 2024

Wiselnn570 commented Oct 26, 2024 •

edited

Loading

AlongWY commented Oct 28, 2024

Running tokenizer on dataset 速度逐渐变慢 #5443

Running tokenizer on dataset 速度逐渐变慢 #5443

Comments

xuyue1112 commented Sep 15, 2024

Reminder

System Info

Reproduction

dataset

Expected behavior

Others

AlongWY commented Sep 18, 2024

Wiselnn570 commented Oct 26, 2024 • edited Loading

AlongWY commented Oct 28, 2024

Wiselnn570 commented Oct 26, 2024 •

edited

Loading