Clarification Needed on Utillization of Tokenization in the Fine-Tuning Module || InternLM-XComposer2d5 #431

khyati2396 · 2024-08-26T09:51:07Z

Hello Fellow Developers,

I am working on implementing the evaluation code in the current fine-tuning module and noticed something regarding the tokenizer.

While the tokenizer is passed into the make_supervised_data_module function, it doesn't seem to be utilized in the DataCollatorForSupervisedDataset.

Since DataCollatorForSupervisedDataset serves as the custom data collator, if the tokenizer isn’t used there, what is being employed for tokenization? This brings up the concern of whether the fine-tuning script is functioning as intended.

Could you please clarify this?

> Also, when are you planning to release the evaluation code?

Thanks in Advance.

yuhangzang · 2024-08-26T10:23:06Z

The tokenizer is defined in modeling_internlm_xcomposer2.py.
You can use VLMEvalKit for evaluation.

khyati2396 · 2024-08-27T11:45:15Z

Thanks for the response @yuhangzang
This makes sense.

I have a few more questions.
What are the GPU requirements for the full-finetuning?
What all parameters do I need to change for distributed GPU finetuning?

I am failing to use multi-GPUs for training.

Case 1:

I tried the Lora fintuning on the sample dataset on the single A100. Lora-finetuning works on a single 80GB A100 machine.
The parameters I changed were as below.

GPUS_PER_NODE=1 ## previous value was 8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

This works properly.

Case 2:

I have 8 X L4 machines. (23 GBs X 8 = 184 GBs of GPU memory)
I keep getting below error when I try with GPUS_PER_NODE = 1/2/3/4/5/6/7
Value of NNODES is still 1.

[2024-08-27 11:37:21,292] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:148:init] Reduce bucket size 200000000
[2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:149:init] Allgather bucket size 200000000
[2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:150:init] CPU Offload: False
[2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:151:init] Round robin gradient partitioning: False
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/karan/tasks_by_petpooja/internLM_xcomposer2_5/finetune/finetune.py", line 336, in
[rank0]: train()
[rank0]: File "/home/karan/tasks_by_petpooja/internLM_xcomposer2_5/finetune/finetune.py", line 326, in train
[rank0]: trainer.train()
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/transformers/trainer.py", line 1553, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/transformers/trainer.py", line 1682, in _inner_training_loop
[rank0]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
[rank0]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/init.py", line 181, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 306, in init
[rank0]: self._configure_optimizer(optimizer, model_parameters)
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1250, in _configure_optimizer
[rank0]: self.optimizer = self._configure_zero_optimizer(basic_optimizer)
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1508, in _configure_zero_optimizer
[rank0]: optimizer = DeepSpeedZeroOptimizer(
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 393, in init
[rank0]: weights_partition = self.parallel_partitioned_bit16_groups[i][partition_id].to(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.13 GiB. GPU

What changes do I need to make for this to work?

yuhangzang · 2024-08-30T09:22:47Z

Our code is tested in 8 A100 GPUs (80GB). You may set a small value of hd_num to save the GPU memory.

mm-assistant bot assigned myownskyW7 Aug 26, 2024

InternLM deleted a comment Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification Needed on Utillization of Tokenization in the Fine-Tuning Module || InternLM-XComposer2d5 #431

Clarification Needed on Utillization of Tokenization in the Fine-Tuning Module || InternLM-XComposer2d5 #431

khyati2396 commented Aug 26, 2024

yuhangzang commented Aug 26, 2024

khyati2396 commented Aug 27, 2024

yuhangzang commented Aug 30, 2024

Clarification Needed on Utillization of Tokenization in the Fine-Tuning Module || InternLM-XComposer2d5 #431

Clarification Needed on Utillization of Tokenization in the Fine-Tuning Module || InternLM-XComposer2d5 #431

Comments

khyati2396 commented Aug 26, 2024

yuhangzang commented Aug 26, 2024

khyati2396 commented Aug 27, 2024

Case 1:

Case 2:

yuhangzang commented Aug 30, 2024