Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification Needed on Utillization of Tokenization in the Fine-Tuning Module || InternLM-XComposer2d5 #431

Open
khyati2396 opened this issue Aug 26, 2024 · 3 comments
Assignees

Comments

@khyati2396
Copy link

Hello Fellow Developers,

I am working on implementing the evaluation code in the current fine-tuning module and noticed something regarding the tokenizer.

While the tokenizer is passed into the make_supervised_data_module function, it doesn't seem to be utilized in the DataCollatorForSupervisedDataset.

Since DataCollatorForSupervisedDataset serves as the custom data collator, if the tokenizer isn’t used there, what is being employed for tokenization? This brings up the concern of whether the fine-tuning script is functioning as intended.

Could you please clarify this?

> Also, when are you planning to release the evaluation code?

Thanks in Advance.

@yuhangzang
Copy link
Collaborator

  1. The tokenizer is defined in modeling_internlm_xcomposer2.py.

  2. You can use VLMEvalKit for evaluation.

@InternLM InternLM deleted a comment Aug 26, 2024
@khyati2396
Copy link
Author

Thanks for the response @yuhangzang
This makes sense.

I have a few more questions.
What are the GPU requirements for the full-finetuning?
What all parameters do I need to change for distributed GPU finetuning?

I am failing to use multi-GPUs for training.

Case 1:

I tried the Lora fintuning on the sample dataset on the single A100. Lora-finetuning works on a single 80GB A100 machine.
The parameters I changed were as below.

GPUS_PER_NODE=1 ## previous value was 8
NNODES=1
NODE_RANK=0
MASTER_ADDR=localhost
MASTER_PORT=6001

This works properly.

Case 2:

I have 8 X L4 machines. (23 GBs X 8 = 184 GBs of GPU memory)
I keep getting below error when I try with GPUS_PER_NODE = 1/2/3/4/5/6/7
Value of NNODES is still 1.

[2024-08-27 11:37:21,292] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 2 optimizer
[2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:148:init] Reduce bucket size 200000000
[2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:149:init] Allgather bucket size 200000000
[2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:150:init] CPU Offload: False
[2024-08-27 11:37:21,292] [INFO] [stage_1_and_2.py:151:init] Round robin gradient partitioning: False
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/karan/tasks_by_petpooja/internLM_xcomposer2_5/finetune/finetune.py", line 336, in
[rank0]: train()
[rank0]: File "/home/karan/tasks_by_petpooja/internLM_xcomposer2_5/finetune/finetune.py", line 326, in train
[rank0]: trainer.train()
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/transformers/trainer.py", line 1553, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/transformers/trainer.py", line 1682, in _inner_training_loop
[rank0]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/accelerate/accelerator.py", line 1303, in prepare
[rank0]: result = self._prepare_deepspeed(*args)
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/accelerate/accelerator.py", line 1779, in _prepare_deepspeed
[rank0]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/init.py", line 181, in initialize
[rank0]: engine = DeepSpeedEngine(args=args,
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 306, in init
[rank0]: self._configure_optimizer(optimizer, model_parameters)
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1250, in _configure_optimizer
[rank0]: self.optimizer = self._configure_zero_optimizer(basic_optimizer)
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1508, in _configure_zero_optimizer
[rank0]: optimizer = DeepSpeedZeroOptimizer(
[rank0]: File "/home/karan/miniconda3/envs/tasks_internlm/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 393, in init
[rank0]: weights_partition = self.parallel_partitioned_bit16_groups[i][partition_id].to(
[rank0]: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.13 GiB. GPU

What changes do I need to make for this to work?

@yuhangzang
Copy link
Collaborator

Our code is tested in 8 A100 GPUs (80GB). You may set a small value of hd_num to save the GPU memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants