Not able to understand how my data should be #1949

KamalUtla · 2024-10-08T15:02:02Z

KamalUtla
Oct 8, 2024

I am fine tuning Llama-3-70B model and I am using a dataset which is in the alpaca format: https://huggingface.co/datasets/KamalConvai/nsfw_0

When i am preprocessing the dataset using the command that axolotl has provided its showing:

[2024-10-08 14:59:31,079] [DEBUG] [axolotl.normalize_config:86] [PID:7999] [RANK:0] bf16 support not detected, disabling for this configuration. [2024-10-08 14:59:31,132] [INFO] [axolotl.normalize_config:207] [PID:7999] [RANK:0] GPU memory usage baseline: 0.000GB () [2024-10-08 14:59:31,656] [DEBUG] [axolotl.load_tokenizer:290] [PID:7999] [RANK:0] EOS: 128001 / <|end_of_text|> [2024-10-08 14:59:31,656] [DEBUG] [axolotl.load_tokenizer:291] [PID:7999] [RANK:0] BOS: 128000 / <|begin_of_text|> [2024-10-08 14:59:31,657] [DEBUG] [axolotl.load_tokenizer:292] [PID:7999] [RANK:0] PAD: 128001 / <|end_of_text|> [2024-10-08 14:59:31,657] [DEBUG] [axolotl.load_tokenizer:293] [PID:7999] [RANK:0] UNK: None / None [2024-10-08 14:59:31,657] [INFO] [axolotl.load_tokenizer:304] [PID:7999] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference. [2024-10-08 14:59:31,657] [INFO] [axolotl.load_tokenized_prepared_datasets:204] [PID:7999] [RANK:0] Skipping prepared dataset in last_run_prepared/2be906f1787fd3dc5ef32e826c80dabd for pre-processing... [2024-10-08 14:59:31,657] [INFO] [axolotl.load_tokenized_prepared_datasets:209] [PID:7999] [RANK:0] Loading raw datasets... [2024-10-08 14:59:31,657] [INFO] [axolotl.load_tokenized_prepared_datasets:218] [PID:7999] [RANK:0] No seed provided, using default seed of 42 [2024-10-08 14:59:31,741] [INFO] [axolotl.get_dataset_wrapper:582] [PID:7999] [RANK:0] Loading dataset with base_type: alpaca and prompt_style: None num_proc must be <= 20. Reducing num_proc to 20 for dataset of size 20. [2024-10-08 14:59:31,741] [WARNING] [datasets.arrow_dataset.map:3098] [PID:7999] num_proc must be <= 20. Reducing num_proc to 20 for dataset of size 20. [2024-10-08 14:59:31,928] [DEBUG] [axolotl.process_datasets_for_packing:189] [PID:7999] [RANK:0] min_input_len: 1072 [2024-10-08 14:59:31,929] [DEBUG] [axolotl.process_datasets_for_packing:191] [PID:7999] [RANK:0] max_input_len: 1148 num_proc must be <= 20. Reducing num_proc to 20 for dataset of size 20. [2024-10-08 14:59:31,929] [WARNING] [datasets.arrow_dataset.map:3098] [PID:7999] num_proc must be <= 20. Reducing num_proc to 20 for dataset of size 20. Dropping Long Sequences (num_proc=20): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 75.30 examples/s] [2024-10-08 14:59:33,609] [INFO] [axolotl.load_tokenized_prepared_datasets:461] [PID:7999] [RANK:0] Saving merged prepared dataset to disk... last_run_prepared/2be906f1787fd3dc5ef32e826c80dabd Saving the dataset (1/1 shards): : 0 examples [00:00, ? examples/s] [2024-10-08 14:59:34,336] [DEBUG] [axolotl.calculate_total_num_steps:316] [PID:7999] [RANK:0] total_num_tokens: 22_186 [2024-10-08 14:59:34,337] [DEBUG] [axolotl.calculate_total_num_steps:333] [PID:7999] [RANK:0] total_supervised_tokens: 1_706[2024-10-08 14:59:34,337] [DEBUG] [axolotl.calculate_total_num_steps:411] [PID:7999] [RANK:0] total_num_steps: 0 Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [00:07<00:00, 3.80it/s] [2024-10-08 14:59:42,620] [INFO] [axolotl.cli.preprocess.do_cli:94] [PID:7999] [RANK:0] Success! Preprocessed data path:dataset_prepared_path: last_run_prepared``

you can see the line: Saving the dataset (1/1 shards): : 0 examples [00:00, ? examples/s]

the examples are not being read. You can check my dataset in the huggingface repo its exactly like the alpaca format dataset, i even checked running the same code with https://huggingface.co/datasets/tatsu-lab/alpaca and its working fine. What am I doing wrong?

Answered by NanoCode012

Oct 9, 2024

Thanks for the config @KamalUtla . I took a quick run with it and found that since your sequences were longer than 1024 (your seq length), the sequences got dropped at Dropping Long Sequences stage. My recommendation would be to increase sequence_len to a higher value.

View full answer

NanoCode012 · 2024-10-09T04:14:40Z

NanoCode012
Oct 9, 2024
Collaborator

Hello @KamalUtla , could you provide more info like how your axolotl config looks like? Could you perhaps have used a low seq length causing it to be dropped?

5 replies

KamalUtla Oct 9, 2024
Author

this is the config file:

base_model: casperhansen/llama-3-70b-fp16
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true # Use 4-bit quantization to fit the model in 48GB
strict: false

datasets:

path: KamalConvai/nsfw_0
type: alpaca
dataset_prepared_path: last_run_prepared
val_set_size: 0 # No validation split as requested
output_dir: ./outputs/out/qlora-llama3-70b

adapter: qlora
lora_model_dir:

sequence_len: 1024
sample_packing: false
pad_to_sequence_len: true

lora_r: 8
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 2 # Increased epochs due to small dataset
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.00001

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention: true

warmup_steps: 2
evals_per_epoch: 0 # No evaluation as we're not using a validation set
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0

fsdp:
fsdp_config:

special_tokens:
pad_token: <|end_of_text|>

NanoCode012 Oct 9, 2024
Collaborator

Thanks for the config @KamalUtla . I took a quick run with it and found that since your sequences were longer than 1024 (your seq length), the sequences got dropped at Dropping Long Sequences stage. My recommendation would be to increase sequence_len to a higher value.

Answer selected by KamalUtla

NanoCode012 Oct 9, 2024
Collaborator

For more detail, I found that sequence_len: 2048 includes most of the data, while sequence_len: 4096 includes all.

max_input_len: 2165 shows that some are larger than 2048.

KamalUtla Oct 9, 2024
Author

@NanoCode012 Oh I see, that makes sense. Thanks a lot for your help.

Just for a clarification, will LLama-3-70B support 4096 sequence length?

NanoCode012 Oct 9, 2024
Collaborator

@KamalUtla , yes it does! Up to 8k. But be careful of going out of memory.

Edit: It can be found here under context length https://huggingface.co/meta-llama/Meta-Llama-3-70B

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to understand how my data should be #1949

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Not able to understand how my data should be #1949

KamalUtla Oct 8, 2024

Replies: 1 comment · 5 replies

NanoCode012 Oct 9, 2024 Collaborator

KamalUtla Oct 9, 2024 Author

NanoCode012 Oct 9, 2024 Collaborator

NanoCode012 Oct 9, 2024 Collaborator

KamalUtla Oct 9, 2024 Author

NanoCode012 Oct 9, 2024 Collaborator

KamalUtla
Oct 8, 2024

Replies: 1 comment 5 replies

NanoCode012
Oct 9, 2024
Collaborator

KamalUtla Oct 9, 2024
Author

NanoCode012 Oct 9, 2024
Collaborator

NanoCode012 Oct 9, 2024
Collaborator

KamalUtla Oct 9, 2024
Author

NanoCode012 Oct 9, 2024
Collaborator