`SFTTrainer` Raises NotImplementedError with `IterableDataset` #2138

research-boy · 2024-09-27T14:55:36Z

System Info

Google Colab

Description

When attempting to fine-tune a model using the SFTTrainer with an IterableDataset, an error occurs because the SFTTrainer expects a dataset that supports random access (__getitem__). This is problematic when working with large datasets that cannot be loaded into memory at once and require streaming.
Error Message

NotImplementedError: Subclasses of Dataset should implement __getitem__.

Context : This issue is especially relevant for fine-tuning on very large datasets, where memory constraints make it impractical to load the dataset fully into memory.

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

Load a large dataset using the datasets library with streaming enabled, like this:

from datasets import load_dataset

# Load dataset in streaming mode
dataset = load_dataset('csv', data_files='path_to_large_files/*.csv', streaming=True)

Attempt to fine-tune a model using SFTTrainer with the streaming dataset:

from trl import SFTTrainer
from unsloth import is_bfloat16_supported

# Define the model and tokenizer
model = ... # Load your model here
tokenizer = ... # Load your tokenizer here

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    formatting_func = format_example,
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        warmup_steps = 5,
        max_steps = 320,

        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        eval_strategy="no"
    ),
)

Expected behavior

The NotImplementedError is raised when the trainer tries to access the dataset.

The text was updated successfully, but these errors were encountered:

dame-cell · 2024-09-27T16:04:11Z

The trl library can handle IterableDataset and it was actually fixed check this out pr
and if you get any error regarding unsloth try fine-tuning the model without using unsloth

research-boy · 2024-09-27T16:38:06Z

@dame-cell ya i did check the PR , you can see the last few comments mentioning to do pip install git+https://github.com/huggingface/trl.git which also didn't work for me.

research-boy · 2024-09-28T10:03:33Z

This is what happening,the function _prepare_non_packed_dataloader doesn't have 'IterableDataset' implemented properly while _prepare_packed_dataloader does have . So it runs when you set the packing=True. But on running trainer_stats = trainer.train() gives another error AttributeError: 'ConstantLengthDataset' object has no attribute 'column_names'

dame-cell · 2024-09-28T11:26:43Z

Hmm did you try running the code without unsloth? like just using the trl library

research-boy added the 🐛 bug Something isn't working label Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`SFTTrainer` Raises NotImplementedError with `IterableDataset` #2138

`SFTTrainer` Raises NotImplementedError with `IterableDataset` #2138

research-boy commented Sep 27, 2024 •

edited

Loading

dame-cell commented Sep 27, 2024 •

edited

Loading

research-boy commented Sep 27, 2024 •

edited

Loading

research-boy commented Sep 28, 2024

dame-cell commented Sep 28, 2024 •

edited

Loading

SFTTrainer Raises NotImplementedError with IterableDataset #2138

SFTTrainer Raises NotImplementedError with IterableDataset #2138

Comments

research-boy commented Sep 27, 2024 • edited Loading

System Info

Description

Information

Tasks

Reproduction

Expected behavior

dame-cell commented Sep 27, 2024 • edited Loading

research-boy commented Sep 27, 2024 • edited Loading

research-boy commented Sep 28, 2024

dame-cell commented Sep 28, 2024 • edited Loading

`SFTTrainer` Raises NotImplementedError with `IterableDataset` #2138

`SFTTrainer` Raises NotImplementedError with `IterableDataset` #2138

research-boy commented Sep 27, 2024 •

edited

Loading

dame-cell commented Sep 27, 2024 •

edited

Loading

research-boy commented Sep 27, 2024 •

edited

Loading

dame-cell commented Sep 28, 2024 •

edited

Loading