[Data] Implement dataset mixer for combining datasets in training #2112

lewtun · 2024-09-24T14:41:36Z

Feature request

In the alignment-handbook, we implemented a "dataset mixer" that allows one to easily combine datasets in varying proportions, provided they all share the same schema.

It could be interesting to port this mixer to TRL, so that users can easily combine datasets during training. The only caveat I see is that to support the CLI training e.g. trl sft ... we'd need a data structure that is compatible because dict objects don't place nice with CLIs.

Motivation

Advanced post-training typically combines different datasets / proportions. Supporting this in TRL would allow us to gradually deprecate the handbook in favour of using the lib directly.

Your contribution

Open to discussion :)

The text was updated successfully, but these errors were encountered:

August-murr · 2024-10-11T11:42:15Z

@lewtun
Shouldn't this be a part of Datasets instead?

Datasets already has Interleaves, which also mixes datasets in a similar way. It's not quite integrable as it is, but may be useful.

As for integrating it into the CLI, I think the best way to do it would be with a config file, like a JSON, something like this::

trl sft --dataset_mixer --mix_config mix_config.json

mix_config.json:

{
  "dataset_mixer": {
    "dataset_1": 0.4,
    "dataset_2": 0.3,
    "dataset_3": 0.2
  },
  "splits": ["train", "train", "test"],
  "configs": ["main", "math", "logic"],
  "columns_to_keep": ["text", "input", "text"],
  "shuffle": true
}

or a simpler JSON:

[
  ["dataset_1", 0.4, "main", "train", "text"],
  ["dataset_2", 0.3, "math", "train", "input"],
  ["dataset_3", 0.2, "logic", "test", "text"],
]

if the parsing doesn't get too complicated.

How does that sound?

qgallouedec added the ✨ enhancement New feature or request label Oct 7, 2024

August-murr linked a pull request Oct 16, 2024 that will close this issue

Data mixer Integration #2240

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Data] Implement dataset mixer for combining datasets in training #2112

[Data] Implement dataset mixer for combining datasets in training #2112

lewtun commented Sep 24, 2024

August-murr commented Oct 11, 2024

[Data] Implement dataset mixer for combining datasets in training #2112

[Data] Implement dataset mixer for combining datasets in training #2112

Comments

lewtun commented Sep 24, 2024

Feature request

Motivation

Your contribution

August-murr commented Oct 11, 2024