Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Implement dataset mixer for combining datasets in training #2112

Open
lewtun opened this issue Sep 24, 2024 · 1 comment · May be fixed by #2240
Open

[Data] Implement dataset mixer for combining datasets in training #2112

lewtun opened this issue Sep 24, 2024 · 1 comment · May be fixed by #2240
Labels
✨ enhancement New feature or request

Comments

@lewtun
Copy link
Member

lewtun commented Sep 24, 2024

Feature request

In the alignment-handbook, we implemented a "dataset mixer" that allows one to easily combine datasets in varying proportions, provided they all share the same schema.

It could be interesting to port this mixer to TRL, so that users can easily combine datasets during training. The only caveat I see is that to support the CLI training e.g. trl sft ... we'd need a data structure that is compatible because dict objects don't place nice with CLIs.

Motivation

Advanced post-training typically combines different datasets / proportions. Supporting this in TRL would allow us to gradually deprecate the handbook in favour of using the lib directly.

Your contribution

Open to discussion :)

@qgallouedec qgallouedec added the ✨ enhancement New feature or request label Oct 7, 2024
@August-murr
Copy link
Contributor

@lewtun
Shouldn't this be a part of Datasets instead?

Datasets already has Interleaves, which also mixes datasets in a similar way. It's not quite integrable as it is, but may be useful.

As for integrating it into the CLI, I think the best way to do it would be with a config file, like a JSON, something like this::

trl sft --dataset_mixer --mix_config mix_config.json

mix_config.json:

{
  "dataset_mixer": {
    "dataset_1": 0.4,
    "dataset_2": 0.3,
    "dataset_3": 0.2
  },
  "splits": ["train", "train", "test"],
  "configs": ["main", "math", "logic"],
  "columns_to_keep": ["text", "input", "text"],
  "shuffle": true
}

or a simpler JSON:

[
  ["dataset_1", 0.4, "main", "train", "text"],
  ["dataset_2", 0.3, "math", "train", "input"],
  ["dataset_3", 0.2, "logic", "test", "text"],
]

if the parsing doesn't get too complicated.

How does that sound?

@August-murr August-murr linked a pull request Oct 16, 2024 that will close this issue
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants