Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
In this PR I include the Multilingual Nanosets. As discussed w/ @negar-foroutan, it includes the following features:
MultilingualNanosetDatasetsArgs
. We create this new type of config to:a. Be able to include custom tokens (Language tokens¿?) to prepend to the samples of each dataset.
b. Be able to differentiate in
run_train.py
Nanoset
&MultilingualNanoset
.MultilingualNanoset
. We are creating a train split or a valid split with theis_valid
flag. The valid split is created extracting samples from the dataset, the rest are for training. The quantity of extracted samples is given byconfig.tokens.limit_val_batches * trainer.global_batch_size
. In short, we can only control how many validation batches we want in each data parallel group. We do this in other to process the same number of validation samples in each and every DP group.We prepend Language tokens¿? at the sample level, so a batch of data can contain samples from multiple languages.
data_stages.data.dataset.dataset_tokens
andtokens.limit_val_batches
.*The valid dataloader is created in the same way as the training one but we are just supporting the
MultilingualNanoset
, NOT the HF Datasets &Nanoset