Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multilingual Nanoset #2

Merged
merged 9 commits into from
Jul 18, 2024

Conversation

TJ-Solergibert
Copy link

In this PR I include the Multilingual Nanosets. As discussed w/ @negar-foroutan, it includes the following features:

  1. New MultilingualNanosetDatasetsArgs. We create this new type of config to:
    a. Be able to include custom tokens (Language tokens¿?) to prepend to the samples of each dataset.
    b. Be able to differentiate in run_train.py Nanoset & MultilingualNanoset.
  2. MultilingualNanoset. We are creating a train split or a valid split with the is_valid flag. The valid split is created extracting samples from the dataset, the rest are for training. The quantity of extracted samples is given by config.tokens.limit_val_batches * trainer.global_batch_size. In short, we can only control how many validation batches we want in each data parallel group. We do this in other to process the same number of validation samples in each and every DP group.
    We prepend Language tokens¿? at the sample level, so a batch of data can contain samples from multiple languages.
  3. We create the valid dataloader the same way* nanotron creates the training data loaders (In a lazy fashion, multiple stages) but it's still missing the validation loop.
  4. I included a config file that consumes pretokenized data with datatrove & the gpt2 tokenizer stored in RCP. Check data_stages.data.dataset.dataset_tokens and tokens.limit_val_batches.

*The valid dataloader is created in the same way as the training one but we are just supporting the MultilingualNanoset, NOT the HF Datasets & Nanoset

@TJ-Solergibert
Copy link
Author

TJ-Solergibert commented Jul 17, 2024

After this morning's discussion:

  • The user will manually split the data into training and validation folders. The total number of samples will be computed once the MultilingualNanoset is created.
  • We print a total number of samples but the DataLoder will drop the last samples (drop_last=True) in order to keep the DP Groups balanced.

You can try the provided config file in RCP with 0.1 GPUs. In order to download some samples from the c4 dataset In multiple languages + preprocess them use the following code snippets:

import os

from functools import partial
from datasets import Dataset, load_dataset

def gen_from_iterable_dataset(iterable_ds):
    yield from iterable_ds

PATH_TO_RAW_DATASETS = "/mloscratch/homes/solergib/nf/nanotron-multilingual/raw_datasets"
splits = ["train", "validation"]
langs = ["en", "es", "fr"]

for split in splits:
    for lang in langs:
        it_ds = load_dataset("allenai/c4", lang, streaming=True, split=split).take(10000)
        ds = Dataset.from_generator(partial(gen_from_iterable_dataset, it_ds), features=it_ds.features)
        ds.to_json(os.path.join(PATH_TO_RAW_DATASETS, lang, f"{split}.json"))
for lang in es en fr; do
  for split in train validation; do
  python3 tools/preprocess_data.py --tokenizer-name-or-path meta-llama/Meta-Llama-3-8B --output-folder datasets/c4-$lang/$split --n-tasks 8 jsonl --dataset raw_datasets/$lang/$split.json
  done
done 

@negar-foroutan negar-foroutan merged commit da50231 into swiss-ai:main Jul 18, 2024
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants