Multilingual Nanoset #2

TJ-Solergibert · 2024-07-16T15:36:55Z

In this PR I include the Multilingual Nanosets. As discussed w/ @negar-foroutan, it includes the following features:

New MultilingualNanosetDatasetsArgs. We create this new type of config to:
a. Be able to include custom tokens (Language tokens¿?) to prepend to the samples of each dataset.
b. Be able to differentiate in run_train.py Nanoset & MultilingualNanoset.
MultilingualNanoset. We are creating a train split or a valid split with the is_valid flag. The valid split is created extracting samples from the dataset, the rest are for training. The quantity of extracted samples is given by config.tokens.limit_val_batches * trainer.global_batch_size. In short, we can only control how many validation batches we want in each data parallel group. We do this in other to process the same number of validation samples in each and every DP group.
We prepend Language tokens¿? at the sample level, so a batch of data can contain samples from multiple languages.
We create the valid dataloader the same way* nanotron creates the training data loaders (In a lazy fashion, multiple stages) but it's still missing the validation loop.
I included a config file that consumes pretokenized data with datatrove & the gpt2 tokenizer stored in RCP. Check data_stages.data.dataset.dataset_tokens and tokens.limit_val_batches.

*The valid dataloader is created in the same way as the training one but we are just supporting the MultilingualNanoset, NOT the HF Datasets & Nanoset

TJ-Solergibert · 2024-07-17T14:18:14Z

After this morning's discussion:

The user will manually split the data into training and validation folders. The total number of samples will be computed once the MultilingualNanoset is created.
We print a total number of samples but the DataLoder will drop the last samples (drop_last=True) in order to keep the DP Groups balanced.

You can try the provided config file in RCP with 0.1 GPUs. In order to download some samples from the c4 dataset In multiple languages + preprocess them use the following code snippets:

import os

from functools import partial
from datasets import Dataset, load_dataset

def gen_from_iterable_dataset(iterable_ds):
    yield from iterable_ds

PATH_TO_RAW_DATASETS = "/mloscratch/homes/solergib/nf/nanotron-multilingual/raw_datasets"
splits = ["train", "validation"]
langs = ["en", "es", "fr"]

for split in splits:
    for lang in langs:
        it_ds = load_dataset("allenai/c4", lang, streaming=True, split=split).take(10000)
        ds = Dataset.from_generator(partial(gen_from_iterable_dataset, it_ds), features=it_ds.features)
        ds.to_json(os.path.join(PATH_TO_RAW_DATASETS, lang, f"{split}.json"))

for lang in es en fr; do
  for split in train validation; do
  python3 tools/preprocess_data.py --tokenizer-name-or-path meta-llama/Meta-Llama-3-8B --output-folder datasets/c4-$lang/$split --n-tasks 8 jsonl --dataset raw_datasets/$lang/$split.json
  done
done

TJ-Solergibert added 8 commits July 16, 2024 11:45

Added MultilingualNanoset Config

0485fd6

Added MultilingualNanoset

539832a

Added Language token

d9f0670

Forgot the trainer ups

efe8720

Fix minor errors. Everything works

25ad39b

Updated config file with GPT2 tokenized datasets in RCP

d91f9e1

Before lunch

d0c14e3

After lunch

9cfc5ea

Ready

eed7bce

negar-foroutan merged commit da50231 into swiss-ai:main Jul 18, 2024
1 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multilingual Nanoset #2

Multilingual Nanoset #2

TJ-Solergibert commented Jul 16, 2024

TJ-Solergibert commented Jul 17, 2024 •

edited

Loading

Multilingual Nanoset #2

Multilingual Nanoset #2

Conversation

TJ-Solergibert commented Jul 16, 2024

TJ-Solergibert commented Jul 17, 2024 • edited Loading

TJ-Solergibert commented Jul 17, 2024 •

edited

Loading