Skip to content

Commit

Permalink
merge main into moe; also, adapt gpt3(-moe) for val logging
Browse files Browse the repository at this point in the history
  • Loading branch information
haeggee committed Aug 16, 2024
2 parents fad3497 + 8b68126 commit 100ebf4
Show file tree
Hide file tree
Showing 21 changed files with 662 additions and 173 deletions.
121 changes: 61 additions & 60 deletions examples/config_multilingual_nanoset.yaml
Original file line number Diff line number Diff line change
@@ -1,62 +1,63 @@
checkpoints:
checkpoint_interval: 1000
checkpoint_interval: 1000000
checkpoints_path: checkpoints/
checkpoints_path_is_shared_file_system: false
resume_checkpoint_path: null
save_initial_state: false
data_stages:
- data:
dataset:
training_folder: datasets/c4-es/train
validation_folder: datasets/c4-es/validation
lang_to_ids:
es: 128002
num_loading_workers: 1
seed: 42
name: General purpose training (Single dataset)
start_training_step: 1
- data:
dataset:
training_folder:
- datasets/c4-es/train
- datasets/c4-en/train
- datasets/c4-fr/train
validation_folder:
- datasets/c4-es/validation
- datasets/c4-en/validation
- datasets/c4-fr/validation
lang_to_ids:
es: 128002
en: 128003
fr: 128004
num_loading_workers: 1
seed: 42
name: Second purpose training (> 1 dataset)
start_training_step: 15
- data:
dataset:
training_folder:
datasets/c4-es/train: 0.6
datasets/c4-en/train: 0.3
datasets/c4-fr/train: 0.1
validation_folder:
- datasets/c4-es/validation
- datasets/c4-en/validation
- datasets/c4-fr/validation
lang_to_ids:
es: 128002
en: 128003
fr: 128004

num_loading_workers: 1
seed: 42
name: Third purpose training (Blended dataset)
start_training_step: 25
- data:
dataset:
training_folder:
- datasets/c4-es/train
- datasets/c4-en/train
- datasets/c4-fr/train
validation_folder:
- datasets/c4-es/validation
- datasets/c4-en/validation
- datasets/c4-fr/validation
languages:
- es
- en
- fr
num_loading_workers: 1
seed: 42
name: General purpose training (Blended dataset)
start_training_step: 1
- data:
dataset:
training_folder:
- datasets/c4-es/train
validation_folder:
- datasets/c4-es/validation
languages:
- es
num_loading_workers: 1
seed: 42
name: Second purpose training (Single dataset)
start_training_step: 1000
- data:
dataset:
training_folder:
- datasets/c4-es/train
- datasets/c4-en/train
- datasets/c4-fr/train
validation_folder:
- datasets/c4-es/validation
- datasets/c4-en/validation
- datasets/c4-fr/validation
languages:
- es
- en
- fr
num_loading_workers: 1
seed: 42
name: Third purpose training (>1 dataset)
start_training_step: 2000
general:
benchmark_csv_path: null
consumed_train_samples: null
ignore_sanity_checks: true
project: Nanoset
project: MultilingualV2
run: llama
seed: 42
step: null
Expand All @@ -75,12 +76,12 @@ model:
bos_token_id: 1
eos_token_id: 2
hidden_act: silu
hidden_size: 512
hidden_size: 4096
initializer_range: 0.02
intermediate_size: 512
intermediate_size: 14336
is_llama_config: true
max_position_embeddings: 1024
num_hidden_layers: 2
max_position_embeddings: 4096
num_hidden_layers: 32
num_attention_heads: 32
num_key_value_heads: 8
pad_token_id: null
Expand All @@ -89,7 +90,7 @@ model:
rope_theta: 500000.0
rms_norm_eps: 1.0e-06
rope_scaling: null
tie_word_embeddings: true
tie_word_embeddings: false
use_cache: true
vocab_size: 128256
optimizer:
Expand All @@ -112,11 +113,11 @@ optimizer:
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 1
dp: 2
expert_parallel_size: 1
pp: 1
pp_engine: 1f1b
tp: 1
tp: 4
tp_linear_async_communication: false
tp_mode: REDUCE_SCATTER
profiler: null
Expand All @@ -128,7 +129,7 @@ tokens:
batch_accumulation_per_replica: 1
limit_test_batches: 0
limit_val_batches: 10
micro_batch_size: 4
sequence_length: 1024
train_steps: 200
val_check_interval: -1
micro_batch_size: 3
sequence_length: 4096
train_steps: 500
val_check_interval: 100
4 changes: 4 additions & 0 deletions examples/doremi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,3 +87,7 @@ For evaluation, we do uniform sampling on the test set to evaluate a 2.5B model
- 2.5B llama trained using the optimized weights: https://huggingface.co/nanotron/doremi-llama-2.5b-optimized-weights

and the dataset: https://huggingface.co/datasets/nanotron/the-pile-for-doremi

#### Thoughts

For DoReMi, it's useful if you don't initially have an idea of what would be a good distribution for your training data, or want a quick way to find a better baseline than the uniform distribution if you want to tune the data distribution by hand. In my previous experiments, DoReMi matched the pretraining performance of the distribution of mamba training but couldn't outperform it. I suspect it doesn't work well when there are nuances, meaning the difference between your known best distribution and a better distribution isn't significant.
12 changes: 12 additions & 0 deletions examples/mamba/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,18 @@ pip install -r requirements.txt

> https://wandb.ai/bouteille/test/reports/Mamba-loss--Vmlldzo2OTgwNDM5
## Bug related to nanotron
Encountered the following issue when ran train_mamba.sh:
```
causal_conv1d_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZNK3c1017SymbolicShapeMeta18init_is_contiguousEv
```
Solved this by doing:
pip uninstall mamba-ssm
pip install causal_conv1d==1.1.1
pip install mamba-ssm --no-cache-dir
https://github.com/state-spaces/mamba/issues/169


## Credits
Credits to the following repositories from which the code was adapted:
- https://github.com/state-spaces/mamba
5 changes: 5 additions & 0 deletions examples/mup/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,8 @@ We trained a 350m model with spectral µTransfer and standard parametrization us
Please check the directory [[./examples/mup/configs]](/examples/mup/configs) for the configurations we used to reproduce the experiments.

![LLaMA](./assets/llama.png)


#### Thoughts

For Spectral MuP, the experiments we used it on MLP only [link] and 300m LLaMA [link] (there are links to the experiment config in the mup readme). However, when we tested it on 1B/8B models iirc, the loss blew up for some reasons. So, we'd recommend they try μTransfer, not spectral μTransfer.
19 changes: 15 additions & 4 deletions run_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,6 @@ def get_dataloader_from_data_stage(
sequence_length=trainer.sequence_length,
token_size=token_size,
train_split_num_samples=trainer.config.tokens.train_steps * trainer.global_batch_size,
dataset_tokens=data.dataset.dataset_tokens,
random_seed=data.seed,
)

Expand All @@ -209,6 +208,7 @@ def get_dataloader_from_data_stage(
consumed_train_samples=consumed_train_samples,
dataloader_num_workers=data.num_loading_workers,
dataloader_drop_last=True,
is_multilingual=True,
)

return train_dataloader
Expand Down Expand Up @@ -241,7 +241,6 @@ def get_valid_dataloader_from_data_stage(
dataset_folders=data.dataset.validation_folder,
sequence_length=trainer.sequence_length,
token_size=token_size,
dataset_tokens=data.dataset.dataset_tokens,
is_valid=True,
random_seed=data.seed,
)
Expand All @@ -256,6 +255,8 @@ def get_valid_dataloader_from_data_stage(
micro_batch_size=trainer.micro_batch_size,
dataloader_num_workers=data.num_loading_workers,
dataloader_drop_last=True,
shuffle=True,
is_multilingual=True,
)

return valid_dataloader
Expand Down Expand Up @@ -315,7 +316,7 @@ def get_valid_dataloader(trainer: DistributedTrainer) -> Dict[str, DataLoader]:
stage = cast(DatasetStageArgs, stage)

log_rank(
f"[Validation Plan] Stage {stage.name} has {len(stage.data.dataset.validation_folder)} folders with samples in the validation set",
f"[Validation Plan] Stage {stage.name} has {len(stage.data.dataset.validation_folder)} folders with samples for the validation set",
logger=logger,
level=logging.INFO,
rank=0,
Expand All @@ -324,8 +325,18 @@ def get_valid_dataloader(trainer: DistributedTrainer) -> Dict[str, DataLoader]:
dataloader = (
get_valid_dataloader_from_data_stage(trainer, stage.data)
if stage_idx == 0
else lambda stage=stage: get_dataloader_from_data_stage(trainer, stage.data)
else lambda stage=stage: get_valid_dataloader_from_data_stage(trainer, stage.data)
)
# TODO(tj.solergibert) As we are creating again the valid dataloader in every validation stage, we print multiple times
# the validation MultilingualNanoset info (Number of samples, etc.) [UPDATE: ]. In order to solve that, we could get rid of this lambda
# funcs and directly create all dataloaders.
#
# This lambda functs (Used in training too) are for creating the DataLoaders lazyly FOR 1. Start training faster instead
# of creating multiple DataLoaders 2. Consume less memory as the lambda func is lighter that the DataLoader object with
# the Dataset, collator, etc.
# BUT 1. The Nanoset creation process is very fast and 2. Nanosets doesn't consume any memory at all till we start sampling
# from the Nanoset. Also they later transform the DataLoader into a Iterator object so it's impossible to retrieve
# the DataLoader object again to delete it (More comments in trainer.py)
dataloaders[stage.name] = dataloader
return dataloaders

Expand Down
38 changes: 29 additions & 9 deletions src/nanotron/config/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,12 @@
from yaml.loader import SafeLoader

from nanotron.config.lighteval_config import LightEvalConfig
from nanotron.config.models_config import ExistingCheckpointInit, NanotronConfigs, RandomInit, SpectralMupInit
from nanotron.config.models_config import (
ExistingCheckpointInit,
NanotronConfigs,
RandomInit,
SpectralMupInit,
)
from nanotron.config.parallelism_config import ParallelismArgs
from nanotron.config.utils_config import (
RecomputeGranularity,
Expand Down Expand Up @@ -111,7 +116,7 @@ def __post_init__(self):
class MultilingualNanosetDatasetsArgs:
training_folder: Union[str, dict, List[str]]
validation_folder: Union[str, List[str]]
lang_to_ids: dict # Mapping from the previously defined folders to tokens. Respect the order
languages: List[str] # NOTE(tj.solergibert) Required for 1. Aggregating the result 2. Reporting to WANDB

def __post_init__(self):
if isinstance(self.training_folder, str): # Case 1: 1 Dataset folder
Expand All @@ -125,20 +130,25 @@ def __post_init__(self):
self.training_folder = list(tmp_training_folder.keys())
self.dataset_weights = list(tmp_training_folder.values())

self.dataset_tokens = list(self.lang_to_ids.values())
assert len(self.training_folder) == len(
self.languages
), f"The sizes of training_folder and languages mismatch ({len(self.training_folder)} vs {len(self.languages)})"

assert len(self.training_folder) == len(
self.validation_folder
), f"The sizes of training_folder and validation_folder mismatch ({len(self.training_folder)} vs {len(self.validation_folder)})"
assert len(self.training_folder) == len(
self.dataset_tokens
), f"The sizes of training_folder and lang_to_ids mismatch ({len(self.training_folder)} vs {len(self.dataset_tokens)})"


@dataclass
class DataArgs:
"""Arguments related to the data and data files processing"""

dataset: Union[PretrainDatasetsArgs, NanosetDatasetsArgs, MultilingualNanosetDatasetsArgs]
dataset: Union[
PretrainDatasetsArgs,
NanosetDatasetsArgs,
MultilingualNanosetDatasetsArgs,
MultilingualNanosetDatasetsArgs,
]
seed: Optional[int]
num_loading_workers: Optional[int] = 1

Expand Down Expand Up @@ -405,6 +415,13 @@ def __post_init__(self):
for i in range(len(self.data_stages) - 1)
), "The stages are not sorted by start_training_step in increasing order"

# NOTE(tj.solergibert) As we are reporting the training & validation metrics together, we
# must comply with val_check_interval % iteration_step_info_interval = 0
if not self.tokens.val_check_interval % self.logging.iteration_step_info_interval == 0:
raise ValueError(
f"It is necessary to run the validation stage during a logging step. Validation interval: {self.tokens.val_check_interval}, Logging interval: {self.logging.iteration_step_info_interval}"
)

# # if lighteval, we need tokenizer to be defined
# if self.checkpoints.lighteval is not None:
# assert self.tokenizer.tokenizer_name_or_path is not None
Expand All @@ -427,7 +444,10 @@ def as_dict(self) -> dict:


def get_config_from_dict(
config_dict: dict, config_class: Type = Config, skip_unused_config_keys: bool = False, skip_null_keys: bool = False
config_dict: dict,
config_class: Type = Config,
skip_unused_config_keys: bool = False,
skip_null_keys: bool = False,
):
"""Get a config object from a dictionary
Expand All @@ -445,7 +465,7 @@ def get_config_from_dict(
if skip_null_keys:
logger.warning("Skip_null_keys set")
config_dict = {
k: {kk: vv for kk, vv in v.items() if vv is not None} if isinstance(v, dict) else v
k: ({kk: vv for kk, vv in v.items() if vv is not None} if isinstance(v, dict) else v)
for k, v in config_dict.items()
if v is not None
}
Expand Down
10 changes: 8 additions & 2 deletions src/nanotron/config/models_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,10 @@ def as_starcoder2(self) -> Starcoder2Config:
if "_is_using_mup" in config:
del config["_is_using_mup"]
return Starcoder2Config(
grouped_query=True, num_kv_heads=self.num_attention_heads, use_rotary_embeddings=False, **config
grouped_query=True,
num_kv_heads=self.num_attention_heads,
use_rotary_embeddings=False,
**config,
)

@property
Expand Down Expand Up @@ -244,7 +247,10 @@ def as_starcoder2(self) -> Starcoder2Config:
if "_is_using_mup" in config:
del config["_is_using_mup"]
return Starcoder2Config(
grouped_query=True, num_kv_heads=self.num_attention_heads, use_rotary_embeddings=False, **config
grouped_query=True,
num_kv_heads=self.num_attention_heads,
use_rotary_embeddings=False,
**config,
)

@property
Expand Down
2 changes: 2 additions & 0 deletions src/nanotron/config/parallelism_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ class ParallelismArgs:
pp_engine: Pipeline engine to use between "1f1b" and "afab"
tp_mode: TP mode to use between "all_reduce" and "reduce_scatter": all_reduce is normal, reduce_scatter activate sequence parallelism
tp_linear_async_communication: Whether to use async communication in TP linear layers
recompute_layer: Whether to recompute each Transformer layer to save memory.
"""

dp: int
Expand All @@ -31,6 +32,7 @@ class ParallelismArgs:
pp_engine: Optional[PipelineEngine] = None
tp_mode: Optional[TensorParallelLinearMode] = None
tp_linear_async_communication: Optional[bool] = None
recompute_layer: bool = False

expert_parallel_size: int = 1

Expand Down
Loading

0 comments on commit 100ebf4

Please sign in to comment.