-
Notifications
You must be signed in to change notification settings - Fork 640
Deepspeed Usage
You can also train with Microsoft Deepspeed's Sparse Attention, with any combination of dense and sparse attention that you'd like. However, you will have to endure the installation process.
If everything installed correctly you now have access to a few new features:
dalle = DALLE(
dim = 512,
depth = 64,
heads = 8,
attn_types = ('full', 'sparse') # interleave sparse and dense attention for 64 layers
)
You should now run all training sessions with deepspeed
instead of python
if you wish to make use of its distributed features.
deepspeed train_dalle.py <...> --distributed_backend deepspeed
deepspeed train_dalle.py <...> --distributed_backend deepspeed --fp16
deepspeed --num_gpus 1 train_dalle.py <...> --distributed_backend deepspeed
Change the deepspeed_config
dictionary in train_dalle.py
or train_vae.py
to adjust DeepSpeed based on your setup.
If you are interested in ZeRO-enabled training, see below:
To use floating-point-16, simply pass --fp16
to train_dalle.py
(not available for train_vae.py
)
deepspeed train_dalle.py --image_text_folder=/path/to/your/dataset --distributed_backend --deepspeed --fp16
ZeRO stages 1-3 have been confirmed to work (for us) when using V100
, A100
, RTX3090
.
ZeRO currently only works with half-precision training, so you have to pass the --fp16
flag when activating it:
deepspeed_config = {
"zero_optimization": {
"stage": 1,
},
'train_batch_size': BATCH_SIZE,
'gradient_clipping': GRAD_CLIP_NORM,
'fp16': {
'enabled': args.fp16,
},
}
Stage 2 will try to use gradient_accumulate in order to fill up the VRAM of each GPU more effectively.
You may also optionally enable cpu_offload
at this point in order to use the CPU-based Adam which deepspeed provides.
deepspeed_config = {
"zero_optimization": {
"stage": 2,
"cpu_offload": True,
},
[...]
}
deepspeed_config = {
"zero_optimization": {
"stage": 3,
},
'fp16': {
'enabled': args.fp16,
'loss_scale': 0,
'initial_scale_power': 15,
},
[...]
}
Fair warning: This stuff is experimental. If you have issues let us know in the Issues section.
deepspeed_config = {
"zero_optimization": {
"stage": 3,
},
# Offload the model parameters If you have an nvme drive - you should use the nvme option.
# Otherwise, use 'cpu' and remove the `nvme_path` line
"offload_param": {
"device": "nvme",
"nvme_path": "/home/samsepiol/.cache/DeepSpeed/deepspeed_param",
"buffer_count": 5,
"buffer_size": 1e8,
"max_in_cpu": 1e9
},
# Offload the optimizer of choice. If you have an nvme drive - you should use the nvme option.
# Otherwise, use 'cpu' and remove the `nvme_path` line
"offload_optimizer": {
"device": "nvme", # options are 'none', 'cpu', 'nvme'
"nvme_path": "/home/samsepiol/.cache/DeepSpeed/deepspeed_optim",
"buffer_count": 4,
"pin_memory": False,
"fast_init": False
},
"activation_checkpointing": {
"partition_activations": True,
"cpu_checkpointing": True,
"contiguous_memory_optimization": True,
"number_checkpoints": None,
"synchronize_checkpoint_boundary": True,
"profile": False
},
# Override pytorch's Adam optim with `FusedAdam` (just called Adam here). Can
"optimizer": {
"type": "Adam", # You can also use AdamW here
"params": {
"lr": LEARNING_RATE,
},
},
'train_batch_size': BATCH_SIZE,
'gradient_clipping': GRAD_CLIP_NORM,
'fp16': {
'enabled': args.fp16,
},
}