Skip to content

Commit

Permalink
Remove HMP from optimum-habana (#349)
Browse files Browse the repository at this point in the history
  • Loading branch information
jwieczorekhabana authored Nov 24, 2023
1 parent 2f16e5d commit 2129f91
Show file tree
Hide file tree
Showing 34 changed files with 290 additions and 480 deletions.
55 changes: 3 additions & 52 deletions docs/source/package_reference/gaudi_config.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,69 +16,20 @@ limitations under the License.

# Gaudi Configuration

In order to make the most of Gaudi, it is advised to rely on advanced features such as Habana Mixed Precision or optimized operators.
You can specify which features to use in a Gaudi configuration, which will take the form of a JSON file following this template:

```JSON
{
"use_habana_mixed_precision": true/false,
"hmp_is_verbose": true/false,
"use_fused_adam": true/false,
"use_fused_clip_norm": true/false,
"hmp_bf16_ops": [
"torch operator to compute in bf16",
"..."
],
"hmp_fp32_ops": [
"torch operator to compute in fp32",
"..."
]
}
```

Here is a description of each configuration parameter:
- `use_habana_mixed_precision` enables to decide whether or not Habana Mixed Precision (HMP) should be used. HMP allows to mix *fp32* and *bf16* operations. You can find more information [here](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/PT_Mixed_Precision.html).
- `hmp_is_verbose` enables to decide whether to log precision decisions for each operation for debugging purposes. It is disabled by default. You can find an example of such log [here](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/PT_Mixed_Precision.html#hmp-logs).
- `use_fused_adam` enables to decide whether to use the [custom fused implementation of the ADAM optimizer provided by Habana](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Custom_Ops_PyTorch.html#custom-optimizers).
- `use_fused_clip_norm` enables to decide whether to use the [custom fused implementation of gradient norm clipping provided by Habana](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Custom_Ops_PyTorch.html#other-custom-ops).
- `hmp_bf16_ops` enables to specify the Torch operations that should be computed in *bf16*. You can find more information about casting rules [here](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/PT_Mixed_Precision.html#basic-design-rules).
- `hmp_fp32_ops` enables to specify the Torch operations that should be computed in *fp32*. You can find more information about casting rules [here](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/PT_Mixed_Precision.html#basic-design-rules).

<Tip warning={true}>

`hmp_is_verbose`, `hmp_bf16_ops` and `hmp_fp32_ops` will not be used if `use_habana_mixed_precision` is false.
- `use_torch_autocast` enables PyTorch autocast; used to define good pre-defined config; users should favor `--bf16` training argument
- `autocast_bf16_ops` list of operations that should be run with bf16 precision under autocast context; using environment flag LOWER_LIST is a preffered way for operator autocast list override
- `autocast_fp32_ops` list of operations that should be run with fp32 precision under autocast context; using environment flag FP32_LIST is a preffered way for operator autocast list override

</Tip>

You can find examples of Gaudi configurations in the [Habana model repository on the Hugging Face Hub](https://huggingface.co/habana). For instance, [for BERT Large we have](https://huggingface.co/Habana/bert-large-uncased-whole-word-masking/blob/main/gaudi_config.json):

```JSON
{
"use_habana_mixed_precision": true,
"hmp_is_verbose": false,
"use_fused_adam": true,
"use_fused_clip_norm": true,
"hmp_bf16_ops": [
"add",
"addmm",
"bmm",
"div",
"dropout",
"gelu",
"iadd",
"linear",
"layer_norm",
"matmul",
"mm",
"rsub",
"softmax",
"truediv"
],
"hmp_fp32_ops": [
"embedding",
"nll_loss",
"log_softmax"
]
}
```

Expand Down
42 changes: 7 additions & 35 deletions docs/source/usage_guides/accelerate_training.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -57,44 +57,16 @@ To not take them into account in the computation of the throughput at the end of
## Mixed-Precision Training

Mixed-precision training enables to compute some operations using lighter data types to accelerate training.
Habana Mixed Precision (HMP) proposes to mix *fp32* and *bf16* operations.
Optimum Habana enables mixed precision training in a similar fashion as 🤗 Transformers:
- argument `--bf16` enables usage of PyTorch autocast
- argument `--half_precision_backend [hpu_amp, cpu_amp]` is used to specify a device on which mixed precision operations should be performed

<Tip warning={true}>

Please refer to the [list of supported PyTorch operators](https://docs.habana.ai/en/latest/PyTorch/Pytorch_Operators/Pytorch_Operators.html) beforehand to make sure the ones you are interested in are compatible with *bf16*.

</Tip>

To apply HMP, you must set `"use_habana_mixed_precision"` to `true` in the Gaudi configuration file.
Then, you can specify which operators to compute in *bf16* with `"hmp_bf16_ops"` and which operators to compute in *fp32* with `"hmp_fp32_ops"`.
If these operators are not specified, their default values are set to be the ones written in the [Gaudi configuration file of BERT](https://huggingface.co/Habana/bert-large-uncased-whole-word-masking/blob/main/gaudi_config.json), which is a good starting point for applying HMP:
```
"hmp_bf16_ops": [
"add",
"addmm",
"bmm",
"div",
"dropout",
"gelu",
"iadd",
"linear",
"layer_norm",
"matmul",
"mm",
"rsub",
"softmax",
"truediv"
],
"hmp_fp32_ops": [
"embedding",
"nll_loss",
"log_softmax"
]
```

<Tip>
<Tip warning={true}>

Torch Autocast can also be used as a backend for mixed-precision training. You need to add the argument `--bf16` to enable it.
Please refer to the [advanced autocast usage on Gaudi](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/Autocast.html) for more informations regarding:
- default autocast operations
- default autocast operations override

</Tip>

Expand Down
9 changes: 6 additions & 3 deletions examples/audio-classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ python run_audio_classification.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/wav2vec2 \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```

On a single HPU, this script should run in ~13 minutes and yield an accuracy of **97.96%**.
Expand Down Expand Up @@ -83,7 +84,8 @@ python ../gaudi_spawn.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/wav2vec2 \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```

On 8 HPUs, this script should run in ~12 minutes and yield an accuracy of **80.49%**.
Expand Down Expand Up @@ -157,7 +159,8 @@ python run_audio_classification.py \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/wav2vec2
--gaudi_config_name Habana/wav2vec2 \
--bf16
```


Expand Down
2 changes: 1 addition & 1 deletion examples/audio-classification/run_audio_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
9 changes: 6 additions & 3 deletions examples/contrastive-image-text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,8 @@ python run_clip.py \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/clip \
--throughput_warmup_steps 3 \
--dataloader_num_workers 16
--dataloader_num_workers 16 \
--bf16
```


Expand Down Expand Up @@ -141,7 +142,8 @@ python ../gaudi_spawn.py --world_size 8 --use_mpi run_clip.py \
--throughput_warmup_steps 3 \
--dataloader_num_workers 16 \
--mediapipe_dataloader \
--use_hpu_graphs_for_training
--use_hpu_graphs_for_training \
--bf16
```

> `--mediapipe_dataloader` only works on Gaudi2.
Expand Down Expand Up @@ -247,5 +249,6 @@ python run_clip.py \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/clip
--gaudi_config_name Habana/clip \
--bf16
```
2 changes: 1 addition & 1 deletion examples/contrastive-image-text/run_bridgetower.py
Original file line number Diff line number Diff line change
Expand Up @@ -303,7 +303,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
2 changes: 1 addition & 1 deletion examples/contrastive-image-text/run_clip.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
12 changes: 8 additions & 4 deletions examples/image-classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ python run_image_classification.py \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/vit \
--throughput_warmup_steps 3 \
--dataloader_num_workers 1
--dataloader_num_workers 1 \
--bf16
```

For Swin, you need to change/add the following arguments:
Expand Down Expand Up @@ -95,7 +96,8 @@ python run_image_classification.py \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/vit \
--throughput_warmup_steps 3 \
--dataloader_num_workers 1
--dataloader_num_workers 1 \
--bf16
```

Internally, the script will use the [`ImageFolder`](https://huggingface.co/docs/datasets/v2.0.0/en/image_process#imagefolder) feature which will automatically turn the folders into 🤗 Dataset objects.
Expand Down Expand Up @@ -196,7 +198,8 @@ python ../gaudi_spawn.py \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/vit \
--throughput_warmup_steps 3 \
--dataloader_num_workers 1
--dataloader_num_workers 1 \
--bf16
```

For Swin, you need to change/add the following arguments:
Expand Down Expand Up @@ -279,4 +282,5 @@ python run_image_classification.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/vit \
--dataloader_num_workers 1
--dataloader_num_workers 1 \
--bf16
2 changes: 1 addition & 1 deletion examples/image-classification/run_image_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
15 changes: 10 additions & 5 deletions examples/language-modeling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,8 @@ python run_mlm.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/roberta-base \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```

To run on your own training and validation files, use the following command:
Expand All @@ -197,7 +198,8 @@ python run_mlm.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/roberta-base \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```

If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
Expand All @@ -223,7 +225,8 @@ python ../gaudi_spawn.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/roberta-base \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```


Expand All @@ -247,7 +250,8 @@ python run_clm.py \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```


Expand Down Expand Up @@ -338,7 +342,8 @@ python run_clm.py \
--gaudi_config_name Habana/gpt2 \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference
--use_hpu_graphs_for_inference \
--bf16
```


Expand Down
2 changes: 1 addition & 1 deletion examples/language-modeling/run_clm.py
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
2 changes: 1 addition & 1 deletion examples/language-modeling/run_mlm.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
12 changes: 8 additions & 4 deletions examples/question-answering/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ python run_qa.py \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```


Expand All @@ -79,7 +80,8 @@ python ../gaudi_spawn.py \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```


Expand Down Expand Up @@ -148,7 +150,8 @@ python run_qa.py \
--output_dir /tmp/squad/ \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference
--use_hpu_graphs_for_inference \
--bf16
```


Expand Down Expand Up @@ -198,7 +201,8 @@ python run_seq2seq_qa.py \
--ignore_pad_token_for_loss False \
--pad_to_max_length \
--save_strategy epoch \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```

For multi-card and DeepSpeed runs, you can use `python ../gaudi_spawn.py --world_size 8 --use_mpi` and `python ../gaudi_spawn.py --world_size 8 --use_deepspeed` as shown in the previous sections.
2 changes: 1 addition & 1 deletion examples/question-answering/run_qa.py
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
2 changes: 1 addition & 1 deletion examples/question-answering/run_seq2seq_qa.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
Loading

0 comments on commit 2129f91

Please sign in to comment.