Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove HMP from optimum-habana #349

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 3 additions & 52 deletions docs/source/package_reference/gaudi_config.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,69 +16,20 @@ limitations under the License.

# Gaudi Configuration

In order to make the most of Gaudi, it is advised to rely on advanced features such as Habana Mixed Precision or optimized operators.
You can specify which features to use in a Gaudi configuration, which will take the form of a JSON file following this template:

```JSON
{
"use_habana_mixed_precision": true/false,
"hmp_is_verbose": true/false,
"use_fused_adam": true/false,
"use_fused_clip_norm": true/false,
"hmp_bf16_ops": [
"torch operator to compute in bf16",
"..."
],
"hmp_fp32_ops": [
"torch operator to compute in fp32",
"..."
]
}
```

Here is a description of each configuration parameter:
- `use_habana_mixed_precision` enables to decide whether or not Habana Mixed Precision (HMP) should be used. HMP allows to mix *fp32* and *bf16* operations. You can find more information [here](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/PT_Mixed_Precision.html).
- `hmp_is_verbose` enables to decide whether to log precision decisions for each operation for debugging purposes. It is disabled by default. You can find an example of such log [here](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/PT_Mixed_Precision.html#hmp-logs).
- `use_fused_adam` enables to decide whether to use the [custom fused implementation of the ADAM optimizer provided by Habana](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Custom_Ops_PyTorch.html#custom-optimizers).
- `use_fused_clip_norm` enables to decide whether to use the [custom fused implementation of gradient norm clipping provided by Habana](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Custom_Ops_PyTorch.html#other-custom-ops).
- `hmp_bf16_ops` enables to specify the Torch operations that should be computed in *bf16*. You can find more information about casting rules [here](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/PT_Mixed_Precision.html#basic-design-rules).
- `hmp_fp32_ops` enables to specify the Torch operations that should be computed in *fp32*. You can find more information about casting rules [here](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/PT_Mixed_Precision.html#basic-design-rules).
Comment on lines -44 to -45
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should mention:

  • use_torch_autocast but saying that --bf16 should be favored as use_torch_autocast is used to define a good pre-defined config
  • autocast_bf16_ops and autocast_fp32_ops as Add support for autocast custom ops in GaudiTrainer #308 enables users to specify cutom op lists but saying that the default should work for most models

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed by email, regarding autocast_bf16_ops and autocast_fp32_ops, I'm fine with saying that the env variable way should be favored. But they should still be documented.


<Tip warning={true}>

`hmp_is_verbose`, `hmp_bf16_ops` and `hmp_fp32_ops` will not be used if `use_habana_mixed_precision` is false.
- `use_torch_autocast` enables PyTorch autocast; used to define good pre-defined config; users should favor `--bf16` training argument
- `autocast_bf16_ops` list of operations that should be run with bf16 precision under autocast context; using environment flag LOWER_LIST is a preffered way for operator autocast list override
- `autocast_fp32_ops` list of operations that should be run with fp32 precision under autocast context; using environment flag FP32_LIST is a preffered way for operator autocast list override

</Tip>

You can find examples of Gaudi configurations in the [Habana model repository on the Hugging Face Hub](https://huggingface.co/habana). For instance, [for BERT Large we have](https://huggingface.co/Habana/bert-large-uncased-whole-word-masking/blob/main/gaudi_config.json):

```JSON
{
"use_habana_mixed_precision": true,
"hmp_is_verbose": false,
"use_fused_adam": true,
"use_fused_clip_norm": true,
"hmp_bf16_ops": [
"add",
"addmm",
"bmm",
"div",
"dropout",
"gelu",
"iadd",
"linear",
"layer_norm",
"matmul",
"mm",
"rsub",
"softmax",
"truediv"
],
"hmp_fp32_ops": [
"embedding",
"nll_loss",
"log_softmax"
]
}
```

Expand Down
42 changes: 7 additions & 35 deletions docs/source/usage_guides/accelerate_training.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -57,44 +57,16 @@ To not take them into account in the computation of the throughput at the end of
## Mixed-Precision Training

Mixed-precision training enables to compute some operations using lighter data types to accelerate training.
Habana Mixed Precision (HMP) proposes to mix *fp32* and *bf16* operations.
Optimum Habana enables mixed precision training in a similar fashion as 🤗 Transformers:
- argument `--bf16` enables usage of PyTorch autocast
- argument `--half_precision_backend [hpu_amp, cpu_amp]` is used to specify a device on which mixed precision operations should be performed

<Tip warning={true}>

Please refer to the [list of supported PyTorch operators](https://docs.habana.ai/en/latest/PyTorch/Pytorch_Operators/Pytorch_Operators.html) beforehand to make sure the ones you are interested in are compatible with *bf16*.

</Tip>
Comment on lines -62 to -66
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But those operators are incompatible with autocast. HMP and autocast operate on different software levels. Please see: https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/Autocast.html#override-options

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, we don't have to keep the same operators. Maybe it will just be easier to refer to GPT2's Gaudi config.


To apply HMP, you must set `"use_habana_mixed_precision"` to `true` in the Gaudi configuration file.
Then, you can specify which operators to compute in *bf16* with `"hmp_bf16_ops"` and which operators to compute in *fp32* with `"hmp_fp32_ops"`.
If these operators are not specified, their default values are set to be the ones written in the [Gaudi configuration file of BERT](https://huggingface.co/Habana/bert-large-uncased-whole-word-masking/blob/main/gaudi_config.json), which is a good starting point for applying HMP:
```
"hmp_bf16_ops": [
"add",
"addmm",
"bmm",
"div",
"dropout",
"gelu",
"iadd",
"linear",
"layer_norm",
"matmul",
"mm",
"rsub",
"softmax",
"truediv"
],
"hmp_fp32_ops": [
"embedding",
"nll_loss",
"log_softmax"
]
```
Comment on lines -69 to -93
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still keep a part of this to show how to specify custom op lists. We can add a link to the GPT2 Gaudi config when it is updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But shouldn't users provide custom lists in a similar way to other training demos outside of HuggingFace? We can keep those in GaudiConfig to make sure they are optimized for specific model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO users should be able to do both because those already used to Optimum Habana probably have Gaudi configs with custom op lists, so switching to Autocast will be easy and they won't be confused.


<Tip>
<Tip warning={true}>
regisss marked this conversation as resolved.
Show resolved Hide resolved

Torch Autocast can also be used as a backend for mixed-precision training. You need to add the argument `--bf16` to enable it.
Please refer to the [advanced autocast usage on Gaudi](https://docs.habana.ai/en/latest/PyTorch/PyTorch_Mixed_Precision/Autocast.html) for more informations regarding:
- default autocast operations
- default autocast operations override

</Tip>

Expand Down
9 changes: 6 additions & 3 deletions examples/audio-classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ python run_audio_classification.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/wav2vec2 \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```

On a single HPU, this script should run in ~13 minutes and yield an accuracy of **97.96%**.
Expand Down Expand Up @@ -83,7 +84,8 @@ python ../gaudi_spawn.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/wav2vec2 \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```

On 8 HPUs, this script should run in ~12 minutes and yield an accuracy of **80.49%**.
Expand Down Expand Up @@ -157,7 +159,8 @@ python run_audio_classification.py \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/wav2vec2
--gaudi_config_name Habana/wav2vec2 \
--bf16
```


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -237,7 +237,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
9 changes: 6 additions & 3 deletions examples/contrastive-image-text/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,8 @@ python run_clip.py \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/clip \
--throughput_warmup_steps 3 \
--dataloader_num_workers 16
--dataloader_num_workers 16 \
--bf16
```


Expand Down Expand Up @@ -141,7 +142,8 @@ python ../gaudi_spawn.py --world_size 8 --use_mpi run_clip.py \
--throughput_warmup_steps 3 \
--dataloader_num_workers 16 \
--mediapipe_dataloader \
--use_hpu_graphs_for_training
--use_hpu_graphs_for_training \
--bf16
```

> `--mediapipe_dataloader` only works on Gaudi2.
Expand Down Expand Up @@ -246,5 +248,6 @@ python run_clip.py \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/clip
--gaudi_config_name Habana/clip \
--bf16
```
2 changes: 1 addition & 1 deletion examples/contrastive-image-text/run_bridgetower.py
Original file line number Diff line number Diff line change
Expand Up @@ -299,7 +299,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
2 changes: 1 addition & 1 deletion examples/contrastive-image-text/run_clip.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
12 changes: 8 additions & 4 deletions examples/image-classification/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ python run_image_classification.py \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/vit \
--throughput_warmup_steps 3 \
--dataloader_num_workers 1
--dataloader_num_workers 1 \
--bf16
```

For Swin, you need to change/add the following arguments:
Expand Down Expand Up @@ -95,7 +96,8 @@ python run_image_classification.py \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/vit \
--throughput_warmup_steps 3 \
--dataloader_num_workers 1
--dataloader_num_workers 1 \
--bf16
```

Internally, the script will use the [`ImageFolder`](https://huggingface.co/docs/datasets/v2.0.0/en/image_process#imagefolder) feature which will automatically turn the folders into 🤗 Dataset objects.
Expand Down Expand Up @@ -196,7 +198,8 @@ python ../gaudi_spawn.py \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/vit \
--throughput_warmup_steps 3 \
--dataloader_num_workers 1
--dataloader_num_workers 1 \
--bf16
```

For Swin, you need to change/add the following arguments:
Expand Down Expand Up @@ -279,4 +282,5 @@ python run_image_classification.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/vit \
--dataloader_num_workers 1
--dataloader_num_workers 1 \
--bf16
Original file line number Diff line number Diff line change
Expand Up @@ -240,7 +240,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
15 changes: 10 additions & 5 deletions examples/language-modeling/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,8 @@ python run_mlm.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/roberta-base \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```

To run on your own training and validation files, use the following command:
Expand All @@ -197,7 +198,8 @@ python run_mlm.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/roberta-base \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```

If your dataset is organized with one sample per line, you can use the `--line_by_line` flag (otherwise the script
Expand All @@ -223,7 +225,8 @@ python ../gaudi_spawn.py \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--gaudi_config_name Habana/roberta-base \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```


Expand All @@ -247,7 +250,8 @@ python run_clm.py \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```


Expand Down Expand Up @@ -315,7 +319,8 @@ python run_clm.py \
--gaudi_config_name Habana/gpt2 \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference
--use_hpu_graphs_for_inference \
--bf16
```


Expand Down
2 changes: 1 addition & 1 deletion examples/language-modeling/run_clm.py
Original file line number Diff line number Diff line change
Expand Up @@ -311,7 +311,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
2 changes: 1 addition & 1 deletion examples/language-modeling/run_mlm.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
12 changes: 8 additions & 4 deletions examples/question-answering/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,8 @@ python run_qa.py \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```


Expand All @@ -79,7 +80,8 @@ python ../gaudi_spawn.py \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```


Expand Down Expand Up @@ -148,7 +150,8 @@ python run_qa.py \
--output_dir /tmp/squad/ \
--use_habana \
--use_lazy_mode \
--use_hpu_graphs_for_inference
--use_hpu_graphs_for_inference \
--bf16
```


Expand Down Expand Up @@ -198,7 +201,8 @@ python run_seq2seq_qa.py \
--ignore_pad_token_for_loss False \
--pad_to_max_length \
--save_strategy epoch \
--throughput_warmup_steps 3
--throughput_warmup_steps 3 \
--bf16
```

For multi-card and DeepSpeed runs, you can use `python ../gaudi_spawn.py --world_size 8 --use_mpi` and `python ../gaudi_spawn.py --world_size 8 --use_deepspeed` as shown in the previous sections.
2 changes: 1 addition & 1 deletion examples/question-answering/run_qa.py
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
2 changes: 1 addition & 1 deletion examples/question-answering/run_seq2seq_qa.py
Original file line number Diff line number Diff line change
Expand Up @@ -338,7 +338,7 @@ def main():
)

# Log on each process the small summary:
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast or gaudi_config.use_habana_mixed_precision
mixed_precision = training_args.bf16 or gaudi_config.use_torch_autocast
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, "
+ f"distributed training: {training_args.parallel_mode.value == 'distributed'}, "
Expand Down
Loading
Loading