From a7f28162936e2036b6223fe0739ffcae75cea9e9 Mon Sep 17 00:00:00 2001
From: Nikolay <nikolay.lyalyushkin@intel.com>
Date: Thu, 11 May 2023 18:59:14 +0200
Subject: [PATCH 1/3] cmd to run JPQD in DDP mode

---
 examples/openvino/audio-classification/README.md | 6 +++---
 examples/openvino/image-classification/README.md | 4 ++--
 examples/openvino/question-answering/README.md   | 4 ++--
 examples/openvino/text-classification/README.md  | 3 ++-
 4 files changed, 9 insertions(+), 8 deletions(-)
diff --git a/examples/openvino/audio-classification/README.md b/examples/openvino/audio-classification/README.md
index 39896a6ac..562dfa096 100644
--- a/examples/openvino/audio-classification/README.md
+++ b/examples/openvino/audio-classification/README.md
@@ -18,7 +18,7 @@ limitations under the License.
 
 This folder contains [`run_audio_classification.py`](https://github.com/huggingface/optimum/blob/main/examples/openvino/audio-classification/run_audio_classification.py), a script to fine-tune a 🤗 Transformers model on the 🗣️ [Keyword Spotting subset](https://huggingface.co/datasets/superb#ks) of the SUPERB dataset while applying Quantization-Aware Training (QAT). QAT can be easily applied by replacing the Transformers [`Trainer`](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#trainer) with the Optimum [`OVTrainer`]. Any model from our [hub](https://huggingface.co/models) can be fine-tuned and quantized, as long as the model is supported by the [`AutoModelForAudioClassification`](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForAudioClassification) API.
 
-### Fintuning Wav2Vec2 on Keyword Spotting with QAT
+### Fine-tuning Wav2Vec2 on Keyword Spotting with QAT
 
 The following command shows how to fine-tune [Wav2Vec2-base](https://huggingface.co/facebook/wav2vec2-base) on the 🗣️ [Keyword Spotting subset](https://huggingface.co/datasets/superb#ks) of the SUPERB dataset with Quantization-Aware Training (QAT). The `OVTrainer` uses a default quantization configuration which should work in many cases, but we can also customize the algorithm details. Here, we quantize the Wav2Vec2-base model with a custom configuration file specified by `--nncf_compression_config`. For more details on the quantization configuration, see NNCF documentation [here](https://github.com/openvinotoolkit/nncf/blob/develop/docs/compression_algorithms/Quantization.md).
 
@@ -60,7 +60,7 @@ On a single V100 GPU, this script should run in ~45 minutes and yield a quantize
 `OVTrainer` also provides advanced optimization workflow via NNCF to structurally prune, quantize and distill. Following is an example of joint pruning, quantization and distillation on Wav2Vec2-base model for keyword spotting task. To enable JPQD optimization, use an alternative configuration specified with `--nncf_compression_config`. For more details on how to configure the pruning algorithm, see NNCF documentation [here](https://github.com/openvinotoolkit/nncf/blob/develop/nncf/experimental/torch/sparsity/movement/MovementSparsity.md).
 
 ```bash
-python run_audio_classification.py \
+torchrun --nproc-per-node=1 run_audio_classification.py \
     --model_name_or_path facebook/wav2vec2-base \
     --teacher_model_name_or_path anton-l/wav2vec2-base-ft-keyword-spotting \
     --nncf_compression_config configs/wav2vec2-base-jpqd.json \
@@ -92,4 +92,4 @@ python run_audio_classification.py \
     --seed 0
 ```
 
-This script should take about 3 hours on a single V100 GPU and produce a quantized Wav2Vec2-base model with ~80% structured sparsity in its linear layers. The model accuracy should converge to about 97.5%.
+This script should take about 3 hours on a single V100 GPU and produce a quantized Wav2Vec2-base model with ~80% structured sparsity in its linear layers. The model accuracy should converge to about 97.5%. For launching script on multiple GPU specify `--nproc-per-node=<number of GPU>`.
diff --git a/examples/openvino/image-classification/README.md b/examples/openvino/image-classification/README.md
index 6b42fdbe4..37e080e45 100644
--- a/examples/openvino/image-classification/README.md
+++ b/examples/openvino/image-classification/README.md
@@ -48,7 +48,7 @@ On a single V100 GPU, this example takes about 1 minute and yields a quantized m
 `OVTrainer` also provides advanced optimization workflow via NNCF to structurally prune, quantize and distill. Following is an example of joint pruning, quantization and distillation on Swin-base model for food101 dataset. To enable JPQD optimization, use an alternative configuration specified with `--nncf_compression_config`. For more details on how to configure the pruning algorithm, see NNCF documentation [here](https://github.com/openvinotoolkit/nncf/blob/develop/nncf/experimental/torch/sparsity/movement/MovementSparsity.md).
 
 ```bash
-python run_image_classification.py \
+torchrun --nproc-per-node=1 run_image_classification.py \
     --model_name_or_path microsoft/swin-base-patch4-window7-224 \
     --teacher_model_name_or_path skylord/swin-finetuned-food101 \
     --distillation_weight 0.9 \
@@ -75,4 +75,4 @@ python run_image_classification.py \
     --nncf_compression_config configs/swin-base-jpqd.json
 ```
 
-This example results in a quantized swin-base model with ~40% sparsity in its linear layers of the transformer blocks, giving 90.7% accuracy on food101 and taking about 12.5 hours on a single V100 GPU.
+This example results in a quantized swin-base model with ~40% sparsity in its linear layers of the transformer blocks, giving 90.7% accuracy on food101 and taking about 12.5 hours on a single V100 GPU. For launching script on multiple GPU specify `--nproc-per-node=<number of GPU>`.
diff --git a/examples/openvino/question-answering/README.md b/examples/openvino/question-answering/README.md
index 24ac373c6..35a498caa 100644
--- a/examples/openvino/question-answering/README.md
+++ b/examples/openvino/question-answering/README.md
@@ -47,7 +47,7 @@ python run_qa.py \
 ```
 
 ### Joint Pruning, Quantization and Distillation (JPQD) for BERT on SQuAD1.0
-`OVTrainer` also provides an advanced optimization workflow through the NNCF when Transformer model can be structurally pruned along with 8-bit quantization and distillation. Below is an example which demonstrates how to jointly prune, quantize BERT-base for SQuAD 1.0 using NNCF config `--nncf_compression_config` and distill from BERT-large teacher. This example closely resembles the movement sparsification work of [Lagunas et al., 2021, Block Pruning For Faster Transformers](https://arxiv.org/pdf/2109.04838.pdf). This example takes about 12 hours with a single V100 GPU and ~40% of the weights of the Transformer blocks were pruned.
+`OVTrainer` also provides an advanced optimization workflow through the NNCF when Transformer model can be structurally pruned along with 8-bit quantization and distillation. Below is an example which demonstrates how to jointly prune, quantize BERT-base for SQuAD 1.0 using NNCF config `--nncf_compression_config` and distill from BERT-large teacher. This example closely resembles the movement sparsification work of [Lagunas et al., 2021, Block Pruning For Faster Transformers](https://arxiv.org/pdf/2109.04838.pdf). This example takes about 12 hours with a single V100 GPU and ~40% of the weights of the Transformer blocks were pruned. For launching script on multiple GPU specify `--nproc-per-node=<number of GPU>`.
 
 More on how to configure movement sparsity, see NNCF documentation [here](https://github.com/openvinotoolkit/nncf/blob/develop/nncf/experimental/torch/sparsity/movement/MovementSparsity.md).
 
@@ -57,7 +57,7 @@ To run the JPQD example, please install optimum-intel from source. This command
 ```
 
 ```bash
-python run_qa.py \
+torchrun --nproc-per-node=1 run_qa.py \
     --model_name_or_path bert-base-uncased \
     --dataset_name squad \
     --teacher_model_name_or_path bert-large-uncased-whole-word-masking-finetuned-squad \
diff --git a/examples/openvino/text-classification/README.md b/examples/openvino/text-classification/README.md
index d10d8f743..47faf87e7 100644
--- a/examples/openvino/text-classification/README.md
+++ b/examples/openvino/text-classification/README.md
@@ -58,7 +58,7 @@ To run the JPQD example, please install optimum-intel from source. This command
 
 ```bash
 TASK_NAME=sst2
-python run_glue.py \
+torchrun --nproc-per-node=1 run_glue.py \
     --model_name_or_path bert-base-uncased \
     --task_name $TASK_NAME \
     --teacher_model_name_or_path yoshitomo-matsubara/bert-large-uncased-sst2 \
@@ -83,3 +83,4 @@ python run_glue.py \
 ```
 
 On a single V100 GPU, this script should run in ~1.8 hours, and yield accuracy of **92.2%** with ~40% of the weights of the Transformer blocks pruned.
+For launching script on multiple GPU specify `--nproc-per-node=<number of GPU>`.

From 79d92a35e30a3fb4114d412b1f2b0cc728f07400 Mon Sep 17 00:00:00 2001
From: Nikolay <nikolay.lyalyushkin@intel.com>
Date: Fri, 19 May 2023 14:45:01 +0200
Subject: [PATCH 2/3] suggestion from Alexander and note about hyperparameters
 tuning

---
 examples/openvino/audio-classification/README.md | 2 +-
 examples/openvino/image-classification/README.md | 2 +-
 examples/openvino/question-answering/README.md   | 5 +++--
 examples/openvino/text-classification/README.md  | 2 +-
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/examples/openvino/audio-classification/README.md b/examples/openvino/audio-classification/README.md
index 562dfa096..1a6f1ddce 100644
--- a/examples/openvino/audio-classification/README.md
+++ b/examples/openvino/audio-classification/README.md
@@ -92,4 +92,4 @@ torchrun --nproc-per-node=1 run_audio_classification.py \
     --seed 0
 ```
 
-This script should take about 3 hours on a single V100 GPU and produce a quantized Wav2Vec2-base model with ~80% structured sparsity in its linear layers. The model accuracy should converge to about 97.5%. For launching script on multiple GPU specify `--nproc-per-node=<number of GPU>`.
+This script should take about 3 hours on a single V100 GPU and produce a quantized Wav2Vec2-base model with ~80% structured sparsity in its linear layers. The model accuracy should converge to about 97.5%. For launching the script on multiple GPUs specify `--nproc-per-node=<number of GPU>`. Note, that different batch size and other hyperparameters might be required to achieve the same results as on a single GPU.
diff --git a/examples/openvino/image-classification/README.md b/examples/openvino/image-classification/README.md
index 37e080e45..25d7cbc54 100644
--- a/examples/openvino/image-classification/README.md
+++ b/examples/openvino/image-classification/README.md
@@ -75,4 +75,4 @@ torchrun --nproc-per-node=1 run_image_classification.py \
     --nncf_compression_config configs/swin-base-jpqd.json
 ```
 
-This example results in a quantized swin-base model with ~40% sparsity in its linear layers of the transformer blocks, giving 90.7% accuracy on food101 and taking about 12.5 hours on a single V100 GPU. For launching script on multiple GPU specify `--nproc-per-node=<number of GPU>`.
+This example results in a quantized swin-base model with ~40% sparsity in its linear layers of the transformer blocks, giving 90.7% accuracy on food101 and taking about 12.5 hours on a single V100 GPU. For launching the script on multiple GPUs specify `--nproc-per-node=<number of GPU>`. Note, that different batch size and other hyperparameters might be required to achieve the same results as on a single GPU.
diff --git a/examples/openvino/question-answering/README.md b/examples/openvino/question-answering/README.md
index 35a498caa..ae54086be 100644
--- a/examples/openvino/question-answering/README.md
+++ b/examples/openvino/question-answering/README.md
@@ -47,13 +47,14 @@ python run_qa.py \
 ```
 
 ### Joint Pruning, Quantization and Distillation (JPQD) for BERT on SQuAD1.0
-`OVTrainer` also provides an advanced optimization workflow through the NNCF when Transformer model can be structurally pruned along with 8-bit quantization and distillation. Below is an example which demonstrates how to jointly prune, quantize BERT-base for SQuAD 1.0 using NNCF config `--nncf_compression_config` and distill from BERT-large teacher. This example closely resembles the movement sparsification work of [Lagunas et al., 2021, Block Pruning For Faster Transformers](https://arxiv.org/pdf/2109.04838.pdf). This example takes about 12 hours with a single V100 GPU and ~40% of the weights of the Transformer blocks were pruned. For launching script on multiple GPU specify `--nproc-per-node=<number of GPU>`.
+`OVTrainer` also provides an advanced optimization workflow through the NNCF when Transformer model can be structurally pruned along with 8-bit quantization and distillation. Below is an example which demonstrates how to jointly prune, quantize BERT-base for SQuAD 1.0 using NNCF config `--nncf_compression_config` and distill from BERT-large teacher. This example closely resembles the movement sparsification work of [Lagunas et al., 2021, Block Pruning For Faster Transformers](https://arxiv.org/pdf/2109.04838.pdf). This example takes about 12 hours with a single V100 GPU and ~40% of the weights of the Transformer blocks were pruned. For launching the script on multiple GPUs specify `--nproc-per-node=<number of GPU>`. Note, that different batch size and other hyperparameters qmight be required to achieve the same results as on a single GPU.
 
 More on how to configure movement sparsity, see NNCF documentation [here](https://github.com/openvinotoolkit/nncf/blob/develop/nncf/experimental/torch/sparsity/movement/MovementSparsity.md).
 
 To run the JPQD example, please install optimum-intel from source. This command will install or upgrade optimum-intel and all necessary dependencies:
 
-```python -m pip install --upgrade "git+https://github.com/huggingface/optimum-intel.git#egg=optimum-intel[openvino, nncf]"
+```python
+python -m pip install --upgrade "git+https://github.com/huggingface/optimum-intel.git#egg=optimum-intel[openvino, nncf]"
 ```
 
 ```bash
diff --git a/examples/openvino/text-classification/README.md b/examples/openvino/text-classification/README.md
index 47faf87e7..0128220c8 100644
--- a/examples/openvino/text-classification/README.md
+++ b/examples/openvino/text-classification/README.md
@@ -83,4 +83,4 @@ torchrun --nproc-per-node=1 run_glue.py \
 ```
 
 On a single V100 GPU, this script should run in ~1.8 hours, and yield accuracy of **92.2%** with ~40% of the weights of the Transformer blocks pruned.
-For launching script on multiple GPU specify `--nproc-per-node=<number of GPU>`.
+For launching the script on multiple GPUs specify `--nproc-per-node=<number of GPU>`. Note, that different batch size and other hyperparameters might be required to achieve the same results as on a single GPU.

From cfca9d4fbfc3fc6750a416e472d1abf23da0b194 Mon Sep 17 00:00:00 2001
From: Lyalyushkin Nikolay <nikolay.lyalyushkin@intel.com>
Date: Fri, 19 May 2023 15:08:45 +0200
Subject: [PATCH 3/3] removed not needed instructions

---
 examples/openvino/question-answering/README.md | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/examples/openvino/question-answering/README.md b/examples/openvino/question-answering/README.md
index ae54086be..c57d332e6 100644
--- a/examples/openvino/question-answering/README.md
+++ b/examples/openvino/question-answering/README.md
@@ -51,12 +51,6 @@ python run_qa.py \
 
 More on how to configure movement sparsity, see NNCF documentation [here](https://github.com/openvinotoolkit/nncf/blob/develop/nncf/experimental/torch/sparsity/movement/MovementSparsity.md).
 
-To run the JPQD example, please install optimum-intel from source. This command will install or upgrade optimum-intel and all necessary dependencies:
-
-```python
-python -m pip install --upgrade "git+https://github.com/huggingface/optimum-intel.git#egg=optimum-intel[openvino, nncf]"
-```
-
 ```bash
 torchrun --nproc-per-node=1 run_qa.py \
     --model_name_or_path bert-base-uncased \