NVIDIA · tango4j · Dec 21, 2024 · Dec 23, 2024 · Dec 24, 2024 · Dec 24, 2024
diff --git a/docs/source/asr/speaker_diarization/api.rst b/docs/source/asr/speaker_diarization/api.rst
@@ -12,9 +12,17 @@ Model Classes
     :show-inheritance:
     :members: add_speaker_model_config, _init_segmentation_info, _init_speaker_model, setup_training_data, setup_validation_data, setup_test_data, get_ms_emb_seq, get_cluster_avg_embs_model, get_ms_mel_feat, forward, forward_infer, training_step, validation_step, compute_accuracies
 
+.. autoclass:: nemo.collections.asr.models.SortformerEncLabelModel
+    :show-inheritance:
+    :members: list_available_models, setup_training_data, setup_validation_data, setup_test_data, process_signal, forward, forward_infer, frontend_encoder, diarize, training_step, validation_step, multi_validation_epoch_end, _get_aux_train_evaluations, _get_aux_validation_evaluations, _init_loss_weights, _init_eval_metrics, _reset_train_metrics, _reset_valid_metrics, _setup_diarize_dataloader, _diarize_forward, _diarize_output_processing, test_batch, _get_aux_test_batch_evaluations, on_validation_epoch_end
+
 Mixins
 ------
 .. autoclass:: nemo.collections.asr.parts.mixins.mixins.DiarizationMixin
     :show-inheritance:
     :members:
 
+.. autoclass:: nemo.collections.asr.parts.mixins.mixins.diarization.SpkDiarizationMixin
+    :show-inheritance:
+    :members: diarize, diarize_generator, _diarize_on_begin, _diarize_input_processing, _diarize_input_manifest_processing, _setup_diarize_dataloader, _diarize_forward, _diarize_output_processing, _diarize_on_end, _input_audio_to_rttm_processing, get_value_from_diarization_config
+
diff --git a/docs/source/asr/speaker_diarization/configs.rst b/docs/source/asr/speaker_diarization/configs.rst
diff --git a/docs/source/asr/speaker_diarization/datasets.rst b/docs/source/asr/speaker_diarization/datasets.rst
@@ -1,35 +1,17 @@
 Datasets
 ========
 
-This page is about formatting a dataset for diarization training and inference. To train or fine-tune the speaker diarization system, you could either train/fine-tune speaker embedding extractor model separately or you can train/fine-tune speaker embedding extractor and neural diarizer at the same time.
-
-* To train or fine-tune a speaker embedding extractor model separately, please go check out these pages: :doc:`Speech Classification Datasets <../speech_classification/datasets>` and :doc:`Speaker Recognition Datasets <../speaker_recognition/datasets>` for preparing datasets for training and validating VAD and speaker embedding models respectively.   
-
-
-* To train or fine-tune speaker embedding extractor and neural diarizer together, please follow the dataset preparation process in this page.
-
-Data Preparation for Training 
------------------------------
-
-.. image:: images/msdd_train_and_infer.png
-        :align: center
-        :width: 800px
-        :alt: MSDD training and inference 
-
-As shown in the above figure, a full-fledged speaker diarization process through speaker embedding extractor, clustering algorithm and neural diarizer. Note that only speaker embedding extractor and neural diarizer are trainable models and they can be train/fine-tune together on diarization datasets. We recommend to use a speaker embedding extractor model that is trained on large amount of single-speaker dataset and use it for training a neural diarizer model. 
-
-Speaker diarization training is also managed by Hydra configurations based on ``.yaml`` files, just as in other NeMo neural models. See :doc:`NeMo Speaker Diarization Configuration Files <./configs>` for setting up the input Hydra configuration file for speaker diarization. Input data should be provided in line delimited JSON format as below:
-
-* Create a manifest file for speaker diarization
+Data Preparation for Speaker Diarization Training (For End-to-End Diarization)  
+------------------------------------------------------------------------------
 
 Speaker diarization training and inference both require the same type of manifest files. This manifest file can be created by using the script in ``<NeMo_git_root>/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``. The following example shows how to run ``pathfiles_to_diarize_manifest.py`` by providing path list files.
 
-.. code-block:: shell-session
-    
-   python NeMo/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py \
-        --paths2audio_files='/path/to/audio_file_path_list.txt' \
-        --paths2rttm_files='/path/to/rttm_file_list.txt' \
-        --manifest_filepath='/path/to/manifest_filepath/train_manifest.json 
+.. code-block:: bash
+  python NeMo/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py \ 
+    --add_duration \ 
+    --paths2audio_files='/path/to/audio_file_path_list.txt' \
+    --paths2rttm_files='/path/to/rttm_file_list.txt' \
+    --manifest_filepath='/path/to/manifest_filepath/train_manifest.json 
 
 
 All three arguments are required. Note that we need to maintain consistency on unique filenames for every field (key) by only changing the filename extensions. For example, if there is an audio file named ``abcd01.wav``, the rttm file should be named as ``abcd01.rttm`` and the transcription file should be named as ``abcd01.txt``. 
@@ -46,7 +28,7 @@ To train a diarization model, one needs to provide Rich Transcription Time Marke
 
 .. code-block:: bash
 
-  SPEAKER TS3012d.Mix-Headset 1 331.573 0.671 <NA> <NA> MTD046ID <NA> <NA>
+  SPEAKER TS3012d.Mix-Headset 1 32.679 0.671 <NA> <NA> MTD046ID <NA> <NA>
 
 
 Make a list of RTTM files for the audio files you have in ``audio_file_path_list.txt``.
@@ -65,31 +47,46 @@ As an output file, ``train_manifest.json`` will have the following line for each
 
 .. code-block:: bash
 
-  {"audio_filepath": "/path/to/abcd01.wav", "offset": 0, "duration": null, "label": "infer", "text": "-", "num_speakers": 2, "rttm_filepath": "/path/to/rttm/abcd01.rttm"}
+  {"audio_filepath": "/path/to/abcd01.wav", "offset": 0, "duration": 90, "label": "infer", "text": "-", "num_speakers": 2, "rttm_filepath": "/path/to/rttm/abcd01.rttm"}
+
 
+For end-to-end speaker diarization training, the manifest file described in this section fullfils the requirements for the input manifest file. 
+For cascaded speaker diarization training (TS-VAD style), the manifest file should be further processed to generate session-wise manifest files.
 
-* Manifest files for MSDD training
 
-After generating a session-wise manifest file, we need to break down each session-wise manifest file into a split manifest file containing start time and duration of the split samples due to memory capacity. More importantly, since MSDD only uses pairwise (two-speaker) model and data samples, we need to split RTTM files if there are more than two speakers.
+Manifest JSON files for MSDD (TS-VAD style model) Training
+----------------------------------------------------------
+
+This section is about formatting a dataset for cascaded diarization training (e.g., TS-VAD, MSDD, etc.). To train or fine-tune the speaker diarization system, you could either train/fine-tune speaker embedding extractor model separately or you can train/fine-tune speaker embedding extractor and neural diarizer at the same time.
+
+* To train or fine-tune a speaker embedding extractor model separately, please go check out these pages: :doc:`Speech Classification Datasets <../speech_classification/datasets>` and :doc:`Speaker Recognition Datasets <../speaker_recognition/datasets>` for preparing datasets for training and validating VAD and speaker embedding models respectively.   
+
+
+.. image:: images/msdd_train_and_infer.png
+        :align: center
+        :width: 800px
+        :alt: MSDD training and inference 
+
+As shown in the above figure, a full-fledged speaker diarization process through speaker embedding extractor, clustering algorithm and neural diarizer. Note that only speaker embedding extractor and neural diarizer are trainable models and they can be train/fine-tune together on diarization datasets. We recommend to use a speaker embedding extractor model that is trained on large amount of single-speaker dataset and use it for training a neural diarizer model. 
+
+For training MSDD, we need one more step of trucating the source manifest into even shorter chunks. After generating a session-wise manifest file, we need to break down each session-wise manifest file into a split manifest file containing start time and duration of the split samples due to memory capacity. More importantly, since MSDD only uses pairwise (two-speaker) model and data samples, we need to split RTTM files if there are more than two speakers.
 
 Note that you should specify window length and shift length of the base scale of your MSDD model when you generate the manifest file for training samples. More importantly, ``step_count`` determines how many steps (i.e., base-scale segments) are in a split data sample. If ``step_count`` is too long, you might not be able to load a single sample in a batch.
 
 .. code-block:: bash
 
-  python NeMo/scripts/speaker_tasks/create_msdd_train_dataset.py \
-        --input_manifest_path='path/to/train_manifest.json' \
-        --output_manifest_path='path/to/train_manifest.50step.json' \
-        --pairwise_rttm_output_folder='path/to/rttm_output_folder' \
-        --window=0.5 \
-        --shift=0.25 \
-        --step_count=50 
+  python NeMo/scripts/speaker_tasks/create_msdd_train_dataset.py \ 
+    --input_manifest_path='path/to/train_manifest.json' \
+    --output_manifest_path='path/to/train_manifest.50step.json' \
+    --pairwise_rttm_output_folder='path/to/rttm_output_folder' \
+    --window=0.5 \
+    --shift=0.25 \
+    --step_count=50 
 
 All arguments are required to generate a new manifest file. Specify a session-wise diarization manifest file to ``--input_manifest_path`` and specify an output file name in ``--output_manifest_path``. In the folder that is specified for ``--pairwise_rttm_output_folder``, the script will create multiple two-speaker RTTM files from the given RTTM file and create manifest file that only contains two speakers in the specified RTTM range. 
 
-
 For example, if ``abcd01.wav`` has three speakers (``1911,1988,192``), the three RTTM files will be created: ``abcd01.1911_1988.rttm``, ``abcd01.1911_192.rttm`` and ``abcd01.1988_192.rttm``. Subsequently, the segments will be only generated from the newly generated two-speaker RTTM files.
 
-
 Specify ``window`` and ``shift`` of the base-scale in your MSDD model. In this example, we use default setting of ``window=0.5`` and ``shift=0.25`` and ``step_count=50``. Here are example lines in the output file ``/path/to/train_manifest.50step.json``.
 
 - Example manifest file ``train_manifest.50step.json``.
@@ -106,21 +103,21 @@ Prepare the msdd training dataset for both train and validation. After the train
 
 .. code-block:: bash
 
-    python ./multiscale_diar_decoder.py --config-path='../conf/neural_diarizer' --config-name='msdd_5scl_15_05_50Povl_256x3x32x2.yaml' \
-        trainer.devices=1 \
-        trainer.max_epochs=20  \
-        model.base.diarizer.speaker_embeddings.model_path="titanet_large" \
-        model.train_ds.manifest_filepath="<train_manifest_path>" \
-        model.validation_ds.manifest_filepath="<dev_manifest_path>" \
-        model.train_ds.emb_dir="<train_temp_dir>" \
-        model.validation_ds.emb_dir="<dev_temp_dir>" \
-        exp_manager.name='sample_train' \
-        exp_manager.exp_dir='./msdd_exp' \
+  python ./multiscale_diar_decoder.py --config-path='../conf/neural_diarizer' --config-name='msdd_5scl_15_05_50Povl_256x3x32x2.yaml' \ 
+    trainer.devices=1 \ 
+    trainer.max_epochs=20  \ 
+    model.base.diarizer.speaker_embeddings.model_path="titanet_large" \ 
+    model.train_ds.manifest_filepath="<train_manifest_path>" \ 
+    model.validation_ds.manifest_filepath="<dev_manifest_path>" \ 
+    model.train_ds.emb_dir="<train_temp_dir>" \ 
+    model.validation_ds.emb_dir="<dev_temp_dir>" \ 
+    exp_manager.name='sample_train' \ 
+    exp_manager.exp_dir='./msdd_exp' \
 
 In the above example training session, we use ``titanet_large`` model as a pretrained speaker embedding model.
 
-Data Preparation for Inference
-------------------------------
+Data Preparation for Diarization Inference: for Both End-to-end and Cascaded Systems
+-------------------------------------------------------------------------------
 
 As in dataset preparation for diarization trainiing, diarization inference is based on Hydra configurations which are fulfilled by ``.yaml`` files. See :doc:`NeMo Speaker Diarization Configuration Files <./configs>` for setting up the input Hydra configuration file for speaker diarization inference. Input data should be provided in line delimited JSON format as below:
 
@@ -132,11 +129,11 @@ In each line of the input manifest file, ``audio_filepath`` item is mandatory wh
 
 .. code-block:: bash
 
-    python pathfiles_to_diarize_manifest.py --paths2audio_files /path/to/audio_file_path_list.txt \
-                                            --paths2txt_files /path/to/transcript_file_path_list.txt \
-                                            --paths2rttm_files /path/to/rttm_file_path_list.txt \
-                                            --paths2uem_files /path/to/uem_file_path_list.txt \
-                                            --paths2ctm_files /path/to/ctm_file_path_list.txt \
+    python pathfiles_to_diarize_manifest.py --paths2audio_files /path/to/audio_file_path_list.txt \ 
+                                            --paths2txt_files /path/to/transcript_file_path_list.txt \ 
+                                            --paths2rttm_files /path/to/rttm_file_path_list.txt \ 
+                                            --paths2uem_files /path/to/uem_file_path_list.txt \  
+                                            --paths2ctm_files /path/to/ctm_file_path_list.txt \ 
                                             --manifest_filepath /path/to/manifest_output/input_manifest.json 
 
 The ``--paths2audio_files`` and ``--manifest_filepath`` are required arguments. Note that we need to maintain consistency on unique filenames for every field (key) by only changing the filename extensions. For example, if there is an audio file named ``abcd.wav``, the rttm file should be named as ``abcd.rttm`` and the transcription file should be named as ``abcd.txt``. 
@@ -213,52 +210,3 @@ The following are descriptions about each field in an input manifest JSON file.
 
    TS3012d.Mix-Headset 1 12.879 0.32 okay NA lex MTD046ID
    TS3012d.Mix-Headset 1 13.203 0.24 yeah NA lex MTD046ID
-
-
-Evaluation on Benchmark Datasets
---------------------------------
-
-The following instructions can help one to reproduce the expected diarization performance on two benchmark English dialogue datasets. The following results are evaluations based on 0.25 second collar without evaluating overlapped speech. The evaluation is based on oracle VAD results from RTTM files. Therefore, diarization error rate (DER) is equal to confusion error rate since oracle VAD has no miss detection or false alarm.
-
-AMI Meeting Corpus
-~~~~~~~~~~~~~~~~~~
-
-The following are the suggested parameters for reproducing the diarization performance for `AMI <https://groups.inf.ed.ac.uk/ami/corpus/>`_ test set. This setting is based on meeting domain configuration in  ``<NeMo_git_root>/examples/speaker_tasks/diarization/conf/inference/diar_infer_meeting.yaml``
-
-.. code-block:: bash
-
-  diarizer.manifest_filepath="/path/to/AMItest_input_manifest.json"
-  diarizer.oracle_num_speakers=null # Performing unknown number of speaker case 
-  diarizer.oracle_vad=True # Use oracle VAD extracted from RTTM files.
-  diarizer.collar=0.25
-  diarizer.ignore_overlap=True 
-  diarizer.speaker_embeddings.model_path="titanet_large"
-
-We provide a helper script to download the dataset and format it into a NeMo manifest.
-
-.. code-block:: bash
-
-    python scripts/data_processing/speaker_tasks/get_ami_data.py --manifest_filepath AMItest_input_manifest.json
-
-
-CallHome American English Speech (CHAES), LDC97S42
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-We use the CH109 set which is a subset of the CHAES dataset which has only two speakers in one session. 
-The following are the suggested parameters for reproducing the diarization performance for the CH109 set and this setting is based on telephonic domain configuration in ``<NeMo_git_root>/examples/speaker_tasks/diarization/conf/inference/diar_infer_telephonic.yaml``
-
-.. code-block:: bash
-
-  diarizer.manifest_filepath="/path/to/ch109_input_manifest.json"
-  diarizer.oracle_vad=True # Use oracle VAD extracted from RTTM files.
-  diarizer.collar=0.25
-  diarizer.ignore_overlap=True 
-  diarizer.speaker_embeddings.model_path="titanet_large"
-
-
-To evaluate the performance on AMI Meeting Corpus, the following instructions can help.
-  - Download CHAES Meeting Corpus at LDC website `LDC97S42 <https://catalog.ldc.upenn.edu/LDC97S42>`_ (CHAES is not publicly available).
-  - Download the CH109 filename list (whitelist) from `CH109 whitelist <https://raw.githubusercontent.com/tango4j/diarization_annotation/main/CH109/ch109_whitelist.txt>`_.
-  - Download RTTM files for CH109 set from `CH109 RTTM files <https://raw.githubusercontent.com/tango4j/diarization_annotation/main/CH109/split_rttms.tar.gz>`_.
-  - Generate an input manifest file using ``<NeMo_git_root>/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``
-
diff --git a/docs/source/asr/speaker_diarization/images/asr_sd_diagram.png b/docs/source/asr/speaker_diarization/images/asr_sd_diagram.png
diff --git a/docs/source/asr/speaker_diarization/images/ats.png b/docs/source/asr/speaker_diarization/images/ats.png
diff --git a/docs/source/asr/speaker_diarization/images/e2e_diar_diagram.png b/docs/source/asr/speaker_diarization/images/e2e_diar_diagram.png
diff --git a/docs/source/asr/speaker_diarization/images/intro_comparison.png b/docs/source/asr/speaker_diarization/images/intro_comparison.png
diff --git a/docs/source/asr/speaker_diarization/images/loss_types.png b/docs/source/asr/speaker_diarization/images/loss_types.png
diff --git a/docs/source/asr/speaker_diarization/images/main_dataflow.png b/docs/source/asr/speaker_diarization/images/main_dataflow.png
diff --git a/docs/source/asr/speaker_diarization/images/sortformer.png b/docs/source/asr/speaker_diarization/images/sortformer.png