Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sortformer Diarizer 4spk v1 model PR Part 4: Sortformer Documents and Notebook Tutorials #11707

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/source/asr/speaker_diarization/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,17 @@ Model Classes
:show-inheritance:
:members: add_speaker_model_config, _init_segmentation_info, _init_speaker_model, setup_training_data, setup_validation_data, setup_test_data, get_ms_emb_seq, get_cluster_avg_embs_model, get_ms_mel_feat, forward, forward_infer, training_step, validation_step, compute_accuracies

.. autoclass:: nemo.collections.asr.models.SortformerEncLabelModel
:show-inheritance:
:members: list_available_models, setup_training_data, setup_validation_data, setup_test_data, process_signal, forward, forward_infer, frontend_encoder, diarize, training_step, validation_step, multi_validation_epoch_end, _get_aux_train_evaluations, _get_aux_validation_evaluations, _init_loss_weights, _init_eval_metrics, _reset_train_metrics, _reset_valid_metrics, _setup_diarize_dataloader, _diarize_forward, _diarize_output_processing, test_batch, _get_aux_test_batch_evaluations, on_validation_epoch_end

Mixins
------
.. autoclass:: nemo.collections.asr.parts.mixins.mixins.DiarizationMixin
:show-inheritance:
:members:

.. autoclass:: nemo.collections.asr.parts.mixins.mixins.diarization.SpkDiarizationMixin
:show-inheritance:
:members: diarize, diarize_generator, _diarize_on_begin, _diarize_input_processing, _diarize_input_manifest_processing, _setup_diarize_dataloader, _diarize_forward, _diarize_output_processing, _diarize_on_end, _input_audio_to_rttm_processing, get_value_from_diarization_config

275 changes: 257 additions & 18 deletions docs/source/asr/speaker_diarization/configs.rst

Large diffs are not rendered by default.

158 changes: 53 additions & 105 deletions docs/source/asr/speaker_diarization/datasets.rst
Original file line number Diff line number Diff line change
@@ -1,35 +1,17 @@
Datasets
========

This page is about formatting a dataset for diarization training and inference. To train or fine-tune the speaker diarization system, you could either train/fine-tune speaker embedding extractor model separately or you can train/fine-tune speaker embedding extractor and neural diarizer at the same time.

* To train or fine-tune a speaker embedding extractor model separately, please go check out these pages: :doc:`Speech Classification Datasets <../speech_classification/datasets>` and :doc:`Speaker Recognition Datasets <../speaker_recognition/datasets>` for preparing datasets for training and validating VAD and speaker embedding models respectively.


* To train or fine-tune speaker embedding extractor and neural diarizer together, please follow the dataset preparation process in this page.

Data Preparation for Training
-----------------------------

.. image:: images/msdd_train_and_infer.png
:align: center
:width: 800px
:alt: MSDD training and inference

As shown in the above figure, a full-fledged speaker diarization process through speaker embedding extractor, clustering algorithm and neural diarizer. Note that only speaker embedding extractor and neural diarizer are trainable models and they can be train/fine-tune together on diarization datasets. We recommend to use a speaker embedding extractor model that is trained on large amount of single-speaker dataset and use it for training a neural diarizer model.

Speaker diarization training is also managed by Hydra configurations based on ``.yaml`` files, just as in other NeMo neural models. See :doc:`NeMo Speaker Diarization Configuration Files <./configs>` for setting up the input Hydra configuration file for speaker diarization. Input data should be provided in line delimited JSON format as below:

* Create a manifest file for speaker diarization
Data Preparation for Speaker Diarization Training (For End-to-End Diarization)
------------------------------------------------------------------------------

Speaker diarization training and inference both require the same type of manifest files. This manifest file can be created by using the script in ``<NeMo_git_root>/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``. The following example shows how to run ``pathfiles_to_diarize_manifest.py`` by providing path list files.

.. code-block:: shell-session
python NeMo/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py \
--paths2audio_files='/path/to/audio_file_path_list.txt' \
--paths2rttm_files='/path/to/rttm_file_list.txt' \
--manifest_filepath='/path/to/manifest_filepath/train_manifest.json
.. code-block:: bash
python NeMo/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py \
--add_duration \
--paths2audio_files='/path/to/audio_file_path_list.txt' \
--paths2rttm_files='/path/to/rttm_file_list.txt' \
--manifest_filepath='/path/to/manifest_filepath/train_manifest.json


All three arguments are required. Note that we need to maintain consistency on unique filenames for every field (key) by only changing the filename extensions. For example, if there is an audio file named ``abcd01.wav``, the rttm file should be named as ``abcd01.rttm`` and the transcription file should be named as ``abcd01.txt``.
Expand All @@ -46,7 +28,7 @@ To train a diarization model, one needs to provide Rich Transcription Time Marke

.. code-block:: bash

SPEAKER TS3012d.Mix-Headset 1 331.573 0.671 <NA> <NA> MTD046ID <NA> <NA>
SPEAKER TS3012d.Mix-Headset 1 32.679 0.671 <NA> <NA> MTD046ID <NA> <NA>


Make a list of RTTM files for the audio files you have in ``audio_file_path_list.txt``.
Expand All @@ -65,31 +47,46 @@ As an output file, ``train_manifest.json`` will have the following line for each

.. code-block:: bash

{"audio_filepath": "/path/to/abcd01.wav", "offset": 0, "duration": null, "label": "infer", "text": "-", "num_speakers": 2, "rttm_filepath": "/path/to/rttm/abcd01.rttm"}
{"audio_filepath": "/path/to/abcd01.wav", "offset": 0, "duration": 90, "label": "infer", "text": "-", "num_speakers": 2, "rttm_filepath": "/path/to/rttm/abcd01.rttm"}


For end-to-end speaker diarization training, the manifest file described in this section fullfils the requirements for the input manifest file.
For cascaded speaker diarization training (TS-VAD style), the manifest file should be further processed to generate session-wise manifest files.

* Manifest files for MSDD training

After generating a session-wise manifest file, we need to break down each session-wise manifest file into a split manifest file containing start time and duration of the split samples due to memory capacity. More importantly, since MSDD only uses pairwise (two-speaker) model and data samples, we need to split RTTM files if there are more than two speakers.
Manifest JSON files for MSDD (TS-VAD style model) Training
----------------------------------------------------------

This section is about formatting a dataset for cascaded diarization training (e.g., TS-VAD, MSDD, etc.). To train or fine-tune the speaker diarization system, you could either train/fine-tune speaker embedding extractor model separately or you can train/fine-tune speaker embedding extractor and neural diarizer at the same time.

* To train or fine-tune a speaker embedding extractor model separately, please go check out these pages: :doc:`Speech Classification Datasets <../speech_classification/datasets>` and :doc:`Speaker Recognition Datasets <../speaker_recognition/datasets>` for preparing datasets for training and validating VAD and speaker embedding models respectively.


.. image:: images/msdd_train_and_infer.png
:align: center
:width: 800px
:alt: MSDD training and inference

As shown in the above figure, a full-fledged speaker diarization process through speaker embedding extractor, clustering algorithm and neural diarizer. Note that only speaker embedding extractor and neural diarizer are trainable models and they can be train/fine-tune together on diarization datasets. We recommend to use a speaker embedding extractor model that is trained on large amount of single-speaker dataset and use it for training a neural diarizer model.

For training MSDD, we need one more step of trucating the source manifest into even shorter chunks. After generating a session-wise manifest file, we need to break down each session-wise manifest file into a split manifest file containing start time and duration of the split samples due to memory capacity. More importantly, since MSDD only uses pairwise (two-speaker) model and data samples, we need to split RTTM files if there are more than two speakers.

Note that you should specify window length and shift length of the base scale of your MSDD model when you generate the manifest file for training samples. More importantly, ``step_count`` determines how many steps (i.e., base-scale segments) are in a split data sample. If ``step_count`` is too long, you might not be able to load a single sample in a batch.

.. code-block:: bash

python NeMo/scripts/speaker_tasks/create_msdd_train_dataset.py \
--input_manifest_path='path/to/train_manifest.json' \
--output_manifest_path='path/to/train_manifest.50step.json' \
--pairwise_rttm_output_folder='path/to/rttm_output_folder' \
--window=0.5 \
--shift=0.25 \
--step_count=50
python NeMo/scripts/speaker_tasks/create_msdd_train_dataset.py \
--input_manifest_path='path/to/train_manifest.json' \
--output_manifest_path='path/to/train_manifest.50step.json' \
--pairwise_rttm_output_folder='path/to/rttm_output_folder' \
--window=0.5 \
--shift=0.25 \
--step_count=50

All arguments are required to generate a new manifest file. Specify a session-wise diarization manifest file to ``--input_manifest_path`` and specify an output file name in ``--output_manifest_path``. In the folder that is specified for ``--pairwise_rttm_output_folder``, the script will create multiple two-speaker RTTM files from the given RTTM file and create manifest file that only contains two speakers in the specified RTTM range.


For example, if ``abcd01.wav`` has three speakers (``1911,1988,192``), the three RTTM files will be created: ``abcd01.1911_1988.rttm``, ``abcd01.1911_192.rttm`` and ``abcd01.1988_192.rttm``. Subsequently, the segments will be only generated from the newly generated two-speaker RTTM files.


Specify ``window`` and ``shift`` of the base-scale in your MSDD model. In this example, we use default setting of ``window=0.5`` and ``shift=0.25`` and ``step_count=50``. Here are example lines in the output file ``/path/to/train_manifest.50step.json``.

- Example manifest file ``train_manifest.50step.json``.
Expand All @@ -106,21 +103,21 @@ Prepare the msdd training dataset for both train and validation. After the train

.. code-block:: bash

python ./multiscale_diar_decoder.py --config-path='../conf/neural_diarizer' --config-name='msdd_5scl_15_05_50Povl_256x3x32x2.yaml' \
trainer.devices=1 \
trainer.max_epochs=20 \
model.base.diarizer.speaker_embeddings.model_path="titanet_large" \
model.train_ds.manifest_filepath="<train_manifest_path>" \
model.validation_ds.manifest_filepath="<dev_manifest_path>" \
model.train_ds.emb_dir="<train_temp_dir>" \
model.validation_ds.emb_dir="<dev_temp_dir>" \
exp_manager.name='sample_train' \
exp_manager.exp_dir='./msdd_exp' \
python ./multiscale_diar_decoder.py --config-path='../conf/neural_diarizer' --config-name='msdd_5scl_15_05_50Povl_256x3x32x2.yaml' \
trainer.devices=1 \
trainer.max_epochs=20 \
model.base.diarizer.speaker_embeddings.model_path="titanet_large" \
model.train_ds.manifest_filepath="<train_manifest_path>" \
model.validation_ds.manifest_filepath="<dev_manifest_path>" \
model.train_ds.emb_dir="<train_temp_dir>" \
model.validation_ds.emb_dir="<dev_temp_dir>" \
exp_manager.name='sample_train' \
exp_manager.exp_dir='./msdd_exp' \

In the above example training session, we use ``titanet_large`` model as a pretrained speaker embedding model.

Data Preparation for Inference
------------------------------
Data Preparation for Diarization Inference: for Both End-to-end and Cascaded Systems
-------------------------------------------------------------------------------

As in dataset preparation for diarization trainiing, diarization inference is based on Hydra configurations which are fulfilled by ``.yaml`` files. See :doc:`NeMo Speaker Diarization Configuration Files <./configs>` for setting up the input Hydra configuration file for speaker diarization inference. Input data should be provided in line delimited JSON format as below:

Expand All @@ -132,11 +129,11 @@ In each line of the input manifest file, ``audio_filepath`` item is mandatory wh

.. code-block:: bash

python pathfiles_to_diarize_manifest.py --paths2audio_files /path/to/audio_file_path_list.txt \
--paths2txt_files /path/to/transcript_file_path_list.txt \
--paths2rttm_files /path/to/rttm_file_path_list.txt \
--paths2uem_files /path/to/uem_file_path_list.txt \
--paths2ctm_files /path/to/ctm_file_path_list.txt \
python pathfiles_to_diarize_manifest.py --paths2audio_files /path/to/audio_file_path_list.txt \
--paths2txt_files /path/to/transcript_file_path_list.txt \
--paths2rttm_files /path/to/rttm_file_path_list.txt \
--paths2uem_files /path/to/uem_file_path_list.txt \
--paths2ctm_files /path/to/ctm_file_path_list.txt \
--manifest_filepath /path/to/manifest_output/input_manifest.json

The ``--paths2audio_files`` and ``--manifest_filepath`` are required arguments. Note that we need to maintain consistency on unique filenames for every field (key) by only changing the filename extensions. For example, if there is an audio file named ``abcd.wav``, the rttm file should be named as ``abcd.rttm`` and the transcription file should be named as ``abcd.txt``.
Expand Down Expand Up @@ -213,52 +210,3 @@ The following are descriptions about each field in an input manifest JSON file.

TS3012d.Mix-Headset 1 12.879 0.32 okay NA lex MTD046ID
TS3012d.Mix-Headset 1 13.203 0.24 yeah NA lex MTD046ID


Evaluation on Benchmark Datasets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we keep a list of benchmark datasets here and maybe a table showcasing the performance of our models on these sets compared to current SOTA?

--------------------------------

The following instructions can help one to reproduce the expected diarization performance on two benchmark English dialogue datasets. The following results are evaluations based on 0.25 second collar without evaluating overlapped speech. The evaluation is based on oracle VAD results from RTTM files. Therefore, diarization error rate (DER) is equal to confusion error rate since oracle VAD has no miss detection or false alarm.

AMI Meeting Corpus
~~~~~~~~~~~~~~~~~~

The following are the suggested parameters for reproducing the diarization performance for `AMI <https://groups.inf.ed.ac.uk/ami/corpus/>`_ test set. This setting is based on meeting domain configuration in ``<NeMo_git_root>/examples/speaker_tasks/diarization/conf/inference/diar_infer_meeting.yaml``

.. code-block:: bash

diarizer.manifest_filepath="/path/to/AMItest_input_manifest.json"
diarizer.oracle_num_speakers=null # Performing unknown number of speaker case
diarizer.oracle_vad=True # Use oracle VAD extracted from RTTM files.
diarizer.collar=0.25
diarizer.ignore_overlap=True
diarizer.speaker_embeddings.model_path="titanet_large"

We provide a helper script to download the dataset and format it into a NeMo manifest.

.. code-block:: bash

python scripts/data_processing/speaker_tasks/get_ami_data.py --manifest_filepath AMItest_input_manifest.json


CallHome American English Speech (CHAES), LDC97S42
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We use the CH109 set which is a subset of the CHAES dataset which has only two speakers in one session.
The following are the suggested parameters for reproducing the diarization performance for the CH109 set and this setting is based on telephonic domain configuration in ``<NeMo_git_root>/examples/speaker_tasks/diarization/conf/inference/diar_infer_telephonic.yaml``

.. code-block:: bash

diarizer.manifest_filepath="/path/to/ch109_input_manifest.json"
diarizer.oracle_vad=True # Use oracle VAD extracted from RTTM files.
diarizer.collar=0.25
diarizer.ignore_overlap=True
diarizer.speaker_embeddings.model_path="titanet_large"


To evaluate the performance on AMI Meeting Corpus, the following instructions can help.
- Download CHAES Meeting Corpus at LDC website `LDC97S42 <https://catalog.ldc.upenn.edu/LDC97S42>`_ (CHAES is not publicly available).
- Download the CH109 filename list (whitelist) from `CH109 whitelist <https://raw.githubusercontent.com/tango4j/diarization_annotation/main/CH109/ch109_whitelist.txt>`_.
- Download RTTM files for CH109 set from `CH109 RTTM files <https://raw.githubusercontent.com/tango4j/diarization_annotation/main/CH109/split_rttms.tar.gz>`_.
- Generate an input manifest file using ``<NeMo_git_root>/scripts/speaker_tasks/pathfiles_to_diarize_manifest.py``

Binary file modified docs/source/asr/speaker_diarization/images/asr_sd_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading