General steps to add a Speaker diarization dataset with <files, annotations> to the hub:
- Prepare a folder containing audios and annotations files , which should be organised like this:
dataset_folder
├── audio
│ ├── file_1.mp3
│ ├── file_2.mp3
│ └── file_3.mp3
├── annotations
│ ├── file_1.rttm
│ ├── file_2.rttm
│ └── file_3.rttm
- Get dictionnaries with the following structure:
annotations_files = {
"subset1": [list of annotations_files in subset1],
"subset2": [list of annotations_files in subset2],
}
audio_files = {
"subset1": [list of annotations_files in subset1],
"subset2": [list of annotations_files in subset2],
}
Here, each subset will correspond in a Hugging Face dataset subset.
- Use SpeakerDiarization module from
diarizers
to obtain your Hugging Face dataset:
from diarizers import SpeakerDiarizationDataset
dataset = SpeakerDiarizationDataset(audio_files, annotations_files).construct_dataset()
Note: This module can currently be used on RTTM format annotation files, but may need to be adapted for other formats.
We explain the scripts we used to add the various datasets present in the diarizers-community:
git clone https://github.com/pyannote/AMI-diarization-setup.git
cd /AMI-diarization-setup/pyannote/
sh download_ami.sh
sh download_ami_sdm.sh
Download for each language (example here for Japanese):
wget https://ca.talkbank.org/data/CallHome/jpn.zip
wget -r -np -nH --cut-dirs=2 -R index.html* https://media.talkbank.org/ca/CallHome/jpn/
unzip jpn.zip
Download the RTTM files:
git clone [email protected]:joonson/voxconverse.git
Download the audio files:
wget https://www.robots.ox.ac.uk/~vgg/data/voxconverse/data/voxconverse_dev_wav.zip
unzip voxconverse_dev_wav.zip
wget https://www.robots.ox.ac.uk/~vgg/data/voxconverse/data/voxconverse_test_wav.zip
unzip voxconverse_test_wav.zip
The Simsamu dataset is based on this Hugging Face dataset:
git lfs install
git clone [email protected]:datasets/medkit/simsamu
We pushed each of these datasets using spd_datasets.py
and the following script:
python3 spd_datasets.py \
--dataset=callhome \
--path_to_dataset=/path_to_callhome \
--push_to_hub=False \
--hub_repository=diarizers-community/callhome \
To use the synthetic dataset pipeline, first install diarizers
:
git clone https://github.com/huggingface/diarizers.git
cd diarizers
pip install -e .
To augment your synthetic datas with noise, you need to use background noise and room impulse response datasets. Here are suggested datasets and how to download them:
- Background Noise dataset: WHAM!. To download:
wget https://my-bucket-a8b4b49c25c811ee9a7e8bba05fa24c7.s3.amazonaws.com/wham_noise.zip
unzip wham_noise.zip
- Room Impulse Response dataset: MIT-ir-survey. To download:
wget https://mcdermottlab.mit.edu/Reverb/IRMAudio/Audio.zip
unzip Audio.zip
To generate synthetic datasets, you will need to specify a few parameters via the SyntheticDatasetConfig
class.
You can generate 20 hours of japanese synthetic speaker diarization datas using the following code snippet:
from diarizers import SyntheticDatasetConfig, SyntheticDataset
synthetic_config = SyntheticDatasetConfig(
dataset_name="mozilla-foundation/common_voice_17_0",
subset="validated",
split="ja",
speaker_column_name="client_id",
audio_column_name="audio",
min_samples_per_speaker=10,
nb_speakers_from_dataset=-1,
sample_rate=16000,
nb_speakers_per_meeting=3,
num_meetings=1600,
segments_per_meeting=16,
normalize=True,
augment=False,
overlap_proba=0.3,
overlap_length=3,
random_gain=False,
add_silence=True,
silence_duration=3,
silence_proba=0.7,
denoise=False,
num_proc=2
)
dataset = SyntheticDataset(synthetic_config).generate()
dataset.push_to_hub('diarizers-community/synthetic-speaker-diarization-dataset')
Find more informations on how to use 🤗 Diarizers synthetic speaker diarization pipeline in this notebook: .