Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition [arXiv]
Code for ICASSP 2024 paper "Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition".
- pytorch
- torchaudio (For data processing and loss function)
- transformers (For pretrained wav2vec2)
- timm (For optimizer and scheduler)
- yacs (For config)
- jiwer (For ASR evaluation)
- sentencepiece (For byte-pair encoding)
For utterance-level speech emotion recognition, IEMOCAP is for SER and ASR training and evaluation.
For fine-grained speech emotion recognition (or speech emotion diarization) on ZED, we view it as a weakly supervised problem. IEMOCAP with utterance-level emotion label is for training and ZED is for evaluation.
-
Obtain IEMOCAP from USC for speech emotion recognition and automatic speech recognition.
-
By default, we use character encoding for ASR transcriptions and just run the command to get
iemocap4char.pkl
. In our experiment, character encoding is better.
cd dataset/IEMOCAP
bash preprocess.sh --dataset_dir $path to directory of IEMOCAP$ --bpemode char
cd ../..
Alternatively, use byte-pair encoding with 5000 units for ASR transcriptions and run the command to get iemocap4bpe.pkl
. Note the final vocab_size is equal to the number of iemocap_bpe5000_units.txt
lines. If you want to use other vocab_size, specify --nbpe
argument for the below preprocessing command and then change VOCAB_SIZE
, SYMBOL_PATH
, BPEMODEL_PATH
yaml arguments in configs/bpe for model training.
cd dataset/IEMOCAP
bash preprocess.sh --dataset_dir $path to directory of IEMOCAP$ --nbpe 5000 --bpemode bpe
cd ../..
Description of IEMOCAP Pickle Files
iemocap4char.pkl
oriemocap4bpe.pkl
is the dictionary with Session name as keys and a list of dictionaries as values. Following is a brief description of each dictionary element with key/value pairs.audio
: audio data of the utterance sampled at 16kHz.audio_length
: original audio length, will be calculated as expected output length of pretrained wav2vec2 on the fly.tlabel
: token ids of the corresponding ASR transcription.label
: emotion label of the utterance.text
: ASR reference transcription in the format of string.
-
Obtain ZED for speech emotion diarization.
-
Run the following command to get
zed.pkl
.
cd dataset/ZED
bash preprocess.sh --dataset_dir $path to directory of ZED$
cd ../..
Description of ZED Pickle Files
zed.pkl
stores a list of dictionaries. Following is a brief description of each dictionary element with key/value pairs.id
: the utterance id in ZED dataset.audio
: audio data of the utterance sampled at 16kHz.audio_length
: original audio length, will be calculated as expected output length of pretrained wav2vec2 on the fly.duration
: duration of the utterance in second.frame_label
: fine-grained frame-level label, in the resolution of wav2vec stride/window.interval
: a dictionary containing start/end time step of the emotion.emotion
: utterance level label, not used.text
: ASR reference transcription in the format of string.
In our paper, we train the model with frozen wav2vec2 and equal coefficient for all the loss. See yaml files with 'asrser' suffix for default training arguments or use commnad line for easy config modification. For efficiency, we skip the loss calculation during evaluation in transducer.py.
# Emotion Neural Transducer for SER and ASR
python main.py --model ent --headw 0.5 --rnntw 0.5 --jointw 0.5 --bpe_mode char
# Factorized Emotion Neural Transducer for SER and ASR
python main.py --model fent --headw 0.5 --rnntw 0.5 --jointw 0.5 --bpe_mode char
# Emotion Neural Transducer for SED
python main.py --model ent --headw 0.5 --rnntw 0.5 --jointw 0.5 --bpe_mode char --sed
# Factorized Emotion Neural Transducer for SED
python main.py --model fent --headw 0.5 --rnntw 0.5 --jointw 0.5 --bpe_mode char --sed
--model
: the model chosen for training. Note we take the pretrained model as feature extractor in our paper and the output is the trainable weighted sum from multiple transformer layers.ent
: Emotion Neural Transducer.fent
: Factorized Emotion Neural transducer. We use different trainable weighted sum for blank/vocabulary joint network.
--bpe_mode
: encoding type for ASR reference transcription.char
: character encoding by default.bpe
: byte-pair encoding.
--headw
: loss weight for cross-entropy loss over pooled features.--rnntw
: loss weight for RNN-T loss.--jointw
: loss weight for lattice loss applied to emotion joint network.--lmw
: loss weight for language modeling loss applied to vocabulary prediction network, only applicable to FENT. Not used in our paper. Refer Factorized Neural Transducer for more details.--sed
: speech emotion diarization mode. Specifically, we use IEMOCAP as training dataset and ZED as testing dataset. We only provide the yaml based on character encoding.--finetune
: whether to finetune the pretrained model or take the pretrained model as feature extractor.
Besides above easy config modification, you can also modify the corresponding yaml argument in configs, especially when the dataset/pickle path or vocabulary size changes.
We provide some variants of lattice max pooling in loss.py just as the ablation study in our paper.
One
: lattice max pooling loss applied for the single node. Namely only the most emotional/non-emotional node is seleceted to perform backpropagation.FullT
: lattice max pooling loss applied over all the nodes within the timestamp where the most emotional/non-emotional node is located, namely the entire row.FullU
: lattice max pooling loss applied over all the nodes within the token where the most emotional/non-emotional node is located, namely the entire column.
Lattice max pooling loss is also adaptive to supervised training data. For mixing method in our paper, we concat audio segments to produce longer utterance and train the model with different emotion intervals in a supervised manner. For code readability, we provide core code for data preprocessing in preprocess_mix.sh and supervised lattice loss in loss.py.
Description of IEMOCAP_MIX Pickle Files
iemocap_mix.pkl
stores a list of dictionaries. Following is a brief description of each dictionary element with key/value pairs.audio
: audio data of the concatenated utterance sampled at 16kHz.audio_length
: different from above IEMOCAP pickle files, here we calculate the expected output length of pretrained wav2vec2 directly.text
: concatenated ASR reference transcription in the format of string.tlabel
: token ids of the corresponding concatenated ASR transcription.frame_tlabel
: emotion label of each token. The length is equal to the number of characters in the concatenated transcription.frame_tlabel_length
: interval length of concatenated ASR transcription. The length is equal to the number of audio segments to be concatenated.label
: emotion interval label of the utterance. The length is equal to the number of audio segments to be concatenated.frame_label
: emotion label of each timestamp. The length is equal to the duration of the upsampled audio by pretrained wav2vec2.frame_label_length
: interval length of concatenated audio. The length is equal to the number of audio segments to be concatenated.
wenet for ASR transcrption preprocssing and RNN-T architecture.
Speech Emotion Diarization for ZED preprocessing and evaluation.
@inproceedings{shen2024emotion,
title={Emotion Neural Transducer for Fine-Grained Speech Emotion Recognition},
author={Shen, Siyuan and Gao, Yu and Liu, Feng and Wang, Hanyang and Zhou, Aimin},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={10111--10115},
year={2024},
organization={IEEE}
}