This code is to demonstrate the multimodal LSTM described in the following paper
Jimmy SJ. Ren, Yongtao Hu, Yu-Wing Tai, Chuan Wang, Li Xu, Wenxiu Sun, Qiong Yan,
"Look, Listen and Learn - A Multimodal LSTM for Speaker Identification", The 30th AAAI Conference on Artificial Intelligence (AAAI-16).
Please visit here for a refactored version of the multimodal LSTM and more applications. The training procedure and the pre-processed training data used in this paper are also released there.
Raw dataset can be downloaded Baidu Pan or Google Drive.
###Dataset summary This is a multimodal dataset containing both face images and corresponding speaking audio clips, which is extracted from the first two seasons of TV series - "The Big Bang Theory".
###Face images
We extracted the faces of all the characters (no matter he/she is a leading character or not) from 12 episodes in the TV series, i.e. first 6 episodes from Season 1 and first 6 episodes from Season 2. All the faces have been organized per character per episode. For example, for Season-1-Episode-1, you will find 6 folders in folder face-images/s01e01
, including 5 of them are for the 5 leading characters (Howard, Leonard, Penny, Raj and Sheldon) and 1 named other
for all other non-leading characters. In total, we have more than 407K face images (in JPG format).
###Speaking audio
We extracted all the speaking audio segments of the 5 leading characters across the whole two seasons. All speaking audio clips are merged into one per character per episode. For example, for Season-2-Episode-1, you will find 5 WAV files in folder speaking-audio/s02e01
for the 5 leading characters (Howard, Leonard, Penny, Raj and Sheldon). The matches between the name and labels used in the WAV filenames can be seen in the labels.txt
file. In total, we have more than 3 hours length of speaking audio (in WAV format).
###Terms of use The dataset is provided for research purposes only. Any commercial use is prohibited. Please cite our paper if you use the dataset in your research work
This code uses pre-processed data in the .mat form. To run the code, please go here or here to download the data.
The code was tested in Ubuntu 14.04, it should also run in Windows. You have to have a NVidia GPU to run the code, graphics memory need to be larger than 4GB.
Step 1: Go here or here, download the whole LSTM_sn_full_mm_weight_share
folder overwrite the same folder in the code.
Step 2: Launch Matlab and enter the LSTM_sn_full_mm_weight_share
folder. Open speaker_naming/face_audio_5c/
, run test_FA_all_v52.m
.
Wait for several minutes and you will see the caculated false alarm rate and accuray. You will find that both false alarm rate and accuracy are the highest among all versions of multi/single modal LSTM.
Step 1: Go here or here, download the whole LSTM_sn_half_mm_weight_share
folder overwrite the same folder in the code.
Step 2: Launch Matlab and enter the LSTM_sn_half_mm_weight_share
folder. Open speaker_naming/face_audio_5c/
, run test_FA_all_v5.m
.
Step 1: Go here or here, download the whole LSTM_sn_no_mm_weight_share
folder overwrite the same folder in the code.
Step 2: Launch Matlab and enter the LSTM_sn_no_mm_weight_share
folder. Open speaker_naming/face_audio_5c/
, run test_FA_all_v61.m
.
Step 1: Go here or here, download the whole LSTM_sn_audio_only
folder as well as LSTM_sn_face_only
folder, overwrite the same folders in the code.
Step 2: Launch Matlab and enter the LSTM_sn_audio_only
folder or LSTM_sn_face_only
folder. Open speaker_naming/audio_only/
or speaker_naming/face_only/
, run test_audio_all.m
or test_face_all.m
.