Speaker Recognition using Deep Neural Networks

Introduction

This repository contains the code and results for my Master's thesis, which focuses on developing a robust speaker recognition system using deep neural networks. The goal of this project was to create a model capable of accurately identifying individuals based on their voice.

Methodology

Data Preprocessing:
- Conversion of raw audio signals into spectrograms using short-time Fourier transform (STFT).
Feature Extraction:
- Experimentation with various feature extraction techniques, including:
  - Mel-Frequency Cepstral Coefficients (MFCC)
  - Spectral contrast
  - Mel spectrograms
- Evaluation of different feature representations based on their ability to capture speaker-specific information.
Model Architecture:
- Design of a custom convolutional neural network (BetterCNN) tailored for speaker recognition.
- Comparison with the widely used ResNet50 architecture.
Experiments and Results:
- Evaluation of the proposed model on multiple datasets (50_speakers, LibriSpeech, TIMIT).
- Detailed analysis of the performance metrics (accuracy, F1-score).
- Comparison of BetterCNN with ResNet50 and other baseline models.

Results

Mel spectrograms were found to be the most effective feature representation for speaker recognition in this study. The proposed BetterCNN model consistently outperformed ResNet50 on all datasets, demonstrating its superior ability to capture the nuances of human speech.

Key findings:
- BetterCNN achieved an F1-score of 96.11% and accuracy of 96.24% on the 50_speakers dataset.
- On the LibriSpeech dataset, BetterCNN reached an accuracy of over 99.75%.

Conclusion

This research highlights the effectiveness of deep learning techniques for speaker recognition. The proposed BetterCNN model offers a promising approach for developing accurate and efficient speaker identification systems. Future work could explore:

Larger datasets: Training on more diverse and larger datasets.
Advanced architectures: Exploring more complex neural network architectures (e.g., transformers).
Multimodal approaches: Combining audio with other biometric modalities (e.g., facial images).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
commons.ipynb		commons.ipynb
databases.txt		databases.txt
speaker_recognition.ipynb		speaker_recognition.ipynb
utils.ipynb		utils.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speaker Recognition using Deep Neural Networks

About

Releases

Packages

Languages

tojoos/SpeakerRecognition

Folders and files

Latest commit

History

Repository files navigation

Speaker Recognition using Deep Neural Networks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages