The DeepSpeech-Keras project helps to do the Speech-To-Text analysis easily.
from deepspeech import load
deepspeech = load('pl')
files = ['path/to/audio.wav']
sentences = deepspeech(files)
Using DeepSpeech-Keras you can:
- perform speech-to-text analysis using pre-trained models
- tune pre-trained models to your needs
- create new models on your own
All of this was done using Keras API and Python 3.6. The main principle behind the project is that program and it's structure should be easy to use and understand.
DeepSpeech-Keras key features:
- Multi GPU support: we do data parallelism via
multi_gpu_model
. This induces quasi-linear speedup on up to 8 GPUs. - CuDNN support: Model using CuDNNLSTM implementation by NVIDIA Developers. CPU devices is also supported.
- DataGenerator: The feature extraction (on CPU) can be parallel to model training (on GPU). Moreover it can use precomputed features saved in a hdf5 file. echo ". /home/rolczynski/.anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc
You can use pip (not prepared yet):
pip install deepspeech-keras
Otherwise clone the code and create a new environment via conda:
git clone https://github.com/rolczynski/DeepSpeech-Keras.git
conda env create -f=environment.yml # or use: environment-gpu.yml
conda activate DeepSpeech-Keras
The speech recognition is a tough task. You don't need to know all details to use one of the pretrained models. However it's worth to understand conceptional crucial components:
- Input: WAVE files with mono 16-bit 16 kHz (up to 5 seconds)
- FeaturesExtractor: Convert audio files using MFCC Features
- Model: CTC model defined in Keras (references: [1], [2])
- Decoder: Beam search algorithm with the language model support decode a sequence of probabilities using Alphabet
- DataGenerator: Stream data to the model via generator
- Callbacks: Set of functions monitoring the training
Loaded pre-trained model has already all components. The prediction can be invoked implicit via __call__
method or more explicit:
from deepspeech import load
deepspeech = load('pl') # Also can be: load(dir='model_dir')
files = ['to/test/sample.wav']
X = deepspeech.get_features(files)
y_hat = deepspeech.predict(X)
sentences = deepspeech.decode(y_hat)
The heart of the deepspeech is the Keras model (deepspeech.model
). You can make use of all
available Keras methods like predict_on_batch,
get_weights ect. You can use it straightforward.
However rather than write your own training routine from scratch, you can use the deepspeech.fit method.
Algorithm attributes (generators, optimizer, ctc loss, callbacks ect) are already set and
they are ready to use. All you have to do is to create new or modify configuration.yaml
file.
from deepspeech import DeepSpeech
deepspeech = DeepSpeech.construct('/tuned-model/configuration.yaml',
'/tuned-model/alphabet.txt')
deepspeech.load('pl') # Raise error if different architecture is defined
train_generator = deepspeech.create_generator('train.csv', batch_size=32)
dev_generator = deepspeech.create_generator('dev.csv', batch_size=32)
deepspeech.fit(train_generator, dev_generator, epochs=5)
deepspeech.save('/tuned-model/weights.hdf5')
If you have several GPU's and more than 500 hours labeled audio samples, you could start to create new models. You can use run.py script as a guideline. In configuration file you specify parameters for all five components:
features_extractor: # audio.py: FeatureExtractor parameters (e.g. win_len or win_step)
model: # model.py: Model parameters (e.g. units)
optimizer: # deepspeech.optimizer: Optimizer parameters (e.g. name or lr)
callbacks: # callback.py: List of callback parameters (e.g tensorboard)
decoder: # ctc_decoder.py: Decoder parameters (e.g. naive)
For more details check default configuration for polish. Of course you don't have to use the configuration file. You can define all required dependencies (decoder, model ect) and pass to the deepspeech init. Moreover each component can be replaced and you can pass your own objects.
In scripts folder you can find e.g.:
confusion_matrix.py
: Investigate what mistakes were made and plot results.consumer.py
: Manage between different experiments. Create a queue file and in each line write command line to execute.features.py
: Extract features and save them in the hdf5 file ready to use by DataGenerator.evaluate.py
: Script to do a model evaluation. Calculate WER, CER and different statistics.
The calculations are in progress.
Amazing work was already done. Check out the implementations of Mozilla DeepSpeech (TensorFlow), deepspeech.pytorch (PyTorch) or even KerasDeepSpeech (Keras). This project is complementary to projects listed above. It tries to be more user-friendly for the newcomer users.
Have a question? Like the tool? Don't like it? Open an issue and let's talk about it! Pull requests are appreciated! Interesting fields to improve:
- support for different languages (or e.g. import weights from DeepSpeech pre-trained model)
- rewrite
keras
model intotensorflow.keras
- implement Eager Execution to make a debugging more interactive
- Seq2Seq model training with language model (+attention)