Skip to content
This repository has been archived by the owner on Feb 12, 2022. It is now read-only.

Commit

Permalink
Add missing models (#7)
Browse files Browse the repository at this point in the history
* update requirement to conda

* update env

* add argument to change model

* update models

* remove model details as we now have multiple models
  • Loading branch information
faroit authored Jun 20, 2020
1 parent b18fd5f commit 8e30524
Show file tree
Hide file tree
Showing 15 changed files with 291 additions and 452 deletions.
6 changes: 3 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
#### joe made this: http://goel.io/joe

env3/
env/
env*/

#####=== Python ===#####

Expand Down Expand Up @@ -70,7 +69,8 @@ target/
.LSOverride

# Icon must end with two \r
Icon
Icon


# Thumbnails
._*
Expand Down
40 changes: 0 additions & 40 deletions Pipfile

This file was deleted.

366 changes: 0 additions & 366 deletions Pipfile.lock

This file was deleted.

57 changes: 40 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,29 @@

<img width="400" align="right" alt="screen shot 2017-11-21 at 12 35 28" src="https://user-images.githubusercontent.com/72940/33071669-be6c35b2-cebc-11e7-8822-9b998ad1ea09.png">

Estimating the number of concurrent speakers from single channel mixtures is a very challenging task that is a mandatory first step to address any realistic “cocktail-party” scenario. It has various audio-based applications such as blind source separation, speaker diarisation, and audio surveillance. Building upon powerful machine learning methodology and the possibility to generate large amounts of learning data, Deep Neural Network (DNN) architectures are well suited to directly estimate speaker counts.
_CountNet_ is a deep learning model to estimate the number of concurrent speakers from single channel mixtures is a very challenging task that is a mandatory first step to address any realistic “cocktail-party” scenario. It has various audio-based applications such as blind source separation, speaker diarisation, and audio surveillance.

## Publication
This repo provides pre-trained models.

#### Accepted for ICASSP 2018
## Publications

* __Title__: Classification vs. Regression in Supervised Learning for Single Channel
### 2019: IEEE/ACM Transactions on Audio, Speech, and Language Processing

* __Title__: CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning
Speaker Count Estimation
* __Authors__: Fabian-Robert Stöter, Soumitro Chakrabarty, Bernd Edler, Emanuël
* __Authors__: [Fabian-Robert Stöter](https://faroit.com), Soumitro Chakrabarty, Bernd Edler, Emanuël
A. P. Habets
* __Preprint__: [arXiv 1712.04555](http://arxiv.org/abs/1712.04555)

## Model
* __Preprint__: [HAL](https://hal-lirmm.ccsd.cnrs.fr/lirmm-02010805)
* __Proceedings__: [IEEE](https://ieeexplore.ieee.org/document/8506601) (paywall)

<img width="360" align="right" alt="screen shot 2017-11-21 at 12 35 28" src="https://user-images.githubusercontent.com/72940/33072095-60d1929c-cebe-11e7-91de-1dff3fc50bde.png">

In this work a recurrent neural network was trained to generate speaker count estimates for 0 to 10 speakers. The model uses three Bi-LSTM layers inspired by a model for singing voice separation by [Leglaive15](https://hal.archives-ouvertes.fr/hal-01110035).
### 2018: ICASSP

* __Title__: Classification vs. Regression in Supervised Learning for Single Channel
Speaker Count Estimation
* __Authors__: [Fabian-Robert Stöter](https://faroit.com), Soumitro Chakrabarty, Bernd Edler, Emanuël
A. P. Habets
* __Preprint__: [arXiv 1712.04555](http://arxiv.org/abs/1712.04555)
* __Proceedings__: [IEEE](https://ieeexplore.ieee.org/document/8462159) (paywall)

## Demos

Expand All @@ -34,22 +39,22 @@ This repository provides the [keras](https://keras.io/) model to be used from Py
[Docker](https://www.docker.com/) makes it easy to reproduce the results and install all requirements. If you have docker installed, run the following steps to predict a count from the provided test sample.

* Build the docker image: `docker build -t countnet .`
* Predict from example: `docker run -i countnet python predict_audio.py examples/5_speakers.wav`
* Predict from example: `docker run -i countnet python predict.py --model CRNN examples/5_speakers.wav`

### Manual Installation

Make sure you have Python 3.6, `libsndfile` and `libhdf5` installed on your system (e.g. through Anaconda). To install the requirements run
To install the requirements using Anaconda Python, run

`pip install -r requirements.txt`
`conda env create -f env.yml`

You can now run the command line script and process wav files
You can now run the command line script and process wav files using the pre-trained model `CRNN` (best peformance).

`python predict_audio.py examples/5_speakers.wav`
`python predict.py examples/5_speakers.wav --model CRNN`

## Reproduce Paper Results using the LibriCount Dataset
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1216072.svg)](https://doi.org/10.5281/zenodo.1216072)

The full test dataset is available for download on Zenodo.
The full test dataset is available for download on [Zenodo](https://doi.org/10.5281/zenodo.1216072).

### LibriCount10 0dB Dataset

Expand Down Expand Up @@ -85,6 +90,24 @@ In the following example a speaker count of 3 speakers is the ground truth.
]
```

### Running evaluation

```python eval.py ~/path/to/LibriCount10-0dB --model CRNN``` outputs the _mean absolute error_ per class and averaged.

### Pretrained models

| Name | Number of Parameters | MAE on test set |
|----------|----------------------|-----------------|
| `RNN` | 0.31M | 0.38 |
| `F-CRNN` | 0.06M | 0.36 |
| `CRNN` | 0.35M | __0.27__ |


## FAQ

#### Is it possible to convert the model to run on a modern version of keras with tensorflow backend?

Yes, its possible. But I was unable to get identical results when converting model. I tried this [guide](https://github.com/keras-team/keras/wiki/Converting-convolution-kernels-from-Theano-to-TensorFlow-and-vice-versa) but it still didn't help to get to the same performance compared to keras 1.2.2 and theano.

## License

Expand Down
49 changes: 49 additions & 0 deletions env.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
name: countnet
channels:
- defaults
dependencies:
- ca-certificates=2020.1.1
- certifi=2020.4.5.2
- intel-openmp=2019.4
- libcxx=10.0.0
- libedit=3.1.20191231
- libffi=3.3
- mkl=2019.4
- mkl-service=2.3.0
- ncurses=6.2
- openssl=1.1.1g
- pip=20.1.1
- python=3.6.10
- readline=8.0
- setuptools=47.3.0
- six=1.15.0
- sqlite=3.32.2
- tk=8.6.10
- wheel=0.34.2
- xz=5.2.5
- zlib=1.2.11
- pip:
- audioread==2.1.8
- backports-weakref==1.0rc1
- bleach==1.5.0
- cffi==1.14.0
- decorator==4.4.2
- h5py==2.10.0
- html5lib==0.9999999
- joblib==0.15.1
- keras==1.2.2
- librosa==0.7.2
- llvmlite==0.32.1
- markdown==2.2.0
- numba==0.43.0
- numpy==1.18.5
- protobuf==3.12.2
- pycparser==2.20
- pyyaml==5.3.1
- resampy==0.2.2
- scikit-learn==0.22
- scipy==1.4.1
- soundfile==0.10.3.post1
- theano==0.9.0
- werkzeug==1.0.1
- tqdm
101 changes: 101 additions & 0 deletions eval.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import numpy as np
import soundfile as sf
import argparse
import os
import keras
import sklearn
import glob
import predict
import json
from keras import backend as K

import tqdm

eps = np.finfo(np.float).eps


def mae(y, p):
return np.mean([abs(a - b) for a, b in zip(p, y)])


def mae_by_count(y, p):
diffs = []
for c in range(0, int(np.max(y)) + 1):
ind = np.where(y == c)
diff = mae(y[ind], np.round(p[ind]))
diffs.append(diff)

return diffs


if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Load keras model and predict speaker count'
)
parser.add_argument(
'root',
help='root dir to evaluation data set'
)

parser.add_argument(
'--model', default='CRNN',
help='model name'
)

args = parser.parse_args()

# load model
model = keras.models.load_model(
os.path.join('models', args.model + '.h5'),
custom_objects={
'class_mae': predict.class_mae,
'exp': K.exp
}
)


# print model configuration
model.summary()

# load standardisation parameters
scaler = sklearn.preprocessing.StandardScaler()
with np.load(os.path.join("models", 'scaler.npz')) as data:
scaler.mean_ = data['arr_0']
scaler.scale_ = data['arr_1']

input_files = glob.glob(os.path.join(
args.root, 'test', '*.wav'
))

y_trues = []
y_preds = []

for input_file in tqdm.tqdm(input_files):

metadata_file = os.path.splitext(
os.path.basename(input_file)
)[0] + ".json"
metadata_path = os.path.join(args.root, 'test', metadata_file)

with open(metadata_path) as data_file:
data = json.load(data_file)
# add ground truth
y_trues.append(len(data))

# compute audio
audio, rate = sf.read(input_file, always_2d=True)

# downmix to mono
audio = np.mean(audio, axis=1)

count = predict.count(audio, model, scaler)
# add prediction
y_preds.append(count)

y_preds = np.array(y_preds)
y_trues = np.array(y_trues)


mae_k = mae_by_count(y_trues, y_preds)
print("MAE per Count: ", {k: v for k, v in enumerate(mae_k)})
print("Mean MAE", mae(y_trues, y_preds))
Binary file added models/CNN.h5
Binary file not shown.
Binary file added models/CRNN.h5
Binary file not shown.
Binary file added models/F-CRNN.h5
Binary file not shown.
Binary file added models/RNN.h5
Binary file not shown.
Binary file removed models/RNN_keras2.h5
Binary file not shown.
1 change: 0 additions & 1 deletion models/RNN_keras2.json

This file was deleted.

88 changes: 88 additions & 0 deletions predict.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
import numpy as np
import soundfile as sf
import argparse
import os
import keras
import sklearn
import librosa
from keras import backend as K


eps = np.finfo(np.float).eps


def class_mae(y_true, y_pred):
return K.mean(
K.abs(
K.argmax(y_pred, axis=-1) - K.argmax(y_true, axis=-1)
),
axis=-1
)


def count(audio, model, scaler):
# compute STFT
X = np.abs(librosa.stft(audio, n_fft=400, hop_length=160)).T

# apply global (featurewise) standardization to mean1, var0
X = scaler.transform(X)

# cut to input shape length (500 frames x 201 STFT bins)
X = X[:500, :]

# apply l2 normalization
Theta = np.linalg.norm(X, axis=1) + eps
X /= np.mean(Theta)

# add sample dimension
X = X[np.newaxis, ...]

if len(model.input_shape) == 4:
X = X[:, np.newaxis, ...]

ys = model.predict(X, verbose=0)
return np.argmax(ys, axis=1)[0]


if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Load keras model and predict speaker count'
)

parser.add_argument(
'audio',
help='audio file (samplerate 16 kHz) of 5 seconds duration'
)

parser.add_argument(
'--model', default='CRNN',
help='model name'
)

args = parser.parse_args()

# load model
model = keras.models.load_model(
os.path.join('models', args.model + '.h5'),
custom_objects={
'class_mae': class_mae,
'exp': K.exp
}
)

# print model configuration
model.summary()
# save as svg file
# load standardisation parameters
scaler = sklearn.preprocessing.StandardScaler()
with np.load(os.path.join("models", 'scaler.npz')) as data:
scaler.mean_ = data['arr_0']
scaler.scale_ = data['arr_1']

# compute audio
audio, rate = sf.read(args.audio, always_2d=True)

# downmix to mono
audio = np.mean(audio, axis=1)
estimate = count(audio, model, scaler)
print("Speaker Count Estimate: ", estimate)
Loading

4 comments on commit 8e30524

@jonashaag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidentally deleted requirements.txt?

@faroit
Copy link
Owner Author

@faroit faroit commented on 8e30524 Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with conda only. but if you think its helpful, I can revert that.

More important, if you have any idea to go beyond keras+theano without loosing performance, let me know ;-)

@jonashaag
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It breaks the Docker build, so I think it's useful :-) Will send a PR when I got it fixed. Some requirements need update

@faroit
Copy link
Owner Author

@faroit faroit commented on 8e30524 Sep 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It breaks the Docker build, so I think it's useful :-) Will send a PR when I got it fixed. Some requirements need update

right, I didn't check that

Please sign in to comment.