Official implementation of SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling (to appear in INTERSPEECH 2022).
This repo contains an older version of the code, but is kept for compatibility.
The latest version is available here.
- Audio samples
- Audio effect transfer with Gradio + HuggingFace Spaces 🤗
- Clone this repository:
git clone https://github.com/Takaaki-Saeki/ssl_speech_restoration.git
. - CD into this repository:
cd ssl_speech_restoration
. - Install python packages and download some pretrained models:
./setup.sh
.
- If you use default Japanese corpora:
- Download JSUT Basic5000 and JVS Corpus
- Downsample them to 22.05 kHz and place them under
data/
asjsut_22k
andjvs_22k
.- JSUT is a single-speaker dataset and requires the structure as
jsut_22k/*.wav
. Note that this is the ground-truth clean speech data which correspond to the simulated data and is not used for training. You may want to usejsut_22k
only to compare the restored speech and ground-truth speech. - JVS parallel100 includes 100-speaker data and requires the structure as
jvs_22k/${spkr_name}/*.wav
. This is a clean speech dataset used for the backward learning of the dual-learning method.
- JSUT is a single-speaker dataset and requires the structure as
- Place simulated low-quality data under
./data
asjsut_22k-low
.
- Or you can use arbitrary datasets by modifying config files.
You can choose MelSpec
or SourFilter
models with --config_path
option.
As shown in the paper, MelSpec
model is of higher-quality.
Firstly you need to split the data to train/val/test and dump them by the following command.
python preprocess.py --config_path configs/train/${feature}/ssl_jsut.yaml
To perform self-supervised learning with dual learning, run the following command.
python train.py \
--config_path configs/train/${feature}/ssl_jsut.yaml \
--stage ssl-dual \
--run_name ssl_melspec_dual
For other options, refer to train.py
.
Note that you might need to tune some parameters for your own datasets.
In our experiences, learning_rate
and beta
are cruicial parameters.
For example, if the trianing is unstable, consider making beta
smaller (e.b., beta: 0.001
).
To perform speech restoration of the test data, run the following command.
python eval.py \
--config_path configs/test/${feature}/ssl_jsut.yaml \
--ckpt_path ${path to checkpoint} \
--stage ssl-dual \
--run_name ssl_melspec_dual
For other options, see eval.py
.
You can run a simple audio effect transfer demo using a model pretrained with real data.
Run the following command.
python aet_demo.py
Or you can customize the dataset or model.
You need to edit audio_effect_transfer.yaml
and run the following command.
python aet.py \
--config_path configs/test/melspec/audio_effect_transfer.yaml \
--stage ssl-dual \
--run_name aet_melspec_dual
For other options, see aet.py
.
See here.
You can generate simulated low-quality data as in the paper with the following command.
python simulated_data.py \
--in_dir ${input_directory (e.g., path to jsut_22k)} \
--output_dir ${output_directory (e.g., path to jsut_22k-low)} \
--corpus_type ${single-speaker corpus or multi-speaker corpus} \
--deg_type lowpass
Then download the pretrained model correspond to the deg_type and run the following command.
python eval.py \
--config_path configs/train/${feature}/ssl_jsut.yaml \
--ckpt_path ${path to checkpoint} \
--stage ssl-dual \
--run_name ssl_melspec_dual
@article{saeki22selfremaster,
title={{SelfRemaster}: {S}elf-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling},
author={T. Saeki and S. Takamichi and T. Nakamura and N. Tanji and H. Saruwatari},
journal={arXiv preprint arXiv:2203.12937},
year={2022}
}