-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add genau and update dataset preperation
- Loading branch information
1 parent
4aeaec9
commit 66c987d
Showing
219 changed files
with
448,277 additions
and
11 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,8 @@ | ||
__pycache__ | ||
**/__pycache__ | ||
**/data | ||
**/pretrained_models | ||
**/*.ckpt | ||
**/*.pt | ||
.DS_Store | ||
**/.DS_Store |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
__pycache__ | ||
taming | ||
log | ||
**/log | ||
logs | ||
**/logs | ||
esc50.zip | ||
ESC-50-master | ||
*.wav | ||
ckpt | ||
lightning_logs | ||
mlx_submit_* | ||
job_queue.sh | ||
*.txt | ||
*.cleaned | ||
audiocaps_train.json | ||
dataset | ||
checkpoints | ||
*.tar | ||
condor* | ||
wandb | ||
audioldm_train/modules/fit | ||
core* | ||
.vscode | ||
compute_clap.py | ||
cal_clap_score.py | ||
run_logs | ||
samples |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
[![arXiv](ARXIV ICON)](ARXIV LINK) | ||
|
||
# GenAU inference, training and evaluation | ||
- [Inference](#inference) | ||
* [Audio to text script](#text-to-audio) | ||
* [Gradio demo](#gradio-demo) | ||
* [Inference a list of promots](#inference-a-list-of-prompts) | ||
- [Training](#training) | ||
* [GenAU](#genau) | ||
* [Finetuning GenAU](#finetuning-genau) | ||
* [1D-VAE (optional)](#1d-vae-optional) | ||
- [Evaluation](#evaluation) | ||
- [Cite this work](#cite-this-work) | ||
- [Acknowledgements](#acknowledgements) | ||
|
||
# Environment initalization | ||
For initializing your environment, please refer to the [general README](../README.md). | ||
|
||
# Inference | ||
|
||
## Text to Audio | ||
To quickly generate an audio based on an input text prompt, run | ||
```shell | ||
python scripts/text_to_audio.py --prompt "Horses growl and clop hooves." --model "genau-full-l" | ||
``` | ||
- This will automatically downloads and uses the model `genau-full-l` with default settings. You may change these parameters or provide your custome model config file and checkpoint path. | ||
- Available models include `genau-full-l` (1.25B parameters) and `genau-full-s` (493M parameters) | ||
- These models are trained to generate ambient sounds and is incapable of generating speech or music. | ||
- Outputs will be saved by default at `samples/model_output` using the provided prompt as the file name. | ||
|
||
## Gradio Demo | ||
Run a local interactive demo with Gradio: | ||
```shell | ||
python app_text2audio.py | ||
``` | ||
|
||
## Inference a list of prompts | ||
Optionally, you may prepare a `.txt` file with your target prompts and run | ||
|
||
```shell | ||
python scripts/inference_file.py --list_inference <path-to-prompts-file> --model <model_name> | ||
|
||
# Example | ||
python scripts/inference_file.py --list_inference samples/prompts_list.txt --model "genau-full-l" | ||
``` | ||
|
||
|
||
## Training | ||
|
||
### Dataset | ||
Please refer to the [dataset preperation README](../dataset_preperation/README.md) for instructions on downloading our dataset or preparing your own. | ||
|
||
### GenAU | ||
- Preapre a yaml config file for your experiments. A sample config file is provided at `settings/simple_runs/genau.yaml` | ||
- Specify your project name and provide your Wandb key in the config file. A Wandb key can be obtained from [https://wandb.ai/authorize](https://wandb.ai/authorize) | ||
- Optionally, provide your S3 bucket and folder to save intermediate checkpoints. | ||
- By default, checkpoints will be saved under `run_logs/genau/train` at the same level as the config file. | ||
|
||
```shell | ||
# Training GenAU from scratch | ||
python train/genau.py -c settings/simple_runs/genau.yaml | ||
``` | ||
|
||
For multinode training, run | ||
```shell | ||
python -m torch.distributed.run --nproc_per_node=8 train/genau.py -c settings/simple_runs/genau.yaml | ||
``` | ||
### Finetuning GenAU | ||
|
||
- Prepare you custom dataset and obtain the dataset keys following [dataset preperation README](../dataset_preperation/README.md) | ||
- Make a copy and adjust the default config file of `genau-full-l` which you can find under `pretrained_models/genau/genau-full-l.yaml` | ||
- Add ids for your dataset keys under `dataset2id` attribute in the config file. | ||
|
||
```shell | ||
# Finetuning GenAU | ||
python train/genau.py --reload_from_ckpt 'genau-full-l' \ | ||
--config <path-to-config-file> \ | ||
--dataset_keys "<dataset_key_1>" "<dataset_key_2>" ... | ||
``` | ||
|
||
|
||
### 1D VAE (Optional) | ||
By default, we offer a pre-trained 1D-VAE for GenAU training. If you prefer, you can train your own VAE by following the provided instructions. | ||
- Prepare your own dataset following the instructions in the [dataset preperation README](../dataset_preperation/README.md) | ||
- Preapre your yaml config file in a similar way to the GenAU config file | ||
- A sample config file is provided at `settings/simple_runs/1d_vae.yaml` | ||
|
||
```shell | ||
python train/1d_vae.py -c settings/simple_runs/1d_vae.yaml | ||
``` | ||
|
||
## Evaluation | ||
- We follow [audioldm](https://github.com/haoheliu/AudioLDM-training-finetuning) to perform our evaulations. | ||
- By default, the models will be evaluated periodically during training as specified in the config file. For each evaulation, a folder with the generated audio will be saved under `run_logs/train' at the same levels the specified config file. | ||
- The code idenfities the test dataset in an already existing folder according to number of samples. If you would like to test on a new test dataset, register it in `scripts/generate_and_eval` | ||
|
||
```shell | ||
|
||
# Evaluate on an existing generated folder | ||
python scripts/evaluate.py --log_path <path-to-the-experiment-folder> | ||
|
||
# Geneate test audios from a pre-trained checkpoint and run evaulation | ||
python scripts/generate_and_eval.py -c <path-to-config> -ckpt <path-to-pretrained-ckpt> | ||
``` | ||
The evaluation result will be saved in a json file at the same level of the generated audio folder. | ||
|
||
# Cite this work | ||
If you found this useful, please consider citing our work | ||
|
||
```TODO | ||
``` | ||
|
||
# Acknowledgements | ||
Our audio generation and evaluation codebase relies on [audioldm](https://github.com/haoheliu/AudioLDM-training-finetuning). We sincerely appreciate the authors for sharing their code openly. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
ckpt/ | ||
*.pth | ||
*.wav | ||
*.npy | ||
*.egg-info | ||
__pycache__ | ||
vctk_test | ||
.DS_* | ||
script/* | ||
datasets/* | ||
test_fad/* | ||
*.ckpt | ||
*.json | ||
audio | ||
build | ||
dist | ||
*.pkl | ||
pickle_check.py | ||
test.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
Copyright (c) 2012-2022 Scott Chacon and others | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining | ||
a copy of this software and associated documentation files (the | ||
"Software"), to deal in the Software without restriction, including | ||
without limitation the rights to use, copy, modify, merge, publish, | ||
distribute, sublicense, and/or sell copies of the Software, and to | ||
permit persons to whom the Software is furnished to do so, subject to | ||
the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be | ||
included in all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, | ||
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF | ||
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND | ||
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE | ||
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION | ||
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION | ||
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# Audio Generation Evaluation | ||
|
||
This toolbox aims to unify audio generation model evaluation for easier future comparison. | ||
|
||
## Quick Start | ||
|
||
First, prepare the environment | ||
```shell | ||
pip install git+https://github.com/haoheliu/audioldm_eval | ||
``` | ||
|
||
Second, generate test dataset by | ||
```shell | ||
python3 gen_test_file.py | ||
``` | ||
|
||
Finally, perform a test run. A result for reference is attached [here](https://github.com/haoheliu/audioldm_eval/blob/main/example/paired_ref.json). | ||
```shell | ||
python3 test.py # Evaluate and save the json file to disk (example/paired.json) | ||
``` | ||
|
||
## Evaluation metrics | ||
We have the following metrics in this toolbox: | ||
|
||
- Recommanded: | ||
- FAD: Frechet audio distance | ||
- ISc: Inception score | ||
- Other for references: | ||
- FD: Frechet distance, realized by PANNs, a state-of-the-art audio classification model | ||
- KID: Kernel inception score | ||
- KL: KL divergence (softmax over logits) | ||
- KL_Sigmoid: KL divergence (sigmoid over logits) | ||
- PSNR: Peak signal noise ratio | ||
- SSIM: Structural similarity index measure | ||
- LSD: Log-spectral distance | ||
|
||
The evaluation function will accept the paths of two folders as main parameters. | ||
1. If two folder have **files with same name and same numbers of files**, the evaluation will run in **paired mode**. | ||
2. If two folder have **different numbers of files or files with different name**, the evaluation will run in **unpaired mode**. | ||
|
||
**These metrics will only be calculated in paried mode**: KL, KL_Sigmoid, PSNR, SSIM, LSD. | ||
In the unpaired mode, these metrics will return minus one. | ||
|
||
## Evaluation on AudioCaps and AudioSet | ||
|
||
The AudioCaps test set consists of audio files with multiple text annotations. To evaluate the performance of AudioLDM, we randomly selected one annotation per audio file, which can be found in the [accompanying json file](https://github.com/haoheliu/audioldm_eval/tree/c9e936ea538c4db7e971d9528a2d2eb4edac975d/example/AudioCaps). | ||
|
||
Given the size of the AudioSet evaluation set with approximately 20,000 audio files, it may be impractical for audio generative models to perform evaluation on the entire set. As a result, we randomly selected 2,000 audio files for evaluation, with the corresponding annotations available in a [json file](https://github.com/haoheliu/audioldm_eval/tree/c9e936ea538c4db7e971d9528a2d2eb4edac975d/example/AudioSet). | ||
|
||
For more information on our evaluation process, please refer to [our paper](https://arxiv.org/abs/2301.12503). | ||
|
||
## Example | ||
|
||
```python | ||
import torch | ||
from audioldm_eval import EvaluationHelper | ||
|
||
# GPU acceleration is preferred | ||
device = torch.device(f"cuda:{0}") | ||
|
||
generation_result_path = "example/paired" | ||
target_audio_path = "example/reference" | ||
|
||
# Initialize a helper instance | ||
evaluator = EvaluationHelper(16000, device) | ||
|
||
# Perform evaluation, result will be print out and saved as json | ||
metrics = evaluator.main( | ||
generation_result_path, | ||
target_audio_path, | ||
limit_num=None # If you only intend to evaluate X (int) pairs of data, set limit_num=X | ||
) | ||
``` | ||
|
||
## Note | ||
|
||
- Update on 24 June 2023: | ||
- **Issues on model evaluation:** I found the PANNs based Frechet Distance and KL score is not as robust as FAD sometimes. For example, when the generation are all silent audio, the FAD and KL still indicate model perform very well, while FAD and Inception Score (IS) can still reflect the model true bad performance. Sometimes the resample method on audio can significantly affect the FD (+-30) and KL (+-0.4) performance as well. | ||
- To address this issue, in another branch of this repo ([passt_replace_panns](https://github.com/haoheliu/audioldm_eval/tree/passt_replace_panns)), I change the PANNs model to Passt, which I found to be more robust to resample method and other trival mismatches. | ||
|
||
- **Update on code:** The calculation of FAD is slow. Now, after each calculation of a folder, the code will save the FAD feature into an .npy file for later reference. | ||
|
||
## TODO | ||
|
||
- [ ] Add pretrained AudioLDM model. | ||
- [ ] Add CLAP score | ||
|
||
## Cite this repo | ||
|
||
If you found this tool useful, please consider citing | ||
```bibtex | ||
@article{liu2023audioldm, | ||
title={AudioLDM: Text-to-Audio Generation with Latent Diffusion Models}, | ||
author={Liu, Haohe and Chen, Zehua and Yuan, Yi and Mei, Xinhao and Liu, Xubo and Mandic, Danilo and Wang, Wenwu and Plumbley, Mark D}, | ||
journal={arXiv preprint arXiv:2301.12503}, | ||
year={2023} | ||
} | ||
``` | ||
|
||
## Reference | ||
|
||
> https://github.com/toshas/torch-fidelity | ||
> https://github.com/v-iashin/SpecVQGAN |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
from .metrics.fid import calculate_fid | ||
from .metrics.isc import calculate_isc | ||
from .metrics.kid import calculate_kid | ||
from .metrics.kl import calculate_kl | ||
from .eval import EvaluationHelper | ||
|
||
print("2023 -06 -22") |
Empty file.
Oops, something went wrong.