In addition to training losses, AudioCraft provides a set of objective metrics for audio synthesis and audio generation. As these metrics may require extra dependencies and can be costly to train, they are often disabled by default. This section provides guidance for setting up and using these metrics in the AudioCraft training pipelines.
We provide an implementation of the Scale-Invariant Signal-to-Noise Ratio in PyTorch. No specific requirement is needed for this metric. Please activate the metric at the evaluation stage with the appropriate flag:
dora run <...> evaluate.metrics.sisnr=true
We provide a Python wrapper around the ViSQOL official implementation to conveniently run ViSQOL within the training pipelines.
One must specify the path to the ViSQOL installation through the configuration in order to enable ViSQOL computations in AudioCraft:
# the first parameter is used to activate visqol computation while the second specify
# the path to visqol's library to be used by our python wrapper
dora run <...> evaluate.metrics.visqol=true metrics.visqol.bin=<path_to_visqol>
See an example grid: Compression with ViSQOL
To learn more about ViSQOL and how to build ViSQOL binary using bazel, please refer to the instructions available in the open source repository.
Similarly to ViSQOL, we use a Python wrapper around the Frechet Audio Distance official implementation in TensorFlow.
Note that we had to make several changes to the actual code in order to make it work. Please refer to the FrechetAudioDistanceMetric class documentation for more details. We do not plan to provide further support in obtaining a working setup for the Frechet Audio Distance at this stage.
# the first parameter is used to activate FAD metric computation while the second specify
# the path to FAD library to be used by our python wrapper
dora run <...> evaluate.metrics.fad=true metrics.fad.bin=<path_to_google_research_repository>
See an example grid: Evaluation with FAD
We provide a PyTorch implementation of the Kullback-Leibler Divergence computed over the probabilities of the labels obtained by a state-of-the-art audio classifier. We provide our implementation of the KLD using the PaSST classifier.
In order to use the KLD metric over PaSST, you must install the PaSST library as an extra dependency:
pip install 'git+https://github.com/kkoutini/[email protected]#egg=hear21passt'
Then similarly, you can use the metric activating the corresponding flag:
# one could extend the kld metric with additional audio classifier models that can then be picked through the configuration
dora run <...> evaluate.metrics.kld=true metrics.kld.model=passt
We provide a text-consistency metric, similarly to the MuLan Cycle Consistency from MusicLM or the CLAP score used in Make-An-Audio. More specifically, we provide a PyTorch implementation of a Text consistency metric relying on a pre-trained Contrastive Language-Audio Pretraining (CLAP).
Please install the CLAP library as an extra dependency prior to using the metric:
pip install laion_clap
Then similarly, you can use the metric activating the corresponding flag:
# one could extend the text consistency metric with additional audio classifier models that can then be picked through the configuration
dora run ... evaluate.metrics.text_consistency=true metrics.text_consistency.model=clap
Note that the text consistency metric based on CLAP will require the CLAP checkpoint to be provided in the configuration.
Finally, as introduced in MusicGen, we provide a Chroma Cosine Similarity metric in PyTorch. No specific requirement is needed for this metric. Please activate the metric at the evaluation stage with the appropriate flag:
dora run ... evaluate.metrics.chroma_cosine=true
For all the above audio generation metrics, we offer the option to compute the metric on the reconstructed audio
fed in EnCodec instead of the generated sample using the flag <metric>.use_gt=true
.
You will find example of configuration for the different metrics introduced above in:
- The musicgen's default solver for all audio generation metrics
- The compression's default solver for all audio synthesis metrics
Similarly, we provide different examples in our grids: