Skip to content

A benchmark to evaluate articulatory speech synthesis systems.

License

Notifications You must be signed in to change notification settings

quantling/articubench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

articubench

A benchmark to evaluate articulatory speech synthesis systems. This benchmark uses the VocalTractLab [1] as its articulatory speech synthesis simulator.

Warning

This package is not complete yet and will be hopefully released in March 2023 alongside the ESSV 2023. If you are interested in the current status get in touch with @derNarr or @paulovic96.

Installation

pip install articubench

Overview

Box and arrow overview of the data flow and tasks of the articubench benchmark.

The benchmarks defines three tasks: semantic only, semantic-acoustic and acoustic only.

Control Model comparison

Comparing the PAULE, inverse, segment-based and baseline control models along different model properties. The memory demand includes the python binaries (200 MB). The segment model needs an embedder and the MFA in its pipeline, for which the data is given in parenthesis.

Property PAULE Inverse Seg-Model* Baseline-Model
Trainable parameters [million] 15.6 2.6 0 (6.6 + MFA) 0
Execution time [seconds] 200 0.2 0.5 (0.3 + MFA) < 0.0001
Memory demand [MB] 5600 5000 2 (5100 + MFA) 200
Energy used for Training [kWh] 393 1.9 0.0 (7.6 + MFA) 0.0

Benchmark Results

Results will be published when they are available.

Tiny PAULE Inverse Seg-Model* Baseline-Model
Total Score        
Articulatory Scores        
Semantic Scores        
Acoustic Scores        
Tongue Height        
EMA sensors        
Max Velocity        
Max Jerk        
Classification        
semantic RMSE        
loudness envelope        
spectrogram RMSE        
Small PAULE Inverse Seg-Model* Baseline-Model
Total Score        
Articulatory Scores        
Semantic Scores        
Acoustic Scores        
Tongue Height        
EMA sensors        
Max Velocity        
Max Jerk        
Classification        
semantic RMSE        
loudness envelope        
spectrogram RMSE        
Nomal PAULE Inverse Seg-Model* Baseline-Model
Total Score        
Articulatory Scores        
Semantic Scores        
Acoustic Scores        
Tongue Height        
EMA sensors        
Max Velocity        
Max Jerk        
Classification        
semantic RMSE        
loudness envelope        
spectrogram RMSE        

Literature

First ideas about the articubench benchmark were presented at the ESSV2022:

https://www.essv.de/paper.php?id=1140
@INPROCEEDINGS{ESSV2022_1140,
TITLE = {Articubench - An articulatory speech synthesis benchmark},
AUTHOR = {Konstantin Sering and Paul Schmidt-Barbo},
YEAR = {2022},
PAGES = {43--50},
KEYWORDS = {Articulatory Synthesis},
BOOKTITLE = {Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2022},
EDITOR = {Oliver Niebuhr and Malin Svensson Lundmark and Heather Weston},
PUBLISHER = {TUDpress, Dresden},
ISBN = {978-3-95908-548-9}
}

Corpora

Data used here comes from the following speech corpora:

  • KEC (EMA data, acoustics)
  • baba-babi-babu speech rate (ultra sound; acoustics)
  • Mozilla Common Voice
  • GECO (only phonetic transscription; duration and phone)

Prerequisits

For running the benchmark:

  • python >=3.8
  • praat
  • VTL API 2.5.1quantling (included in this repository)

Additionally, for creating the benchmark:

  • mfa (Montreal forced aligner)

License

  • VTL is GPLv3.0+ license

Acknowledgements

This research was supported by an ERC advanced Grant (no. 742545) and the University of Tübingen.

Links