The cues used by human observers to predict the emotional state of an interlocutor are ambiguous, and this is particularly true for vocal emotion (Atias et al., 2019). Efforts to predict the negative-to-positive degree of “pleasantness” of emotional speech (termed “valence”) have been especially fraught (Busso & Rahman, 2012). According to prior work using time-series models (Ong et al., 2021), the perception of speaker valence by human listeners in an auditory-only modality is predominantly based on signal semantics, with acoustic features such as prosodic contour and voice quality demonstrating weaker explanatory power. Here, I investigate several linear regression models to compare the explanatory power of semantic and acoustic information, both alone and in combination. Furthermore, I explore the extent to which a fine-tuned, self-supervised transformer model is able to simulate human behavior in valence ratings of natural, spoken narratives.
This project analyzes data provided in the first release the Stanford Emotional Narratives Dataset. For results of analyses to date, please see the results
folder of this repository, which includes results both as a readme (for easy online viewing), as well as a .docx file.
Repository License: CC BY-SA 4.0
inputs: audio files in .wav format
outputs: for each input .wav, a .csv file with 88 extracted acoustic features (eGeMAPS) for every five-second window in the file {data/egemaps}
inputs:
- time-aligned transcription files from the SEND dataset
- time-aligned, aggregated (Evaluator Weighted Estimator) human valence ratings from the SEND dataset
- acoustic features extracted (via
acousticsExtractor.py
) from .wav files exported from SEND videos - lexical valence and arousal norms collected by Warriner et al. (2013)
outputs: dataframe with each row representing a five-second window and columns providing data relating to the identification, human rating, lexical/semantic features, and acoustic features of the window {"data.csv"}
inputs:
- audio files in .wav format
- data.csv
outputs: dataframe with each row representing a five-second window and columns providing data relating to the identification, human rating, and model-predicted rating of the window {"llm_prediction_data.csv"}
inputs: data.csv
outputs: for four models, a plot comparing the human gold standard valence rating for each five-second window and the valence rating predicted by the model, including the coefficient of determination, Pearson correlation coefficient, and concordance correlation coefficient; the script also contains in-line code to explore the beta weights for statistically significant predictors in each model {figs}
inputs: llm_prediction_data.csv
outputs: a plot comparing the human gold standard valence rating for each five-second window and the valence rating predicted by the model, including the coefficient of determination, Pearson correlation coefficient, and concordance correlation coefficient {figs}