Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speech Emotion Recognition (SER) #6

Open
mirix opened this issue Aug 10, 2023 · 9 comments
Open

Speech Emotion Recognition (SER) #6

mirix opened this issue Aug 10, 2023 · 9 comments

Comments

@mirix
Copy link
Owner

mirix commented Aug 10, 2023

Apparently, and according to its own creators, the audEERING model was not the wisest of choices.

To address such shortcomings, I have forked CMU-MOSEI:

https://github.com/mirix/messaih

The fork is tailored towards Speech Emotion Recognition (SER).

The idea now would be training a model on messAIh and see how it behaves on a real-life scenario.

@YuryKonvpalto
Copy link

Hi Mirix,

please tell, how did you convert VAD values of the model ([0...1]) to Euclidian space ([-1..1]? Model values seems to be always around 0,3... each.

@mirix
Copy link
Owner Author

mirix commented Aug 24, 2023

I tried different normalisation strategies from sklearn, but, in the end, I settled on this:

# 0 1 scale to -1 1 scale
vad_sample = (vad_sample - .5) * 2

If you think about it, it is like converting, say Fahrenheit to Celsius. You have to shift the zero and re-scale the value of the degree:

https://www.calculatorsoup.com/calculators/conversions/fahrenheit-to-celsius.php

@YuryKonvpalto
Copy link

Ok. But their model always predict values for each V-A-D starting with 0,3... (I tested a lot of audio files, but naver has got values above or beneath 0,3). So the values that differ are going after 0,3...

Taking your way of conversion (vad_sample = (vad_sample - .5) * 2) - it would always give a negative value. E.g.:
V = 0.334; A=0.330; D=0.336; - if we subtract 0.5 from each value, it goes to negative.

In your code you stipulate 'Centroids of the Ekman emotions in a VAD diagram'. For JOY it is = {'v': 0.76, 'a': 0.48, 'd': 0.35}
But because of the above mentioned reasons we can't even get a small postive value (if we converse like that vad_sample = (vad_sample - .5) * 2).

Do I miss something?

@mirix
Copy link
Owner Author

mirix commented Aug 24, 2023

It works relatively well for me. If you obtain constant values and I had to guess, there is an issue with your embeddings... Which may actually come from your audio.

I am having the same problem when trying to fine-tune a wav2vec2 model, I obtain constant values and I am guessing that the issue comes from reading the numpy arrays with the feature extractor, but I am still investigating.

@YuryKonvpalto
Copy link

It's strange... May be you have downloaded a model version from HuggingFace and this version works a little bit different in comparison with version deployed now on HF?..
Using a Huggingface version of model (via HF itself) it always gives values starting from 0.3.
Values though are not constants, - the digits after 0,3 (i.e. after first decimal) are different and vary according to input audio.

I guess for each V-A-D a zero-point is 0.3333333...
So if it goes beneath 0.3333 - I assume it becomes negative;
when it goes above 0.3333 - I assume it becomes positive.

May I please kindly ask you to occasionally check the version of model deployed on HF - do you get other values of each VAD higher or lower than 0.3....?

@mirix
Copy link
Owner Author

mirix commented Aug 24, 2023

The model is over 1 year old, so my guess is that the issue is feature extraction.

@mirix
Copy link
Owner Author

mirix commented Aug 25, 2023

Anyway, I am not using that model anymore. I am trying to train my own. I just posted the script to hear people's opinions on the VAD-to-Ekman conversion.

It actually works relatively well (compared to other models, of course). Sentence-by-sentence there are many errors, but, if you consider the conversation as a whole, clusters of certain emotions are typically good indicators for flagging the conversation.

The main issue is that we are particularly interested in detecting fear, and it seems that that is precisely one of the model's weak points. The problem is the training dataset.

@YuryKonvpalto
Copy link

Very interesting. I found even better reseach in the field of conversion VAD to Eckman in this paper: https://www.researchgate.net/publication/284724383_Affect_Representation_and_Recognition_in_3D_Continuous_Valence-Arousal-Dominance_Space

It takes 15 emotions and maps in Tables their mean values and standard deviations to each of VAD. And provides Euc.distances between all the 15 basic emotions. Very intersting.
So far as I have read, the fear perseption stays the most challenging task. I think it goes from psychology - one rarely masks joy or sadness, but almost everyone try to mask his fear..

What I trying to acheave is a webapp, that records conversation and send on the fly chunks (3-5 sec) of it to model for emotions evaluation. According to evaluated atmosphere it suggests a music background (music pieces with VADs corresponding to conversation's VAD.
In the theory (if we consider VAD as a vector) you can add to it a VAD vector of music piece, thats either neglect conversation VAD vector (i.e. joy (positive) vector of music piece is added to sad (negative) vector of conversation - it becomes atleast neutral) or even enhance it (makes a positive VAD vector of conversation even more positive after adding a music VAD vector).

@mirix
Copy link
Owner Author

mirix commented Aug 25, 2023

It sounds amazing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants