-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speech Emotion Recognition (SER) #6
Comments
Hi Mirix, please tell, how did you convert VAD values of the model ([0...1]) to Euclidian space ([-1..1]? Model values seems to be always around 0,3... each. |
I tried different normalisation strategies from sklearn, but, in the end, I settled on this:
If you think about it, it is like converting, say Fahrenheit to Celsius. You have to shift the zero and re-scale the value of the degree: https://www.calculatorsoup.com/calculators/conversions/fahrenheit-to-celsius.php |
Ok. But their model always predict values for each V-A-D starting with 0,3... (I tested a lot of audio files, but naver has got values above or beneath 0,3). So the values that differ are going after 0,3... Taking your way of conversion (vad_sample = (vad_sample - .5) * 2) - it would always give a negative value. E.g.: In your code you stipulate 'Centroids of the Ekman emotions in a VAD diagram'. For JOY it is = {'v': 0.76, 'a': 0.48, 'd': 0.35} Do I miss something? |
It works relatively well for me. If you obtain constant values and I had to guess, there is an issue with your embeddings... Which may actually come from your audio. I am having the same problem when trying to fine-tune a wav2vec2 model, I obtain constant values and I am guessing that the issue comes from reading the numpy arrays with the feature extractor, but I am still investigating. |
It's strange... May be you have downloaded a model version from HuggingFace and this version works a little bit different in comparison with version deployed now on HF?.. I guess for each V-A-D a zero-point is 0.3333333... May I please kindly ask you to occasionally check the version of model deployed on HF - do you get other values of each VAD higher or lower than 0.3....? |
The model is over 1 year old, so my guess is that the issue is feature extraction. |
Anyway, I am not using that model anymore. I am trying to train my own. I just posted the script to hear people's opinions on the VAD-to-Ekman conversion. It actually works relatively well (compared to other models, of course). Sentence-by-sentence there are many errors, but, if you consider the conversation as a whole, clusters of certain emotions are typically good indicators for flagging the conversation. The main issue is that we are particularly interested in detecting fear, and it seems that that is precisely one of the model's weak points. The problem is the training dataset. |
Very interesting. I found even better reseach in the field of conversion VAD to Eckman in this paper: https://www.researchgate.net/publication/284724383_Affect_Representation_and_Recognition_in_3D_Continuous_Valence-Arousal-Dominance_Space It takes 15 emotions and maps in Tables their mean values and standard deviations to each of VAD. And provides Euc.distances between all the 15 basic emotions. Very intersting. What I trying to acheave is a webapp, that records conversation and send on the fly chunks (3-5 sec) of it to model for emotions evaluation. According to evaluated atmosphere it suggests a music background (music pieces with VADs corresponding to conversation's VAD. |
It sounds amazing. |
Apparently, and according to its own creators, the audEERING model was not the wisest of choices.
To address such shortcomings, I have forked CMU-MOSEI:
https://github.com/mirix/messaih
The fork is tailored towards Speech Emotion Recognition (SER).
The idea now would be training a model on messAIh and see how it behaves on a real-life scenario.
The text was updated successfully, but these errors were encountered: