Replies: 1 comment 3 replies
-
I don’t recall seeing any analysis of this. I remember checking it in some ASR experiment and the difference was very small, about 1% relative to the WER. Saurabh tried it on speech enhancement and the difference was more pronounced when evaluated on EER in speaker ID task, maybe 5% relative. More detailed analysis would be welcome. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
When mixing audios at the feature level, each feature type defines a
mix()
function. For theTorchaudioFbank
feature, this is defined here and is given as:Is there some analysis about how accurate this mixing is, compared with mixing the raw audio and then performing extraction? STFT is linear and the log-exp addition takes care of the log part, but wouldn't the Mel-scale filters make some difference? Does anyone know of any papers about this?
Beta Was this translation helpful? Give feedback.
All reactions