You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, the upper limit on the duration of audio chunks taken as input by Persephone is 10 seconds. This is an issue for the real-world deployment of Persephone, because many documents in archives such as the Pangloss Collection are divided into longer chunks.
Thus, the document “Romanmangan, the fairy from the other world" has a duration of 1,890 seconds, and is divided into 212 sentences. Seventy sentences, amounting to more than half of the total duration of this substantial story, are above the 10-second limit, and thus not used in training.
A suggestion from a reviewer of a paper at SLTU is to perform Voice Activity Detection (VAD), to detect silence and non-silence, and then cut the long waveform at silence part into smaller pieces. This way, we may still use all the data for training.
The text was updated successfully, but these errors were encountered:
alexis-michaud
changed the title
Detecting silence within audio chunks of more than 10 seconds
Making use of audio chunks of more than 10 seconds
Mar 20, 2020
Yeah, detecting voices and breaking on silence is definitely a good angle to take. However, for training data it doesn't fully solve the problem because we still need to know what parts of the transcription correspond to that chunk. One useful approach would be to do forced alignment as an initial approach, then chunk based on silence, then feed it into training.
👍
Yes, this is a more ambitious and promising approach than what I had in mind. It's the way to go.
I had in mind cases where, once silences are removed, the chunk gets down to under 10 seconds and can be used in the training set without splitting the transcription. Then VAD is enough to cram the chunk into the training set. But implementing the more ambitious solution is better, as it is more general (addressing all cases); Removing silences is not 'clean', as it comes at the cost of compromised audio. Simply removing silences tampers with the original signal, removing useful cues (pauses are part of the structure, and removing them can create acoustic 'monsters').
Currently, the upper limit on the duration of audio chunks taken as input by
Persephone
is 10 seconds. This is an issue for the real-world deployment ofPersephone
, because many documents in archives such as the Pangloss Collection are divided into longer chunks.Thus, the document “Romanmangan, the fairy from the other world" has a duration of 1,890 seconds, and is divided into 212 sentences. Seventy sentences, amounting to more than half of the total duration of this substantial story, are above the 10-second limit, and thus not used in training.
A suggestion from a reviewer of a paper at SLTU is to perform Voice Activity Detection (VAD), to detect silence and non-silence, and then cut the long waveform at silence part into smaller pieces. This way, we may still use all the data for training.
The text was updated successfully, but these errors were encountered: