Making use of audio chunks of more than 10 seconds #230

alexis-michaud · 2020-03-20T08:23:29Z

Currently, the upper limit on the duration of audio chunks taken as input by Persephone is 10 seconds. This is an issue for the real-world deployment of Persephone, because many documents in archives such as the Pangloss Collection are divided into longer chunks.

Thus, the document “Romanmangan, the fairy from the other world" has a duration of 1,890 seconds, and is divided into 212 sentences. Seventy sentences, amounting to more than half of the total duration of this substantial story, are above the 10-second limit, and thus not used in training.

A suggestion from a reviewer of a paper at SLTU is to perform Voice Activity Detection (VAD), to detect silence and non-silence, and then cut the long waveform at silence part into smaller pieces. This way, we may still use all the data for training.

The text was updated successfully, but these errors were encountered:

oadams · 2020-03-20T09:03:10Z

Yeah, detecting voices and breaking on silence is definitely a good angle to take. However, for training data it doesn't fully solve the problem because we still need to know what parts of the transcription correspond to that chunk. One useful approach would be to do forced alignment as an initial approach, then chunk based on silence, then feed it into training.

alexis-michaud · 2020-03-20T09:20:40Z

👍
Yes, this is a more ambitious and promising approach than what I had in mind. It's the way to go.

I had in mind cases where, once silences are removed, the chunk gets down to under 10 seconds and can be used in the training set without splitting the transcription. Then VAD is enough to cram the chunk into the training set. But implementing the more ambitious solution is better, as it is more general (addressing all cases); Removing silences is not 'clean', as it comes at the cost of compromised audio. Simply removing silences tampers with the original signal, removing useful cues (pauses are part of the structure, and removing them can create acoustic 'monsters').

alexis-michaud changed the title ~~Detecting silence within audio chunks of more than 10 seconds~~ Making use of audio chunks of more than 10 seconds Mar 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making use of audio chunks of more than 10 seconds #230

Making use of audio chunks of more than 10 seconds #230

alexis-michaud commented Mar 20, 2020

oadams commented Mar 20, 2020

alexis-michaud commented Mar 20, 2020

Making use of audio chunks of more than 10 seconds #230

Making use of audio chunks of more than 10 seconds #230

Comments

alexis-michaud commented Mar 20, 2020

oadams commented Mar 20, 2020

alexis-michaud commented Mar 20, 2020