Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making use of audio chunks of more than 10 seconds #230

Open
alexis-michaud opened this issue Mar 20, 2020 · 2 comments
Open

Making use of audio chunks of more than 10 seconds #230

alexis-michaud opened this issue Mar 20, 2020 · 2 comments

Comments

@alexis-michaud
Copy link

Currently, the upper limit on the duration of audio chunks taken as input by Persephone is 10 seconds. This is an issue for the real-world deployment of Persephone, because many documents in archives such as the Pangloss Collection are divided into longer chunks.

Thus, the document “Romanmangan, the fairy from the other world" has a duration of 1,890 seconds, and is divided into 212 sentences. Seventy sentences, amounting to more than half of the total duration of this substantial story, are above the 10-second limit, and thus not used in training.

A suggestion from a reviewer of a paper at SLTU is to perform Voice Activity Detection (VAD), to detect silence and non-silence, and then cut the long waveform at silence part into smaller pieces. This way, we may still use all the data for training.

@alexis-michaud alexis-michaud changed the title Detecting silence within audio chunks of more than 10 seconds Making use of audio chunks of more than 10 seconds Mar 20, 2020
@oadams
Copy link
Collaborator

oadams commented Mar 20, 2020

Yeah, detecting voices and breaking on silence is definitely a good angle to take. However, for training data it doesn't fully solve the problem because we still need to know what parts of the transcription correspond to that chunk. One useful approach would be to do forced alignment as an initial approach, then chunk based on silence, then feed it into training.

@alexis-michaud
Copy link
Author

👍
Yes, this is a more ambitious and promising approach than what I had in mind. It's the way to go.

I had in mind cases where, once silences are removed, the chunk gets down to under 10 seconds and can be used in the training set without splitting the transcription. Then VAD is enough to cram the chunk into the training set. But implementing the more ambitious solution is better, as it is more general (addressing all cases); Removing silences is not 'clean', as it comes at the cost of compromised audio. Simply removing silences tampers with the original signal, removing useful cues (pauses are part of the structure, and removing them can create acoustic 'monsters').

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants