-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving window rather than 'hard' chunking into 10-second chunks? #195
Comments
I think this is a really good point you have raised here. Right now I have a bunch of other tasks I have to do with the UI but I'll look into this more when I'm done with those. I suspect there's some way of dealing with this by splitting the audio at sections that have a sufficient duration of silence but there's a bit of work that has to be done there to get it to work cleanly in all the edge cases. |
Yes, 10 seconds is a long time, for speech. It should be possible to find and exploit landmarks such as breathing (breath groups). Maybe add a 'signal processing' label to this issue, as well as to #4 #7 #39 #111 ? So that when a signal processing expert joins the team, it will be possible to list relevant issues in one fell swoop. |
This is a good idea and should be implemented. It would be straight-forward to stitch the overlapping windows together using some edit-distance matching to create one contiguous transcription. Another idea (similar to the breathing idea mentioned) would be to break on pauses and silences. This doesn't completely resolve the issue though since there might still conceivably be > 10s segments with no clearly discernable silence. |
Is it really straightforward to stitch when there is overlap between successive audio windows? Edit-distance matching will allow for stitching, but any mismatches in the overlapping portion of the 2 chunks will need to be resolved. For instance if transcription of chunk 0 ends with How likely is it that the same sounds will be transcribed differently by the software when they are part of different chunks of audio (in this hypothetical example: One could dream of a 'heatmapping' or '3D' display for manual verification, which would have the 1st candidate foregrounded but with the 2nd best candidate visible 'between the lines', as it were. This might also make sense when |
To detect sentences (call it "sentence-ish units", like the Speech processing people would know how hard it is to identify breath-groups in the signal, either as such (identifying spectral cues to in-breath, in ideal cases: signal from a head-worn microphone in good conditions) or as silent pauses or through other cues such as f0 declination inside the breath group (when signal-to-noise ratio is not good enough to allow acoustic detection of in-breath). |
So there's two parts of this problem. The first is, given two segments A and B, ensure that the part unique to B immediately follows A (or that the part unique to A immediately precedes B). With any reasonable overlap between the strings and any reasonably low phoneme error rate, this can done with high confidence using fuzzy string matching. The more A and B overlap time-wise the less likely a mistake is (exponentially so). The second problem is how to resolve differences. The straightforward approach here is to take the hypothesis with more confidence. The most correct way to do this would involve summing over all the paths in the CTC trellis that currespond to suffixes of A and those corresponding to prefixes of B, and taking the most likely output. The easiest way (which makes the most sense in our context given that we're doing greedy 1-best decoding) would be to just take all the CTC output probabilities in our 1-best path that correspond to that phoneme in that part of the sequence and sum over them. Then we compare and take the more likely one. This probably doesn't make any sense to the reader but it's partly a note to myself for future.
It's an interesting empirical question which would actually yield insight into how much the LSTM is relying on long-range information in order to make its decisions. I agree that it would happen most often for tone, but I'm not sure whether it will occur that often.
Presenting alternative hypotheses via a beautifully displayed lattice or similar is something that's been at the back of my mind for a long time now. This would be great to incorporate into a web front-end, especially in the context of an iterative training pipeline where a linguists corrections are fed back into model training.
Good to know this figure! |
Would be keen to talk about this UI when we get to the point of a more fully featured implementation for a front-end |
The source recordings are split into 10-second chunks, right? This makes it harder to identify phonemes at the edges of these 10-second chunks: not only is phonemic context lacking (at the 'left' for the 1st phoneme, at the 'right' for the last), but the phone astraddle the 10-second-chunk boundary get cut rather brutally 🔪. This creates a maimed stub that is harder to identify.
What about using a moving window to smoothe the edges? This could improve recognition of sounds found at boundaries: at 0 s, 10s, 20s, etc. Instead of
Persephone would deal with overlapping chunks. Added chunks are highlighted below:
In case of mismatch between successive windows, for instance if the transcription for 0-10 s ends on a /f/ whereas the 5-15 s has /s/ at the same position inside the string, then the 'mid-file transcription (found in the chunk where the target phone has not lost its integrity & sits snugly in the middle of a pristine context) will be favoured over the 'maimed-stub transcription' (found in the chunk where the target phone is at, or close to, an edge) and the /s/ would be retained, not the /f/.
Of course it is likely to get more complex than this, since probabilities for successive phonemes are not independent from one another. But intuitively it seems clear that there is room for improvement by addressing the issue of transition from one audio chunk to the next. Options for 'sensitive' choice of boundaries could include detection of long pauses, in-breath...
The text was updated successfully, but these errors were encountered: