Understanding the VAD dataset #655

kkm000 · 2022-04-02T05:03:35Z

kkm000
Apr 2, 2022

I am still trying to wrap my head around Lhotse, and, honestly, have a trouble grokking even the basic concepts. I learn the quickest by doing. My question is, essentially, what the VAD dataset really is.

The practical problem that I need to solve is like this. I have a very large set (~1TB) of short (5-40s) separate utterances, each in its own G.711 u-law 8 kHz mono file. I need to automatically, more or less, label temporal intervals in these files, to obtain a very small, compared to the total size (say, 1% is currently more than I probably need), with 5 categories:

Silence. This is easy. Everything below -54dB power cutoff is silence.
Clear voice.
Voice over a background noise of various kinds. I do not have a good characterisation of what it is; ideally, everything but the foreground voice.
Noise, including but not limited to similar background noise; anything which has enough energy but is not foreground human speech, like line noise, any real sound that got into the recording that is not voice, from dropped pen to a jet plane etc.
Music, a typical on-hold kind. This is usually not mixed with other real world noises.

I need a certain amount of samples of these. The samples should be at least 520ms in length. They do not have to have any relation to the source file: any temporal cut of t_{min} or more is a separate training labeled example. Since I can discard most of classified spans that are of a low confidence, I'm after high-confidence ones.

Is this something that Lhotse can do? Does it perform quite advanced DSP of the signal to produce such an unsupervised classification, or am I not grokking its purpose entirely, and this is not what the tool is intended for/capable of?

pzelasko · 2022-04-02T13:08:19Z

pzelasko
Apr 2, 2022
Maintainer

Hi Kirill,

I don't think Lhotse is entirely what you are looking for. Coming from Kaldi world, think of it as data directories + "nnet egs" in the world of Python + PyTorch. The main purpose is to make your life easier with managing, manipulation, and loading of speech data.

We don't provide any pre-trained models/advanced DSP algorithms for the sort of task you described (and I don't think we are aiming to either). However, if you wanted to write or use an existing Python tool for that, Lhotse might be helpful for creating the data description manifests, loading audio, truncating it, transforms, augmentation, feature extraction, stratified sampling, etc. To some extent we also support importing from/exporting to Kaldi data dirs.

Regarding VadDataset, it's a PyTorch dataset class that converts a mini-batch of data manifests (CutSet) to a mini-batch of tensors (ready for nnet training or inference). Different dataset classes will return different kinds of tensors that are task specific (for VAD it would be an audio or feature tensor + sample/frame level speech activity labels).

1 reply

pzelasko Apr 2, 2022
Maintainer

Also if you want to learn more about Lhotse with specific examples, check out the tutorial notebooks, they demonstrate the capabilities on small data such as mini librispeech.

https://github.com/lhotse-speech/lhotse/tree/master/examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the VAD dataset #655

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Understanding the VAD dataset #655

kkm000 Apr 2, 2022

Replies: 1 comment · 1 reply

pzelasko Apr 2, 2022 Maintainer

pzelasko Apr 2, 2022 Maintainer

kkm000
Apr 2, 2022

Replies: 1 comment 1 reply

pzelasko
Apr 2, 2022
Maintainer

pzelasko Apr 2, 2022
Maintainer