Replies: 1 comment 1 reply
-
Hi Kirill, I don't think Lhotse is entirely what you are looking for. Coming from Kaldi world, think of it as data directories + "nnet egs" in the world of Python + PyTorch. The main purpose is to make your life easier with managing, manipulation, and loading of speech data. We don't provide any pre-trained models/advanced DSP algorithms for the sort of task you described (and I don't think we are aiming to either). However, if you wanted to write or use an existing Python tool for that, Lhotse might be helpful for creating the data description manifests, loading audio, truncating it, transforms, augmentation, feature extraction, stratified sampling, etc. To some extent we also support importing from/exporting to Kaldi data dirs. Regarding VadDataset, it's a PyTorch dataset class that converts a mini-batch of data manifests (CutSet) to a mini-batch of tensors (ready for nnet training or inference). Different dataset classes will return different kinds of tensors that are task specific (for VAD it would be an audio or feature tensor + sample/frame level speech activity labels). |
Beta Was this translation helpful? Give feedback.
-
I am still trying to wrap my head around Lhotse, and, honestly, have a trouble grokking even the basic concepts. I learn the quickest by doing. My question is, essentially, what the VAD dataset really is.
The practical problem that I need to solve is like this. I have a very large set (~1TB) of short (5-40s) separate utterances, each in its own G.711 u-law 8 kHz mono file. I need to automatically, more or less, label temporal intervals in these files, to obtain a very small, compared to the total size (say, 1% is currently more than I probably need), with 5 categories:
I need a certain amount of samples of these. The samples should be at least 520ms in length. They do not have to have any relation to the source file: any temporal cut of t_{min} or more is a separate training labeled example. Since I can discard most of classified spans that are of a low confidence, I'm after high-confidence ones.
Is this something that Lhotse can do? Does it perform quite advanced DSP of the signal to produce such an unsupervised classification, or am I not grokking its purpose entirely, and this is not what the tool is intended for/capable of?
Beta Was this translation helpful? Give feedback.
All reactions