Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/unsup multichan waveform dataset #532

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

kkarrancsu
Copy link

Added Dataset which supports multichannel audio-samples. Updated collator to drop pad-tracks.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think it's a good start, will take a bit more effort to handle multi-channel properly here, but we'll get there.

# TODO: how to ensure that each track is synced across batches? i.e. dim=1 is the track index
# and should correspond to the same mic across batches

cuts = maybe_pad(cuts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed as you're manually zero-padding later.

# and should correspond to the same mic across batches

cuts = maybe_pad(cuts)
cuts = remove_pad_tracks(cuts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a pitfall here, what if a MixedCut looks like:

|-------cut1-------||---padding---||----cut2----|

or any variation of the situation where padding is in between of two cuts. I don't think Lhotse would handle these situations well with your current code. Maybe you should try only removing the padding at the end (and beginning, but for that one you have to be careful about modifying the offsets on the remaining tracks). Rather than manually removing PaddingCuts, I suggest using .truncate() with carefully computed offset and duration arguments; that method will handle a lot of pitfalls and edge-cases.

for idx, cut in enumerate(cuts):
ntrack = len(cut.tracks)
nsamp = cut.num_samples
audio[idx, 0:ntrack, 0:nsamp] = torch.from_numpy(cut.load_audio(mixed=False))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if you did cut.mix(musan_cut) here, it will also add an extra track; as is, the code would not work with additive noise data augmentation

cuts = remove_pad_tracks(cuts)

# NOTE: what to do when the # of tracks is not the same across cuts, right now
# this is zero-padding but that seems bad ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you won't escape zero-padding of examples with less channels if you need to collate the data. However, I suggest you modify this function to return a 3-tuple: (audio, audio_lens, channel_indexes) where audio is the collated data with shape (B, C, T), audio_lens has the length of each multi-channel example of shape (B,), and channel_indexes is a list of lists of which C dim indexes have meaningful channels for examples (it could also be channel_lens tensor of shape (B,) assuming first c channels are always meaningful, if it's possible to guarantee).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the end your models will have to somehow deal with the non-meaningful channels anyway. As long as you're working on same-number-of-channels data, no need to overthink this.

assert all(isinstance(cut, MixedCut) for cut in cuts)

# TODO: how to ensure that each track is synced across batches? i.e. dim=1 is the track index
# and should correspond to the same mic across batches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ensure the tracks are sorted by some property; I imagine this is something very corpus specific and should be done by the user, not by the library.

@@ -74,6 +74,41 @@ def _validate(self, cuts: CutSet) -> None:
assert all(cut.has_recording for cut in cuts)


class UnsupervisedMultiChanWaveformDataset(UnsupervisedDataset):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class UnsupervisedMultiChanWaveformDataset(UnsupervisedDataset):
class MultiChannelWaveformDataset(UnsupervisedDataset):

somehow reads better to me

"audio_lens": audio_lens,
}
else:
return {"cuts": cuts, "audio": [c.load_audio(mixed=False) for c in cuts]}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line would again have the extra padding channels problem. This suggests that maybe the solution should not be (entirely) in the collate function, but inside load_audio, e.g. controlled by an extra argument?

@@ -96,7 +96,7 @@ def __getitem__(self, cuts: CutSet) -> Dict[str, Union[torch.Tensor, List[str]]]
Return a new batch, with the batch size automatically determined using the constraints
of max_frames and max_cuts.
"""
validate_for_asr(cuts)
#validate_for_asr(cuts)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be uncommented

@pzelasko
Copy link
Collaborator

pzelasko commented Jan 12, 2022

For anybody interested in this, here's some context of our earlier discussion with @kkarrancsu

I expect you to run into issues related to padding and MUSAN data augmentation with it. Basically, padding and augmentation creates extra tracks in MixedCut, and neither MixedCut nor collate_multi_channel_audio know which tracks are the data, and which tracks are the padding / noise. So, for 4-channel audio, you might end up with 6-channel output from collate_multi_channel_audio; you might want to modify it somehow to avoid that (interpret padding as padding all channels and noise augmentation as… adding the noise to each channel? or just one of them? I don’t know)

@pzelasko
Copy link
Collaborator

@kkarrancsu I have a different idea -- we could add a new attribute to MixTrack that's called separate_channel: bool = True -- it would indicate if a given track is a "data" channel or an "augmentation" channel. We would add a parameter with the same name to mix (default=True) and pad (default=False) operations on all cuts.

We would need to extend FeatureMixer and AudioMixer to handle these cases properly. The idea is to again add separate_channel parameter to add_to_mix method, and modify the property unmixed_audio:

lhotse/lhotse/audio.py

Lines 984 to 990 in b41e4f8

@property
def unmixed_audio(self) -> np.ndarray:
"""
Return a numpy ndarray with the shape (num_tracks, num_samples), where each track is
zero padded and scaled adequately to the offsets and SNR used in ``add_to_mix`` call.
"""
return np.vstack(self.tracks)

so that instead of simply vstacking the right channels, it vstacks only the "separate" channels, downmixes the remaining channels to mono, and adds them to each of the "separate" channels. The analogous operation is needed for FeatureMixer.

Then, collate_multi_channel_audio could remain almost unmodified w.r.t. to what's there in the codebase, and it would not need any special logic to figure out what's a real channel and what is just padding/augmentation.

Of course we'd need to add more unit tests to make sure this doesn't break anything and works as expected.

@danpovey
Copy link
Collaborator

It seems to me that the more "correct" way to do this would be, when adding noise to multi-channel audio, to add multiple channels of noise. I assume this would require some nontrivial simulation, possibly with multiple sources.
I would have thought that for any application that was going to process multiple-channel audio in a non-trivial way, just adding a single type of noise to all channels would not really be sufficient.

@pzelasko
Copy link
Collaborator

pzelasko commented Jan 13, 2022

Good point.. I am not sure if implementing that on top of MixedCut makes sense though, as we would need to hold all the information relating to the "nontrivial simulation" in the manifests. It might make more sense to write a dedicated module/transform that works directly on audio data (or use an existing tool).

One such tool is e.g. https://github.com/asteroid-team/torch-audiomentations, but I just noticed that they are doing exactly the same simplified mono downmix I was thinking about:

https://github.com/asteroid-team/torch-audiomentations/blob/261015b3fdd99b475507aab01456093e13719519/torch_audiomentations/augmentations/background_noise.py#L13-L21

Another option is using https://github.com/LCAV/pyroomacoustics as a transform inside your PyTorch Dataset class I think.

In any case, we would still need to be able to handle the padding. I think the solution I suggested with separate_channel is complimentary to using more advanced simulations later in the data pipeline (e.g., manifests only contain the padding information, MixedCut loads the audio and doesn't add the extra channel for padding, and noise+simulation is added via a transform inside Dataset).

@danpovey
Copy link
Collaborator

OK, sure, it was just a thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants