Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

collate_multi_channel_audio #552

Closed
m-wiesner opened this issue Jan 26, 2022 · 3 comments
Closed

collate_multi_channel_audio #552

m-wiesner opened this issue Jan 26, 2022 · 3 comments

Comments

@m-wiesner
Copy link
Contributor

I think there is a problem in collate_multi_channel_audio

def collate_multi_channel_audio(cuts: CutSet) -> torch.Tensor:
    """
    Load audio samples for all the cuts and return them as a batch in a torch tensor.
    The cuts have to be of type ``MixedCut`` and their tracks will be interpreted as individual channels.
    The output shape is ``(batch, channel, time)``.
    The cuts will be padded with silence if necessary.
    """
    assert all(cut.has_recording for cut in cuts)
    assert all(isinstance(cut, MixedCut) for cut in cuts)
    cuts = maybe_pad(cuts)
    first_cut = next(iter(cuts))
    audio = torch.empty(len(cuts), len(first_cut.tracks), first_cut.num_samples)
    for idx, cut in enumerate(cuts):
        audio[idx] = torch.from_numpy(cut.load_audio())
    return audio

the output tensor is initialized here

audio = torch.empty(len(cuts), len(first_cut.tracks), first_cut.num_samples)

and then inside the subsequent for loop
cut.load_audio() uses the flag mix=True, by default so it returns a tensor of size (1 x cut.num_samples) instead of a tensor of size (1 x len(first_cut.tracks) x cut.num_samples). This means the multichannel track is mixed down by default and the values in audio[:, 1:, :] are not ever set and can be arbitrary values.

The mix flag should probably be passed as an option to collate_multi_channel_audio, or otherwise it should be updated to return a tensor of size (len(cuts), first_cut.num_samples), where the mix should happen automatically, and the doc string should be updated to reflect this.

@pzelasko
Copy link
Collaborator

Good point. This function is actually not very well supported in Lhotse right now -- please refer to the discussion in #532. If you're open to doing some work to extend the multi-channel support in Lhotse, I'd love to help with that.

@m-wiesner
Copy link
Contributor Author

m-wiesner commented Jan 27, 2022

There is another small related problem I have noticed ...

The function mix_cuts() in cut.py is supposed to return cuts of type MixedCuts. The docstring says """Return a MixedCut that consists of the input Cuts mixed with each other as-is."""

In some cases, there are CutSets intended to represent multichannel audio for which a small number of recordings, for whatever reason, only have a single channel. In these cases the function applied in the functools.reduce operator will not be applied to the first (and only element). Currently the function is the mix() method, which among other things, casts cuts to MixedCuts.

As a result, a single Channel MonoCut recording will not be cast to a MixedCut. This problem also affects mix_same_recording_channels(), which will return a CutSet, that has some MonoCuts as well as Mixed cuts, when I think the intention was for it to only return MixedCuts. A similar problem is present in the MixedCut truncate method, which has a special case for when there is a single channel, and returns a MonoCut, which is effectively casting the MixedCut to a MonoCut, which I also don't think is the desired behavior, but perhaps this was intended ... I assume a similar problem may also affect other methods, but these are the only two I have found so far.

I have fixed this by adding a static function to MixedCuts

@staticfunction
def from_mono(cut: MonoCut) -> MixedCut:
       return MixedCut(id=cut.id, tracks=[MixTrack(cut=cut)])

and then changing mix_cuts from

from

return reduce(mix, cuts)
to

return MixedCut.from_mono(next(iter(cuts))) if len(cuts) == 1 else reduce(mix, cuts)

I also added this into the MixedCut truncate method


if len(new_tracks) == 1:                                                                                                               
            # The truncation resulted in just a single cut - simply return it.                                                              
            return MixedCut.from_mono(new_tracks[0].cut)

I can submit a pull request if this seems fine, but I think this is related more generally to the issue of how to support MultiChannel audio.

@desh2608
Copy link
Collaborator

I think we can close this since MultiCut is now supported as its own class. Feel free to re-open if you think this is still an issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants