Custom dataloader with pipelines? #1152

Lauler · 2022-11-16T02:19:32Z

Lauler
Nov 16, 2022

Hi! Excellent library!

I was wondering whether there is a way to write a custom dataloader that works with existing pyannote.audio pipelines? Inference with pipelines is slowed down by having to read data and perform inference sequentially as opposed to having a dataloader prefetch examples.

In your applying a pipeline example I saw that the user can supply waveforms in-memory to the pipe as a dict with the fields waveform and sample_rate. So I tried writing a dataloader with a custom collator that padded all waveforms to same length and returned dimensions (batch_size, waveform_length). Same with the sample_rates. However, the pipe seems to expect single examples as I get "Boolean value of Tensor with more than one value is ambiguous" error.

It would be very helpful if you, in the documentation, could provide at minimum one example of inference that goes beyond single files. E.g.

Performing batch inference with efficient data loading using pipes.

or

Performing batch inference or alternatively performing efficient dataloading with batch_size=1 and a regular custom written pytorch dataloader that reads files from disk, without magically abstracting away stuff behind Lightning or pyannote specific (black magic) functions.

Disclaimer: I haven't yet tried applying a pretrained model the regular way without pipes. I wanted quick and dirty results, which meant trying to avoid reimplementing the functionality of existing pipes in order to be able to perform efficient inference replicating the pipe output. I imagine efficient inference might be quite an important use case for many of the users of your library, so an example would go a long way I think!

Lauler · 2022-11-16T23:58:17Z

Lauler
Nov 16, 2022
Author

After trying out different approaches, I found the following general pattern worked the best for me with the pyannote.audio pipeline:

import os
import pandas as pd
import torchaudio
from torch.utils.data import Dataset, DataLoader

class DiarizationDataset(Dataset):
    def __init__(self, filepaths):
        self.filepaths = filepaths

    def __len__(self):
        return len(self.filepaths)

    def __getitem__(self, idx):
        # speech_array, sample_rate = librosa.load(audio_filepath, sr=16000)
        speech_array, sample_rate = torchaudio.load(self.filepaths[idx])

        if speech_array.dim() >= 2:
            speech_array = torch.mean(speech_array, dim=0)

        # pyannote.audio pipeline expects a dict with keys "waveform" and "sampling_rate"
        inputs = {
            "sample_rate": sample_rate,
            "waveform": speech_array, 
            "filename": self.filepaths[idx],
        }

        return inputs

def custom_collate_fn(data):
    """
    To load audio files batch wise we make a custom collator to allow
    for variable length waveforms."""

    waveform = [sample["waveform"] for sample in data]
    sample_rate = [torch.tensor(sample["sample_rate"]) for sample in data]

    sample_rate = torch.stack(sample_rate) 

    batch = {
        "sample_rate": sample_rate,
        "waveform": waveform,  # samples may have different dimensions
    }
    return batch

filepaths = os.listdir("some/dir")
diarization = DiarizationDataset(filepaths)
diarization_loader = DataLoader(
    diarization,
    batch_size=1, # Not much difference between 1 and higher
    shuffle=False,
    num_workers=24,
    collate_fn=custom_collate_fn,
    prefetch_factor=4,
)

pipe = Pipeline.from_pretrained("pyannote/[email protected]", use_auth_token=True)

speakers = []
for index, batch in tqdm(enumerate(diarization_loader), total=len(diarization_loader)):
    for i in range(0, len(batch["waveform"])):
        diarization = pipe(
            {
                "sample_rate": int(batch["sample_rate"][i]),
                "waveform": batch["waveform"][i].to("cuda").unsqueeze(0),
            }
        )

        df_speaker = pd.DataFrame(
            [
                {"start": segment.start, "end": segment.end, "label": label}
                for segment, _, label in diarization.itertracks(yield_label=True)
            ]
        )
        df_speaker["dokid"] = batch["filename"][i]
        speakers.append(df_speaker)

df_speakers = pd.concat(speakers).reset_index(drop=True)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom dataloader with pipelines? #1152

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Custom dataloader with pipelines? #1152

Lauler Nov 16, 2022

Replies: 1 comment

Lauler Nov 16, 2022 Author

Lauler
Nov 16, 2022

Lauler
Nov 16, 2022
Author