Replies: 1 comment
-
After trying out different approaches, I found the following general pattern worked the best for me with the pyannote.audio pipeline: import os
import pandas as pd
import torchaudio
from torch.utils.data import Dataset, DataLoader
class DiarizationDataset(Dataset):
def __init__(self, filepaths):
self.filepaths = filepaths
def __len__(self):
return len(self.filepaths)
def __getitem__(self, idx):
# speech_array, sample_rate = librosa.load(audio_filepath, sr=16000)
speech_array, sample_rate = torchaudio.load(self.filepaths[idx])
if speech_array.dim() >= 2:
speech_array = torch.mean(speech_array, dim=0)
# pyannote.audio pipeline expects a dict with keys "waveform" and "sampling_rate"
inputs = {
"sample_rate": sample_rate,
"waveform": speech_array,
"filename": self.filepaths[idx],
}
return inputs
def custom_collate_fn(data):
"""
To load audio files batch wise we make a custom collator to allow
for variable length waveforms."""
waveform = [sample["waveform"] for sample in data]
sample_rate = [torch.tensor(sample["sample_rate"]) for sample in data]
sample_rate = torch.stack(sample_rate)
batch = {
"sample_rate": sample_rate,
"waveform": waveform, # samples may have different dimensions
}
return batch
filepaths = os.listdir("some/dir")
diarization = DiarizationDataset(filepaths)
diarization_loader = DataLoader(
diarization,
batch_size=1, # Not much difference between 1 and higher
shuffle=False,
num_workers=24,
collate_fn=custom_collate_fn,
prefetch_factor=4,
)
pipe = Pipeline.from_pretrained("pyannote/[email protected]", use_auth_token=True)
speakers = []
for index, batch in tqdm(enumerate(diarization_loader), total=len(diarization_loader)):
for i in range(0, len(batch["waveform"])):
diarization = pipe(
{
"sample_rate": int(batch["sample_rate"][i]),
"waveform": batch["waveform"][i].to("cuda").unsqueeze(0),
}
)
df_speaker = pd.DataFrame(
[
{"start": segment.start, "end": segment.end, "label": label}
for segment, _, label in diarization.itertracks(yield_label=True)
]
)
df_speaker["dokid"] = batch["filename"][i]
speakers.append(df_speaker)
df_speakers = pd.concat(speakers).reset_index(drop=True) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi! Excellent library!
I was wondering whether there is a way to write a custom dataloader that works with existing pyannote.audio pipelines? Inference with pipelines is slowed down by having to read data and perform inference sequentially as opposed to having a dataloader prefetch examples.
In your applying a pipeline example I saw that the user can supply waveforms in-memory to the pipe as a dict with the fields
waveform
andsample_rate
. So I tried writing a dataloader with a custom collator that padded all waveforms to same length and returned dimensions(batch_size, waveform_length)
. Same with thesample_rate
s. However, the pipe seems to expect single examples as I get "Boolean value of Tensor with more than one value is ambiguous" error.It would be very helpful if you, in the documentation, could provide at minimum one example of inference that goes beyond single files. E.g.
or
batch_size=1
and a regular custom written pytorch dataloader that reads files from disk, without magically abstracting away stuff behind Lightning or pyannote specific (black magic) functions.Disclaimer: I haven't yet tried applying a pretrained model the regular way without pipes. I wanted quick and dirty results, which meant trying to avoid reimplementing the functionality of existing pipes in order to be able to perform efficient inference replicating the pipe output. I imagine efficient inference might be quite an important use case for many of the users of your library, so an example would go a long way I think!
Beta Was this translation helpful? Give feedback.
All reactions