-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Demo code #4
Comments
We implement the demo using matplotlib in a naive way. We don't plan to share this part of the code. While the implementation of demo generation is based on the other project: https://github.com/IFICL/stereocrw/blob/master/vis_scripts/vis_video_itd.py . |
Please see the reply from #5 (comment). The model is correct. |
If possible, could you please briefly explain the logic of the demo code for this project? I checked out https://github.com/IFICL/stereocrw/blob/master/vis_scripts/vis_video_itd.py, the structure seems to be very similar to the evaluate_angle.py in this project. Could you please share a little bit about the demo code for this project. Thank you so much! |
@1390806607 @deBrian07 import csv
import glob
import h5py
import io
import json
import librosa
import numpy as np
import os
import pickle
from PIL import Image
from PIL import ImageFilter
import random
import scipy
import soundfile as sf
import time
from tqdm import tqdm
import glob
import cv2
import torch
import torch.nn as nn
import torchaudio
import torchvision.transforms as transforms
import sys
sys.path.append('..')
from data import AudioSFMbaseDataset
import pdb
class SingleVideoDataset(AudioSFMbaseDataset):
def __init__(self, args, pr, list_sample, split='train'):
self.pr = pr
self.args = args
self.split = split
self.seed = pr.seed
self.image_transform = transforms.Compose(self.generate_image_transform(args, pr))
self.repeat = args.repeat if split == 'train' else 1
video_path = list_sample
audio_path = os.path.join(video_path, 'audio', 'audio.wav')
frame_path = os.path.join(video_path, 'frames')
meta_path = os.path.join(video_path, 'meta.json')
with open(meta_path, "r") as f:
self.meta_dict = json.load(f)
# audio_sample_rate = meta_dict['audio_sample_rate']
self.frame_rate = self.meta_dict['frame_rate']
frame_list = glob.glob(f'{frame_path}/*.jpg')
frame_list.sort()
# import pdb; pdb.set_trace()
self.frame_list = frame_list
audio, self.audio_rate = self.read_audio(audio_path)
audio = np.transpose(audio, (1, 0))
audio = self.normalize_audio(audio, desired_rms=0.1)
self.audio = torch.from_numpy(audio.copy()).float()
num_sample = len(self.frame_list)
# calculate the keyframes:
if args.keyframe_interval == None:
args.keyframe_interval = num_sample
self.keyframe_inds = np.arange(0, num_sample, step=args.keyframe_interval)
# print('Video Dataloader: # of frames {}: {}'.format(self.split, num_sample))
def __getitem__(self, index):
# import pdb; pdb.set_trace()
audio_length = self.audio.shape[1]
frame_path = self.frame_list[index]
start_time = index / self.meta_dict['frame_rate'] - self.pr.clip_length / 2
audio_rate = self.audio_rate
clip_length = int(self.pr.clip_length * self.audio_rate)
audio_start_time = int(start_time * self.audio_rate)
audio_end_time = audio_start_time + clip_length
if audio_start_time < 0:
audio_start_time = 0
audio_end_time = audio_start_time + clip_length
if audio_end_time > audio_length:
audio_end_time = audio_length
audio_start_time = audio_end_time - clip_length
img_2 = self.read_image(frame_path)
audio = self.audio[:, audio_start_time: audio_end_time]
# determine reference image
keyframe_ind = int(index // self.args.keyframe_interval)
# current index is the keyframe, we set the reference image to previous keyframe
if index % self.args.keyframe_interval == 0:
if keyframe_ind !=0:
keyframe_ind -= 1
img1_ind = self.keyframe_inds[keyframe_ind]
img_1 = self.read_image(self.frame_list[img1_ind])
batch = {
'img_1': img_1,
'img_2': img_2,
'img_path': frame_path,
'keyframe_ind': keyframe_ind,
'audio': audio,
}
return batch
def getitem_test(self, index):
self.__getitem__(index)
def __len__(self):
return len(self.frame_list) For the inference code: # For smoothing the prediction which used inside visualization code
def smooth_prediction(signal, window_length):
signal_padding = torch.tensor([signal[-1]] * (window_length - 1))
signal = torch.tensor(signal)
signal = torch.cat([signal, signal_padding], dim=0)
signal = signal.unfold(-1, window_length, 1)
signal = signal.cpu().numpy()
signal_mean = signal.mean(-1)
signal_std = signal.std(-1)
return signal_mean, signal_std
def predict(args, pr, net_vision, net_audio, batch, device):
# import pdb; pdb.set_trace()
inputs = {}
inputs['img_1'] = batch['img_1'].to(device)
inputs['img_2'] = batch['img_2'].to(device)
_, camere_angle_pred = net_vision(inputs['img_2'], inputs['img_1'], return_angle=True)
camere_angle_pred = rot2theta(args, camere_angle_pred) * pr.rotation_correctness
inputs['audio'] = batch['audio'].to(device)
_, sound_angle_pred = net_audio(inputs['audio'], return_angle=True)
sound_angle_pred = logit2angle(args, sound_angle_pred)
return {
'camera_pred': camere_angle_pred,
'sound_pred': sound_angle_pred,
}
def inference(args, pr, net_vision, net_audio, data_set, data_loader, device='cuda', video_idx=None):
# import pdb; pdb.set_trace()
net_vision.eval()
net_audio.eval()
img_path_list = []
camera_preds = []
sound_preds = []
keyframe_inds = []
with torch.no_grad():
for step, batch in tqdm(enumerate(data_loader), total=len(data_loader), desc="Inference"):
# import pdb; pdb.set_trace()
img_paths = batch['img_path']
keyframe_ind = batch['keyframe_ind']
out = predict(args, pr, net_vision, net_audio, batch, device)
camera_preds.append(out['camera_pred'])
sound_preds.append(out['sound_pred'])
keyframe_inds.append(keyframe_ind)
for i in range(args.batch_size):
img_path_list.append(img_paths[i])
# import pdb; pdb.set_trace()
img_path_list = np.array(img_path_list)
camera_preds = torch.cat(camera_preds, dim=-1).data.cpu().numpy()
sound_preds = torch.cat(sound_preds, dim=-1).data.cpu().numpy()
keyframe_inds = torch.cat(keyframe_inds, dim=-1).data.cpu().numpy()
keyframe_camera_preds = camera_preds[data_set.keyframe_inds]
keyframe_camera_preds = np.cumsum(keyframe_camera_preds)
camera_preds += keyframe_camera_preds[keyframe_inds]
if args.vis_predict_only:
visualization_prediction(args, pr, data_set, data_loader, img_path_list, camera_preds, sound_preds, video_idx)
else:
visualization_video(args, pr, data_set, data_loader, img_path_list, camera_preds, sound_preds, video_idx) I think those codes should be far more than enough. I won't provide any codes related to the demo further to avoid duplicated style figures appearing. |
Thank you so much for the information, it helped a lot. Could you let me know what you used for the video for visualization? Is there a specific dataset that you choose the video from? Thank you! |
Those are self-collected videos using iPhone and binaural mics. |
Got it. For binaural mics, do you have any suggestions? Since different mics might have different purpose. |
Since the model is trained on the simulator binaural mic, the real binaural mic will have a domain gap against it. I suggest binaural mic that fits to human HRTF rather stereo mics. I will see if I can upload one or two videos when our servers get back. |
Please look at the video attached. I recorded the video with my iPhone with the stereo option on. However, neither the video nor audio prediction makes sense to me. Could you please take a look at it and give any suggestions? Thank you so much! video-None.mp4 |
There are several issues:
One thing I want to make clear: I recommend trying to debug your issues on your own first before asking me. I will only answer the questions for this repo. |
Got it, I'll try to debug it myself first, thank you so much! |
Hello, could you possibly share the binaural videos as soon as you have it? Thank you so much! |
@deBrian07 Hi, we have uploaded the demo videos to this github repo. Please see Readme for details. Note, The demo videos are for research purposes only. |
Hello, https://ificl.github.io/SLfM/ the real - world inside the realization of the demo code can be sent to me have a look
The text was updated successfully, but these errors were encountered: