Demo code #4

1390806607 · 2023-11-21T02:48:30Z

Hello, https://ificl.github.io/SLfM/ the real - world inside the realization of the demo code can be sent to me have a look

IFICL · 2023-11-21T03:22:26Z

We implement the demo using matplotlib in a naive way. We don't plan to share this part of the code. While the implementation of demo generation is based on the other project: https://github.com/IFICL/stereocrw/blob/master/vis_scripts/vis_video_itd.py .

1390806607 · 2023-11-25T07:29:10Z

Hello, is this picture correct

IFICL · 2023-11-25T17:02:29Z

Please see the reply from #5 (comment). The model is correct.

deBrian07 · 2023-11-26T19:14:01Z

If possible, could you please briefly explain the logic of the demo code for this project? I checked out https://github.com/IFICL/stereocrw/blob/master/vis_scripts/vis_video_itd.py, the structure seems to be very similar to the evaluate_angle.py in this project. Could you please share a little bit about the demo code for this project. Thank you so much!

IFICL · 2023-11-26T20:07:12Z

@1390806607 @deBrian07
The demo code is very simple. You set up the audio (with 0.51s audio length) and vision model. For the dataloader, you will extract the current frame and corresponding audio with 0.51s. The vision model requires two images where we set up the keyframes to accumulately calculate the rotation. Here I provide with demovideo dataloader code:

import csv
import glob
import h5py
import io
import json
import librosa
import numpy as np
import os
import pickle
from PIL import Image
from PIL import ImageFilter
import random
import scipy
import soundfile as sf
import time
from tqdm import tqdm
import glob
import cv2

import torch
import torch.nn as nn
import torchaudio
import torchvision.transforms as transforms

import sys
sys.path.append('..')
from data import AudioSFMbaseDataset

import pdb


class SingleVideoDataset(AudioSFMbaseDataset):
    def __init__(self, args, pr, list_sample, split='train'):
        self.pr = pr
        self.args = args
        self.split = split
        self.seed = pr.seed
        self.image_transform = transforms.Compose(self.generate_image_transform(args, pr))

        self.repeat = args.repeat if split == 'train' else 1

        video_path = list_sample
        audio_path = os.path.join(video_path, 'audio', 'audio.wav')
        frame_path = os.path.join(video_path, 'frames')
        meta_path = os.path.join(video_path, 'meta.json')
        with open(meta_path, "r") as f:
            self.meta_dict = json.load(f)
        
        # audio_sample_rate = meta_dict['audio_sample_rate']
        self.frame_rate = self.meta_dict['frame_rate']
        frame_list = glob.glob(f'{frame_path}/*.jpg')
        frame_list.sort()
        
        # import pdb; pdb.set_trace()
        self.frame_list = frame_list
        audio, self.audio_rate = self.read_audio(audio_path)
        audio = np.transpose(audio, (1, 0))
        audio = self.normalize_audio(audio, desired_rms=0.1)
        self.audio = torch.from_numpy(audio.copy()).float()
        num_sample = len(self.frame_list)

        # calculate the keyframes:
        if args.keyframe_interval == None:
            args.keyframe_interval = num_sample
        self.keyframe_inds = np.arange(0, num_sample, step=args.keyframe_interval)

        # print('Video Dataloader: # of frames {}: {}'.format(self.split, num_sample))

    def __getitem__(self, index):
        # import pdb; pdb.set_trace()
        audio_length = self.audio.shape[1]
        frame_path = self.frame_list[index]
        start_time = index / self.meta_dict['frame_rate'] - self.pr.clip_length / 2
        audio_rate = self.audio_rate
        clip_length = int(self.pr.clip_length * self.audio_rate)
        audio_start_time = int(start_time * self.audio_rate)
        audio_end_time = audio_start_time + clip_length

        if audio_start_time < 0:
            audio_start_time = 0
            audio_end_time = audio_start_time + clip_length

        if audio_end_time > audio_length:
            audio_end_time = audio_length
            audio_start_time = audio_end_time - clip_length
        
        img_2 = self.read_image(frame_path)

        audio = self.audio[:, audio_start_time: audio_end_time]
        
        # determine reference image 
        keyframe_ind = int(index // self.args.keyframe_interval)
        
        # current index is the keyframe, we set the reference image to previous keyframe
        if index % self.args.keyframe_interval == 0:
            if keyframe_ind !=0:
                keyframe_ind -= 1

        img1_ind = self.keyframe_inds[keyframe_ind]
        img_1 = self.read_image(self.frame_list[img1_ind])

        batch = {
            'img_1': img_1,
            'img_2': img_2,
            'img_path': frame_path,
            'keyframe_ind': keyframe_ind,
            'audio': audio,
        }
        return batch

    def getitem_test(self, index):
        self.__getitem__(index)

    def __len__(self): 
        return len(self.frame_list)

For the inference code:

# For smoothing the prediction which used inside visualization code
def smooth_prediction(signal, window_length):
    signal_padding = torch.tensor([signal[-1]] * (window_length - 1))
    signal = torch.tensor(signal)
    signal = torch.cat([signal, signal_padding], dim=0)
    signal = signal.unfold(-1, window_length, 1)
    signal = signal.cpu().numpy()
    signal_mean = signal.mean(-1)
    signal_std = signal.std(-1)
    return signal_mean, signal_std


def predict(args, pr, net_vision, net_audio, batch, device):
    # import pdb; pdb.set_trace()
    inputs = {}
    inputs['img_1'] = batch['img_1'].to(device)
    inputs['img_2'] = batch['img_2'].to(device)
    _, camere_angle_pred = net_vision(inputs['img_2'], inputs['img_1'], return_angle=True)
    camere_angle_pred = rot2theta(args, camere_angle_pred) * pr.rotation_correctness

    inputs['audio'] = batch['audio'].to(device)
    _, sound_angle_pred = net_audio(inputs['audio'], return_angle=True)
    sound_angle_pred = logit2angle(args, sound_angle_pred)

    return {
        'camera_pred': camere_angle_pred,
        'sound_pred': sound_angle_pred,
    }


def inference(args, pr, net_vision, net_audio, data_set, data_loader, device='cuda', video_idx=None):
    # import pdb; pdb.set_trace()
    net_vision.eval()
    net_audio.eval()

    img_path_list = []
    camera_preds = []
    sound_preds = []
    keyframe_inds = []
    
    with torch.no_grad():
        for step, batch in tqdm(enumerate(data_loader), total=len(data_loader), desc="Inference"):
            # import pdb; pdb.set_trace()
            img_paths = batch['img_path']
            keyframe_ind = batch['keyframe_ind']
            out = predict(args, pr, net_vision, net_audio, batch, device)

            camera_preds.append(out['camera_pred'])
            sound_preds.append(out['sound_pred'])
            keyframe_inds.append(keyframe_ind)

            for i in range(args.batch_size):
                img_path_list.append(img_paths[i])
    
    # import pdb; pdb.set_trace()

    img_path_list = np.array(img_path_list)
    camera_preds = torch.cat(camera_preds, dim=-1).data.cpu().numpy()
    sound_preds = torch.cat(sound_preds, dim=-1).data.cpu().numpy()
    keyframe_inds = torch.cat(keyframe_inds, dim=-1).data.cpu().numpy()

    keyframe_camera_preds = camera_preds[data_set.keyframe_inds]
    keyframe_camera_preds = np.cumsum(keyframe_camera_preds)
    camera_preds += keyframe_camera_preds[keyframe_inds]
    if args.vis_predict_only:
        visualization_prediction(args, pr, data_set, data_loader, img_path_list, camera_preds, sound_preds, video_idx)
    else:
        visualization_video(args, pr, data_set, data_loader, img_path_list, camera_preds, sound_preds, video_idx)

I think those codes should be far more than enough. I won't provide any codes related to the demo further to avoid duplicated style figures appearing.

deBrian07 · 2023-12-04T07:21:45Z

Thank you so much for the information, it helped a lot.

Could you let me know what you used for the video for visualization? Is there a specific dataset that you choose the video from? Thank you!

IFICL · 2023-12-04T17:05:05Z

Thank you so much for the information, it helped a lot.

Could you let me know what you used for the video for visualization? Is there a specific dataset that you choose the video from? Thank you!

Those are self-collected videos using iPhone and binaural mics.

deBrian07 · 2023-12-04T17:12:58Z

Got it. For binaural mics, do you have any suggestions? Since different mics might have different purpose.

IFICL · 2023-12-04T17:22:01Z

Since the model is trained on the simulator binaural mic, the real binaural mic will have a domain gap against it. I suggest binaural mic that fits to human HRTF rather stereo mics. I will see if I can upload one or two videos when our servers get back.

deBrian07 · 2023-12-04T19:05:28Z

Please look at the video attached. I recorded the video with my iPhone with the stereo option on. However, neither the video nor audio prediction makes sense to me. Could you please take a look at it and give any suggestions? Thank you so much!

video-None.mp4

IFICL · 2023-12-04T19:41:53Z

There are several issues:

First of all, the model is trained on landscape images, so it's not possible to directly work on the portrait images. And you need to set up keyframe_interval to make it work.
Second, a stereo mic is not a binaural mic that fits the simulated HRTF. And then you need to know that when you record video using portrait mode, the mics are upside down, not on the left or right.

One thing I want to make clear: I recommend trying to debug your issues on your own first before asking me. I will only answer the questions for this repo.

deBrian07 · 2023-12-04T19:53:30Z

Got it, I'll try to debug it myself first, thank you so much!

deBrian07 · 2023-12-09T19:17:19Z

Since the model is trained on the simulator binaural mic, the real binaural mic will have a domain gap against it. I suggest binaural mic that fits to human HRTF rather stereo mics. I will see if I can upload one or two videos when our servers get back.

Hello, could you possibly share the binaural videos as soon as you have it? Thank you so much!

IFICL · 2023-12-18T04:05:45Z

@deBrian07 Hi, we have uploaded the demo videos to this github repo. Please see Readme for details. Note, The demo videos are for research purposes only.

IFICL closed this as completed Nov 22, 2023

IFICL changed the title ~~code~~ Demo code Nov 26, 2023

IFICL pinned this issue Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Demo code #4

Demo code #4

1390806607 commented Nov 21, 2023

IFICL commented Nov 21, 2023

1390806607 commented Nov 25, 2023

IFICL commented Nov 25, 2023

deBrian07 commented Nov 26, 2023

IFICL commented Nov 26, 2023 •

edited

Loading

deBrian07 commented Dec 4, 2023

IFICL commented Dec 4, 2023

deBrian07 commented Dec 4, 2023

IFICL commented Dec 4, 2023

deBrian07 commented Dec 4, 2023

IFICL commented Dec 4, 2023

deBrian07 commented Dec 4, 2023

deBrian07 commented Dec 9, 2023

IFICL commented Dec 18, 2023

Demo code #4

Demo code #4

Comments

1390806607 commented Nov 21, 2023

IFICL commented Nov 21, 2023

1390806607 commented Nov 25, 2023

IFICL commented Nov 25, 2023

deBrian07 commented Nov 26, 2023

IFICL commented Nov 26, 2023 • edited Loading

deBrian07 commented Dec 4, 2023

IFICL commented Dec 4, 2023

deBrian07 commented Dec 4, 2023

IFICL commented Dec 4, 2023

deBrian07 commented Dec 4, 2023

IFICL commented Dec 4, 2023

deBrian07 commented Dec 4, 2023

deBrian07 commented Dec 9, 2023

IFICL commented Dec 18, 2023

IFICL commented Nov 26, 2023 •

edited

Loading