Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo code #4

Closed
1390806607 opened this issue Nov 21, 2023 · 14 comments
Closed

Demo code #4

1390806607 opened this issue Nov 21, 2023 · 14 comments

Comments

@1390806607
Copy link

Hello, https://ificl.github.io/SLfM/ the real - world inside the realization of the demo code can be sent to me have a look

@IFICL
Copy link
Owner

IFICL commented Nov 21, 2023

We implement the demo using matplotlib in a naive way. We don't plan to share this part of the code. While the implementation of demo generation is based on the other project: https://github.com/IFICL/stereocrw/blob/master/vis_scripts/vis_video_itd.py .

@IFICL IFICL closed this as completed Nov 22, 2023
@1390806607
Copy link
Author

image
Hello, is this picture correct

@IFICL
Copy link
Owner

IFICL commented Nov 25, 2023

Please see the reply from #5 (comment). The model is correct.

@deBrian07
Copy link

If possible, could you please briefly explain the logic of the demo code for this project? I checked out https://github.com/IFICL/stereocrw/blob/master/vis_scripts/vis_video_itd.py, the structure seems to be very similar to the evaluate_angle.py in this project. Could you please share a little bit about the demo code for this project. Thank you so much!

@IFICL
Copy link
Owner

IFICL commented Nov 26, 2023

@1390806607 @deBrian07
The demo code is very simple. You set up the audio (with 0.51s audio length) and vision model. For the dataloader, you will extract the current frame and corresponding audio with 0.51s. The vision model requires two images where we set up the keyframes to accumulately calculate the rotation. Here I provide with demovideo dataloader code:

import csv
import glob
import h5py
import io
import json
import librosa
import numpy as np
import os
import pickle
from PIL import Image
from PIL import ImageFilter
import random
import scipy
import soundfile as sf
import time
from tqdm import tqdm
import glob
import cv2

import torch
import torch.nn as nn
import torchaudio
import torchvision.transforms as transforms

import sys
sys.path.append('..')
from data import AudioSFMbaseDataset

import pdb


class SingleVideoDataset(AudioSFMbaseDataset):
    def __init__(self, args, pr, list_sample, split='train'):
        self.pr = pr
        self.args = args
        self.split = split
        self.seed = pr.seed
        self.image_transform = transforms.Compose(self.generate_image_transform(args, pr))

        self.repeat = args.repeat if split == 'train' else 1

        video_path = list_sample
        audio_path = os.path.join(video_path, 'audio', 'audio.wav')
        frame_path = os.path.join(video_path, 'frames')
        meta_path = os.path.join(video_path, 'meta.json')
        with open(meta_path, "r") as f:
            self.meta_dict = json.load(f)
        
        # audio_sample_rate = meta_dict['audio_sample_rate']
        self.frame_rate = self.meta_dict['frame_rate']
        frame_list = glob.glob(f'{frame_path}/*.jpg')
        frame_list.sort()
        
        # import pdb; pdb.set_trace()
        self.frame_list = frame_list
        audio, self.audio_rate = self.read_audio(audio_path)
        audio = np.transpose(audio, (1, 0))
        audio = self.normalize_audio(audio, desired_rms=0.1)
        self.audio = torch.from_numpy(audio.copy()).float()
        num_sample = len(self.frame_list)

        # calculate the keyframes:
        if args.keyframe_interval == None:
            args.keyframe_interval = num_sample
        self.keyframe_inds = np.arange(0, num_sample, step=args.keyframe_interval)

        # print('Video Dataloader: # of frames {}: {}'.format(self.split, num_sample))

    def __getitem__(self, index):
        # import pdb; pdb.set_trace()
        audio_length = self.audio.shape[1]
        frame_path = self.frame_list[index]
        start_time = index / self.meta_dict['frame_rate'] - self.pr.clip_length / 2
        audio_rate = self.audio_rate
        clip_length = int(self.pr.clip_length * self.audio_rate)
        audio_start_time = int(start_time * self.audio_rate)
        audio_end_time = audio_start_time + clip_length

        if audio_start_time < 0:
            audio_start_time = 0
            audio_end_time = audio_start_time + clip_length

        if audio_end_time > audio_length:
            audio_end_time = audio_length
            audio_start_time = audio_end_time - clip_length
        
        img_2 = self.read_image(frame_path)

        audio = self.audio[:, audio_start_time: audio_end_time]
        
        # determine reference image 
        keyframe_ind = int(index // self.args.keyframe_interval)
        
        # current index is the keyframe, we set the reference image to previous keyframe
        if index % self.args.keyframe_interval == 0:
            if keyframe_ind !=0:
                keyframe_ind -= 1

        img1_ind = self.keyframe_inds[keyframe_ind]
        img_1 = self.read_image(self.frame_list[img1_ind])

        batch = {
            'img_1': img_1,
            'img_2': img_2,
            'img_path': frame_path,
            'keyframe_ind': keyframe_ind,
            'audio': audio,
        }
        return batch

    def getitem_test(self, index):
        self.__getitem__(index)

    def __len__(self): 
        return len(self.frame_list)

For the inference code:

# For smoothing the prediction which used inside visualization code
def smooth_prediction(signal, window_length):
    signal_padding = torch.tensor([signal[-1]] * (window_length - 1))
    signal = torch.tensor(signal)
    signal = torch.cat([signal, signal_padding], dim=0)
    signal = signal.unfold(-1, window_length, 1)
    signal = signal.cpu().numpy()
    signal_mean = signal.mean(-1)
    signal_std = signal.std(-1)
    return signal_mean, signal_std


def predict(args, pr, net_vision, net_audio, batch, device):
    # import pdb; pdb.set_trace()
    inputs = {}
    inputs['img_1'] = batch['img_1'].to(device)
    inputs['img_2'] = batch['img_2'].to(device)
    _, camere_angle_pred = net_vision(inputs['img_2'], inputs['img_1'], return_angle=True)
    camere_angle_pred = rot2theta(args, camere_angle_pred) * pr.rotation_correctness

    inputs['audio'] = batch['audio'].to(device)
    _, sound_angle_pred = net_audio(inputs['audio'], return_angle=True)
    sound_angle_pred = logit2angle(args, sound_angle_pred)

    return {
        'camera_pred': camere_angle_pred,
        'sound_pred': sound_angle_pred,
    }


def inference(args, pr, net_vision, net_audio, data_set, data_loader, device='cuda', video_idx=None):
    # import pdb; pdb.set_trace()
    net_vision.eval()
    net_audio.eval()

    img_path_list = []
    camera_preds = []
    sound_preds = []
    keyframe_inds = []
    
    with torch.no_grad():
        for step, batch in tqdm(enumerate(data_loader), total=len(data_loader), desc="Inference"):
            # import pdb; pdb.set_trace()
            img_paths = batch['img_path']
            keyframe_ind = batch['keyframe_ind']
            out = predict(args, pr, net_vision, net_audio, batch, device)

            camera_preds.append(out['camera_pred'])
            sound_preds.append(out['sound_pred'])
            keyframe_inds.append(keyframe_ind)

            for i in range(args.batch_size):
                img_path_list.append(img_paths[i])
    
    # import pdb; pdb.set_trace()

    img_path_list = np.array(img_path_list)
    camera_preds = torch.cat(camera_preds, dim=-1).data.cpu().numpy()
    sound_preds = torch.cat(sound_preds, dim=-1).data.cpu().numpy()
    keyframe_inds = torch.cat(keyframe_inds, dim=-1).data.cpu().numpy()

    keyframe_camera_preds = camera_preds[data_set.keyframe_inds]
    keyframe_camera_preds = np.cumsum(keyframe_camera_preds)
    camera_preds += keyframe_camera_preds[keyframe_inds]
    if args.vis_predict_only:
        visualization_prediction(args, pr, data_set, data_loader, img_path_list, camera_preds, sound_preds, video_idx)
    else:
        visualization_video(args, pr, data_set, data_loader, img_path_list, camera_preds, sound_preds, video_idx)

I think those codes should be far more than enough. I won't provide any codes related to the demo further to avoid duplicated style figures appearing.

@IFICL IFICL changed the title code Demo code Nov 26, 2023
@IFICL IFICL pinned this issue Nov 26, 2023
@deBrian07
Copy link

Thank you so much for the information, it helped a lot.

Could you let me know what you used for the video for visualization? Is there a specific dataset that you choose the video from? Thank you!

@IFICL
Copy link
Owner

IFICL commented Dec 4, 2023

Thank you so much for the information, it helped a lot.

Could you let me know what you used for the video for visualization? Is there a specific dataset that you choose the video from? Thank you!

Those are self-collected videos using iPhone and binaural mics.

@deBrian07
Copy link

Got it. For binaural mics, do you have any suggestions? Since different mics might have different purpose.

@IFICL
Copy link
Owner

IFICL commented Dec 4, 2023

Since the model is trained on the simulator binaural mic, the real binaural mic will have a domain gap against it. I suggest binaural mic that fits to human HRTF rather stereo mics. I will see if I can upload one or two videos when our servers get back.

@deBrian07
Copy link

Please look at the video attached. I recorded the video with my iPhone with the stereo option on. However, neither the video nor audio prediction makes sense to me. Could you please take a look at it and give any suggestions? Thank you so much!

video-None.mp4

@IFICL
Copy link
Owner

IFICL commented Dec 4, 2023

There are several issues:

  1. First of all, the model is trained on landscape images, so it's not possible to directly work on the portrait images. And you need to set up keyframe_interval to make it work.
  2. Second, a stereo mic is not a binaural mic that fits the simulated HRTF. And then you need to know that when you record video using portrait mode, the mics are upside down, not on the left or right.

One thing I want to make clear: I recommend trying to debug your issues on your own first before asking me. I will only answer the questions for this repo.

@deBrian07
Copy link

Got it, I'll try to debug it myself first, thank you so much!

@deBrian07
Copy link

Since the model is trained on the simulator binaural mic, the real binaural mic will have a domain gap against it. I suggest binaural mic that fits to human HRTF rather stereo mics. I will see if I can upload one or two videos when our servers get back.

Hello, could you possibly share the binaural videos as soon as you have it? Thank you so much!

@IFICL
Copy link
Owner

IFICL commented Dec 18, 2023

@deBrian07 Hi, we have uploaded the demo videos to this github repo. Please see Readme for details. Note, The demo videos are for research purposes only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants