Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about merge_input function - Does it really include different resolutions? #231

Open
pldlgb opened this issue Jan 8, 2025 · 4 comments

Comments

@pldlgb
Copy link

pldlgb commented Jan 8, 2025

Sample: From low resolution to high resolution

Hi, I have a question regarding the merge_input function in your code. Specifically, the docstring mentions:

def merge_input(self, sample, encoder_hidden_length, encoder_attention_mask):
    """
        Merge the input video with different resolutions into one sequence
        Sample: From low resolution to high resolution
    """

However, when looking at the implementation, it seems to me that this function might not actually handle different resolutions, but rather incorporates historical frame information. Could you please clarify if this function indeed processes inputs of varying resolutions, or if it only deals with historical conditions from past frames?

Thank you for your time and for providing this project!

@feifeiobama
Copy link
Collaborator

great observation! it only deals with history conditions of different resolutions. the "input" here refers to transformer input instead of user input.

@pldlgb
Copy link
Author

pldlgb commented Jan 8, 2025

My understanding is that the historical conditions here should all have the same resolution. For example, if there are two frames of historical conditions, they should both be 16x24, and there won't be cases where 16x20 is mixed with 32x40.

@feifeiobama
Copy link
Collaborator

our model compresses earlier history condition to lower resolution to save memory & compute. for further details, please refer to the temporal pyramid part in our paper.

@pldlgb
Copy link
Author

pldlgb commented Jan 8, 2025

I understand what you mean, but in each merge_input function, there will only be one resolution of latent as input. It can either be a low-resolution latent or a high-resolution latent, but within the merge_input function, there will only be a single unified resolution, correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants