You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I have a question regarding the merge_input function in your code. Specifically, the docstring mentions:
defmerge_input(self, sample, encoder_hidden_length, encoder_attention_mask):
""" Merge the input video with different resolutions into one sequence Sample: From low resolution to high resolution """
However, when looking at the implementation, it seems to me that this function might not actually handle different resolutions, but rather incorporates historical frame information. Could you please clarify if this function indeed processes inputs of varying resolutions, or if it only deals with historical conditions from past frames?
Thank you for your time and for providing this project!
The text was updated successfully, but these errors were encountered:
My understanding is that the historical conditions here should all have the same resolution. For example, if there are two frames of historical conditions, they should both be 16x24, and there won't be cases where 16x20 is mixed with 32x40.
our model compresses earlier history condition to lower resolution to save memory & compute. for further details, please refer to the temporal pyramid part in our paper.
I understand what you mean, but in each merge_input function, there will only be one resolution of latent as input. It can either be a low-resolution latent or a high-resolution latent, but within the merge_input function, there will only be a single unified resolution, correct?
Pyramid-Flow/pyramid_dit/flux_modules/modeling_pyramid_flux.py
Line 242 in a012faa
Hi, I have a question regarding the merge_input function in your code. Specifically, the docstring mentions:
However, when looking at the implementation, it seems to me that this function might not actually handle different resolutions, but rather incorporates historical frame information. Could you please clarify if this function indeed processes inputs of varying resolutions, or if it only deals with historical conditions from past frames?
Thank you for your time and for providing this project!
The text was updated successfully, but these errors were encountered: