You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was just wondering how you see the chance that with some proper masking strategy you can do full next-frame prediction on an unseen video. This is valid both for VideoMAEv2 and VideoMAE I guess. The proper masking strategy could be just masking the whole (last) frame, given a set of unmasked frames, and then obtaining logits for the reconstructed masked frame. Do you think this is feasible?
The text was updated successfully, but these errors were encountered:
I've done similar experiments and achieved similar results to mae. Based on my limited experimental results:predictive features are easier to train than predictive pixels; the potential of this training method may be higher than mae; and the resource overhead may be greater.
There should be some similar (predictive or autoregressive) work recently, such as v-jepa, aim, etc. You could learn more about it.
Great work and thanks for the code!
I was just wondering how you see the chance that with some proper masking strategy you can do full next-frame prediction on an unseen video. This is valid both for VideoMAEv2 and VideoMAE I guess. The proper masking strategy could be just masking the whole (last) frame, given a set of unmasked frames, and then obtaining logits for the reconstructed masked frame. Do you think this is feasible?
The text was updated successfully, but these errors were encountered: