Turning VideoMAEv2 into a next-frame prediction model #40

IoSonoMarco · 2023-11-01T08:54:32Z

Great work and thanks for the code!

I was just wondering how you see the chance that with some proper masking strategy you can do full next-frame prediction on an unseen video. This is valid both for VideoMAEv2 and VideoMAE I guess. The proper masking strategy could be just masking the whole (last) frame, given a set of unmasked frames, and then obtaining logits for the reconstructed masked frame. Do you think this is feasible?

congee524 · 2024-03-14T08:14:12Z

I've done similar experiments and achieved similar results to mae. Based on my limited experimental results:predictive features are easier to train than predictive pixels; the potential of this training method may be higher than mae; and the resource overhead may be greater.
There should be some similar (predictive or autoregressive) work recently, such as v-jepa, aim, etc. You could learn more about it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Turning VideoMAEv2 into a next-frame prediction model #40

Turning VideoMAEv2 into a next-frame prediction model #40

IoSonoMarco commented Nov 1, 2023 •

edited

Loading

congee524 commented Mar 14, 2024

Turning VideoMAEv2 into a next-frame prediction model #40

Turning VideoMAEv2 into a next-frame prediction model #40

Comments

IoSonoMarco commented Nov 1, 2023 • edited Loading

congee524 commented Mar 14, 2024

IoSonoMarco commented Nov 1, 2023 •

edited

Loading