Skip to content

Latest commit

 

History

History
34 lines (26 loc) · 1.21 KB

README.md

File metadata and controls

34 lines (26 loc) · 1.21 KB

ViViT

alt text

Embedding Video Clips

  • alt text
  • alt text

Transformer model for Video

Model 1: Spatio-temporal attention

  • This model simply forwards all spatio-temporal tokens extracted from the video, $z_0$, through the transformer encoder

Model 2: Factorised encoder

  • alt text alt text

Model 3: Factorized self-attention

  • alt text
    Factorised self-attention (Model 3). Within each transformer block, the multi-headed self-attention operation is factorised into two operations (indicated by striped boxes) that first only compute self-attention spatially, and then temporally\
  • alt text

Model 4: Factorised dot-product attention

  • alt text

Spatial Attention: Across H and W dimension
Temporal Attention: Across T dimension


Ablation

Model Varients

  • alt text
  • The unfactorised model (Model 1) performs the best on Kinetics 400. However, it can also overfit on smaller datasets such as Epic Kitchens, where we find our “Factorised Encoder” (Model 2) to perform the best