You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
Hi Everyone, my apologies is that has been asked/answered someplace else, I couldn't find it.
followed gluon CV tutorial on action recognition " fine tuning with your custom dataset". I am using the model slowfast_4x16_resnet50_custom, and following advice from this post I was able to make it work. I am using as shown there VideoClsCustom class to load the dataset:
In that post there is a clarification that said: "Basically this means, we randomly select 64 consecutive frames. For fast branch, we use a temporal stride of 2 to sample the 64 frames into 32 frames. For slow branch, we use a temporal stride of 16 to sample the 64 frames into 4 frames. Then we concatenate them together and feed it to the network as input."
The thing is that most of my videos are short duration clips of 1 seconds, with 30 frames (30FPS), What happens when the dataset have videos that have fewer than 64 frames? (and sometimes even fewer than 32 frames)
Is the data loader duplicating some frames? (upsampling?) or is it filling them with noise or what?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi Everyone, my apologies is that has been asked/answered someplace else, I couldn't find it.
followed gluon CV tutorial on action recognition " fine tuning with your custom dataset". I am using the model slowfast_4x16_resnet50_custom, and following advice from this post I was able to make it work. I am using as shown there VideoClsCustom class to load the dataset:
train_dataset = VideoClsCustom(root=YOUR_ROOT_PATH, setting=YOUR_SETTING_FILE, train=True, new_length=64, slowfast=True, slow_temporal_stride=16, fast_temporal_stride=2, transform=transform_train)
In that post there is a clarification that said: "Basically this means, we randomly select 64 consecutive frames. For fast branch, we use a temporal stride of 2 to sample the 64 frames into 32 frames. For slow branch, we use a temporal stride of 16 to sample the 64 frames into 4 frames. Then we concatenate them together and feed it to the network as input."
The thing is that most of my videos are short duration clips of 1 seconds, with 30 frames (30FPS), What happens when the dataset have videos that have fewer than 64 frames? (and sometimes even fewer than 32 frames)
Is the data loader duplicating some frames? (upsampling?) or is it filling them with noise or what?
Thanking you in advance,
Beta Was this translation helpful? Give feedback.
All reactions