Convolutional Neural Network for Video Understanding

SnT Project 2021, Brain Cognitive Society

This repository contains the code for this project done so far.

Relevant Links

This is the repository which implements late temporal modeling on top of the 3D CNN architectures and mainly focus on BERT for this aim.

Environment

python 3.6.4
tensorflow 1.x
keras>=2
basically a normal colab environment

DataSet

The Biggest Hurdle we faced while implementing the paper or any 2D/3D CNN for that matter was the video dataset , It was extremenly difficult to parse the dataset and load it into different different frames which would them be sent into our model, where then the last TGAP layer of the model then use to average the score of each of the frame to get the Actual classification of the model.

So, Finally I Used many methods to accomplish this task, one of them was to convert the video into many frames which are then normalized and then transformed into a numpy array which can be given as input to our model to get the result.

For the first implementation I Played with UCF101 dataset and trained the model on just 10% of it(due to lack of GPUs , RAM and time).
For the second implementaton I again used HMDB51 dataset and then extracted it, and then passed it batch fram wise through a CNN+LSTM model we got 67% accuracy by trainin it on just 1/5 of one of the three split of HMDB51 dataset.
For third implementation I used glucon and mxnet library which has prewritten script to download and extract hmdb51 dataset and then used it on 3D resNeXt model.

Model Architecture and Details

I implemented many models including VGG(pretrained) with 4 FC layers on top of it with one Classification layer , CNN with LSTM, and Finally 3DReseXt101.

1.VGG16 + 4 FC layers

-Inspired from This model was just a started for my actual work. I learned how the pytorch works, how to build working model. The First Model utilizes the VGG16 architecture to extract the spatial and temporal features out of the frames of the video and Then several softmax, Relu layer This model is basically a simple 2D CNN architecture which is trained on frames of the video and then use to classify them and taking the average over all the frames of the video. -ref2

Result : By this Method , we were getting 27% accuracy on UCF101 dataset(because I was just able to use 10% of train data to train my model), while the official implementation was giving 44% accuracy, we Implemented this , as this was the first thing which would come to someones mind, if one has to do Action Recognition.

2.CNN + LSTM

-Inspired from here

This was the second model that I Implemented in my learning process to classify video:
This is main working modl which gave 67% accuracy on hMDB51 dataset.

This model basically had CNN + LSTM , conv to explicitely exploit spatial features while LSTM efficiently utilises temporal features.

In CNN-LSTM we have two different modules which are combined together. The CNN is a regular CNN which acts as a 'spatial feature extractor'. The output of the CNN is multiplied by the LSTM cell to learn the 'temporal features'. We implemented this, as it was the foundation for the future work for Action recognition task, because in our 3rd implementation we were using BERT which is a much better version of LSTM to do the job with 3D CNN.

3.Late Temporal Modeling in 3D CNN Architecture with BERT for Action Recognition - here

In this paper we had to implement 3D CNN architecture with its TGAP layer replaced with BERT so as to classify the video.

this Implementation was extremely difficult for me due to the use of 3D CNN , I tried to implement the model and Took a look at some github repos regarding it. But, this was just an optional part for my project.

But, it was very difficult to pass the Video data through the 3D CNN architecture, and we are currently stucked at this point.

Usage

All the ipynb files of the task that we did are in their respective folder

To Run any script just open it in colab and install the necessary packages which are already mentioned there.

It will take 2 hrs to train first model on 10% of the split 1 of ucf101 dataset.( time is less cause we imported the VGG16 pretrained on imagenet dataset)
24 hrs to run the second model on HMDB51 dataset ( 2 hrs to train on 20% of the HMDB51 dataset).
Still in progress, we have implemented resneXt101 architecture till now, and are going to train it on UCF101 dataset, this part is complex due to the fact that we are using 3D CNNs

Individual Contribution:

Weeks	1-2	3-4	5-6	7-8
Tejesh	Did the Prework part , completed various exercises of ML crash course while doing probability side wise	In week 3-4 I completed Kaggle and image classification crash course and tried my hands on tensorflow and pytorch and started working on paper review of the Paper	completed paper review , PPt, Documentation of the paper and I was the speaker for my team , now in these two week I daily spend 3-4 hrs religiously doing the work , Then firstly I implemented simple VGG16 model, and trained it on UCF101 dataset, also implemented tensorhub i3d model to classify video into action(easy task)	In the second last weeks of the project I focused on CNN + LSTM and implemented its architecture, I took its insipiration from a github repo which made use of CNN with LSTM to give good results on AR task this took complete one week and then started with the last task of implementing 3D ResneXt101 with BERT , but currently I am stucked at the task of passing video through its complex architecture , I will try my best to make working model with it

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
CNN + LSTM		CNN + LSTM
ResneXt101		ResneXt101
VGG16+FC+Softmax		VGG16+FC+Softmax
week_1-3		week_1-3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Convolutional Neural Network for Video Understanding

SnT Project 2021, Brain Cognitive Society

Relevant Links

Environment

DataSet

Model Architecture and Details

1.VGG16 + 4 FC layers

2.CNN + LSTM

3.Late Temporal Modeling in 3D CNN Architecture with BERT for Action Recognition - here

Usage

Individual Contribution:

About

Releases

Packages

Contributors 4

Languages

tejeshvaish/CNN-For-Video-Understanding

Folders and files

Latest commit

History

Repository files navigation

Convolutional Neural Network for Video Understanding

SnT Project 2021, Brain Cognitive Society

Relevant Links

Environment

DataSet

Model Architecture and Details

1.VGG16 + 4 FC layers

2.CNN + LSTM

3.Late Temporal Modeling in 3D CNN Architecture with BERT for Action Recognition - here

Usage

Individual Contribution:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages