Skip to content

"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

License

Notifications You must be signed in to change notification settings

zhanwenchen/vtom

Repository files navigation

Video Theory of Mind 🎥 💬

Video Theory of Mind

Video Theory of Mind

University of Virginia, The Pennsylvanian State University, Northeastern University


Installation 🔧

1. Common Installation Steps for All Systems

We recommend setting up a conda environment for the project:

conda create -n vtom python=3.10 intelpython3_full mkl pip numpy mkl-dpcpp mkl-include intel-openmp intel-fortran-rt dpcpp-cpp-rt ninja astunparse psutil pyyaml requests setuptools typing-extensions sympy filelock networkx fsspec packaging -c intel # At the time of writing, Python 3.12 has just been released but the ecosystem has not been completed yet, so we stick with 3.11.
conda activate vtom

cd
git clone [email protected]:zhanwenchen/vtom.git || git clone https://github.com/zhanwenchen/vtom.git
cd vtom

1. Ubuntu 22.04.3 with nvidia-driver-535 (sub-version 535.113.01). Could possibly work for other Linux distributions.

# # 1. Install decord GPU
# Bug: error: ‘AVBSFContext’ was not declared in this scope; did you mean ‘AVIOContext’? ffmpeg 5.0 issue?
# Cause: decord lacks ffmpeg5 support.
# Solution: apply patch at https://github.com/dmlc/decord/issues/186#issuecomment-1171882325
# 1.1 Build ffmpeg with NVIDIA Video Codec SDK 12.1: https://docs.nvidia.com/video-technologies/video-codec-sdk/12.1/ffmpeg-with-nvidia-gpu/index.html

# 1. Install nvcodec and ffmpeg5 for PyTorch and decord
# 1a. Install nvcodec headers into your Conda environment
cd
git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
cd nv-codec-headers
vim Makefile   # change the first line to PREFIX = ${CONDA_PREFIX}
make install

# 1b. Install ffmpeg5 with NVIDIA Video Codec SDK support
cd
git clone https://git.ffmpeg.org/ffmpeg.git
cd ffmpeg
sudo apt install yasm libx264-dev libgnutls28-dev
export MY_SM=86
./configure \
  --extra-cflags='-I/usr/local/cuda/include' \
  --extra-ldflags='-L/usr/local/cuda/lib64' \
  --nvccflags="-gencode arch=compute_${MY_SM},code=sm_${MY_SM} -O2" \
  --disable-doc \
  --enable-decoder=aac \
  --enable-decoder=h264 \
  --enable-decoder=h264_cuvid \
  --enable-decoder=rawvideo \
  --enable-indev=lavfi \
  --enable-encoder=libx264 \
  --enable-encoder=h264_nvenc \
  --enable-demuxer=mov \
  --enable-muxer=mp4 \
  --enable-filter=scale \
  --enable-filter=testsrc2 \
  --enable-protocol=file \
  --enable-protocol=https \
  --enable-gnutls \
  --enable-shared \
  --enable-gpl \
  --enable-nonfree \
  --enable-cuda-nvcc \
  --enable-libx264 \
  --enable-nvenc \
  --enable-cuvid \
  --disable-postproc \
  --disable-static \
  --enable-nvdec \
  --enable-libmp3lame
make clean
make -j
sudo make install
sudo sh -c "echo '/usr/local/lib' >> /etc/ld.so.conf"
sudo ldconfig

# 1c. Confirm your ffmpeg has nvcodec enabled
# Examples in https://pytorch.org/audio/stable/build.ffmpeg.html#checking-the-intallation
ffprobe -hide_banner -decoders | grep h264
ffmpeg -hide_banner -encoders | grep 264
src="https://download.pytorch.org/torchaudio/tutorial-assets/stream-api/NASAs_Most_Scientifically_Complex_Space_Observatory_Requires_Precision-MP4_small.mp4"
ffmpeg -hide_banner -y -vsync 0 \
     -hwaccel cuvid \
     -hwaccel_output_format cuda \
     -c:v h264_cuvid \
     -resize 360x240 \
     -i "${src}" \
     -c:a copy \
     -c:v h264_nvenc \
     -b:v 5M test.mp4
rm test.mp4


# 2. Build Pytorch with CUDA 12.1 from source to use custom ffmpeg5 with nvcodec support
# 2.1 Install system cuSparseLt and NCCL
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo apt install ./cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install libcusparselt0 libcusparselt-dev libnccl2 libnccl-dev

# 2.2 Install conda deps
conda install -c pytorch magma-cuda121 # TODO: include magma-cuda122 building in the future
conda install -c intel -c defaults cmake ninja astunparse expecttest hypothesis psutil pyyaml requests setuptools typing-extensions sympy filelock networkx jinja2 fsspec packaging
conda install -c defaults -c conda-forge jinja2 types-dataclasses optree # NOTE: jinja2 needs to be >= 3.1.2, so at the time of writing, cannot be from -c intel.


# 2.3 Install PyTorch from source
cd && git clone --recursive --single-branch --branch v2.1.0 https://github.com/pytorch/pytorch.git
cd pytorch
# 1. sync submodules
git submodule sync
git submodule update --init --recursive
conda activate vtom
export _GLIBCXX_USE_CXX11_ABI=1
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
export TORCH_CUDA_ARCH_LIST="8.6"
export USE_FFMPEG=1
export USE_TBB=1
export USE_SYSTEM_TBB=1
export USE_SYSTEM_NCCL=1
# TODO: ONNX_USE_PROTOBUF_SHARED_LIBS
# TODO: XNNPACK enabled shared
python setup.py clean && python setup.py develop > install_pytorch.log 2>&1
echo "DONE building pytorch"
pip install tqdm gradio matplotlib sentencepiece protobuf transformers tokenizers huggingface_hub accelerate


# 3. Install decord
cd
git clone --recursive https://github.com/zhanwenchen/decord
cd decord
mkdir build && cd build
cmake .. -DUSE_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DCMAKE_BUILD_TYPE=Release
# cd .. && rm -rf build && mkdir build && cd build
make -j
# Install decord Python bindings
conda activate vtom
cd ../python
python setup.py install --user
# Test decord installation
cd examples
# Run all the Jupyter Notebooks under the vtom environment
# You need to install ALSA (`sudo apt install libasound2-dev` and then `pip install simpleaudio opencv-python-headless`)

Additionally, install FlashAttention for training,

pip install ninja einops

cd
git clone --single-branch --branch v2.3.2 [email protected]:Dao-AILab/flash-attention.git
cd flash-attention
MAX_JOBS=12 python setup.py install # Cannot use pip install . on this repo. Also need to specify sm_86 because it is not included by default. 16 jobs need 96GB RAM.

Running Demo Offline 💿

To run the demo offline, please refer to the instructions in offline_demo.md.

PYTHONPATH="./:$PYTHONPATH" python video_chatgpt/demo/video_demo.py --model-name ./LLaVA-Lightning-7B-v1-1 --projection_path ./video_chatgpt-7B.bin

Training 🚋

For training instructions, check out train_video_chatgpt.md.

python scripts/convert_instruction_json_to_training_format.py \
        --input_json_file ./data/VideoInstruct_Dataset.json \
        --output_json_file video_chatgpt_training.json
# Total annotations retained: 100010


python scripts/save_spatio_temporal_clip_features.py \
        --video_dir_path ./data/videos_train \
        --clip_feat_path ./data/clip_features_train

python scripts/convert_instruction_json_to_training_format_siq2.py \
      --input_json_file ./data/siq2/qa/qa_train.json \
      --output_json_file ./data/siq2/qa/qa_train_instruction.json


# Train the base model on ActivityNet
unset LD_PRELOAD
export IOMP5_PATH=${CONDA_PREFIX}/lib/libiomp5.so # IF Intel
export LD_PRELOAD=${IOMP5_PATH}${LD_PRELOAD:+:${LD_PRELOAD}} # IF Intel
export KMP_AFFINITY=granularity=fine,compact,1,0 # IF Intel
export KMP_BLOCKTIME=1 # IF Intel
PYTHONPATH="./:$PYTHONPATH" torchrun --nproc_per_node=2 --master_port 29001 video_chatgpt/train/train_mem.py \
          --model_name_or_path ./LLaVA-7B-Lightening-v1-1-delta \
          --version v1 \
          --data_path video_chatgpt_training.json \
          --video_folder data/clip_features_train \
          --tune_mm_mlp_adapter True \
          --mm_use_vid_start_end \
          --bf16 True \
          --output_dir ./Video-ChatGPT_7B-1.1_Checkpoints \
          --num_train_epochs 3 \
          --per_device_train_batch_size 1 \
          --per_device_eval_batch_size 1 \
          --gradient_accumulation_steps 1 \
          --evaluation_strategy "no" \
          --save_strategy "steps" \
          --save_steps 3000 \
          --save_total_limit 3 \
          --learning_rate 2e-5 \
          --weight_decay 0. \
          --warmup_ratio 0.03 \
          --lr_scheduler_type "cosine" \
          --logging_steps 100 \
          --tf32 True \
          --model_max_length 2048 \
          --gradient_checkpointing True \
          --lazy_preprocess True

Acknowledgements 🙏

License 📜

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Looking forward to your feedback, contributions, and stars! 🌟 Please raise any issues or questions here.


Notes

-2 layer features, not last layer of clip. 100 (temporal) x 256 (spatial) x 1024 (semantic?). Pool over 100, pool over 256. Linear layer [1024, 512]. Project features to tokens. Both text and video tokens are input into llms, which take tokens anyway.

Finetuning - 3-4 with their dataset with 8xA100. Init model - LLaVA pretrained 7B. ToM datasets - Social-IQ 1.0/2.0/TinySocial. Need to extract CLIP features with pico files.

Social-IQ 1.0: 1, 250 videos, 7500 questions, 30, 000 correct answers and 22, 500 incorrect answers

Social-IQ 2.0:

TinySocial: 50.

Training code can be complicated. Need to modify dataloader. LazySupervisedDataTraining need to be modified. Can ask Yizhou. OpenAI Clip tokenizer, etc.

JSON file - training.

New task - video frame retrieval based on questions. Novelty not as strong. Literature - next prediction or multiple frames? Video frame prediction - next/segment, claim locate frames based on questions. Salient frame theory of mind change/impact. How to correlate between theory of mind change and impact of. Relational retrival instead of object retrieval. Object-level retrieval.

"Get me the frame of the man walking into the store?" vs

"Get the the frame where the man realize his wife lied."

Maybe also add temporal gnn to learn temporal changes in mental states.

TODOs for next week: -[] Apply for MS Azure access and ask if they can use it. -[] Run eval on Social-IQ (need to modify dataloader) -[] Can prioritize reproducing finetuning first -[] Do a survey on video+llm. Video-LLM, VideoChat, Video-ChatGPT. -[] Define novel task (convert vqa to video frame retrieval) -[] Run finetuning on on stuff.

About

"Video-ChatGPT" is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •