This project aims to develop an AI-driven system for recommending and generating product videos for e-commerce, leveraging advances in generative AI and large language models. Focused on the Indonesian market, it addresses the challenge of content curation in digital marketing by automating video recommendations and enhancements.
In this paper, we propose VIDIA (Vision-Language Digital Agent), VIDIA operates as a multiagent system where each agent specializes in different aspects of video analysis and generation, including product recognition, scene understanding, and content enhancement. The system leverages a vision-language pre-training approach, learning a joint representation space that allows for seamless cross-modal reasoning and generation.
These instructions will get your copy of the project up and running on your local machine or Google Colab for development and testing purposes.
- Python 3.8 or higher
- Access to high computational resources (GPU recommended for video processing)
- A YouTube API key for collecting video data
- An OpenAI API key for leveraging models like GPT-4
- A Stability AI API key for latent / stable diffusion models for image / video generation
-
Clone the Repository
git clone https://github.com/sumankwan/Vision-Language-Attention-is-All-You-Need-public.git cd Vision-Language-Attention-is-All-You-Need-public
-
Set Up a Virtual Environment (optional)
python -m venv venv # On Windows venv\Vision-Language-Attention-is-All-You-Need\activate # On Unix or MacOS source venv/bin/activate
-
Install Required Libraries
pip install -r requirements.txt
This
requirements.txt
file should include necessary libraries liketorch
,transformers
,ffmpeg-python
,pytube
,whisper
, and others relevant to the project's technology stack.
-
Load API Keys
- For data collection, add your YouTube API key in
api_keys.txt
file in the project directory - For AI system inference, add OpenAI and Stability AI API keys in the same file
- For data collection, add your YouTube API key in
-
Set Up Google Colab
- Open the project in Google Colab.
- Connect the Colab notebook to your Google Drive.
-
Configure File Structure
This project is organized into several directories, each serving a specific function within the Multimodal AI Pipeline. Ensure the following file structure is set up in your Google Drive:
Project/ ├── Agents/ │ ├── final/ # Stores final versions of processed videos │ ├── adjusted/ # Contains adjusted videos after post-processing │ ├── image/ # Used for storing images used or generated in the pipeline │ └── video/ # Contains raw and intermediate video files used in processing ├── Audio/ # Audio files used or generated by the pipeline ├── DataFrame/ # DataFrames and related data files used for processing and analytics ├── Example/ # Example scripts and templates for using the pipeline ├── Frames/ # Individual frames extracted from videos for processing ├── MVP/ # Minimal Viable Product demonstrations and related files ├── ObjectDetection/ # Scripts and files related to object detection models ├── Output/ # Final output files from the pipeline, including videos and logs ├── Shorts/ # Short video clips for testing or demonstration purposes ├── Text/ # Text data used or generated, including scripts and metadata ├── Video/ # Directory for storing larger video files └── Multimodal_AI_pipeline/ # Core scripts and modules of the Multimodal AI pipeline
-
Data and AI Inference (optional)
- Execute the multimodal_AI_pipeline.ipynb notebook to start the data collection and run through our multimodal processing for later downstream tasks
- data_post_processing.xlsx file is used to organize output from data pipeline from the AI system
To execute the main application script:
python main.py
main.py
should be the entry point of your application, orchestrating the workflow of video data gathering, processing, and recommendation generation based on user inputs.
You can interact with the system using the provided Telegram bot:
- Data Collection: YouTube API is used to gather video content relevant to the Indonesian e-commerce market, focusing on collecting a diverse array of product videos with different categories. The multimodal_AI_pipeline.ipynb notebook is our data pipeline that leverages multiagent multimodal AI system for video understanding.
- Cost: Handling video data is computationally demanding, including its analysis using large foundational models
- Future Direction: Future plans involve incorporating multiple smaller models like Llama-3 to improve multi-agent collaborations
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020. Retrieved from https://arxiv.org/abs/2103.00020
Jin, Y., Wu, C.-F., Brooks, D., & Wei, G.-Y. (2023). S3: Increasing GPU Utilization during Generative Inference for Higher Throughput. arXiv preprint arXiv:2306.06000. Retrieved from https://arxiv.org/abs/2306.06000
Zhu, Z., Feng, X., Chen, D., Yuan, J., Qiao, C., & Hua, G. (2024). Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation. arXiv. https://arxiv.org/abs/2403.12042
Shen, W., Li, C., Chen, H., Yan, M., Quan, X., Chen, H., Zhang, J., & Huang, F. (2024). Small LLMs Are Weak Tool Learners: A Multi-LLM Agent. arXiv. https://arxiv.org/pdf/2401.07324
Zuo, Q., Gu, X., Qiu, L., Dong, Y., Zhao, Z., Yuan, W., Peng, R., Zhu, S., Dong, Z., Bo, L., & Huang, Q. (2024). VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model. arXiv. https://doi.org/10.48550/arXiv.2403.12010
Zhang, J., Chen, J., Wang, C., Yu, Z., Qi, T., Liu, C., & Wu, D. (2024). Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing (Version 2). arXiv. Retrieved March 22, 2024, from https://arxiv.org/abs/2403.12042v2
Zi, B., Zhao, S., Qi, X., Wang, J., Shi, Y., Chen, Q., Liang, B., Wong, K.-F., & Zhang, L. (2024). CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility. arXiv. https://arxiv.org/pdf/2403.12035
Romero, D., & Solorio, T. (2024). Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering. arXiv. https://doi.org/10.48550/arXiv.2402.10698
Wang, B., Li, Y., Lv, Z., Xia, H., Xu, Y., & Sodhi, R. (2024). LAVE: LLM-Powered Agent Assistance and Language Augmentation for Video Editing. ACM Conference on Intelligent User Interfaces (ACM IUI) 2024. https://doi.org/10.48550/arXiv.2402.10294
Tóth, S., Wilson, S., Tsoukara, A., Moreu, E., Masalovich, A., & Roemheld, L. (2024). End-to-end multi-modal product matching in fashion e-commerce. arXiv. https://arxiv.org/abs/2403.11593
Huang, X., Lian, J., Lei, Y., Yao, J., Lian, D., & Xie, X. (2023). Recommender AI Agent: Integrating Large Language Models for Interactive Recommendations. arXiv. https://arxiv.org/abs/2308.16505
Liu, Y., Zhang, K., Li, Y., Yan, Z., Gao, C., Chen, R., Yuan, Z., Huang, Y., Sun, H., Gao, J., He, L., & Sun, L. (2024). Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models. arXiv. https://arxiv.org/abs/2402.17177
Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. NeurIPS 2023. arXiv. https://arxiv.org/abs/2304.08485
Lu, Y., Li, X., Pei, Y., Yuan, K., Xie, Q., Qu, Y., Sun, M., Zhou, C., & Chen, Z. (2024). KVQ: Kwai Video Quality Assessment for Short-form Videos. arXiv. https://arxiv.org/abs/2402.07220v2
Wang, X., Zhang, Y., Zohar, O., & Yeung-Levy, S. (2024). VideoAgent: Long-form Video Understanding with Large Language Model as Agent. arXiv. https://arxiv.org/abs/2403.10517
Zhang, Z., Dai, Z., Qin, L., & Wang, W. (2024). EffiVED: Efficient Video Editing via Text-instruction Diffusion Models. arXiv. https://arxiv.org/abs/2403.11568
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment Anything: A new task, model, and dataset for image segmentation. arXiv. https://arxiv.org/abs/2304.02643
Yuan, Z., Chen, R., Li, Z., Jia, H., He, L., Wang, C., & Sun, L. (2024). Mora: Enabling Generalist Video Generation via A Multi-Agent Framework. arXiv. https://arxiv.org/abs/2403.13248
Yu, J., Zhao, G., Wang, Y., Wei, Z., Zheng, Y., Zhang, Z., Cai, Z., Xie, G., Zhu, J., & Zhu, W. (2024). Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation (Version 2). arXiv. https://arxiv.org/abs/2403.12425v2
Schumann, R., Zhu, W., Feng, W., Fu, T.-J., Riezler, S., & Wang, W. Y. (2024). VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View. arXiv. https://arxiv.org/abs/2307.06082v2
Liu, Z., Sun, Z., Zang, Y., Li, W., Zhang, P., Dong, X., Xiong, Y., Lin, D., & Wang, J.† (2024). RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition. arXiv. https://arxiv.org/abs/2403.13805v1
Huang, Z., Zhang, Z., Zha, Z.-J., Lu, Y., & Guo, B. (2024). RelationVLM: Making Large Vision-Language Models Understand Visual Relations. arXiv. https://arxiv.org/abs/2403.13805
Wang, Z., Xie, E.*, Li, A., Wang, Z., Liu, X., & Li, Z. (Year). Divide and Conquer: Language Models Can Plan and Self-Correct for Compositional Text-to-Image Generation. arXiv. https://arxiv.org/pdf/2401.15688
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R. L., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). Mistral 7B v0.1: A 7-billion-parameter language model for performance and efficiency. arXiv. https://arxiv.org/abs/2310.06825
Yang, D., Hu, L., Tian, Y., Li, Z., Kelly, C., Yang, B., Yang, C., & Zou, Y. (2024). WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text and Image Inputs. arXiv. https://arxiv.org/abs/2403.07944
Wang, Y., Li, K., Li, X., Yu, J., He, Y., Chen, G., Pei, B., Zheng, R., Xu, J., Wang, Z., Shi, Y., Jiang, T., Li, S., Zhang, H., Huang, Y., Qiao, Y., Wang, Y., & Wang, L. (2024, March 22). InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding. arXiv:2403.15377 [cs.CV]. https://arxiv.org/abs/2403.15377
Chan, S. H. (2024). Tutorial on diffusion models for imaging and vision. arXiv:2403.18103 [cs.LG]. https://doi.org/10.48550/arXiv.2403.18103
Li, C., Huang, D., Lu, Z., Xiao, Y., Pei, Q., & Bai, L. (2024). A survey on long video generation: Challenges, methods, and prospects. arXiv:2403.16407 [cs.CV]. https://doi.org/10.48550/arXiv.2403.16407
Shahmohammadi, H., Ghosh, A., & Lensch, H. P. A. (2023). ViPE: Visualise Pretty-much Everything. arXiv preprint arXiv:2310.10543. Retrieved from https://arxiv.org/abs/2310.10543
Lian, J., Lei, Y., Huang, X., Yao, J., Xu, W., & Xie, X. (2024). RecAI: Leveraging Large Language Models for Next-Generation Recommender Systems. arXiv preprint arXiv:2403.06465. Retrieved from https://arxiv.org/abs/2403.06465
Zhou, X., Arnab, A., Buch, S., Yan, S., Myers, A., Xiong, X., ... & Schmid, C. (2024). Streaming Dense Video Captioning. arXiv preprint arXiv:2404.01297. Retrieved from https://arxiv.org/abs/2404.01297
Liu, R., Li, C., Tang, H., Ge, Y., Shan, Y., & Li, G. (2024). ST-LLM: Large Language Models Are Effective Temporal Learners. arXiv preprint arXiv:2404.00308. Retrieved from https://arxiv.org/abs/2404.00308
Wu, B., Chuang, C.-Y., Wang, X., Jia, Y., Krishnakumar, K., Xiao, T., Liang, F., Yu, L., & Vajda, P. (2023). Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis. arXiv preprint arXiv:xxxx.xxxxx. Project website available at [URL provided in your document]
Wu, J., Li, X., Si, C., Zhou, S., Yang, J., Zhang, J., Li, Y., Chen, K., Tong, Y., Liu, Z., & Loy, C. C. (2024). Towards Language-Driven Video Inpainting via Multimodal Large Language Models. arXiv preprint arXiv:2401.10226. Retrieved from https://arxiv.org/abs/2401.10226
Ding, K., Zhang, H., Yu, Q., Wang, Y., Xiang, S., & Pan, C. (2024). Weak distribution detectors lead to stronger generalizability of vision-language prompt tuning. arXiv:2404.00603 [cs.CV]. https://doi.org/10.48550/arXiv.2404.00603
By Suhardiman Agung with the guidance from Professor Daniel Lin