Skip to content

darthgera123/Multimodal-Summarization

Repository files navigation

Multimodal Article Summarization

This project aims to perform multimodal article summarization using pretrained models. This project was done with Prof Vasudev Varma and Balaji Vasan Srinivasan (Adobe). We leverage Pretrained models to perform summarization of articles. Unlike previous methods, our method takes both the article and image as input and the output is a text summary

Detailed Report

Method

In this codebase we leverage OSCAR as our pretrained encoder and GPT2 as our pretrained decoder. We use nucleus sampling to generate text. OSCAR constructs a shared image-text embedding and minimizes distance b/w the Faster-RCNN features of the object and the corresponding word embedding. However you can replace OSCAR with any other visio-linguistic transformer like LXMERT, UNITER,etc. Similarly you can replace GPT2LMHead with any other LM head to generate logits. The components are extremely modular.

Installation

This codebase uses vilio library as our backbone which inturn uses huggingface-3.5.0. To install simply do pip3 install -r requirements.txt. Further instructions are present in GETTING_STARTED.md To run this code, simply run bash exp.sh.

About

Summarization of Multimodal articles

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published