Skip to content

Commit

Permalink
Kosmos2 model (#1483)
Browse files Browse the repository at this point in the history
* Kosmos2 model

* Table of contents

* Table of contents

* Changed workflow

* New openvino version with fixes

* Flake8 fixes

* Flake8 fixes

* Flake8 fixes

* Flake8 fixes

* Gradio example

* Fix misspelling

* Ignore treon docker

* Fix misspelling

* Display bboxes

* Display bboxes

* Display bboxes and description

* Change the number

* Spellchecking

* Change the number

* Improve interactive example

* Fix gradio launch

* Fix README image

* Fix README name
  • Loading branch information
aleksandr-mokrov authored Jan 29, 2024
1 parent 36fd474 commit 8e47bfb
Show file tree
Hide file tree
Showing 8 changed files with 1,134 additions and 0 deletions.
1 change: 1 addition & 0 deletions .ci/ignore_treon_docker.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
272-paint-by-example
273-stable-zephyr-3b-chatbot
276-stable-diffusion-torchdynamo-backend
281-kosmos2-multimodal-large-language-model
301-tensorflow-training-openvino
305-tensorflow-quantization-aware-training
404-style-transfer-webcam
1 change: 1 addition & 0 deletions .ci/ignore_treon_linux.txt
Original file line number Diff line number Diff line change
Expand Up @@ -48,4 +48,5 @@
272-paint-by-example
273-stable-zephyr-3b-chatbot
276-stable-diffusion-torchdynamo-backend
281-kosmos2-multimodal-large-language-model
404-style-transfer-webcam
1 change: 1 addition & 0 deletions .ci/ignore_treon_mac.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,5 +45,6 @@
272-paint-by-example
273-stable-zephyr-3b-chatbot
276-stable-diffusion-torchdynamo-backend
281-kosmos2-multimodal-large-language-model
279-mobilevlm-language-assistant
404-style-transfer-webcam
1 change: 1 addition & 0 deletions .ci/ignore_treon_win.txt
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,4 @@
272-paint-by-example
273-stable-zephyr-3b-chatbot
276-stable-diffusion-torchdynamo-backend
281-kosmos2-multimodal-large-language-model
7 changes: 7 additions & 0 deletions .ci/spellcheck/.pyspelling.wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ backend
backends
Baevski
BasicUNet
bboxes
BEiT
Belrose
Benchmarking
Expand Down Expand Up @@ -98,6 +99,7 @@ ConvNeXt
ConvNeXts
Convolutional
convolutional
coreference
CoSENT
CPUs
cpu
Expand Down Expand Up @@ -298,6 +300,9 @@ KiTS
Koltun
Kondate
Kosaraju
kosmos
Kosmos
KOSMOS
KServe
Kubernetes
Kupyn
Expand Down Expand Up @@ -363,6 +368,8 @@ mistralai
MLS
mms
MMS
MLLM
MLLMs
MMVLM
MLP
MobileLLaMA
Expand Down
3 changes: 3 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ Check out the latest notebooks that show how to optimize and deploy popular mode
|[Stable Diffusion with IP-Adapter](notebooks/278-stable-diffusion-ip-adapter)<br> | Image conditioning in Stable Diffusion pipeline using IP-Adapter | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/182657d9-2aa3-40b3-9fc4-a90b803419fe width=300> |
| [MobileVLM](notebooks/279-mobilevlm-language-assistant)<br> | Mobile language assistant with MobileVLM and OpenVINO | |
| [DepthAnything](notebooks/280-depth-anything)<br>[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/openvinotoolkit/openvino_notebooks/HEAD?filepath=notebooks%2F280-depth-anythingh%2F280-depth-anything.ipynb)<br>[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/main/notebooks/280-depth-anything/280-depth-anything.ipynb) | Monocular Depth estimation with DepthAnything and OpenVINO | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/a9a16658-512f-470c-a33c-0e1f9d0ae72c width=300> |
| [Kosmos-2: Grounding Multimodal Large Language Models](notebooks/281-kosmos2-multimodal-large-language-model)<br> | Kosmos-2: Grounding Multimodal Large Language Model and OpenVINO™ | <img src=https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg width=225> |

## Table of Contents

Expand Down Expand Up @@ -233,6 +234,8 @@ Demos that demonstrate inference on a particular model.
| [278-stable-diffusion-ip-adapter](notebooks/278-stable-diffusion-ip-adapter)<br> | Image conditioning in Stable Diffusion pipeline using IP-Adapter | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/182657d9-2aa3-40b3-9fc4-a90b803419fe width=300> |
| [279-mobilevlm-language-assistant](notebooks/279-mobilevlm-language-assistant)<br> | Mobile language assistant with MobileVLM and OpenVINO | |
| [280-depth-anything](notebooks/280-depth-anything)<br>[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/openvinotoolkit/openvino_notebooks/HEAD?filepath=notebooks%2F280-depth-anythingh%2F280-depth-anything.ipynb)<br>[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/openvinotoolkit/openvino_notebooks/blob/main/notebooks/280-depth-anything/280-depth-anything.ipynb) | Monocular Depth Estimation with DepthAnything and OpenVINO | <img src=https://github.com/openvinotoolkit/openvino_notebooks/assets/29454499/a9a16658-512f-470c-a33c-0e1f9d0ae72c width=225> |
| [281-kosmos2-multimodal-large-language-model](notebooks/281-kosmos2-multimodal-large-language-model)<br> | Kosmos-2: Multimodal Large Language Model and OpenVINO™ | <img src=https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg width=225> |


<div id='-model-training'></div>

Expand Down

Large diffs are not rendered by default.

34 changes: 34 additions & 0 deletions notebooks/281-kosmos2-multimodal-large-language-model/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Kosmos-2: Multimodal Large Language Model and OpenVINO

[KOSMOS-2](https://github.com/microsoft/unilm/tree/master/kosmos-2) is a multimodal large language model (MLLM) that has new capabilities of multimodal grounding and
referring. KOSMOS-2 can understand multimodal input, follow instructions,
perceive object descriptions (e.g., bounding boxes), and ground language to the visual world.

Multimodal Large Language Models (MLLMs) have successfully played a role as a general-purpose interface across a wide
range of tasks, such as language, vision, and vision-language tasks. MLLMs can perceive general modalities, including
texts, images, and audio, and generate responses using free-form texts under zero-shot and few-shot settings.

[In this work](https://arxiv.org/abs/2306.14824), authors unlock the grounding capability for multimodal large
language models. Grounding capability
can provide a more convenient and efficient human-AI interaction for vision-language tasks. It enables the user to point
to the object or region in the image directly rather than input detailed text descriptions to refer to it, the model
can understand that image region with its spatial locations. Grounding capability also enables the model to respond
with visual answers (i.e., bounding boxes), which can support more vision-language tasks such as referring expression
comprehension. Visual answers are more accurate and resolve the coreference ambiguity compared with text-only
responses. In addition, grounding capability can link noun phrases and referring expressions in the generated free-form
text response to the image regions, providing more accurate, informational, and comprehensive answers.


![image](https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg)

## Notebook contents
- Prerequisites
- Infer the original model
- Convert the model to OpenVINO IR
- Inference
- Interactive inference

## Installation instructions
This is a self-contained example that relies solely on its own code.</br>
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to [Installation Guide](../../README.md).

0 comments on commit 8e47bfb

Please sign in to comment.