-
Notifications
You must be signed in to change notification settings - Fork 840
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Kosmos2 model * Table of contents * Table of contents * Changed workflow * New openvino version with fixes * Flake8 fixes * Flake8 fixes * Flake8 fixes * Flake8 fixes * Gradio example * Fix misspelling * Ignore treon docker * Fix misspelling * Display bboxes * Display bboxes * Display bboxes and description * Change the number * Spellchecking * Change the number * Improve interactive example * Fix gradio launch * Fix README image * Fix README name
- Loading branch information
1 parent
36fd474
commit 8e47bfb
Showing
8 changed files
with
1,134 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1,086 changes: 1,086 additions & 0 deletions
1,086
...kosmos2-multimodal-large-language-model/281-kosmos2-multimodal-large-language-model.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
34 changes: 34 additions & 0 deletions
34
notebooks/281-kosmos2-multimodal-large-language-model/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Kosmos-2: Multimodal Large Language Model and OpenVINO | ||
|
||
[KOSMOS-2](https://github.com/microsoft/unilm/tree/master/kosmos-2) is a multimodal large language model (MLLM) that has new capabilities of multimodal grounding and | ||
referring. KOSMOS-2 can understand multimodal input, follow instructions, | ||
perceive object descriptions (e.g., bounding boxes), and ground language to the visual world. | ||
|
||
Multimodal Large Language Models (MLLMs) have successfully played a role as a general-purpose interface across a wide | ||
range of tasks, such as language, vision, and vision-language tasks. MLLMs can perceive general modalities, including | ||
texts, images, and audio, and generate responses using free-form texts under zero-shot and few-shot settings. | ||
|
||
[In this work](https://arxiv.org/abs/2306.14824), authors unlock the grounding capability for multimodal large | ||
language models. Grounding capability | ||
can provide a more convenient and efficient human-AI interaction for vision-language tasks. It enables the user to point | ||
to the object or region in the image directly rather than input detailed text descriptions to refer to it, the model | ||
can understand that image region with its spatial locations. Grounding capability also enables the model to respond | ||
with visual answers (i.e., bounding boxes), which can support more vision-language tasks such as referring expression | ||
comprehension. Visual answers are more accurate and resolve the coreference ambiguity compared with text-only | ||
responses. In addition, grounding capability can link noun phrases and referring expressions in the generated free-form | ||
text response to the image regions, providing more accurate, informational, and comprehensive answers. | ||
|
||
|
||
![image](https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/annotated_snowman.jpg) | ||
|
||
## Notebook contents | ||
- Prerequisites | ||
- Infer the original model | ||
- Convert the model to OpenVINO IR | ||
- Inference | ||
- Interactive inference | ||
|
||
## Installation instructions | ||
This is a self-contained example that relies solely on its own code.</br> | ||
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. | ||
For details, please refer to [Installation Guide](../../README.md). |