Replies: 4 comments
-
My thinking is similar to Figure 2.5 of the Arxiv Multimodal Foundation Model paper. Depending on the use case it may also make sense to use different strategies post-training when performing image analysis. Using text metadata is an obvious one, but in some cases combining Huggingface Diffusion techniques such as OpenPose, Canny edge detection, and Depth individually may improve accuracy as well. For those that are aware of LangChain's LangGraph, a potential use case for a hierarchical team. |
Beta Was this translation helpful? Give feedback.
-
AddingConditionalControltoText-to-ImageDiffusionModels - 2302.05543 - this arxiv article discusses techniques to improve Stable Diffusion image generation based on analysis of various image characteristics (Edge Detection, Pose, Depth, etc.) to create a latent image baseline. Depending on the baseline and desired target image the appropriate strategy to choose will vary. My suggestion is that a layered approach to Image Analysis using similar techniques improve will improve the consistency and accuracy of predictions by segmenting the analysis among a team of specialists. This layered image analysis capability and strategy may already exist in LLaVA, and may only require further building out test scenarios and providing Colab-type training on how to leverage this great model effectively. My interpretation of what was occurring when Vision models were incorrectly classifying a werewolf as a wolf was that they were taking 1 or 2 cues (such as the fur and the wolf head) to draw their conclusion before considering all aspects (such as pose, skeletal structure, hands) before coming to a conclusion on the subject of an image. |
Beta Was this translation helpful? Give feedback.
-
Comparison of testing when using Single image or Multiple ControlNet images:
Mistral was consistently more reliable than Vicuna at 7B parameters q4. ControlNet Images were created using the following site: https://huggingface.co/spaces/TencentARC/T2I-Adapter-SDXL |
Beta Was this translation helpful? Give feedback.
-
Jupyter Notebook with Sample Metadata from 3 separate python toolsLLaVA image analysis - includes metadata from the following python tools - pillow ExifRead hachoir |
Beta Was this translation helpful? Give feedback.
-
What stuck out in reviewing the examples in the playground was the importance of multi-modal information to improve the accuracy of image analysis.
Along the same lines - Is LLaVA able to analyze file metadata to describe the image? Two examples are DICOM and EXIF.
The attached file contains metadata for a Stable Diffusion image. Not the most important use case, but if we can devise a feedback loop where we can measure the similarity of the image created to the description given I think we can shorten training / testing time.
A number of potential benefits with storing CONCISE metadata along with the image.
wolf.txt
A use case I'm more interested in is in life sciences manufacturing related to in-line monitoring. Happy to switch the conversation to a different use case if it's more valuable.
Beta Was this translation helpful? Give feedback.
All reactions