Using EXIF and DICOM information to improve training and fine-tuning #1157

donbr · 2024-02-21T01:44:38Z

donbr
Feb 21, 2024

What stuck out in reviewing the examples in the playground was the importance of multi-modal information to improve the accuracy of image analysis.

Along the same lines - Is LLaVA able to analyze file metadata to describe the image? Two examples are DICOM and EXIF.

The attached file contains metadata for a Stable Diffusion image. Not the most important use case, but if we can devise a feedback loop where we can measure the similarity of the image created to the description given I think we can shorten training / testing time.

A number of potential benefits with storing CONCISE metadata along with the image.

wolf.txt

A use case I'm more interested in is in life sciences manufacturing related to in-line monitoring. Happy to switch the conversation to a different use case if it's more valuable.

donbr · 2024-02-21T01:59:46Z

donbr
Feb 21, 2024
Author

My thinking is similar to Figure 2.5 of the Arxiv Multimodal Foundation Model paper.

Depending on the use case it may also make sense to use different strategies post-training when performing image analysis. Using text metadata is an obvious one, but in some cases combining Huggingface Diffusion techniques such as OpenPose, Canny edge detection, and Depth individually may improve accuracy as well.

For those that are aware of LangChain's LangGraph, a potential use case for a hierarchical team.

0 replies

donbr · 2024-02-21T17:32:16Z

donbr
Feb 21, 2024
Author

AddingConditionalControltoText-to-ImageDiffusionModels - 2302.05543 - this arxiv article discusses techniques to improve Stable Diffusion image generation based on analysis of various image characteristics (Edge Detection, Pose, Depth, etc.) to create a latent image baseline. Depending on the baseline and desired target image the appropriate strategy to choose will vary.

My suggestion is that a layered approach to Image Analysis using similar techniques improve will improve the consistency and accuracy of predictions by segmenting the analysis among a team of specialists. This layered image analysis capability and strategy may already exist in LLaVA, and may only require further building out test scenarios and providing Colab-type training on how to leverage this great model effectively.

My interpretation of what was occurring when Vision models were incorrectly classifying a werewolf as a wolf was that they were taking 1 or 2 cues (such as the fur and the wolf head) to draw their conclusion before considering all aspects (such as pose, skeletal structure, hands) before coming to a conclusion on the subject of an image.

0 replies

donbr · 2024-02-22T09:12:39Z

donbr
Feb 22, 2024
Author

Comparison of testing when using Single image or Multiple ControlNet images:

Mistral was consistently more reliable than Vicuna at 7B parameters q4.

ControlNet Images were created using the following site: https://huggingface.co/spaces/TencentARC/T2I-Adapter-SDXL

0 replies

donbr · 2024-02-22T09:41:54Z

donbr
Feb 22, 2024
Author

Jupyter Notebook with Sample Metadata from 3 separate python tools

LLaVA image analysis - includes metadata from the following python tools - pillow ExifRead hachoir

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using EXIF and DICOM information to improve training and fine-tuning #1157

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Using EXIF and DICOM information to improve training and fine-tuning #1157

donbr Feb 21, 2024

Replies: 4 comments

donbr Feb 21, 2024 Author

donbr Feb 21, 2024 Author

donbr Feb 22, 2024 Author

donbr Feb 22, 2024 Author

Jupyter Notebook with Sample Metadata from 3 separate python tools

donbr
Feb 21, 2024

donbr
Feb 21, 2024
Author

donbr
Feb 21, 2024
Author

donbr
Feb 22, 2024
Author

donbr
Feb 22, 2024
Author