You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{'image_info': [{'face_detections': None,
'image_name': 'b9040a0dbb22.jpg',
'matched_sim': 0.27694183588027954,
'matched_text_index': 2,
'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'},
{'face_detections': None,
'image_name': 'db1c21bc8474.jpg',
'matched_sim': 0.3234919607639313,
'matched_text_index': 1,
'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}],
'similarity_matrix': [[0.24363446235656738,
0.31758785247802734,
0.27694183588027954],
[0.2233106791973114,
0.3234919607639313,
0.26118797063827515]],
'text_list': ['When you lock the door using the lock tab on the driver’s '
'door, all of the other doors and tailgate lock at the same '
'time.',
'Press the master door lock switch in as shown to lock or '
'unlock all doors and the tailgate.',
'When you lock/unlock the driver’s door and tailgate using the '
'master lock switch, all the other doors lock/ unlock at the '
'same time.'],
'url': 'http://www.hfitinfo.com/hofi-48.html'}
Abstract
VL 모델링 Flamingo 등도 interleaved seq를 input으로 씀
release Multimodal C4 (mmc4)
use linear assignment algo to place image into longer bodies of text using CLIP features
After filtering NSFW images, ads, etc., mmc4 corpus consists of 101.2M docs with 571M images (43B English tokens)
Introduction
Prior experiments [2] suggest that performant multimodal in-context learning is dependent upon pretraining on similarly interleaved sequences of images and text (rather than single image/caption pairs). However, such a large-scale corpus has not been made publicly available.
mmc4 is constructed from public web pages contained in the cleaned English c4 corpus.
we place images into sequences of sentences by treating each document as an instance of a bipartite linear assignment problem, with images being assigned to sentences (under the constraint that each sentence is assigned at most one image). We show that applying CLIP ViT-L/14 [24] to estimate bipartite weights in a zero-shot fashion results in state-of-the-art performance on intra-document alignment benchmarks, and then apply this process to 100M+ documents to construct mmc4.
c4에 이미지 삽입하는 방식으로 만들어진듯?
Related Dataset Work
LAION-2B, CC-12M, YFCC100M, ...
Data Curation Process
Initial data collection
Multimodal C4 is an expansion of the text-only c4 dataset
Gathering images
retrieve the original webpages for each document in the c4-en dataset from the Common Crawl version 2019-18, which is the default version for c4. Next, we extract the URLs for downloadable images from the raw WAT files
attempt to download from these URLs, and resize images to a maximum dimension of 800px
eliminate any c4 documents that do not contain valid, downloadable images at the time of collection (mid-to-late 2022)
The starting point after this step is 115M documents and 1.37B images.
아직 이미지가 상당히 많음
De-duplication+small resolution
discard images with a width or height smaller than 150px
작은건 버림
this accounts for many small icons, e.g., navigation buttons.
discard images with an aspect ratio of greater than 2 or less than 0.5
After collecting a set of images for each document, we now describe our intra-document alignment process to interleave the collected images with the sentences.
DOM 구조 활용하지 않고 모델로 처리함
we did not rely on Document Object Model placements in the raw HTML to establish the alignment between images and sentences in each document. Instead, to associate each image with a sentence, we consider each document as an instance of a bipartite assignment problem [19, 16], and use CLIP ViT-L/14 compute pairwise similarities between all sentences/images on a single page. Then, we discard images without at least a 0.15 CLIP cosine similarity to at least one sentence in the document.
Finally, we use [18] to compute a bipartite assignment of images to sentences, under the constraint that each sentence can only be assigned a single image
문장과 이미지 매칭(1:1)으로 할당함
Example#2가 맵핑한거
The text was updated successfully, but these errors were encountered:
Author
Abstract
Introduction
Related Dataset Work
Data Curation Process
Initial data collection
Gathering images
De-duplication+small resolution
Discarding NSFW images
Aligning images and sentences
The text was updated successfully, but these errors were encountered: