Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text #36

eagle705 · 2024-04-11T14:49:38Z

Author

Yejin Choi 있음
https://github.com/allenai/mmc4/blob/main/DATASET_CARD.md

{'image_info': [{'face_detections': None,
                 'image_name': 'b9040a0dbb22.jpg',
                 'matched_sim': 0.27694183588027954,
                 'matched_text_index': 2,
                 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'},
                {'face_detections': None,
                 'image_name': 'db1c21bc8474.jpg',
                 'matched_sim': 0.3234919607639313,
                 'matched_text_index': 1,
                 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}],
 'similarity_matrix': [[0.24363446235656738,
                        0.31758785247802734,
                        0.27694183588027954],
                       [0.2233106791973114,
                        0.3234919607639313,
                        0.26118797063827515]],
 'text_list': ['When you lock the door using the lock tab on the driver’s '
               'door, all of the other doors and tailgate lock at the same '
               'time.',
               'Press the master door lock switch in as shown to lock or '
               'unlock all doors and the tailgate.',
               'When you lock/unlock the driver’s door and tailgate using the '
               'master lock switch, all the other doors lock/ unlock at the '
               'same time.'],
 'url': 'http://www.hfitinfo.com/hofi-48.html'}

Abstract

VL 모델링 Flamingo 등도 interleaved seq를 input으로 씀
release Multimodal C4 (mmc4)
use linear assignment algo to place image into longer bodies of text using CLIP features
After filtering NSFW images, ads, etc., mmc4 corpus consists of 101.2M docs with 571M images (43B English tokens)

Introduction

Prior experiments [2] suggest that performant multimodal in-context learning is dependent upon pretraining on similarly interleaved sequences of images and text (rather than single image/caption pairs). However, such a large-scale corpus has not been made publicly available.
mmc4 is constructed from public web pages contained in the cleaned English c4 corpus.
we place images into sequences of sentences by treating each document as an instance of a bipartite linear assignment problem, with images being assigned to sentences (under the constraint that each sentence is assigned at most one image). We show that applying CLIP ViT-L/14 [24] to estimate bipartite weights in a zero-shot fashion results in state-of-the-art performance on intra-document alignment benchmarks, and then apply this process to 100M+ documents to construct mmc4.
- c4에 이미지 삽입하는 방식으로 만들어진듯?

Related Dataset Work

LAION-2B, CC-12M, YFCC100M, ...

Data Curation Process

Initial data collection
- Multimodal C4 is an expansion of the text-only c4 dataset
Gathering images
- retrieve the original webpages for each document in the c4-en dataset from the Common Crawl version 2019-18, which is the default version for c4. Next, we extract the URLs for downloadable images from the raw WAT files
- attempt to download from these URLs, and resize images to a maximum dimension of 800px
- eliminate any c4 documents that do not contain valid, downloadable images at the time of collection (mid-to-late 2022)
- The starting point after this step is 115M documents and 1.37B images.
  - 아직 이미지가 상당히 많음
De-duplication+small resolution
- discard images with a width or height smaller than 150px
  - 작은건 버림
    - this accounts for many small icons, e.g., navigation buttons.
- discard images with an aspect ratio of greater than 2 or less than 0.5
  - this accounts for many banner-like ads.
    - 비율깨진건 버림
Discarding NSFW images
- employ strict NSFW image filtering, using DataComp’s [14] dataset2metadata13 NSFW binary image classifier.
Aligning images and sentences
- After collecting a set of images for each document, we now describe our intra-document alignment process to interleave the collected images with the sentences.
- DOM 구조 활용하지 않고 모델로 처리함
  - we did not rely on Document Object Model placements in the raw HTML to establish the alignment between images and sentences in each document. Instead, to associate each image with a sentence, we consider each document as an instance of a bipartite assignment problem [19, 16], and use CLIP ViT-L/14 compute pairwise similarities between all sentences/images on a single page. Then, we discard images without at least a 0.15 CLIP cosine similarity to at least one sentence in the document.
  - Finally, we use [18] to compute a bipartite assignment of images to sentences, under the constraint that each sentence can only be assigned a single image
  - 문장과 이미지 매칭(1:1)으로 할당함

Example#2가 맵핑한거

The text was updated successfully, but these errors were encountered:

eagle705 self-assigned this Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text #36

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text #36

eagle705 commented Apr 11, 2024 •

edited

Loading

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text #36

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text #36

Comments

eagle705 commented Apr 11, 2024 • edited Loading

Author

Abstract

Introduction

Related Dataset Work

Data Curation Process

eagle705 commented Apr 11, 2024 •

edited

Loading