Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text #36

Open
eagle705 opened this issue Apr 11, 2024 · 0 comments
Assignees

Comments

@eagle705
Copy link
Owner

eagle705 commented Apr 11, 2024

Author

{'image_info': [{'face_detections': None,
                 'image_name': 'b9040a0dbb22.jpg',
                 'matched_sim': 0.27694183588027954,
                 'matched_text_index': 2,
                 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.90.jpg'},
                {'face_detections': None,
                 'image_name': 'db1c21bc8474.jpg',
                 'matched_sim': 0.3234919607639313,
                 'matched_text_index': 1,
                 'raw_url': 'http://www.hfitinfo.com/honda_fit_pics/3/2/index.91.jpg'}],
 'similarity_matrix': [[0.24363446235656738,
                        0.31758785247802734,
                        0.27694183588027954],
                       [0.2233106791973114,
                        0.3234919607639313,
                        0.26118797063827515]],
 'text_list': ['When you lock the door using the lock tab on the driver’s '
               'door, all of the other doors and tailgate lock at the same '
               'time.',
               'Press the master door lock switch in as shown to lock or '
               'unlock all doors and the tailgate.',
               'When you lock/unlock the driver’s door and tailgate using the '
               'master lock switch, all the other doors lock/ unlock at the '
               'same time.'],
 'url': 'http://www.hfitinfo.com/hofi-48.html'}

Abstract

  • VL 모델링 Flamingo 등도 interleaved seq를 input으로 씀
  • release Multimodal C4 (mmc4)
  • use linear assignment algo to place image into longer bodies of text using CLIP features
  • After filtering NSFW images, ads, etc., mmc4 corpus consists of 101.2M docs with 571M images (43B English tokens)

image

Introduction

  • Prior experiments [2] suggest that performant multimodal in-context learning is dependent upon pretraining on similarly interleaved sequences of images and text (rather than single image/caption pairs). However, such a large-scale corpus has not been made publicly available.
  • mmc4 is constructed from public web pages contained in the cleaned English c4 corpus.
  • we place images into sequences of sentences by treating each document as an instance of a bipartite linear assignment problem, with images being assigned to sentences (under the constraint that each sentence is assigned at most one image). We show that applying CLIP ViT-L/14 [24] to estimate bipartite weights in a zero-shot fashion results in state-of-the-art performance on intra-document alignment benchmarks, and then apply this process to 100M+ documents to construct mmc4.
    • c4에 이미지 삽입하는 방식으로 만들어진듯?
      image

Related Dataset Work

  • LAION-2B, CC-12M, YFCC100M, ...

Data Curation Process

  • Initial data collection

    • Multimodal C4 is an expansion of the text-only c4 dataset
  • Gathering images

    • retrieve the original webpages for each document in the c4-en dataset from the Common Crawl version 2019-18, which is the default version for c4. Next, we extract the URLs for downloadable images from the raw WAT files
    • attempt to download from these URLs, and resize images to a maximum dimension of 800px
    • eliminate any c4 documents that do not contain valid, downloadable images at the time of collection (mid-to-late 2022)
    • The starting point after this step is 115M documents and 1.37B images.
      • 아직 이미지가 상당히 많음
  • De-duplication+small resolution

    • discard images with a width or height smaller than 150px
      • 작은건 버림
        • this accounts for many small icons, e.g., navigation buttons.
    • discard images with an aspect ratio of greater than 2 or less than 0.5
      • this accounts for many banner-like ads.
        • 비율깨진건 버림
  • Discarding NSFW images

    • employ strict NSFW image filtering, using DataComp’s [14] dataset2metadata13 NSFW binary image classifier.
  • Aligning images and sentences

    • After collecting a set of images for each document, we now describe our intra-document alignment process to interleave the collected images with the sentences.
    • DOM 구조 활용하지 않고 모델로 처리함
      • we did not rely on Document Object Model placements in the raw HTML to establish the alignment between images and sentences in each document. Instead, to associate each image with a sentence, we consider each document as an instance of a bipartite assignment problem [19, 16], and use CLIP ViT-L/14 compute pairwise similarities between all sentences/images on a single page. Then, we discard images without at least a 0.15 CLIP cosine similarity to at least one sentence in the document.
      • Finally, we use [18] to compute a bipartite assignment of images to sentences, under the constraint that each sentence can only be assigned a single image
      • 문장과 이미지 매칭(1:1)으로 할당함

image

  • Example#2가 맵핑한거
    image
@eagle705 eagle705 self-assigned this Apr 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant