Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create MSCOCO dataset #19

Open
Frankey419 opened this issue Sep 8, 2019 · 1 comment
Open

How to create MSCOCO dataset #19

Frankey419 opened this issue Sep 8, 2019 · 1 comment

Comments

@Frankey419
Copy link

Hi,
I got the MSCOCO captions_train2014.json and captions_val2014.json, as described in the paper, there are 82,783 train samples and 40,504 val samples, every sample contains 5 captions. If I omit one caption and combine the other four into two paraphrase pairs, there will be about 2*(82,783 + 40,504)=246,574 pairs. How can i get the 320k paraphrase pairs?

@jackyuanjie1990
Copy link

The author replies me how to create the dataset as follows:
Each data has multiple captions. Say a,b and c are paraphrases of each other then to make it into a pair you can do the following pairing:
a -> b
b -> a
a -> c
c -> a
b -> c
c -> b.

This will mean a lot more data-points than the total number of image-caption pair. However, make sure that all the phrases that are part of a single image remain either in train or in val.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants