Addition of popular benchmark datasets #722

ir2718 · 2024-10-17T17:01:12Z

Hi,

I find that it's nice to have a few benchmark datasets integrated into libraries for easier research. My feature request boils down to the implementation of a few image retrieval datasets, namely: CUB, Cars196, Stanford Online Products, and INaturalist. In most image retrieval papers, these datasets are used for benchmarking new methods and models. @KevinMusgrave, If you agree with this request, I can create a PR.

Additionally, some kind of integration with HuggingFace datasets might be nice for text retrieval/text similarity, but I'm not sure if this is of any use since sentence-transformers is probably the most often used library for such things. It also introduces an external dependency, so I'd like to hear your opinion on this.

KevinMusgrave · 2024-10-22T12:43:02Z

Thanks for the suggestions!

I find that it's nice to have a few benchmark datasets integrated into libraries for easier research. My feature request boils down to the implementation of a few image retrieval datasets, namely: CUB, Cars196, Stanford Online Products, and INaturalist. In most image retrieval papers, these datasets are used for benchmarking new methods and models. @KevinMusgrave, If you agree with this request, I can create a PR.

Would the dataset classes download the datasets? Are those datasets readily available for download these days?

Additionally, some kind of integration with HuggingFace datasets might be nice for text retrieval/text similarity, but I'm not sure if this is of any use since sentence-transformers is probably the most often used library for such things. It also introduces an external dependency, so I'd like to hear your opinion on this.

Could you give an example of how this might work?

ir2718 · 2024-10-22T14:06:23Z

Would the dataset classes download the datasets? Are those datasets readily available for download these days?

Ideally, yes, as I would like it to mimic pytorch because of familiarity. This would mean you can specify the root, split, and download (possibly something else in case I missed it). I've already implemented Cars196, and CUB on my fork, so you can have a look at what I had mind: https://github.com/ir2718/pytorch-metric-learning/tree/dataset. If you think this is a step in the right direction, do say so.

Could you give an example of how this might work?

I haven't given it that much thought, but for the sake of example, maybe a function that generates a pytorch dataset from the given huggingface dataset name, input column, and output column.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addition of popular benchmark datasets #722

Addition of popular benchmark datasets #722

ir2718 commented Oct 17, 2024

KevinMusgrave commented Oct 22, 2024

ir2718 commented Oct 22, 2024

Addition of popular benchmark datasets #722

Addition of popular benchmark datasets #722

Comments

ir2718 commented Oct 17, 2024

KevinMusgrave commented Oct 22, 2024

ir2718 commented Oct 22, 2024