Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Addition of popular benchmark datasets #722

Open
ir2718 opened this issue Oct 17, 2024 · 2 comments
Open

Addition of popular benchmark datasets #722

ir2718 opened this issue Oct 17, 2024 · 2 comments

Comments

@ir2718
Copy link

ir2718 commented Oct 17, 2024

Hi,

I find that it's nice to have a few benchmark datasets integrated into libraries for easier research. My feature request boils down to the implementation of a few image retrieval datasets, namely: CUB, Cars196, Stanford Online Products, and INaturalist. In most image retrieval papers, these datasets are used for benchmarking new methods and models. @KevinMusgrave, If you agree with this request, I can create a PR.

Additionally, some kind of integration with HuggingFace datasets might be nice for text retrieval/text similarity, but I'm not sure if this is of any use since sentence-transformers is probably the most often used library for such things. It also introduces an external dependency, so I'd like to hear your opinion on this.

@KevinMusgrave
Copy link
Owner

Thanks for the suggestions!

I find that it's nice to have a few benchmark datasets integrated into libraries for easier research. My feature request boils down to the implementation of a few image retrieval datasets, namely: CUB, Cars196, Stanford Online Products, and INaturalist. In most image retrieval papers, these datasets are used for benchmarking new methods and models. @KevinMusgrave, If you agree with this request, I can create a PR.

Would the dataset classes download the datasets? Are those datasets readily available for download these days?

Additionally, some kind of integration with HuggingFace datasets might be nice for text retrieval/text similarity, but I'm not sure if this is of any use since sentence-transformers is probably the most often used library for such things. It also introduces an external dependency, so I'd like to hear your opinion on this.

Could you give an example of how this might work?

@ir2718
Copy link
Author

ir2718 commented Oct 22, 2024

Would the dataset classes download the datasets? Are those datasets readily available for download these days?

Ideally, yes, as I would like it to mimic pytorch because of familiarity. This would mean you can specify the root, split, and download (possibly something else in case I missed it). I've already implemented Cars196, and CUB on my fork, so you can have a look at what I had mind: https://github.com/ir2718/pytorch-metric-learning/tree/dataset. If you think this is a step in the right direction, do say so.

Could you give an example of how this might work?

I haven't given it that much thought, but for the sake of example, maybe a function that generates a pytorch dataset from the given huggingface dataset name, input column, and output column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants