Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Error] Duplicated dataset with missing download link #140

Open
AMR-KELEG opened this issue Sep 14, 2023 · 3 comments
Open

[Error] Duplicated dataset with missing download link #140

AMR-KELEG opened this issue Sep 14, 2023 · 3 comments
Labels
error Report an error in a dataset entry

Comments

@AMR-KELEG
Copy link

Describe the dataset error

Hi,

I was checking datasets on the great Masader site and found that two datasets are the exact duplicates, and unfortunately, the download link on the provided site is unavailable. I am mainly interested in discussing ideas for automatically detecting duplicated entries on Masader. Thanks for taking the time to read my suggestion, and reviewing this issue!

Additional context

@AMR-KELEG AMR-KELEG added the error Report an error in a dataset entry label Sep 14, 2023
@zaidalyafeai
Copy link
Contributor

Thank you @AMR-KELEG for the report. I removed the duplicate and it should be updated soon. In the past, we have done a duplication removal using embeddings which fixed a lot of the duplicates. Let me know if you have other ideas. All the metadata is accessible on HuggingFace https://huggingface.co/datasets/arbml/masader.

@AMR-KELEG
Copy link
Author

Thanks @zaidalyafeai
The current method you use sounds reasonable, and I do not think I have a better idea.
On another hand, do you think we can have a way for reporting if some datasets are not accessible anymore?

@zaidalyafeai
Copy link
Contributor

The status of datasets change a lot. It is difficult to keep track. We have a report feature that can be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error Report an error in a dataset entry
Projects
None yet
Development

No branches or pull requests

2 participants