You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was checking datasets on the great Masader site and found that two datasets are the exact duplicates, and unfortunately, the download link on the provided site is unavailable. I am mainly interested in discussing ideas for automatically detecting duplicated entries on Masader. Thanks for taking the time to read my suggestion, and reviewing this issue!
Thank you @AMR-KELEG for the report. I removed the duplicate and it should be updated soon. In the past, we have done a duplication removal using embeddings which fixed a lot of the duplicates. Let me know if you have other ideas. All the metadata is accessible on HuggingFace https://huggingface.co/datasets/arbml/masader.
Thanks @zaidalyafeai
The current method you use sounds reasonable, and I do not think I have a better idea.
On another hand, do you think we can have a way for reporting if some datasets are not accessible anymore?
Describe the dataset error
Hi,
I was checking datasets on the great Masader site and found that two datasets are the exact duplicates, and unfortunately, the download link on the provided site is unavailable. I am mainly interested in discussing ideas for automatically detecting duplicated entries on Masader. Thanks for taking the time to read my suggestion, and reviewing this issue!
Additional context
The text was updated successfully, but these errors were encountered: