Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exact deduplication #216

Open
Practicinginhell opened this issue Jun 12, 2024 · 3 comments
Open

Exact deduplication #216

Practicinginhell opened this issue Jun 12, 2024 · 3 comments

Comments

@Practicinginhell
Copy link

Practicinginhell commented Jun 12, 2024

First of all, thank you for providing such an excellent repository. I would like to inquire if the repository supports exact deduplication. Thank you in advance.

@guipenedo
Copy link
Collaborator

Do you mean exact "document" deduplication? As in, remove documents that have their entire content exactly repeated?

@Practicinginhell
Copy link
Author

Indeed, that is precisely the point I was intending to convey.

@guipenedo
Copy link
Collaborator

We currently don't support it out of the box. MinHash will also find those documents but that might be overkill if you only want exact matching. Will add to our to do list, but feel free to make a PR if you'd like to work on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants