-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming data deduplication #265
Comments
Yes, this is possible, if you index the records as you process them. The most efficient approach is to take some batch of records (say 10,000 records), index them all, commit the index, then search for duplicates. The API has methods for this. |
i have a similar scenario where i have to dedupe records coming in streams against couchbase data as quickly as possible . |
I had the same issue with a data flow DB -> NiFi -> Logstash -> Elastic. Good luck ! @larsga I have a question concerning the batch size . If you don't have any idea on how many records you are going to receive, what value do you assign ? How much does it matter if the batch size is too high ? Thanks |
Hi,
Is it possible to check for duplicates within an unbounded streaming data set, not checking against another static data source but against the data that has streamed so far?
The flow is as follows.
Source Database ---> CDC ---> Kafka ----> Stream Processing (invoke Duke for duplicate check) -> Target Database
I would like to build the index as data is streaming in from the CDC, keep incrementing the index with new data and search the index at the same time for each message coming along. What is the way to do this? Or, do we always need at least two static data sets to find duplicates?
Thank you.
The text was updated successfully, but these errors were encountered: