-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add disableBestEffortDeduplication config option #277
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #277 +/- ##
============================================
+ Coverage 70.87% 70.98% +0.11%
- Complexity 301 302 +1
============================================
Files 32 32
Lines 1538 1544 +6
Branches 164 165 +1
============================================
+ Hits 1090 1096 +6
Misses 390 390
Partials 58 58
|
kcbq-connector/src/main/java/com/wepay/kafka/connect/bigquery/config/BigQuerySinkConfig.java
Outdated
Show resolved
Hide resolved
6700049
to
37bcaa2
Compare
@adelcast This looks straightforward, useful, and well-implemented. Do you think it might make sense to expand on the existing unit tests for the |
I may need some pointers...I am looking at BigQuerySinkTaskTest.java. Looking at other tests, they mock at the bigQuery.insertAll level, which is several levels of abstraction beyond the area where the my change is. Any suggestion on how to create a test here? |
hi @C0urante, any comments? would love to get this PR to a place where it can be merged.... |
@adelcast apologies for the delay! Been working on my own feature for this connector and it's eaten my life a little bit :) I think there are two ways we might go about testing this:
Personally I think the second option would be cleaner, easier to test, and require less boilerplate, but I'm not sure how the maintainers of the repo feel about that, so just keep in mind that you might get asked to change it if you decide to go in that direction. Hope this helps! |
The connector is currently adding an InsertId per row, which is used by BigQuery to dedupe rows that have the same insertId (in a 1 minute window). Using insertIds throttles the ingestion rate to a maximum of 100k rows per second & 100 MB/s. Insertions without a insertId disable best effort de-duplication [1], which increases the ingestion quota to a maximum of 1 GB/s. For high throughput applications, its desirable to disable dedupe, handling duplication on the query side. [1] https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication Signed-off-by: Alejandro del Castillo <[email protected]>
37bcaa2
to
6b1d4d4
Compare
@C0urante thanks for the tips! I went ahead and implemented your option 2.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused--this looks like we are mocking the connector dependencies in the test now instead of calling getRowId
in isolation. And it also looks like we're not actually verifying what the return value of getRowId
is... am I missing something? 😅
So this was my thinking: getRecordRow will call getRowId if bestEffortDeduplication = True (to generate an insertId). So, an easy way to make sure the switch is working is to set bestEffortDeduplication = False, then add a record and verify that getRowId was not called. That will imply that no insertid was created and the code path is correct. Not exactly your option 2, but seemed easy enough.... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, missed the times(0)
in there.
I was thinking the getRowId
method might be called unconditionally and just return null
if best-effort deduplication is disabled. This wouldn't be super clean since the RowToInsert::of static constructor method doesn't permit null row IDs, so we'd have to do some branching based on whether the returned string is non-null or not. The advantage would be that we'd be able to more easily unit test this since all that'd be necessary would be to create and configure the task class, and then verify that getRowId
returns null
when (and only when) it should.
Still, this looks good enough as-is. ✅
@C0urante ah, I see what you mean....that would be another way to go. Since you don't seem to have an objection, I'll wait for someone else to chime in, if another approach is preferred. Thanks again for your guidance on this. |
The connector is currently adding an InsertId per row, which is used by
BigQuery to dedupe rows that have the same insertId (in a 1 minute
window). Using insertIds throttles the ingestion rate to a maximum of
100k rows per second & 100 MB/s.
Insertions without a insertId disable best effort de-duplication [1],
which increases the ingestion quota to a maximum of 1 GB/s. For high
throughput applications, its desirable to disable dedupe, handling
duplication on the query side.
[1] https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication
Signed-off-by: Alejandro del Castillo [email protected]