Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DynamoDB: Use ON CONFLICT DO NOTHING clause on INSERT CDC operations #77

Merged
merged 2 commits into from
Oct 28, 2024

Conversation

amotl
Copy link
Member

@amotl amotl commented Oct 24, 2024

Reason

... to mitigate errors when events are relayed redundantly from retries after partially failed batches on the Lambda processor.

References

/cc @kneth, @surister, @hlcianfagna

@cla-bot cla-bot bot added the cla-signed label Oct 24, 2024
@amotl amotl marked this pull request as ready for review October 24, 2024 07:58
@hlcianfagna hlcianfagna requested a review from lservini October 24, 2024 08:04
@amotl
Copy link
Member Author

amotl commented Oct 24, 2024

Discussion Summary

Q&A

  1. If you think this change will apply the correct solution to the corresponding problem, or if it would cause other flaws, if that option should be made configurable, or if it is sane to use as a default particularly on the CDC INSERT operations.
  2. While already at it, do you think that clause should also be applied to other spots in the relevant subsystems of CTK? a) MongoDB CDC? b) Optionally also on full-load operations, both DynamoDB and MongoDB?

Do you think it is applicable then to use the upsert on all three occasions, which are
a) INSERT statements of full load operations,
b) INSERT statements of cdc operations,
c) UPDATE statements of cdc operations.

Do nothing on DuplicateKeyExceptions

@wierdvanderhaar already responded:

I think the upsert could work for all three. Only the Update needs to be transformed into an INSERT INTO .. ON CONFLICT DO UPDATE SET.

@hammerhead added:

Regarding c), there is still one semantical difference between an UPDATE and an INSERT ON CONFLICT: If there is no row matching the UPDATE, it will have no effect. If that statement gets re-written as an INSERT ON CONFLICT, the record gets inserted. Not sure if that is a problem or not, but at least it is a difference in behaviour. Personally, I would recommend to stick to using UPDATE.

Overwrite Option

@amotl asked:

I can imagine a user toggling a flag to also make it apply on a) could enable a convenience case where you can just slap a full load operation on a table that is already populated, with thinking much about needing to drop it first. That would be a cost vs. convenience tradeoff.

@hammerhead responded:

Some kind of “override if exists” flag for a) and b) could be interesting, we had a similar question the other day.

amotl added 2 commits October 24, 2024 11:04
... to mitigate errors when events are relayed redundantly from retries
after partially failed batches on the Lambda processor.
@amotl amotl force-pushed the dynamodb-cdc-insert-conflict-noop branch from 928a99a to 671b47d Compare October 24, 2024 09:07
Copy link
Member

@hammerhead hammerhead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense I believe, thanks!

@kneth
Copy link

kneth commented Oct 24, 2024

@amotl Is it possible to log the conflicts? I could imagine that the user would like to know how often it happens and maybe see if there is a pattern.

@amotl
Copy link
Member Author

amotl commented Oct 24, 2024

I don't know if ON CONFLICT DO ... has any means of logging, and how to retrieve it. An alternative would probably be to omit it, and catch the exception in Python code, in order to be able to converge it into a log event.

Because CDC operations are not of the same high volume / throughput like full-load operations, it might be acceptable performance-wise.

Let's also hear @matriv or @seut about this detail.

@matriv
Copy link

matriv commented Oct 24, 2024

I don't know if ON CONFLICT DO ... has any means of logging, and how to retrieve it. An alternative would probably be to omit it, and catch the exception in Python code, in order to be able to converge it into a log event.

Because CDC operations are not of the same high volume / throughput like full-load operations, it might be acceptable performance-wise.

Let's also hear @matriv or @seut about this detail.

I couldn't find DEBUG/TRACE logging in the existing code that could be switched on to facilitate this.

@amotl
Copy link
Member Author

amotl commented Oct 25, 2024

I couldn't find DEBUG/TRACE logging in CrateDB [...]

Thanks for checking, Marios.

Is it possible to log the conflicts?

@kneth: If you don't have any objections, and considering my other evaluations, let's merge and release this now, and follow up with a more elaborate external implementation later, which will be doing it how other vendors are also doing it.

@amotl
Copy link
Member Author

amotl commented Oct 28, 2024

Hi again. We will integrate and release this update, and will provide a more elaborate implementation later, which is closer to what AWS DMS and others are offering. We will use that ticket to track progress:

@amotl amotl merged commit 8c3693f into main Oct 28, 2024
23 checks passed
@amotl amotl deleted the dynamodb-cdc-insert-conflict-noop branch October 28, 2024 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants