Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support REPLACE INTO for INSERT statements #12516

Merged
merged 5 commits into from
Sep 28, 2024

Conversation

fmeringdal
Copy link
Contributor

@fmeringdal fmeringdal commented Sep 17, 2024

Which issue does this PR close?

Closes #12515 .

Are these changes tested?

Yes, in datafusion/core/tests/user_defined/insert_operation.rs.

Are there any user-facing changes?

Yes, overwrite boolean flags has been replaced by InsertOp enum in the following APIs and configs: TableProvider::insert_into, FileSinkConfig, DataFrameWriteOptions.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions core Core DataFusion crate catalog Related to the catalog crate labels Sep 17, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contirbution @fmeringdal -- this code looks good to me. I am worried about the lack of tests, however

Basically if there isn't test coverage we could potentially break this feature without realizing it in some future refactor.

maybe we could add something in

https://github.com/apache/datafusion/tree/main/datafusion/core/tests/user_defined

That shows that a mock up user defined TableProvider will have insert_into called with the correct value of replace_into when the relevant SQL is planned 🤔

It is unfortunate we don't seem to already have such a test

@@ -273,6 +273,7 @@ pub trait TableProvider: Sync + Send {
_state: &dyn Session,
_input: Arc<dyn ExecutionPlan>,
_overwrite: bool,
_replace_into: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please document this new parameter and explain what it means (bonus points for documenting what _overwrite means too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That shows that a mock up user defined TableProvider will have insert_into called with the correct value of replace_into when the relevant SQL is planned

Test added in datafusion/core/tests/user_defined/insert_operation.rs.

Can you please document this new parameter and explain what it means (bonus points for documenting what _overwrite means too)

Documentation added to the variants of the InsertOp enum.

@fmeringdal
Copy link
Contributor Author

Thanks for the quick review @alamb 🙏 I will get some tests added.

I wanted to run an idea by you before I implement the change to restructure the WriteOp enum from:

pub enum WriteOp {
    InsertOverwrite,
    InsertReplace,
    InsertInto,
    Delete,
    Update,
    Ctas,
}

To:

pub enum WriteOp {
    Insert(InsertOp),
    Delete,
    Update,
    Ctas,
}

pub enum InsertOp {
    Append,    // Represents a regular INSERT INTO operation
    Overwrite, // Represents an INSERT OVERWRITE operation
    Replace,   // Represents an INSERT OR REPLACE operation
}

The purpose of such a change would be so that TableProvider::insert_into could take an op: InsertOp argument instead of two boolean arguments for overwrite and replace_into . I think it makes it slightly more convenient for implementors of the insert_into method to match on an enum rather than checking two booleans.
What do you think?

@alamb
Copy link
Contributor

alamb commented Sep 19, 2024

The purpose of such a change would be so that TableProvider::insert_into could take an op: InsertOp argument instead of two boolean arguments for overwrite and replace_into . I think it makes it slightly more convenient for implementors of the insert_into method to match on an enum rather than checking two booleans.
What do you think?

I agree it would be a good change 👌

Thank you!

@alamb alamb marked this pull request as draft September 20, 2024 14:49
@alamb
Copy link
Contributor

alamb commented Sep 20, 2024

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@fmeringdal fmeringdal force-pushed the feature/support-replace-into-for-insert branch from bb5f9f7 to 2291342 Compare September 21, 2024 20:50
@github-actions github-actions bot added the proto Related to proto crate label Sep 21, 2024
@fmeringdal fmeringdal force-pushed the feature/support-replace-into-for-insert branch 7 times, most recently from b000964 to 6c95327 Compare September 22, 2024 10:22
@fmeringdal fmeringdal marked this pull request as ready for review September 22, 2024 10:41
@fmeringdal
Copy link
Contributor Author

fmeringdal commented Sep 22, 2024

Ready for another round of review! Changes applied since the first iteration:

  • InsertOp enum has been introduced and replaced the overwrite bool. This introduced a few breaking changes as described in the PR description. Any other places I need to document these breaking changes?
  • Test added to datafusion/core/tests/user_defined/insert_operation.rs.
  • Handle both REPLACE INTO and INSERT OR REPLACE INTO

This commit introduces an `InsertOp` enum to replace the boolean
`overwrite` flag to provide a more clear and flexible control over how
data is inserted. This change updates the following APIs and configs to
reflect the change: `TableProvider::insert_into`, `FileSinkConfig` and
`DataFrameWriteOptions`.
@fmeringdal fmeringdal force-pushed the feature/support-replace-into-for-insert branch from 6c95327 to 10297cc Compare September 22, 2024 11:28
@alamb alamb added the api change Changes the API exposed to users of the crate label Sep 27, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @fmeringdal for this PR and for making it easy to review the PR (nicely documented, nicely written code, and nicely tested) 🏆

I had this PR checked out anyways so I took the liberty of merging up from main to resolve the conflicts, as well as added a license notice to the file and fixed a clippy issue.

use datafusion_physical_plan::{DisplayAs, ExecutionMode, ExecutionPlan, PlanProperties};

#[tokio::test]
async fn insert_operation_is_passed_correctly_to_table_provider() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb
Copy link
Contributor

alamb commented Sep 28, 2024

Merged up to get CI fix / hopefully pass

@alamb
Copy link
Contributor

alamb commented Sep 28, 2024

🚀

@alamb alamb merged commit 3892499 into apache:main Sep 28, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate catalog Related to the catalog crate core Core DataFusion crate logical-expr Logical plan and expressions proto Related to proto crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support REPLACE INTO for INSERT statements
2 participants