Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(sink): implement snowflake sink #15429

Merged
merged 40 commits into from
Apr 10, 2024
Merged

feat(sink): implement snowflake sink #15429

merged 40 commits into from
Apr 10, 2024

Conversation

xzhseh
Copy link
Contributor

@xzhseh xzhseh commented Mar 4, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Detailed spec will be added when finalizing this PR.

The general sink logic for snowflake-sink is a little bit different from others, since snowflake only support sink (e.g., insertFiles REST API) from three external stage storages (e.g., aws, azure, and gcp), then we must somehow first upload the corresponding data files to an user provided s3 bucket, and then trigger the snowflake pipe to copy from that specific external staged storage.

To keep everthing simple at present, we only support amazon s3 as an external staged storage.

For detailed snowpipe workflow, please refer to: https://docs.snowflake.com/user-guide/data-load-snowpipe-rest-overview.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

Detailed spec will be updated in Notion later.

@xzhseh xzhseh self-assigned this Mar 4, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 4863 files.

Valid Invalid Ignored Fixed
2101 1 2761 0
Click to see the invalid file list
  • src/connector/src/sink/snowflake_connector.rs

src/connector/src/sink/snowflake_connector.rs Show resolved Hide resolved
@neverchanje neverchanje self-requested a review March 8, 2024 06:33
@xzhseh xzhseh marked this pull request as ready for review March 12, 2024 20:40
@xzhseh xzhseh requested a review from a team as a code owner March 12, 2024 20:40
@xzhseh xzhseh requested a review from fuyufjh March 12, 2024 20:40
@xxhZs
Copy link
Contributor

xxhZs commented Mar 28, 2024

Also need to add an example to integration_tests

@xzhseh
Copy link
Contributor Author

xzhseh commented Apr 4, 2024

Also need to add an example to integration_tests

problem is - we always need an external s3 bucket during any potential use-case (including integration tests), plus a valid snowflake account for the final sink part.

@xzhseh xzhseh requested a review from xxhZs April 8, 2024 15:39
@fuyufjh fuyufjh self-requested a review April 9, 2024 03:59
Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I think log queue, i.e. buffering data in multiple epochs before commit, is a must-have thing for Snowflake. Let's do it later.

src/connector/src/sink/snowflake.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@tabVersion tabVersion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm as the first version. please fix the test also.

@xxhZs
Copy link
Contributor

xxhZs commented Apr 9, 2024

Also need to add an example to integration_tests

problem is - we always need an external s3 bucket during any potential use-case (including integration tests), plus a valid snowflake account for the final sink part.

Ok ok, that can be done without integrating it into our ci, just as an example


/// Construct the *unique* file suffix for the sink
fn file_suffix(&self) -> String {
format!("{}_{}", self.epoch, self.sink_file_suffix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sink_file_suffix is local to each parallelism of snowflake writer, so there might be a same sink_file_suffix across multiple parallelisms. Do we need to ensure uniqueness across multiple parallelisms?

Copy link
Contributor Author

@xzhseh xzhseh Apr 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as long as epoch remains different, everthing should be fine - actually I'm thinking to only include the epoch as the unique identifier for the s3 intermediate file(s), the sink_file_suffix does not really matter, cc @xxhZs @fuyufjh.

refer: #15429 (comment)

plus, the context for this uniqueness is due to snowflake pipe - which will implicitly refuse the sink request (i.e., from insertFiles) to the intermediate s3 file with the same name, even if the status code being returned is 200.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But different parallelisms share the same epochs. Let's say we are now at epoch 233, and we have two sink parallelisms. The first file name in this epoch of both executors are both <s3_path>/233-0. In this case, will the later written one overwrite the firstly written one?

Copy link
Contributor Author

@xzhseh xzhseh Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overwritten will not happen, but it's even worse 🫠 - the latter one upload the data to the same file on external S3, and we trigger the insertFiles request to snowflake pipe; but the file name remains the same, snowflake then implicitly treats this as redundant, and the new version will not be loaded into the pipe, which means the data in the latter file simply gets lost...🙃

Thus, we need another identifier for the parallel writer(s) scenario - any suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uuid is fine. To help debug we can use <epoch>-<uuid> as the file name.

return Ok(());
}
// first sink to the external stage provided by user (i.e., s3)
self.s3_client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use streaming upload instead of buffering the data by ourselves? We can use the streaming upload of opendal or the streaming upload implemented by ourselves with the aws sdk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I think streaming upload is possible - a possible implementation would probably be something like this: https://gist.github.com/ivormetcalf/f2b8e6abfece4328c86ad1ee34363caf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there is no need to reimplement it again. In our object store crate, we have implemented streaming upload for both aws s3 sdk and opendal. For simplicity we can use opendal. You may see the implementation in the following code.

async fn streaming_upload(&self, path: &str) -> ObjectResult<BoxedStreamingUploader> {

async fn streaming_upload(&self, path: &str) -> ObjectResult<BoxedStreamingUploader> {

@@ -32,6 +32,8 @@ pub mod nats;
pub mod pulsar;
pub mod redis;
pub mod remote;
pub mod snowflake;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May have a separate folder snowflake to hold the files related to snowflake.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently we only have snowflake.rs for the core sinking logic, plus snowflake_connector.rs for the helper clients (i.e., rest api client, s3 client) implementations - let's keep it simple at present, and move things around when it gets bigger in the future.

const S3_INTERMEDIATE_FILE_NAME: &str = "RW_SNOWFLAKE_S3_SINK_FILE";

/// The helper function to generate the s3 file name
fn generate_s3_file_name(s3_path: Option<String>, suffix: String) -> String {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that we implemented the functionality of writing to snowflake from scratch. If there is no other implementation from other repo, in the future we may move the logic here to a separate repo specially for snowflake, so that users in snowflake community can reuse our implementation.

}

async fn barrier(&mut self, _is_checkpoint: bool) -> Result<Self::CommitMetadata> {
Ok(())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can following the similar implementation from iceberg #15634

@fuyufjh
Copy link
Member

fuyufjh commented Apr 9, 2024

Also need to add an example to integration_tests

problem is - we always need an external s3 bucket during any potential use-case (including integration tests), plus a valid snowflake account for the final sink part.

Ok ok, that can be done without integrating it into our ci, just as an example

Agree. It's hard to make it work in CI because Snowflake is a SaaS service. An example is good enough in this cases.

@xzhseh
Copy link
Contributor Author

xzhseh commented Apr 10, 2024

plus, after some investigations, rusoto_s3 seems a perfect match with snowflake sink's use case (and even partial of our file sink) - if it looks good, I can refactor the current hand-made S3Client in subsequent pr(s).

cc @fuyufjh @xxhZs.

@xzhseh
Copy link
Contributor Author

xzhseh commented Apr 10, 2024

Also need to add an example to integration_tests

problem is - we always need an external s3 bucket during any potential use-case (including integration tests), plus a valid snowflake account for the final sink part.

Ok ok, that can be done without integrating it into our ci, just as an example

I'll add a detailed spec / example afterwards, any suggestion on where to put it? (e.g., integration_tests/snowflake_-sink/...?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants