Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1000284: Add schema support for structure types #1323

Merged
merged 7 commits into from
Apr 30, 2024

Conversation

sfc-gh-jrose
Copy link
Contributor

@sfc-gh-jrose sfc-gh-jrose commented Mar 18, 2024

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes #SNOW-1000284

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
  3. Please describe how your code solves the related issue.

    This PR adds relevant changes needed to support Structured Types. The goal of this commit is to enable reading structured data from iceberg tables. Writing new tables and support for non-iceberg tables is a future goal of the project.

The change does the following:

  1. metadata that contains fields for relevant types (MAP, OBJECT, ARRAY) signals that structured types are enabled and being returned.
  2. When receiving that metadata the relevant types are marked as structured types using a new new parameter. Users that would like to make new dataframes with structured types will similarly need to set the structured_types parameter on the schema objects. In the future we will enable this behavior by default.
  3. When marked as structured the relevant types will generate slightly different queries for sprocs and udfs which allows them to specify data is structured for both input and output.

I've added test cases that test the following:

  1. Returned data for iceberg and non-iceberg data can be handled with the new metadata. We don't expect non-iceberg structured data yet, but it couldn't hurt to be prepared for it.
  2. Sprocs can take structured data as input and provide it as output. The same goes for udfs.
  3. Subfields in structured objects/maps and indexes in structured arrays can be accessed with regular dataframe semantics.
  4. We're able to return structured data from requests for pandas dataframes from iceberg tables.

Merge Notes:
This change cannot merge until the next python-connector version is released. I will need to add it as a dependency to support the new pandas changes.

This change also will not pass merge gates until I figure out how to add permissions to the test runner in order to access the external volume that we're using for iceberg tests. Maybe @sfc-gh-yixie can help me figure that out.

@sfc-gh-jrose sfc-gh-jrose force-pushed the jrose_snow_1000284_structured_type_support branch 2 times, most recently from ff69f3f to d6f83cd Compare March 26, 2024 20:03
@sfc-gh-jrose sfc-gh-jrose force-pushed the jrose_snow_1000284_structured_type_support branch from d1033e3 to 7b56a11 Compare April 25, 2024 22:08
@sfc-gh-jrose sfc-gh-jrose force-pushed the jrose_snow_1000284_structured_type_support branch from 7b56a11 to e2e444e Compare April 25, 2024 22:08
@sfc-gh-jrose sfc-gh-jrose marked this pull request as ready for review April 26, 2024 00:15
@sfc-gh-jrose sfc-gh-jrose requested a review from a team as a code owner April 26, 2024 00:15
src/snowflake/snowpark/_internal/type_utils.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/_internal/udf_utils.py Outdated Show resolved Hide resolved
@@ -229,6 +294,170 @@ def test_dtypes(session):
]


@pytest.mark.parametrize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also skip these tests in stored proc? Does stored proc connector have the corresponding change for struct type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the time I merge this change it should have the corresponding change.

@@ -76,6 +76,12 @@
IS_NOT_ON_GITHUB = os.getenv("GITHUB_ACTIONS") != "true"
# this env variable is set in regression test
IS_IN_STORED_PROC_LOCALFS = IS_IN_STORED_PROC and os.getenv("IS_LOCAL_FS")
STRUCTURED_TYPE_ENVIRONMENTS = {"dev", "aws"}
IS_STRUCTURED_TYPES_SUPPORTED = (
os.getenv("cloud_provider", "dev") in STRUCTURED_TYPE_ENVIRONMENTS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cloud_provider is set on github, right? We also have some tests running on jenkins, so they will just be skipped on jenkins?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how that gets set. I know that in tox.ini it's passed in regardless of environment, but that doesn't mean it is set.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we will skip these tests on jenkins right now? e.g., https://ci-dev-142.int.snowflakecomputing.com/job/SnowparkPythonClientRegressRunner/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Once the parameter for enabling structured types is on by default we will need to make sure these tests work in all environments, but for now sfctest0 is the only environment that has them enabled. It is also the only environment that we have iceberg test infrastructure set up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks. Then maybe create a jira and add a TODO here to make sure these conditions will be finally removed?

),
id="structured-types-enabled",
),
False: pytest.param(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we finally will remove this example once all deployments supporting struct types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at some point in the future we will make a BCR that turns structured types on by default. That will probably be the best time to remove this.

@sfc-gh-jrose sfc-gh-jrose requested a review from a team as a code owner April 29, 2024 23:14
[
StructField(
"MAP",
MapType(StringType(16777216), LongType(), structured=True),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to explicit use the length 16777216? This number may change soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a BCR we have to use the length for now. I expect that whenever I make the lob change I'll need to update this test and others that depend on this constant.

@sfc-gh-jrose sfc-gh-jrose merged commit 9ccac9e into main Apr 30, 2024
26 checks passed
@sfc-gh-jrose sfc-gh-jrose deleted the jrose_snow_1000284_structured_type_support branch April 30, 2024 21:52
@github-actions github-actions bot locked and limited conversation to collaborators Apr 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants