Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow UDF/Sproc Decorators to execute in a Local Sandbox #1341

Closed

Conversation

sfc-gh-bgoel
Copy link
Contributor

@sfc-gh-bgoel sfc-gh-bgoel commented Mar 28, 2024

Please answer these questions before submitting your pull requests. Thanks!

  1. What GitHub issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    There is no open Github issue for this PR as it is initiated internally within Snowflake.

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
  3. Please describe how your code solves the related issue.

Background

When a user wants to create a UDF/UDTF/UDAF (UDF for short) or an Sproc, they are able to use the corresponding decorator and provide certain parameters that help in the creation of the said object in their Snowflake account. As it is, the current code takes in those parameters, performs some validation and transformation on them, pulls in some relevant session information, and then creates a SQL query that is sent to be run in the Snowflake account.

Ask

As Snowpark is a critical component of developing a Snowflake Native App, we would like to leverage this existing code so that a user does not have to duplicate their effort in writing Snowpark-only code and Snowpark-related code for their Snowflake Native App. We want to capture any user code in python files that use the above decorators, and be able to use that information outside the Snowpark environment without talking to Snowflake.

Design

  1. In the previously-discussed consensus, Snowflake Native App, or really any user who wants to inspect the information before creating the object in Snowflake, will install a callback into the Snowpark execution, and it is through this callback that the information will be sent back to the user for inspection before creating the object.
  2. Further, Snowflake Native App (or a user that just wants to run code in a local sandbox instead of connecting to Snowflake) logic needs to execute in a sandbox because the UDFs/SPROCs should not be immediately created in Snowflake. Snowflake Native App wants to be able to use the resolved properties to create our own SQL to be run as part of our application code, and this workflow should not happen in the account that the user is developing for. Hence the need to have a sandbox instead of the actual account. We do not want any unintended side-effects happening in the user's account and therefore want to wall-off any interaction with Snowflake.
  3. Additionally, because we want to give control to the caller environment (an external user/client) to decide whether they should allow creation of the object in Snowflake. If they choose to not proceed with the creation, then the code should be able to receive that decision and short circuit before it can interact with Snowflake.

Copy link

github-actions bot commented Mar 28, 2024

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

src/snowflake/snowpark/context.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/context.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/_internal/udf_utils.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/_internal/udf_utils.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/context.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/_internal/udf_utils.py Show resolved Hide resolved
src/snowflake/snowpark/functions.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/udf.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/udf.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/context.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/functions.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/functions.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/session.py Outdated Show resolved Hide resolved
src/snowflake/snowpark/session.py Outdated Show resolved Hide resolved
@sfc-gh-bgoel
Copy link
Contributor Author

sfc-gh-bgoel commented Apr 3, 2024

I have read the CLA Document and I hereby sign the CLA

@sfc-gh-bgoel
Copy link
Contributor Author

recheck

@sfc-gh-bgoel sfc-gh-bgoel reopened this Apr 3, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Apr 3, 2024
@sfc-gh-bgoel sfc-gh-bgoel changed the title [Draft][Do Not Merge] Allow UDF Decorator in a Local Sandbox Allow UDF/Sproc Decorators to execute in a Local Sandbox Apr 3, 2024
@@ -7047,6 +7047,9 @@ def udf(
external_access_integrations: Optional[List[str]] = None,
secrets: Optional[Dict[str, str]] = None,
immutable: bool = False,
native_app_params: Optional[
Copy link
Contributor

@sfc-gh-bdufour sfc-gh-bdufour Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion point: nesting the native app-specific params this way. The thought here was that native app params don't take effect in Snowpark, and while we'd update the docs to explain it, developers don't often read the docs and are likely to be surprised. Isolating the parameters with no effect in Snowpark seemed possibly worthwhile.

elif (
not self._local_testing
): # i.e. we are in a sandbox environment = not a true testing environment
return {"data": [("Statement executed successfully.",)], "sfqid": None}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During resolve_imports_and_packages, there is a call made to session.get_session_stage. This method tries to create a stage in the current session if one does not exist. This further calls connection.run_query(). If the connection is of type MockServerConnection, then any query other than use ... fails. This is because they have the mock object names, so use ... can technically be mimic-ed. But all others throw an error, and I want to prevent the error in sandbox.

I debated short circuiting before the call to session.get_session_stage at all, but figured to keep the change as pushed further down as possible. Open to changing this though.

src/snowflake/snowpark/session.py Outdated Show resolved Hide resolved
@sfc-gh-stan sfc-gh-stan self-requested a review April 4, 2024 21:52
@sfc-gh-bgoel sfc-gh-bgoel force-pushed the bgoel-na-hook-3-no-refactor branch from 8764dd4 to 2ad9570 Compare April 5, 2024 13:21
Copy link
Contributor

@sfc-gh-jrose sfc-gh-jrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think the change makes sense. Separating out the native apps params seems like the right thing to do.

My main concern for this change is that it adds additional complexity to already lengthy and complex functions like resolve_imports_and_packages. That function should probably be broken up a bit into some helper functions. I usually follow the rule of thumb that if the whole function can't fit on screen at once it's probably too long.

Adding complexity to support a different execution mode to the Session class will make code maintenance harder. Instead I'd rather see a subclass of Session with appropriate parts changed. The session builder can then choose which type of session is more applicable.

Finally, global variables for control flow could probably be better handled with an environment variable for _is_execution_environment_sandboxed and a session builder parameter for _should_continue_registration. I think we should avoid mutable global variables when possible.

Those are my thoughts, but I'd be interested in seeing what other people say before you start refactoring.

@sfc-gh-bdufour
Copy link
Contributor

Thanks for your feedback! Answers inline

Overall I think the change makes sense. Separating out the native apps params seems like the right thing to do.

My main concern for this change is that it adds additional complexity to already lengthy and complex functions like resolve_imports_and_packages. That function should probably be broken up a bit into some helper functions. I usually follow the rule of thumb that if the whole function can't fit on screen at once it's probably too long.

Fair point, however we were asked very specifically to keep the number of lines changed to a minimum and avoid refactoring for this change at least. Yijun thought that introducing refactorings would increase the risk associated with the PR, and that it was best to avoid this.

Adding complexity to support a different execution mode to the Session class will make code maintenance harder. Instead I'd rather see a subclass of Session with appropriate parts changed. The session builder can then choose which type of session is more applicable.

Keep in mind what we're doing is truly global and not scoped to a session, since we're loading arbitrary user code in a sandbox. We don't know how the user code will create sessions (e.g. through a builder or not).

Finally, global variables for control flow could probably be better handled with an environment variable for _is_execution_environment_sandboxed and a session builder parameter for _should_continue_registration. I think we should avoid mutable global variables when possible.

That was deliberate, see above regarding session state. It's actually one of those rare cases where mutable global state is desirable, IHMO. We need to guarantee that no matter how the user's code is written, we're get the information we need and prevent accidental SF queries.

@sfc-gh-bgoel sfc-gh-bgoel marked this pull request as ready for review April 5, 2024 19:09
@sfc-gh-bgoel sfc-gh-bgoel requested a review from a team as a code owner April 5, 2024 19:09
@@ -869,7 +947,7 @@ def resolve_imports_and_packages(
"or a tuple of the file path (str) and the import path (str)."
)
udf_level_imports[resolved_import_tuple[0]] = resolved_import_tuple[1:]
all_urls = session._resolve_imports(
all_urls = session._resolve_imports( # _resolve_imports will return an empty list in case of sandbox
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For offline discussion, is it possible to skip using session if the native apps sandbox doesn't have a session?

@sfc-gh-bgoel
Copy link
Contributor Author

Closing in favor of #1381

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants