Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1446046: Support glob() for additional_source_files in streamlit deployment. #1108

Open
dreeves-battery opened this issue May 24, 2024 · 2 comments
Labels
enhancement New feature or request streamlit

Comments

@dreeves-battery
Copy link

Description

It would be nice if, instead of looping over additional source files like this:

if additional_source_files:
    for file in additional_source_files:
        ...

It was looped over like this instead:

from glob import glob

# ...

if additional_source_files:
    for file_list in additional_source_files:
        for file in glob(file_list):
            ...

So that users don't need to specify each individual file as a project grows.

This should be fully backwards compatible.

Context

No response

@github-actions github-actions bot changed the title Support glob() for additional_source_files in streamlit deployment. SNOW-1446046: Support glob() for additional_source_files in streamlit deployment. May 24, 2024
@sfc-gh-mraba sfc-gh-mraba added the enhancement New feature or request label May 27, 2024
@sfc-gh-vtimofeenko
Copy link

I did a bit of digging and looks like cli.plugins.streamlit.manager hands it off to cli.plugins.stage.manager to perform the actual PUT through snowflake.connector.cursor. A wildcard passed code in snowflake-cli and the error is returned from the cursor if it tries to upload a directory.

The experiment was conducted on this file structure:


├── common
│   ├── hello.py
│   └── mymodule
│       ├── __init__.py
│       └── submodule
│           └── __init__.py
├── environment.yml
├── pages
│   └── my_page.py
├── snowflake.yml
└── streamlit_app.py

Code in mymodule is not terribly important. Trying out different permutations of wildcards in snowflake.yml:

definition_version: "1.1"
streamlit:
  name: streamlit_app_snowcli
  stage: my_streamlit_stage_snowcli
  query_warehouse: ADHOC
  main_file: streamlit_app.py
  env_file: environment.yml
  pages_dir: pages/
  additional_source_files:
    - common/*.py # OK, produces common/hello.py
    # - common/**  # NOK, Produces "not a file but a directory" for mymodule
    - common/mymodule/*.py # OK, uploads __init__.py
    # - common/mymodule/**/*.py # NOK, produces "my_streamlit_stage_snowcli/streamlit_app_snowcli/common/mymodule/**/__init__.py"
    # - common/mymodule/*/*.py # NOK, produces "my_streamlit_stage_snowcli/streamlit_app_snowcli/common/mymodule/*/__init__.py"
    # - common/mymodule/**/*.py # NOK, produces "my_streamlit_stage_snowcli/streamlit_app_snowcli/common/mymodule/**/__init__.py"
    # - common/mymodule/***/*.py # NOK, produces "my_streamlit_stage_snowcli/streamlit_app_snowcli/common/mymodule/***/__init__.py"
    # - common/mymodule/**/* # NOK, produces "my_streamlit_stage_snowcli/streamlit_app_snowcli/common/mymodule/***/__init__.py"
    - common/mymodule/submodule/* # OK, produces what is expected, this would need to be repeated for every "leaf" subdirectory

Looks like nested wildcards (*/*-like patterns) result in files like "my_streamlit_stage_snowcli/streamlit_app_snowcli/common/mymodule/*/__init__.py" (asterisk is literal) which breaks the Streamlit application.

It's possible to implement something like in the original comment by parsing the glob on snowflake-cli side(take glob, turn it into explicit list of files on snowflake-cli side, pass to snowflake.connector, but potential downside of that approach is that it will deparallelize the uploads, effectively making it single-threaded. Which could be not a big deal as the file count is not large enough.

@dreeves-battery
Copy link
Author

It's possible to implement something like in the original comment by parsing the glob on snowflake-cli side(take glob, turn it into explicit list of files on snowflake-cli side, pass to snowflake.connector, but potential downside of that approach is that it will deparallelize the uploads, effectively making it single-threaded. Which could be not a big deal as the file count is not large enough.

@sfc-gh-vtimofeenko My 2 cents:

First: If you have enough files for parallelization to matter for performance of the deploy step, then you have enough files for globs to matter for the maintainability of your deployment spec.

Second: there is also surely a way to parallelize this, would be my strong intuition, if the existing code is parallelized. A glob() call that returns a.py, b.py and c.py is not fundamentally different than passing those directly in a list. There is just one extra step between the two that can be performed on the local machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request streamlit
Projects
None yet
Development

No branches or pull requests

4 participants