Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Support creation and reading of StructuredDataset with local or remote uri #2914

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

JiangJiaWei1103
Copy link
Contributor

@JiangJiaWei1103 JiangJiaWei1103 commented Nov 8, 2024

Tracking issue

Closes #5954.

Why are the changes needed?

When StructuredDataset is instantiated with an uri, the operation of directly reading the dataframe fails. There exist two cases to discuss:

  1. Read the dataframe from StructuredDataset with a local file path as uri:
@task
def read_sd_from_local_uri() -> pd.DataFrame:
    sd = StructuredDataset(uri="./df.parquet", file_format="parquet")
    df = sd.open(pd.DataFrame).all()

    return df
  1. Read the dataframe from StructuredDataset with a remote file path (e.g., s3 object storage) as uri:
@task
def read_sd_from_remote_uri() -> pd.DataFrame:
    sd = StructuredDataset(uri="s3://my-s3-bucket/s3_flyte_dir/df.parquet", file_format="parquet")
    df = sd.open(pd.DataFrame).all()

    return df

Both cases should successfully read and return pd.DataFrame.

What changes were proposed in this pull request?

As commented here, users have no options to set _literal_sd in StructuredDataset. To prevent _literal_sd from being NoneType, we use StructuredDatasetTransformerEngine to construct a StructuredDataset literal and assign it to _literal_sd of StructuredDataset type.

How was this patch tested?

This patch is tested through a newly added unit test.

Setup process

git clone https://github.com/flyteorg/flytekit.git
gh pr checkout 2914
make setup && pip install -e .

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

@JiangJiaWei1103 JiangJiaWei1103 marked this pull request as ready for review November 10, 2024 06:43
@JiangJiaWei1103 JiangJiaWei1103 changed the title [WIP] Support creation and reading of StructuredDataset with remote uri Support creation and reading of StructuredDataset with local or remote uri Nov 10, 2024
@JiangJiaWei1103 JiangJiaWei1103 changed the title Support creation and reading of StructuredDataset with local or remote uri [BUG] Support creation and reading of StructuredDataset with local or remote uri Nov 10, 2024
@Future-Outlier Future-Outlier self-assigned this Nov 11, 2024
@kumare3
Copy link
Contributor

kumare3 commented Nov 11, 2024

this is awesome

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing, @JiangJiaWei1103
I am testing this.

Comment on lines +187 to +200
def _set_literal(self, ctx: FlyteContext, expected: LiteralType) -> None:
"""
Explicitly set the StructuredDataset Literal to handle the following cases:

1. Read a dataframe from a StructuredDataset with an uri, for example:

@task
def return_sd() -> StructuredDataset:
sd = StructuredDataset(uri="s3://my-s3-bucket/s3_flyte_dir/df.parquet", file_format="parquet")
df = sd.open(pd.DataFrame).all()
return df

For details, please refer to this issue: https://github.com/flyteorg/flyte/issues/5954.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice comment.

Comment on lines +201 to +202
to_literal = loop_manager.synced(flyte_dataset_transformer.async_to_literal)
self._literal_sd = to_literal(ctx, self, StructuredDataset, expected).scalar.structured_dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if is here the best way to write it.
cc @wild-endeavor @thomasjpfan

Comment on lines +668 to +684
@task
def upload_pqt_to_s3(local_path: str, remote_path: str) -> None:
"""Upload local temp parquet file to s3 object storage"""
with tempfile.TemporaryDirectory() as tmp_dir:
fs = FileAccessProvider(
local_sandbox_dir=tmp_dir,
raw_output_prefix="s3://my-s3-bucket"
)
fs.upload(local_path, remote_path)

@task
def read_sd_from_uri(uri: str) -> pd.DataFrame:
sd = StructuredDataset(uri=uri, file_format="parquet")
df = sd.open(pd.DataFrame).all()

return df

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @JiangJiaWei1103
you should put these tasks in a workflow,
which is closer to the reality that users use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

[BUG] Structured Dataset create with remote uri and read directly will fail
3 participants