Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[copilot][flytedirectory] multipart blob download #5715

Merged

Conversation

wayner0628
Copy link
Contributor

@wayner0628 wayner0628 commented Sep 1, 2024

Tracking issue

#3632

Why are the changes needed?

Supporting multipart blob downloads allows us to completely copy the specified directory into the input path.

What changes were proposed in this pull request?

  • Using new storage List api to collect items under container before download
  • Implement List api for memory storage
  • Parallel download

How was this patch tested?

Unit testing and E2E testing

Setup process

  1. Build copilot image
docker build -f Dockerfile.flytecopilot -t my-flytecopilot-app:latest .
docker tag my-flytecopilot-app:latest localhost:30000/my-flytecopilot-app:latest
docker push localhost:30000/my-flytecopilot-app:latest
  1. Use the sandbox to test
flytectl demo start --force
  1. Change copilot image in the sandbox
kubectl edit cm flyte-sandbox-config -n flyte
kubectl rollout restart deployment flyte-sandbox -n flyte
  1. Python example:
import logging
from typing import Tuple, List
import datetime
from flytekit import ContainerTask, kwtypes, workflow, task
from flytekit.types.file import FlyteFile
from flytekit.types.directory import FlyteDirectory


logger = logging.getLogger(__file__)

flyte_file_io = ContainerTask(
    name="flyte_file_io",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(inputs=FlyteFile),
    outputs=kwtypes(out=FlyteFile),
    image="futureoutlier/rawcontainer:0320",
    command=[
        "python",
        "write_flytefile.py",
        "{{.inputs.inputs}}",
        # "/var/inputs/inputs",
        "/var/outputs/out",
    ],
)

flyte_dir_io = ContainerTask(
    name="flyte_dir_io",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(inputs=FlyteDirectory),
    outputs=kwtypes(out=FlyteDirectory),
    image="futureoutlier/rawcontainer:0320",
    command=[
        "python",
        "write_flytedir.py",
        # "{{.inputs.inputs}}",
        "/var/inputs/inputs",
        "/var/outputs/out",
    ],
)

@task
def flyte_file_task() -> FlyteFile:
    with open("./a.txt", "w") as file:
        file.write("This is a.txt file.")
    return FlyteFile(path="./a.txt")

@workflow
def flyte_file_io_wf() -> FlyteFile:
    ff = flyte_file_task()
    return flyte_file_io(inputs=ff)

# Supported by this PR
@task
def flyte_dir_write_task() -> FlyteDirectory:
    from pathlib import Path
    import flytekit
    import os

    working_dir = flytekit.current_context().working_directory
    local_dir = Path(os.path.join(working_dir, "csv_files"))
    local_dir.mkdir(exist_ok=True)
    write_file = local_dir / "a.txt"
    with open(write_file, "w") as file:
        file.write("This is for flyte dir.")

    return FlyteDirectory(path=str(local_dir))

# Not supported by sidecar currently
@task
def flyte_dir_read_task(path: FlyteDirectory) -> bool:
    from pathlib import Path
    path = Path(path)

    if not path.exists() or not path.is_dir():
        print(f"Error: {path} does not exist or is not a directory.")
        return False

    for file in path.rglob("*"):
        if file.is_file():
            print(file)
    return True

@workflow
def flyte_dir_io_wf() -> bool:
    fd = flyte_dir_write_task()
    return flyte_dir_read_task(flyte_dir_io(inputs=fd))

if __name__ == "__main__":
    print(flyte_dir_io_wf())
    print(flyte_file_io_wf())

Screenshots

Unit testing
Screenshot 2024-10-27 at 5 52 37 PM

E2E testing
Screenshot 2024-10-27 at 5 50 46 PM

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

flyteorg/flytekit#2258

Docs link

NA

Copy link

codecov bot commented Sep 1, 2024

Codecov Report

Attention: Patch coverage is 59.11950% with 65 lines in your changes missing coverage. Please review.

Project coverage is 36.90%. Comparing base (fef67b8) to head (dbbd8c3).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
flytecopilot/data/download.go 64.38% 40 Missing and 12 partials ⚠️
flytestdlib/storage/mem_store.go 0.00% 11 Missing ⚠️
flytestdlib/storage/storage.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5715      +/-   ##
==========================================
+ Coverage   36.85%   36.90%   +0.04%     
==========================================
  Files        1310     1310              
  Lines      131246   131372     +126     
==========================================
+ Hits        48377    48477     +100     
- Misses      78670    78682      +12     
- Partials     4199     4213      +14     
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.05% <ø> (ø)
unittests-flytecopilot 22.23% <64.38%> (+10.50%) ⬆️
unittests-flytectl 62.39% <ø> (ø)
unittests-flyteidl 6.92% <ø> (ø)
unittests-flyteplugins 53.84% <ø> (ø)
unittests-flytepropeller 42.90% <ø> (ø)
unittests-flytestdlib 55.31% <0.00%> (-0.07%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
Signed-off-by: wayner0628 <[email protected]>
@wayner0628 wayner0628 marked this pull request as ready for review September 5, 2024 19:14
Copy link
Contributor

@wild-endeavor wild-endeavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wayner0628 - i think this is good. I want to get @eapolinario or @EngHabu to take a quick look at this as well though. This is a pretty core interface that's changing in this PR.

@@ -78,6 +78,9 @@ type RawStore interface {
// Head gets metadata about the reference. This should generally be a light weight operation.
Head(ctx context.Context, reference DataReference) (Metadata, error)

// GetItems retrieves the paths of all items from the Blob store or an error
GetItems(ctx context.Context, reference DataReference) ([]string, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be more accurately named ListItems? Also what is retrieved? The relative path to the reference input? can we add comment?

flytecopilot/data/download.go Show resolved Hide resolved
@@ -54,6 +55,23 @@ func (s *InMemoryStore) Head(ctx context.Context, reference DataReference) (Meta
}, nil
}

func (s *InMemoryStore) GetItems(ctx context.Context, reference DataReference) ([]string, error) {
var items []string
prefix := string(reference) + "/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will reference ever already have a /?

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @wayner0628
Can you test cases like this PR?
flyteorg/flytekit#2258
To be more specifically, this case

flyte_dir_io = ContainerTask(
    name="flyte_dir_io",
    input_data_dir="/var/inputs",
    output_data_dir="/var/outputs",
    inputs=kwtypes(inputs=FlyteDirectory),
    outputs=kwtypes(out=FlyteDirectory),
    image="futureoutlier/rawcontainer:0320",
    command=[
        "python",
        "write_flytedir.py",
        "{{.inputs.inputs}}",
        "/var/outputs/out",
    ],
)

If possible, please proivde screenshot, thank you.

@wild-endeavor
Copy link
Contributor

There is also this PR, https://github.com/flyteorg/flyte/pull/5674/files which I think we should merge first. The change to core api should probably be done separately.

@wild-endeavor
Copy link
Contributor

@wayner0628 #5741 this was just merged, adding a list api to the storage client. mind using the new interface to do this?

@wayner0628
Copy link
Contributor Author

@wild-endeavor No problem, I'll update this PR to align with the new interface.

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tips to develop copilot in single binary.

  1. config
plugins:
  logs:
    dynamic-log-links:
      - comet-ml-execution-id:
          displayName: Comet
          templateUris: "{{ .taskConfig.host }}/{{ .taskConfig.workspace }}/{{ .taskConfig.project_name }}/{{ .executionName }}{{ .nodeId }}{{ .taskRetryAttempt }}{{ .taskConfig.link_suffix }}"
      - comet-ml-custom-id:
          displayName: Comet
          templateUris: "{{ .taskConfig.host }}/{{ .taskConfig.workspace }}/{{ .taskConfig.project_name }}/{{ .taskConfig.experiment_key }}"

    kubernetes-enabled: true
    kubernetes-template-uri: http://localhost:30080/kubernetes-dashboard/#/log/{{.namespace }}/{{ .podName }}/pod?namespace={{ .namespace }}
    cloudwatch-enabled: false
    stackdriver-enabled: false
  k8s:
    default-env-vars:
      - FLYTE_AWS_ENDPOINT: "http://flyte-sandbox-minio.flyte:9000"
      - FLYTE_AWS_ACCESS_KEY_ID: minio
      - FLYTE_AWS_SECRET_ACCESS_KEY: miniostorage
      - MLFLOW_TRACKING_URI: postgresql+psycopg2://postgres:@postgres.flyte.svc.cluster.local:5432/flyteadmin
    co-pilot:
          image: "localhost:30000/copilot-flytefile:0603"
  1. how to build copilot image?
    use Dockerfile.flytecopilot to build it.

@wayner0628 wayner0628 closed this Sep 15, 2024
@wayner0628 wayner0628 reopened this Sep 15, 2024
@wayner0628
Copy link
Contributor Author

wayner0628 commented Oct 28, 2024

Hi all,

I've tested the downloader with a Python task that writes to a FlyteDirectory, and the raw container is able to read it successfully, as shown below: Screenshot 2024-10-27 at 5 38 29 PM

Additionally, I tested the scenario where a raw container writes a directory, and a downstream Python task reads it. It appears that the sidecar (uploader for the container task) does not currently support multi-blob tasks. Addressing this limitation will require more time, but since we already have the downloader with directory support, I suggest we proceed with merging this PR and address the sidecar support in a follow-up PR. (#5924)

I'll update the testing .py code and this screenshot in the description, thank you!

@Future-Outlier
Copy link
Member

Hi all,

I've tested the downloader with a Python task that writes to a FlyteDirectory, and the raw container is able to read it successfully, as shown below: Screenshot 2024-10-27 at 5 38 29 PM

Additionally, I tested the scenario where a raw container writes a directory, and a downstream Python task reads it. It appears that the sidecar (uploader for the container task) does not currently support multi-blob tasks. Addressing this limitation will require more time, but since we already have the downloader with directory support, I suggest we proceed with merging this PR and address the sidecar support in a follow-up PR. (#5924)

I'll update the testing .py code and this screenshot in the description, thank you!

It's looking pretty good, will come here at Thursday or Friday, thank you Wayner.

Signed-off-by: wayner0628 <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo:

  1. 1 download error, return error in handleBlob
  2. upload todo: recursive directory upload (create an issue first)
  3. os.MkdirAll(dir, 0777) add comments (it will handle case if dir exist already)
  4. success count error handling?
  5. readerSuccess, writer success
  6. add comments to explain life cycles and error handling
    https://github.com/flyteorg/flytekit/blob/master/flytekit/core/type_engine.py#L2177-L2201
  7. comments (go routien leaks, error handling...)
  8. add comments to tell others list api handle recursive case already
  9. add comments to tell user the down code is single blob
  10. todo: add comments that we should have timeout
  11. create an issue about sidecar rename to upload
  12. ping fabio and Buğra Gedik to take a look at mem store changes

Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review from @eapolinario

  1. can stow api install more than 1 file a time (We think not)
  2. let's not set a limit of files we can download now (may be add comments to do this if someone hit rate limit in the future)
  3. follow up: use chunk to download it (don't need to do it in this PR)

Thank you @wayner0628 , lots of people want it for a LONG TIME.
let's add comments and fix the error handling then merge it

@wayner0628
Copy link
Contributor Author

Hi @Future-Outlier, I’ve pushed some modifications based on your feedback—please take a look!

can the stow API install more than one file at a time? (We think not)

I’m not sure what this means. Could you clarify what changes you’d like here?

@Future-Outlier
Copy link
Member

Hi @Future-Outlier, I’ve pushed some modifications based on your feedback—please take a look!

can the stow API install more than one file at a time? (We think not)

I’m not sure what this means. Could you clarify what changes you’d like here?

Yes no problem, will do it today.
for the stow API, you can ignore it first.
We just want to make sure there's no way to batch download files.

Co-authored-by: Han-Ru Chen (Future-Outlier) <[email protected]>
Signed-off-by: Wei-Yu Kao <[email protected]>
wayner0628 and others added 3 commits November 7, 2024 21:07
Signed-off-by: Future-Outlier <[email protected]>
Signed-off-by: Future-Outlier <[email protected]>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you Wayner

@Future-Outlier Future-Outlier changed the title Support multipart blob download [copilot][flytedirectory] multipart blob download Nov 8, 2024
@Future-Outlier Future-Outlier enabled auto-merge (squash) November 8, 2024 05:37
@Future-Outlier Future-Outlier merged commit b5f23a6 into flyteorg:master Nov 8, 2024
50 of 51 checks passed
@wayner0628 wayner0628 deleted the feature/download-multipart-blob branch November 8, 2024 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants