Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Data Explorer functionality (list, add) #381

Closed
5 tasks done
robnewman opened this issue Jan 11, 2024 · 21 comments
Closed
5 tasks done

Add support for Data Explorer functionality (list, add) #381

robnewman opened this issue Jan 11, 2024 · 21 comments
Assignees
Labels
API New things that have the API that are not yet supported by the CLI enhancement New feature or request
Milestone

Comments

@robnewman
Copy link
Member

robnewman commented Jan 11, 2024

Add a new command:

tw data-links

that interacts with the new Data Explorer data-links API endpoint in the Seqera Platform. Some suggested functionality (need to add all the auth syntactic sugar):

tw data-links list --workspace=<workspaceId>                             # list data-links in a workspace
tw data-links list --workspace=<workspaceId> --provider=<cloudProvider>  # subset to a specific cloud provider
tw data-links list --workspace=<workspaceId> --type <cloud|custom>       # subset to auto cloud or custom data links
tw data-links add --workspace=<workspaceId> --name=<dataLinkName> --credentials=<credentials> --description=<description> --provider=<cloudProvider> # add a custom data link to a workspace
tw data-links delete --datalink=<datalinkId> --workspace=<workspaceId>   # delete a datalink in a workspace
tw data-links cp --datalink=<datalinkId>:/path/to/object.txt object.txt  # copy/download a single object (defined by prefix path) from the data link to your localhost 
tw data-links cp /path/to/samplesheet.csv --datalink=<datalinkId> --workspace=<workspaceId> # upload a file from your localhost to the data link
tw data-links cp /path/to/folder --datalink =<datalinkId> --workspace=<workspaceId> --recursive # upload all files in a folder from your localhost to the data link

Tasks

  1. weronikasosnowskaseqera
  2. weronikasosnowskaseqera
  3. weronikasosnowskaseqera
  4. weronikasosnowskaseqera
@ewels
Copy link
Member

ewels commented Jan 11, 2024

Presumably tw data-explorer cp downloads to the current working directory? Could be nice to support alternative destinations too.. 🤔 Potentially as a separate tw data-explorer sync command, or a flag, or just a second positional argument..

@pditommaso
Copy link
Contributor

For me it's a -1. Why bloating the CLI with this?

@ewels
Copy link
Member

ewels commented Jan 11, 2024

You could argue that it's not worth having a CLI at all, there's a perfectly good API!

Having it in the CLI makes it faster and easier to work with datasets from the terminal. It improves developer / user experience.

@ewels
Copy link
Member

ewels commented Jan 11, 2024

Also the technical reason for having it in the CLI for downloading files:

Presigned URLs expire after a short-ish window (@swampie thought it was 1 hour). If downloading a large dataset, the download could easily run for many hours. A generated bash script would therefore fail, however the CLI could request the presigned URLs one at a time in series, meaning that they're always fresh and continue to work.

@evanfloden
Copy link
Member

Adding a usecase to download/list a dataset, with a flag to download/list the files inside the dataset csv/tsv/table. For example:

tw dataset cp <dataset_id> --files

This downloads the dataset table (csv/tsv) plus the files. In this way the user only has to be concerned with passing around the dataset object, and they can download/list the files at any time. Think a dataset can also be an output so it becomes a packaging mechanism.

Note: Today, the auth to access/download/list files in a dataset is not guaranteed as users can create whatever s3:// paths they want in a csv. This issue also exists when launching a pipeline.

@pditommaso
Copy link
Contributor

Fair enough

@swampie
Copy link
Member

swampie commented Jan 18, 2024

for upload and download why using the seqera cli when you can use the standard cloud tooling?

@ewels
Copy link
Member

ewels commented Jan 18, 2024

  • No need to maintain cloud credentials locally
  • Support multiple compute env types (clouds) with a consistent command and single CLI tool
  • Download via consistent Seqera identifiers, less risk of sample or file mixup
  • user experience if we add nice things as suggested by Evan: eg. downloading all data paths within the CSV

@mbosio85
Copy link
Member

Considering the ongoing work to extend the Data Explorer availability to personal workspaces, these new CLI capability should be implemented for those as well.

@swampie
Copy link
Member

swampie commented Jan 25, 2024

I agree with Paolo that the complexity is not justified for the time being: open to discuss

@evanfloden
Copy link
Member

Adding a very key point being lost here.

Our end users shouldn't need cloud console or cloud provider CLI access. They likely don't have cloud credentials. This is the point of having different roles with WS admins adding credentials and CEs.

End users want to upload data, run pipelines, and download results.

@pditommaso
Copy link
Contributor

I agree 💯 that CLI should have first-class support. However, my understanding is that the feature highlighted here does not come for free, it may require some specific endpoints.

@robnewman
Copy link
Member Author

Updated original request to match the Data Explorer data-links API endpoint name

@mbosio85 mbosio85 added this to the v1.0.0 milestone Apr 10, 2024
@robnewman
Copy link
Member Author

TBD - pagination is always returned by the API, need to account for this in the CLI commands.

@robnewman robnewman added the API New things that have the API that are not yet supported by the CLI label Apr 23, 2024
@jordeu
Copy link
Member

jordeu commented May 10, 2024

I feel a bit weird about naming this subcommand data-links. I've checked the data explorer's UI, and there, you can upload files without any mention of the "data link" concept. And you can create new "data links" also without any mention of that concept. Why should we use this name in the CLI?

The sub-title where you can list your "data links" says, "Browse remote data repositories and data for use in Seqera Cloud," with no reference to this "data link" concept. Overall, this "data link" concept is misleading.

I'd call it "data source", and then the command line can be tw data-source... with tw ds... alias. Also, the tw data-source add ... subcommand would be more meaningful.

But because naming is difficult and what sounds good to me may sound terrible to others, I suggest reviewing this naming before hardcoding it into the command line interface. Or at least, if "data link" is chosen as the best way of naming it, the web UI should be consistent and call that section "data links" instead of "data explorer" with explicit references to the "data link" concept when you add a new one.

@robnewman
Copy link
Member Author

@jordeu Thanks for the feedback! The Data Explorer API endpoint is called data-links and we were being consistent with that. I think it would be more confusing to have the API endpoint named differently to the CLI interface (when both are publicly accessible). I agree that the term "data-link" is widely used internally but not directly surfaced externally. I would be in favor of explicitly referencing that term in our docs, but open to feedback.

@weronikasosnowskaseqera
Copy link
Contributor

@robnewman we are missing here the method to list content

@robnewman
Copy link
Member Author

@weronikasosnowskaseqera Please add. I wasn't necessarily comprehensive - just that the functionality needs to exist and reflect the API functionality.

@weronikasosnowskaseqera weronikasosnowskaseqera changed the title Add support for Data Explorer functionality (list, add, cp) Add support for Data Explorer functionality (list, add) May 29, 2024
Copy link

canny bot commented Jun 13, 2024

This issue has been unlinked from a Canny post: Add datasets directly from s3 / data explorer to the platform 😢

@robnewman
Copy link
Member Author

robnewman commented Aug 20, 2024

This is now done except for the tw data-link cp command. The other commands are part of the v0.9.4 release.

@weronikasosnowskaseqera
Copy link
Contributor

tw data-link cp (download/upload) will be handled with another task: https://seqera.atlassian.net/browse/PLAT-289

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API New things that have the API that are not yet supported by the CLI enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants