Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming from cloud storage #270

Open
SHuang-Broad opened this issue Sep 13, 2021 · 5 comments
Open

Support streaming from cloud storage #270

SHuang-Broad opened this issue Sep 13, 2021 · 5 comments

Comments

@SHuang-Broad
Copy link
Contributor

Hi,

we are routinely using Nanoplot in our cloud-native pipelines and would love to see Nanoplot support streaming from cloud strages.

Based on a quick glimpse of the code, it looks like that would require at least one dependency, i.e. pysam to support that.
Are there any other "patches" necessary to support the streaming?

Thanks,
Steve

@wdecoster
Copy link
Owner

Hi Steve,

Interesting suggestion! I have to admit I don't immediately know on how to adapt the code for this. Since you ask for pysam you are mainly interested in bam/cram files as input? Which you would then specify using an URL?

Cheers,
Wouter

@SHuang-Broad
Copy link
Contributor Author

Our current pipeline uses Google Cloud Storage (gs://...), but I could see users benefit from support for all major cloud service providers, e.g. AWS, Azure.

If Nanoplot only access the BAM through pysam, then probably that's the dependency that needs to support streaming. And the change will be minimal.

This is definitely an optimization, so it's not an urgent need.

@SHuang-Broad
Copy link
Contributor Author

Regarding supporting gs://... path, I think the following link might be useful.
pysam-developers/pysam#592

@wdecoster
Copy link
Owner

Do you have such a (public?) gs://... path for me to test things on? All our data is processed locally.

@SHuang-Broad
Copy link
Contributor Author

we don't have any public data to share (definitely because downloading data from cloud storage incurs costs on the owner of the data unless something like requester pay is specified, so this could easily be abused by malicious actors).

I think these from DeepVariant team themselves might work, but may require you to set up a google cloud account:
https://console.cloud.google.com/storage/browser/deepvariant/pacbio-case-study-testdata?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false

I'm sorry if this is too much trouble.
Thanks for getting on top of this!
Steve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants