Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link up to AWS S3 buckets #17

Open
ewels opened this issue Jan 27, 2020 · 1 comment
Open

Link up to AWS S3 buckets #17

ewels opened this issue Jan 27, 2020 · 1 comment

Comments

@ewels
Copy link
Owner

ewels commented Jan 27, 2020

The NIH has recently started to make SRA data available directly on AWS S3. It would be cool if SRA-Explorer could also link to these.

The complication is that not all datasets are available, and they are spread across more than one S3 bucket. I think that the only way to get the URLs is to take the access number and build a "guess" S3 URI and then test it to see if it exists.

The current buckets are:

An example URL to a specific BAM file: http://sra-pub-src-1.s3.us-east-1.amazonaws.com/DRZ000036/F10-DA.bam.1 (possible to directly download without authentication).

The buckets should allow public and anonymous access, so we should be able to use an AWS SDK to ping the expected files to see if they exist. @wleepang gave a nice example in Python:

>>> import boto3
>>> from botocore import UNSIGNED
>>> from botocore.client import Config
>>> s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
>>> s3.head_bucket(Bucket='sra-pub-src-1')
{'ResponseMetadata': {'HTTPStatusCode': 200, 'RetryAttempts': 0, 'HostId': 'sE4i9sSiQmHwuBeBAKp8JUOsDq09BIoX/WtNQlmO+7qvmTe9/bwJfBqkCdAE0cdDg8Fspcbmddc=', 'RequestId': '931ABD9E2B59BA63', 'HTTPHeaders': {'date': 'Wed, 22 Jan 2020 20:38:56 GMT', 'x-amz-id-2': 'sE4i9sSiQmHwuBeBAKp8JUOsDq09BIoX/WtNQlmO+7qvmTe9/bwJfBqkCdAE0cdDg8Fspcbmddc=', 'server': 'AmazonS3', 'transfer-encoding': 'chunked', 'x-amz-request-id': '931ABD9E2B59BA63', 'x-amz-bucket-region': 'us-east-1', 'content-type': 'application/xml'}}}

Note that the files contained within each accession directory seem to be randomly named and quite variable. There are BAM files, FastQ files, Fasta files, all sorts. So we need a big warning notice to (a) let the user know that it's up to them to curate the file list that they're getting and (b) to count and warn about how many datasets we were unable to find.

@ewels
Copy link
Owner Author

ewels commented Jan 27, 2020

Open data page for this is now up at https://registry.opendata.aws/ncbi-sra/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant