You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The NIH has recently started to make SRA data available directly on AWS S3. It would be cool if SRA-Explorer could also link to these.
The complication is that not all datasets are available, and they are spread across more than one S3 bucket. I think that the only way to get the URLs is to take the access number and build a "guess" S3 URI and then test it to see if it exists.
The buckets should allow public and anonymous access, so we should be able to use an AWS SDK to ping the expected files to see if they exist. @wleepang gave a nice example in Python:
Note that the files contained within each accession directory seem to be randomly named and quite variable. There are BAM files, FastQ files, Fasta files, all sorts. So we need a big warning notice to (a) let the user know that it's up to them to curate the file list that they're getting and (b) to count and warn about how many datasets we were unable to find.
The text was updated successfully, but these errors were encountered:
The NIH has recently started to make SRA data available directly on AWS S3. It would be cool if SRA-Explorer could also link to these.
The complication is that not all datasets are available, and they are spread across more than one S3 bucket. I think that the only way to get the URLs is to take the access number and build a "guess" S3 URI and then test it to see if it exists.
The current buckets are:
An example URL to a specific BAM file: http://sra-pub-src-1.s3.us-east-1.amazonaws.com/DRZ000036/F10-DA.bam.1 (possible to directly download without authentication).
The buckets should allow public and anonymous access, so we should be able to use an AWS SDK to ping the expected files to see if they exist. @wleepang gave a nice example in Python:
Note that the files contained within each accession directory seem to be randomly named and quite variable. There are BAM files, FastQ files, Fasta files, all sorts. So we need a big warning notice to (a) let the user know that it's up to them to curate the file list that they're getting and (b) to count and warn about how many datasets we were unable to find.
The text was updated successfully, but these errors were encountered: