Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

--geo-location flag runs into invalid zip archive error #326

Open
joverlee521 opened this issue Feb 28, 2024 · 6 comments
Open

--geo-location flag runs into invalid zip archive error #326

joverlee521 opened this issue Feb 28, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@joverlee521
Copy link

Hi NCBI Datasets team,

Today I've tried a couple geolocations with the --geo-location flag and have run into the invalid zip archive error every time.

My attempt with state level "WA"
$ ./datasets download virus genome taxon sars-cov-2             --geo-location WA             --filename data/ncbi_dataset.zip --debug
2024/02/28 19:03:50 
GET /datasets/v2alpha/taxonomy/taxon_suggest/sars-cov-2?exact_match=true&tax_rank_filter=higher_taxon&taxon_resource_filter=TAXON_RESOURCE_FILTER_ALL HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/16.6.0/go
Accept: application/json
Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.6.0
Accept-Encoding: gzip


2024/02/28 19:03:51 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Wed, 28 Feb 2024 19:03:51 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.6.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


2024/02/28 19:03:51 
POST /datasets/v2alpha/taxonomy HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/16.6.0/go
Content-Length: 53
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.6.0
Accept-Encoding: gzip

{"returned_content":"METADATA","taxons":["2697049"]}

2024/02/28 19:03:51 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Wed, 28 Feb 2024 19:03:51 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9.2.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.6.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


2024/02/28 19:03:51 
POST /datasets/v2alpha/virus/genome/download HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/16.6.0/go
Content-Length: 189
Accept: application/zip
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location WA --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.6.0
Accept-Encoding: gzip

{"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"WA","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"2697049"}

2024/02/28 19:03:51 
HTTP/2.0 200 OK
Content-Disposition: attachment; filename=ncbi_dataset.zip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/zip
Date: Wed, 28 Feb 2024 19:03:51 GMT
Grpc-Metadata-Logging-Activity: download
Grpc-Metadata-Logging-Annotated_only: False
Grpc-Metadata-Logging-Refseq_only: False
Grpc-Metadata-Logging-Service: virus
Grpc-Metadata-Logging-Taxon: 2697049
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 55DD9889E6F9F0E2D8D045A9.3.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept-Encoding
X-Datasets-Version: 16.6.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


Downloading: data/ncbi_dataset.zip    112kB done
Downloading: data/ncbi_dataset.zip    112kB invalid zip archive
Validating package []

Use datasets download virus genome taxon <command> --help for detailed help about a command.
My attempt with country level "USA"
$ ./datasets download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug
2024/02/28 18:50:55 
GET /datasets/v2alpha/taxonomy/taxon_suggest/sars-cov-2?exact_match=true&tax_rank_filter=higher_taxon&taxon_resource_filter=TAXON_RESOURCE_FILTER_ALL HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/16.6.0/go
Accept: application/json
Ncbi-Phid: 76BF10892A975A708F9C4692
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.6.0
Accept-Encoding: gzip


2024/02/28 18:50:56 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Wed, 28 Feb 2024 18:50:56 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 76BF10892A975A708F9C4692.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.6.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


2024/02/28 18:50:56 
POST /datasets/v2alpha/taxonomy HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/16.6.0/go
Content-Length: 53
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 76BF10892A975A708F9C4692
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.6.0
Accept-Encoding: gzip

{"returned_content":"METADATA","taxons":["2697049"]}

2024/02/28 18:50:56 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Wed, 28 Feb 2024 18:50:56 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 76BF10892A975A708F9C4692.2.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.6.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


2024/02/28 18:50:56 
POST /datasets/v2alpha/virus/genome/download HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/16.6.0/go
Content-Length: 190
Accept: application/zip
Accept: application/json
Content-Type: application/json
Ncbi-Phid: 76BF10892A975A708F9C4692
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location USA --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.6.0
Accept-Encoding: gzip

{"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"USA","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"2697049"}

2024/02/28 18:50:56 
HTTP/2.0 200 OK
Content-Disposition: attachment; filename=ncbi_dataset.zip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/zip
Date: Wed, 28 Feb 2024 18:50:56 GMT
Grpc-Metadata-Logging-Activity: download
Grpc-Metadata-Logging-Annotated_only: False
Grpc-Metadata-Logging-Refseq_only: False
Grpc-Metadata-Logging-Service: virus
Grpc-Metadata-Logging-Taxon: 2697049
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: 76BF10892A975A708F9C4692.3.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept-Encoding
X-Datasets-Version: 16.6.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


Downloading: data/ncbi_dataset.zip    8.25MB done
Downloading: data/ncbi_dataset.zip    8.25MB invalid zip archive
Validating package []

Use datasets download virus genome taxon <command> --help for detailed help about a command.
My attempt with continent level "Africa"
$ ./datasets download virus genome taxon sars-cov-2             --geo-location Africa             --filename data/ncbi_dataset.zip --debug
2024/02/28 19:02:41 
GET /datasets/v2alpha/taxonomy/taxon_suggest/sars-cov-2?exact_match=true&tax_rank_filter=higher_taxon&taxon_resource_filter=TAXON_RESOURCE_FILTER_ALL HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/16.6.0/go
Accept: application/json
Ncbi-Phid: E35746682FB5DDAAA893F10F
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.6.0
Accept-Encoding: gzip


2024/02/28 19:02:42 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Wed, 28 Feb 2024 19:02:42 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: E35746682FB5DDAAA893F10F.1.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.6.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


2024/02/28 19:02:42 
POST /datasets/v2alpha/taxonomy HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/16.6.0/go
Content-Length: 53
Accept: application/json
Content-Type: application/json
Ncbi-Phid: E35746682FB5DDAAA893F10F
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.6.0
Accept-Encoding: gzip

{"returned_content":"METADATA","taxons":["2697049"]}

2024/02/28 19:02:42 
HTTP/2.0 200 OK
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/json
Date: Wed, 28 Feb 2024 19:02:42 GMT
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: E35746682FB5DDAAA893F10F.2.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
X-Datasets-Version: 16.6.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


2024/02/28 19:02:42 
POST /datasets/v2alpha/virus/genome/download HTTP/1.1
Host: api.ncbi.nlm.nih.gov
User-Agent: OpenAPI-Generator/16.6.0/go
Content-Length: 193
Accept: application/zip
Accept: application/json
Content-Type: application/json
Ncbi-Phid: E35746682FB5DDAAA893F10F
X-Datasets-Client: datasets-cli
X-Datasets-Client-Arch: amd64
X-Datasets-Client-Cmd: download virus genome taxon sars-cov-2 --geo-location Africa --filename data/ncbi_dataset.zip --debug
X-Datasets-Client-Os: linux
X-Datasets-Client-Version: 16.6.0
Accept-Encoding: gzip

{"annotated_only":false,"complete_only":false,"format":"tsv","geo_location":"Africa","host":"","include_sequence":["GENOME"],"pangolin_classification":"","refseq_only":false,"taxon":"2697049"}

2024/02/28 19:02:42 
HTTP/2.0 200 OK
Content-Disposition: attachment; filename=ncbi_dataset.zip
Content-Security-Policy: upgrade-insecure-requests
Content-Type: application/zip
Date: Wed, 28 Feb 2024 19:02:42 GMT
Grpc-Metadata-Logging-Activity: download
Grpc-Metadata-Logging-Annotated_only: False
Grpc-Metadata-Logging-Refseq_only: False
Grpc-Metadata-Logging-Service: virus
Grpc-Metadata-Logging-Taxon: 2697049
Grpc-Metadata-Via: h2 linkerd
Ncbi-Phid: E35746682FB5DDAAA893F10F.3.1
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Vary: Accept-Encoding
X-Datasets-Version: 16.6.0
X-Ua-Compatible: IE=Edge
X-Xss-Protection: 1; mode=block


Downloading: data/ncbi_dataset.zip    855B invalid zip archive
Downloading: data/ncbi_dataset.zip    855B invalid zip archive
Validating package []

Use datasets download virus genome taxon <command> --help for detailed help about a command.
@ericcox1 ericcox1 added the bug Something isn't working label Feb 28, 2024
@ericcox1
Copy link
Collaborator

Hi @joverlee521,

Thanks for opening this issue. We are aware of this bug but haven't yet scheduled a time to get it fixed.

Alternatively, you can download a cached SARS-CoV-2 genome data package, a highly compressed archive containing all SARS-CoV-2 sequences, use grep to identify sequences with the geographic location of interest, and pull out the sequences you want using samtools.

When I last looked at this about a year ago, the grep command below seemed to work well for narrowing down the list of genomes to those isolated from Washington state, but you may want to verify that this is still working well for you.

Here is what I suggest:

# Download all SARS-CoV-2 genomes
datasets download virus genome taxon sars-cov-2 --filename sars2.zip
 
# Extract all SARS-CoV-2 genome sequences from the downloaded zip archive
unzip -qc sars2.zip ncbi_dataset/data/genomic.fna > sars2-genomic.fna
 
# From the downloaded zip archive, use dataformat to generate a table of genome accessions and geo-location and filter for genomes from Washington state
dataformat tsv virus-genome --package sars2.zip --fields accession,geo-location | \
grep "USA: WA\|USA: Washington\|USA:.*[, ]WA$" | \
grep -v "ID\|Idaho\|DC\|DISTRICT OF COLUMBIA" > sars2-WA-clean.tsv
 
# Copy the accessions to a new file
cut -f1 sars2-WA-clean.tsv > sars2-WA-clean-acc.list
 
# Use samtools to copy the SARS-CoV-2 genomes from Washington state, from the file containing all SARS-CoV-2 genomes to a new file
samtools faidx --region-file sars2-WA-clean-acc.list --output sars2_WA_genomes.fna sars2-genomic.fna 

I hope that helps.

Best,
Eric

Eric Cox, PhD [Contractor] (he/him/his)
NCBI Datasets
Sequence Enhancements, Tools and Delivery (SeqPlus)
NIH/NLM/NCBI
[email protected]

@DOH-PXC5303
Copy link

I ran into the same issue for sars-cov-2 (and have gotten around it by downloading the full dataset as suggested) but wanted to note I've had no issues using the geo-location flag for mpox and other taxa. Do you know if the bug is specific to sars-cov-2 @ericcox1?

@ericcox1
Copy link
Collaborator

Hi @DOH-PXC5303,

I'm not aware of this issue affecting other taxa. This bug could be related to the large number of genome records that we have for SARS-CoV-2, which is currently at >8.7 M.

-Eric

@skylarwalters
Copy link

skylarwalters commented Jul 10, 2024

Hi! I've been having this issue too when I run this command:
datasets download virus genome taxon Viruses --complete-only --host human --geo-location Senegal --filename geo.zip
Do you have any recommendations for how I may be able to get around the invalid zip archive? I'm confident it is not a space or connection issue. Thank you so much!!

@ericcox1
Copy link
Collaborator

Hi @skylarwalters,

We haven't yet had a chance to implement better support for geographic location filtering due to other institutional priorities.

In the meantime, here's an alternative workflow that you can try:

  1. Download the list of nucleotide accessions (with versions) representing virus genome sequences isolated in Senegal from the NCBI Virus web page
  2. Use this downloaded list of nucleotide accessions with the datasets CLI to download the genome sequences, for example:
    datasets download virus genome accession --inputfile sequences.acc --filename senegal-viruses.zip

Best,
Eric

@skylarwalters
Copy link

Hi Eric! Thank you so much for the help!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants