-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
--include none and --chromosomes all #298
Comments
Hi @conchoecia, Thanks for opening this issue.
This is a bug. We will try to fix this soon. In the meantime, I suggest that you try the following to only download the chromosome sequences:
Thanks again for opening this issue. I'll comment on this thread when we have a bug fix ready. Best, Eric Cox, PhD [Contractor] (he/him/his) |
Hi @ericcox1, This solution works well - thanks! I will adjust my scripts to do this instead of parsing the sequence report .json file. -Darrin Update: I found that doing this process pulls scaffolds that are known to be localized to specific chromosomes, but are not actually placed. For example, there is a bird genome, |
Hi @ericcox1, I identified a place where this breaks - for some assemblies, rehydrating still downloads the entire genome assembly fasta file, in addition to the chromosome-scale scaffolds as individual files as I requested. Here is a minimal example that uses the latest release of datasets: #!/bin/bash
# For the genome assembly, GCA_933207985.1, it appears like downloading the chromosome-scale scaffolds resulted in two errors
# - The first error is that all of the chromosome-scale scaffolds downloaded twice.
# - The second error is that all of the non-chromosome-scale scaffolds downloaded more than once.
ASSEMBLY=GCA_933207985.1
# set up datasets
curl -o datasets 'ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod u+x ./datasets
# Using the example from this github issue: https://github.com/ncbi/datasets/issues/298
# The way it works is by downloading a dehydrated dataset, then downloading to select only the chromosome-scale scaffolds
./datasets download genome accession ${ASSEMBLY} --chromosomes all --filename TEST.zip --dehydrated
unzip TEST.zip -d TEST
./datasets rehydrate --directory TEST --match chr The resulting files are:
However, the file |
I found another place where this breaks. Some assemblies, despite having chromosome-scale scaffolds, have the error 'Found no files for rehydration' after running this. The assembly that I found that causes this error was Here is the minimal example: #!/bin/bash
# For the record GCF_905220415.1, there is some problem where the final fasta file is empty when using this method.
# Closer inspection reveals that the database correctly identifies certain scaffolds as being chromosome-scale, but
# they are not downloaded correctly
ASSEMBLY=GCF_905220415.1
# set up datasets
curl -o datasets 'ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod u+x ./datasets
# Check if there are chromosome-scale scaffolds
./datasets summary genome accession ${ASSEMBLY} --report sequence --as-json-lines | grep 'Chromosome' | head -5
# remove old files from previous runs
rm -rf TEST/ TEST.zip
# Using the example from this github issue: https://github.com/ncbi/datasets/issues/298
# The way it works is by downloading a dehydrated dataset, then downloading to select only the chromosome-scale scaffolds
./datasets download genome accession ${ASSEMBLY} --chromosomes all --filename TEST.zip --dehydrated
unzip TEST.zip -d TEST
./datasets rehydrate --directory TEST --match chr Here are the results of running the above script, showing that there are chromosome-scale scaffolds, but the rehydration did not work.
|
Hello,
I would like to use the
--chromosomes all
option when I download a genome to only get the chromosomes. I noticed that using this option also automatically downloads the complete genome fasta file (I think because--include genome
appears to be the default. For example, when I run this command:datasets download genome accession GCA_940337035.1 --chromosomes all --filename TEST.zip
, these are the resulting files:I do not want to download
GCA_940337035.1_PGI_AGRIOTES_LIN_V1_genomic.fna
.I thought that trying
--chromosomes all --include none
would allow me to download the fasta files of just the scaffolds designated as chromosomes, but it doesn't download any sequence.Do you have any suggestions on how to download just the chromosome scaffolds without having to filter based on the info in the sequence report? I am using
datasets v15.29.0
Thank you!
Darrin
The text was updated successfully, but these errors were encountered: