--include none and --chromosomes all #298

conchoecia · 2023-12-19T15:40:31Z

Hello,

I would like to use the --chromosomes all option when I download a genome to only get the chromosomes. I noticed that using this option also automatically downloads the complete genome fasta file (I think because --include genome appears to be the default. For example, when I run this command: datasets download genome accession GCA_940337035.1 --chromosomes all --filename TEST.zip, these are the resulting files:

Archive:  TEST.zip
  inflating: README.md
  inflating: ncbi_dataset/data/assembly_data_report.jsonl
  inflating: ncbi_dataset/data/GCA_940337035.1/GCA_940337035.1_PGI_AGRIOTES_LIN_V1_genomic.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr1.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr2.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr3.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr4.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr5.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr6.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr7.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr8.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr9.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/chr10.fna
  inflating: ncbi_dataset/data/GCA_940337035.1/unplaced.scaf.fna

I do not want to download GCA_940337035.1_PGI_AGRIOTES_LIN_V1_genomic.fna.

I thought that trying --chromosomes all --include none would allow me to download the fasta files of just the scaffolds designated as chromosomes, but it doesn't download any sequence.

Do you have any suggestions on how to download just the chromosome scaffolds without having to filter based on the info in the sequence report? I am using datasets v15.29.0

Thank you!
Darrin

The text was updated successfully, but these errors were encountered:

ericcox1 · 2023-12-19T19:58:58Z

Hi @conchoecia,

Thanks for opening this issue.

I noticed that using this option also automatically downloads the complete genome fasta file

This is a bug. We will try to fix this soon. In the meantime, I suggest that you try the following to only download the chromosome sequences:

Download a dehydrated package
datasets download genome accession GCA_940337035.1 --chromosomes all --filename TEST.zip --dehydrated
Unzip the downloaded package
unzip TEST.zip -d TEST
Rehydrate the extracted package, using --match to selectively download filenames that include "chr"
datasets rehydrate --directory TEST --match chr

Thanks again for opening this issue. I'll comment on this thread when we have a bug fix ready.

Best,
Eric

Eric Cox, PhD [Contractor] (he/him/his)
NCBI Datasets
Sequence Enhancements, Tools and Delivery (SeqPlus)
NIH/NLM/NCBI
[email protected]

conchoecia · 2023-12-20T12:17:30Z

Hi @ericcox1,

This solution works well - thanks! I will adjust my scripts to do this instead of parsing the sequence report .json file.

-Darrin

Update: I found that doing this process pulls scaffolds that are known to be localized to specific chromosomes, but are not actually placed.

For example, there is a bird genome, GCA_027574665.1, that has named chromosomes with the properties {"assignedMoleculeLocationType":"Chromosome", "role":"assembled-molecule"}. It also has unplaced pieces that are known to be on a specific chromosome, but are unplaced. These scaffolds are all less than 1Mbp, and have the properties {"assignedMoleculeLocationType":"Chromosome", "role":"unlocalized-scaffold"}. I'm not sure yet if I want to exclude the second type for my analysis, but this would be a good reason to parse the seq-report from datasets download genome accession GCA_027574665.1 --include seq-report

conchoecia · 2023-12-30T17:58:13Z

Hi @ericcox1,

I identified a place where this breaks - for some assemblies, rehydrating still downloads the entire genome assembly fasta file, in addition to the chromosome-scale scaffolds as individual files as I requested.

Here is a minimal example that uses the latest release of datasets:

#!/bin/bash

# For the genome assembly, GCA_933207985.1, it appears like downloading the chromosome-scale scaffolds resulted in two errors
#  - The first error is that all of the chromosome-scale scaffolds downloaded twice.
#  - The second error is that all of the non-chromosome-scale scaffolds downloaded more than once.

ASSEMBLY=GCA_933207985.1

# set up datasets
curl -o datasets 'ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod u+x ./datasets

# Using the example from this github issue: https://github.com/ncbi/datasets/issues/298
# The way it works is by downloading a dehydrated dataset, then downloading to select only the chromosome-scale scaffolds
./datasets download genome accession ${ASSEMBLY} --chromosomes all --filename TEST.zip --dehydrated
unzip TEST.zip -d TEST
./datasets rehydrate --directory TEST --match chr

The resulting files are:

./TEST/ncbi_dataset/data/GCA_933207985.1/chr05.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr02.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr13.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr14.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr03.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr04.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr12.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr11.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr09.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr07.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr10.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr01.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr06.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/GCA_933207985.1_aPelCul1.1_chrom_genomic.fna
./TEST/ncbi_dataset/data/GCA_933207985.1/chr08.fna

However, the file ./TEST/ncbi_dataset/data/GCA_933207985.1/GCA_933207985.1_aPelCul1.1_chrom_genomic.fna should not be present, based on how I've seen how the example works with other Assembly Accessions. I am not sure if this happened for more than one accession or not. Thanks!

conchoecia · 2023-12-31T09:48:39Z

I found another place where this breaks. Some assemblies, despite having chromosome-scale scaffolds, have the error 'Found no files for rehydration' after running this. The assembly that I found that causes this error was GCF_905220415.1.

Here is the minimal example:

#!/bin/bash

# For the record GCF_905220415.1, there is some problem where the final fasta file is empty when using this method.
# Closer inspection reveals that the database correctly identifies certain scaffolds as being chromosome-scale, but
#  they are not downloaded correctly

ASSEMBLY=GCF_905220415.1

# set up datasets
curl -o datasets 'ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/LATEST/linux-amd64/datasets'
chmod u+x ./datasets

# Check if there are chromosome-scale scaffolds
./datasets summary genome accession ${ASSEMBLY} --report sequence --as-json-lines | grep 'Chromosome' | head -5

# remove old files from previous runs
rm -rf TEST/ TEST.zip
# Using the example from this github issue: https://github.com/ncbi/datasets/issues/298
# The way it works is by downloading a dehydrated dataset, then downloading to select only the chromosome-scale scaffolds
./datasets download genome accession ${ASSEMBLY} --chromosomes all --filename TEST.zip --dehydrated
unzip TEST.zip -d TEST
./datasets rehydrate --directory TEST --match chr

Here are the results of running the above script, showing that there are chromosome-scale scaffolds, but the rehydration did not work.

 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.5M  100 17.5M    0     0  5424k      0  0:00:03  0:00:03 --:--:-- 5424k
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"1","gc_count":"5143305","gc_percent":34,"genbank_accession":"HG991959.1","length":15086434,"refseq_accession":"NC_059537.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"2","gc_count":"4472954","gc_percent":34,"genbank_accession":"HG991960.1","length":13248411,"refseq_accession":"NC_059538.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"3","gc_count":"4471753","gc_percent":34,"genbank_accession":"HG991961.1","length":13170806,"refseq_accession":"NC_059539.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"4","gc_count":"4360384","gc_percent":34,"genbank_accession":"HG991962.1","length":12846590,"refseq_accession":"NC_059540.1","role":"assembled-molecule"}
{"assembly_accession":"GCF_905220415.1","assembly_unit":"Primary Assembly","assigned_molecule_location_type":"Chromosome","chr_name":"5","gc_count":"4238058","gc_percent":33.5,"genbank_accession":"HG991963.1","length":12694599,"refseq_accession":"NC_059541.1","role":"assembled-molecule"}
Collecting 1 genome record [================================================] 100% 1/1
Downloading: TEST.zip    3.98kB valid zip structure -- files not checked
Validating package [================================================] 100% 4/4
Archive:  TEST.zip
  inflating: TEST/README.md
  inflating: TEST/ncbi_dataset/data/assembly_data_report.jsonl
  inflating: TEST/ncbi_dataset/fetch.txt
  inflating: TEST/ncbi_dataset/data/dataset_catalog.json
Found no files for rehydration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

--include none and --chromosomes all #298

--include none and --chromosomes all #298

conchoecia commented Dec 19, 2023 •

edited

Loading

ericcox1 commented Dec 19, 2023

conchoecia commented Dec 20, 2023 •

edited

Loading

conchoecia commented Dec 30, 2023 •

edited

Loading

conchoecia commented Dec 31, 2023

--include none and --chromosomes all #298

--include none and --chromosomes all #298

Comments

conchoecia commented Dec 19, 2023 • edited Loading

ericcox1 commented Dec 19, 2023

conchoecia commented Dec 20, 2023 • edited Loading

conchoecia commented Dec 30, 2023 • edited Loading

conchoecia commented Dec 31, 2023

conchoecia commented Dec 19, 2023 •

edited

Loading

conchoecia commented Dec 20, 2023 •

edited

Loading

conchoecia commented Dec 30, 2023 •

edited

Loading