Skip to content

Latest commit

 

History

History
18 lines (13 loc) · 1.06 KB

metadata-preparation.md

File metadata and controls

18 lines (13 loc) · 1.06 KB

Preparing Genbank Assembly File for Queued Submissions

The raw genbank metadata file was downloaded, and filtered as such:

awk -F "," '{print $6"\t"$15}' 2019-08-15-all-genbank-genomes.csv | tail -n +2 | sed 's/"//g' | awk -F "/" '{print $1"/"$2"/"$3"/"$4"/"$5"/"$6"/"$7"/"$8"/"$9"/"$10"/"$10"_genomic.fna.gz"}' > 2018-08-16-genbank-accessions-ftp-list.tsv

This creates a tab-delimited file of each genbank assembly name and the ftp path to download with wget in the job.

To then split the metadata file into batches of 500 lines, or thus 500 jobs per submission:

# in the metadata folder
mkdir splits
split -a 5 -l 50 -d 2018-08-16-genbank-accessions-ftp-list.tsv splits/genomes-

These are the files that are represented in the metadata/splits folder. This shouldn't change that often since genbank updates happen incrementally and I don't expect things to change a whole lot in a given amount of time. But this is how the metadata/list of assemblies and ftp paths and subsequently split files are created to feed into batch submission jobs to HTCondor.