Skip to content

Latest commit

 

History

History
44 lines (38 loc) · 1.68 KB

README.md

File metadata and controls

44 lines (38 loc) · 1.68 KB

A pipeline for masking a genome, starting from a fasta file or a gff file.

By Pengkai Zhu

Institution: Fujian Agriculture and Forestry University

Email: [email protected]

Cite: Zhu, P., He, T., Zheng, Y., and Chen, L. (2023). The need for masked genomes in gymnosperms. Frontiers in Plant Science 14. doi: 10.3389/fpls.2023.1309744.


Ultra-Large genomes often strain computational resources during alignment or indexing, leading to analysis issues. However, some analyses focus on specific genome regions, like exons, introns, UTRs, and key loci, which may represent only 50% or less of the total genome size. Aligning the entire genome results in unnecessary resource usage. Therefore, I propose masking repetitive regions to shrink the reference genome, making the analysis more efficient and lowering resource demands for large genome alignments.

1.Software

  1. Red
  2. BEDOPS
  3. bedtools2

2. Workflow (begin with a fasta file)

1. Creating Directory to Store Output

mkdir -p OUTPUT

2. Predicting Repetitive Sequences from genome

Red -gnm /path/to/genome/dir/ -msk ./OUTPUT -rpt ./OUTPUT

3. Converting Soft-Masked Genome to Hard-Masked Genome

awk '!/>/ {gsub(/[atcg]/,"N")} 1' ./OUTPUT/genome.msk > ./OUTPUT/genome.hardmasked.fa

3. Workflow (begin with a fasta file and a repeats anotation file)

1. Convert gfffile to bedfile

gff2bed < LTR.gff3 > LTR.bed

2. Masked genome.fa

bedtools maskfasta -fi genome.fa -bed LTR.bed -fo genome.hardmasked.fasta