A pipeline for masking a genome, starting from a fasta file or a gff file.

By Pengkai Zhu

Institution: Fujian Agriculture and Forestry University

Email: pkzhu222@gmail.com

Cite: Zhu, P., He, T., Zheng, Y., and Chen, L. (2023). The need for masked genomes in gymnosperms. Frontiers in Plant Science 14. doi: 10.3389/fpls.2023.1309744.

Ultra-Large genomes often strain computational resources during alignment or indexing, leading to analysis issues. However, some analyses focus on specific genome regions, like exons, introns, UTRs, and key loci, which may represent only 50% or less of the total genome size. Aligning the entire genome results in unnecessary resource usage. Therefore, I propose masking repetitive regions to shrink the reference genome, making the analysis more efficient and lowering resource demands for large genome alignments.

1.Software

Red
BEDOPS
bedtools2

2. Workflow (begin with a fasta file)

1. Creating Directory to Store Output

mkdir -p OUTPUT

2. Predicting Repetitive Sequences from genome

Red -gnm /path/to/genome/dir/ -msk ./OUTPUT -rpt ./OUTPUT

3. Converting Soft-Masked Genome to Hard-Masked Genome

awk '!/>/ {gsub(/[atcg]/,"N")} 1' ./OUTPUT/genome.msk > ./OUTPUT/genome.hardmasked.fa

3. Workflow (begin with a fasta file and a repeats anotation file)

1. Convert gfffile to bedfile

gff2bed < LTR.gff3 > LTR.bed

2. Masked genome.fa

bedtools maskfasta -fi genome.fa -bed LTR.bed -fo genome.hardmasked.fasta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A pipeline for masking a genome, starting from a fasta file or a gff file.

1.Software

2. Workflow (begin with a fasta file)

1. Creating Directory to Store Output

2. Predicting Repetitive Sequences from genome

3. Converting Soft-Masked Genome to Hard-Masked Genome

3. Workflow (begin with a fasta file and a repeats anotation file)

1. Convert gfffile to bedfile

2. Masked genome.fa

Files

README.md

Latest commit

History

README.md

File metadata and controls

A pipeline for masking a genome, starting from a fasta file or a gff file.

1.Software

2. Workflow (begin with a fasta file)

1. Creating Directory to Store Output

2. Predicting Repetitive Sequences from genome

3. Converting Soft-Masked Genome to Hard-Masked Genome

3. Workflow (begin with a fasta file and a repeats anotation file)

1. Convert gfffile to bedfile

2. Masked genome.fa