-
Notifications
You must be signed in to change notification settings - Fork 2
Building a reference sequence
Building a metagenomic reference sequence based on an already existing one (or you can create your own one from scratch)
We provide reference sequences - see the installation part of the wiki. You might want to adjust these, and we show you some tips and tricks for doing this on Linux on this page.
Check for complete genomes in 1 contig, typically from this page: https://www.ncbi.nlm.nih.gov/genbank/
Get organisms from old reference sequences (e.g. Neisseria bacilliformis):
# .fa.fai reference required
samtools faidx /path/to/old_reference_sequence.fa
# extract N. bacilliformis and save as 1.fa
samtools faidx /path/to/old_reference_sequence.fa NZ_CP059571.1_Neisseria_bacilliformis_strain_DSM_23338_chromosome__complete_genome > 1.fa
Join all new organisms using cat:
cat *.fa > new_genomes.fa
Statistics - how many genomes == Fasta headers exist? Did the number give you what you expected ? Else there might be plasmids or fragmented chromosomes or contigs in the assembly.
# count the chromosomes
grep -c ">" *.fa
# get a
samtools faidx x.fa
less x.fa.fai
Contamination, ie. of genomes with human or other DNA, or of sample fastq read files with environmental DNA, is probably the biggest problem in metagenomics. We wrote a tool, Blacklist, which obscures (masks with Ns) questionable regions in a bacterial reference metagenome. Blacklister works by aligning contaminant sequences to the metagenome, then masking those potentially contaminant regions which received alignments with Ns, so no further metagenome reads are aligned there. This led to the removal of many questionable alignments to obscure taxa from our results sets across many lung metagenomes.
# First change input file, reference and number of threads used in script
# Number of threads used in the blacklister script should match the number defined in
# the srun -c 24 command, eg 24 in this case.
# Reference FASTA needs a bowtie2 index
bowtie2-build -f myref.fa myref.fa
# run blacklister directly
bash blacklister.sh
# or run via SLURM
srun -c 24 bash blacklister.sh
- copy the current sequence in a folder (.fa and .fai file)
- remove row containing the organism from the .fai file
bash split_fasta_names_samtools.sh
cd out/
cat *.fa > ../new_name.fa
- join all .fa files containing parts of the reference using cat, e.g.:
srun cat /ngsssd1/rcug/refbuild2020/human_for_metagenomes.fa \
/ngsssd1/rcug/refbuild2020/bacteria_all_curated/2021_12/more_removed/all_bact.fa.masked.2021_12.fa \
/ngsssd1/rcug/refbuild2020/bacteria_refseq/ncbi-genomes-2020-09-08/3_select_genomes/2021_09/blacklister/ilona_add_2021_09.fa.masked.fa \
/ngsssd1/rcug/refbuild2020/fungi_all/2021_revise/fungi_2020_all.masked.revised.fa \
/ngsssd1/rcug/refbuild2020/2021_12_human/rm_fungus/4_fungidb_selected_masked_revised/fungidb_selected_masked_revised_2021_12.fa \
/ngsssd1/rcug/refbuild2020/archaea/all_archaea_cln_masked_1strainperspecies.fa \
/ngsssd1/rcug/refbuild2020/virus/viruses_31_plus_cold_2021_08.fa \
/ngsssd1/rcug/refbuild2020/2021_12_human/neu/bowtie/blacklister/6_neu.fasta.masked.fa \
> 2021_12_human_bact_arch_fungi_vir.fa &
# create FASTA index using samtools
samtools faidx refence_sequence.fa
An index is required for bwa to be able to align against the sequence/index
bwa index.fa
- copy the new reference sequence to your other ones
- add the reference sequence to the Wochenende
config.yaml
- and to
run_Wochenende_SLURM.sh