Skip to content

Building a reference sequence

Colin Davenport edited this page Jun 17, 2022 · 5 revisions

Building a metagenomic reference sequence based on an already existing one (or you can create your own one from scratch)

We provide reference sequences - see the installation part of the wiki. You might want to adjust these, and we show you some tips and tricks for doing this on Linux on this page.

Get new genomes from the web

Check for complete genomes in 1 contig, typically from this page: https://www.ncbi.nlm.nih.gov/genbank/

Add organisms from old reference sequences

Get organisms from old reference sequences (e.g. Neisseria bacilliformis):

# .fa.fai reference required
samtools faidx /path/to/old_reference_sequence.fa
# extract N. bacilliformis and save as 1.fa
samtools faidx /path/to/old_reference_sequence.fa NZ_CP059571.1_Neisseria_bacilliformis_strain_DSM_23338_chromosome__complete_genome > 1.fa 

Join all new organisms using cat:

cat *.fa > new_genomes.fa

Statistics - how many genomes == Fasta headers exist? Did the number give you what you expected ? Else there might be plasmids or fragmented chromosomes or contigs in the assembly.

# count the chromosomes
grep -c ">" *.fa

# get a 
samtools faidx x.fa
less x.fa.fai

Contamination

Contamination, ie. of genomes with human or other DNA, or of sample fastq read files with environmental DNA, is probably the biggest problem in metagenomics. We wrote a tool, Blacklist, which obscures (masks with Ns) questionable regions in a bacterial reference metagenome. Blacklister works by aligning contaminant sequences to the metagenome, then masking those potentially contaminant regions which received alignments with Ns, so no further metagenome reads are aligned there. This led to the removal of many questionable alignments to obscure taxa from our results sets across many lung metagenomes.

Blacklister

Github Blacklister

# First change input file, reference and number of threads used in script
# Number of threads used in the blacklister script should match the number defined in 
# the srun -c 24 command, eg 24 in this case.

# Reference FASTA needs a bowtie2 index
bowtie2-build -f myref.fa myref.fa

# run blacklister directly
bash blacklister.sh
# or run via SLURM
srun -c 24 bash blacklister.sh

remove organisms from the current reference sequence

  • copy the current sequence in a folder (.fa and .fai file)
  • remove row containing the organism from the .fai file
bash split_fasta_names_samtools.sh
cd out/
cat *.fa > ../new_name.fa

putting the reference together

  • join all .fa files containing parts of the reference using cat, e.g.:
srun cat /ngsssd1/rcug/refbuild2020/human_for_metagenomes.fa  \
/ngsssd1/rcug/refbuild2020/bacteria_all_curated/2021_12/more_removed/all_bact.fa.masked.2021_12.fa \
/ngsssd1/rcug/refbuild2020/bacteria_refseq/ncbi-genomes-2020-09-08/3_select_genomes/2021_09/blacklister/ilona_add_2021_09.fa.masked.fa \
/ngsssd1/rcug/refbuild2020/fungi_all/2021_revise/fungi_2020_all.masked.revised.fa \
/ngsssd1/rcug/refbuild2020/2021_12_human/rm_fungus/4_fungidb_selected_masked_revised/fungidb_selected_masked_revised_2021_12.fa \
/ngsssd1/rcug/refbuild2020/archaea/all_archaea_cln_masked_1strainperspecies.fa \
/ngsssd1/rcug/refbuild2020/virus/viruses_31_plus_cold_2021_08.fa \
/ngsssd1/rcug/refbuild2020/2021_12_human/neu/bowtie/blacklister/6_neu.fasta.masked.fa \
> 2021_12_human_bact_arch_fungi_vir.fa &
# create FASTA index using samtools
samtools faidx refence_sequence.fa

create bwa index

An index is required for bwa to be able to align against the sequence/index

bwa index.fa

add to Wochenende

  • copy the new reference sequence to your other ones
  • add the reference sequence to the Wochenende config.yaml
  • and to run_Wochenende_SLURM.sh