Skip to content

Latest commit

 

History

History
245 lines (205 loc) · 8.65 KB

README.md

File metadata and controls

245 lines (205 loc) · 8.65 KB

Post assembly curation

Please refer to Part 3 - Slides.

Requirements

This tutorial assumes T2T-Polish, seqrequester, Winnowmap, asset are installed under $tools. The rest are assumed to be set in your environment $PATH (~/.bash_profile) or loaded via modules.

1. Clean up

  1. Identify MT, chloroplast, or any other contamination
  1. Remove excessive small copies of rDNA, small repetitive sequences, ...

  2. Add ribotin rDNA (if available) and MT sequences

  3. Name chromosomes (eg. chr1_1, chr1_2, ...)

2. Coverage based analysis

Telomere, gaps, microsatellite annotation

mkdir -p pattern && cd pattern
asm=bH_v2
ver=v2    # Tip: I prefer the polished ones be v1.0 or v2.0. Set accordingly
ln -s ../assemblies/$asm.fasta
$tools/T2T-Polish/coverage/init.sh $asm.fa $ver
cd ../

Output files:

  $ver.bed          # size
  $ver.exclude.beds # N-gaps
  $ver.telo.bed     # telomere
  # Microsatellite repeats (pattern)
  $ver/$ver.microsatellite.AT.128.bw
  $ver/$ver.microsatellite.GA.128.bw
  $ver/$ver.microsatellite.GC.128.bw
  $ver/$ver.microsatellite.TC.128.bw
  # ... (and .bed for the above bw files)

Load them on IGV and investigate satellite repeat enriched region. Log along scaffolds that should be re-oriented to match the p and q arm definitions and/or match between haplotypes. Re-orienting can happen after read-mapping and polishing.

Here is an IGV session file. Navigate through chromosomes, and look for repeated patterns.

Mapping reads back to the assembly

Align all reads to

  • both haplotypes
  • each haplotype

As a result, there will be 3 x 3 alignments:
(HiFi, ONT, Illumina) x (Both haps, hap1, hap2)

mkdir -p mapping && cd mapping
ls /full/path/to/HiFi/*.fq.gz > HiFi.fofn
ls /full/path/to/ONT/*.fq.gz > ONT.fofn
ls /full/path/to/Illumina/*read-pair1.fq.gz > illumina1.fofn
ls /full/path/to/Illumina/*read-pair2.fq.gz > illumina2.fofn
paste illumina1.fofn illumina2.fofn > illumina.fofn
rm illumina1.fofn illumina2.fofn

For slurm environment, use the submitter wrapper. Otherwise, run each scripts described as in the README. Understand winnowmap submitter script

$tools/T2T-Polish/winnowmap/_submit.sh
Usage: ./_submit.sh ref.fasta prefix map [wm_opt]
  ref.fasta  reference to align
  prefix     output prefix
  map        mapping mode. map-pb for HiFi, map-ont for ONT, map-pb-clr for CLR.
  wm_opt     winnowmap optional arguments. i.e. -y for adding methylation tags.
  Required: input.fofn

For example, for hifi reads, submit for the 3 assemblies:

pf=HiFi
asm=bH_v2 # adjust
asm_fa=../assemblies/$asm
# For convenience, we assume ../assemblies/ has all 3 versions as $asm_fa.$hap.fa

for hap in dip hap1 hap2
do
  mkdir -p ${pf}_to_$hap && cd ${pf}_to_$hap
  ln -s ../HiFi.fofn input.fofn
  $tools/T2T-Polish/winnowmap/_submit.sh $asm_fa.$hap.fa ${pf}_to_$hap map-pb
  cd ../
done

Submit the same script for ONT reads; using ONT.fofn as input.fofn and replacing map-pb to map-ont.

Illumina reads can be mapped with bwa, as in this script.

3. k-mer based errors and haplotype switches

Let's get back to the upper directory from mapping and build kmer dbs for merqury. Re-use the input.fofn files in mapping. In this example, we will generate k=31 mers for evaluation.

cd ../
mkdir -p meryl && cd meryl
for pf in illumina HiFi ONT
do
  mkdir $pf && cd $pf
  ln -s ../../mapping/$pf.fofn
  $MERQURY/_submit_build.sh 31 $pf.fofn $pf # with slurm
  cd ../
done

Once the meryl dbs are built, they will appear as $pf.k31.meryl. Sym-link them all under meryl for convenience.

cd ../
for pf in illumina HiFi
do
  ln -s $pf/$pf.k31.meryl $pf.meryl
done

For trios

Apply the same for parental datasets; to obtain e.g. pat.k31.meryl and mat.k31.meryl if available. Sym-link as the rest of the meryl dbs.

Generate hapmers:

cd ../
mkdir -p hapmers && cd hapmers
$MERQURY/hapmers.sh ../mat.meryl ../pat.meryl ../illumina.meryl
cd ../

Hybrid db

In addition to the raw reads kmer sets, we need the hybrid kmer db - which is a merged version of illumina and HiFi kmers. See T2T-Polish for more details.

meryl union-sum [ greater-than 1 illumina.k31.meryl ] [ greater-than 1 HiFi.k31.meryl ] output hybrid.meryl

Merqury

Now we are ready to generate kmer profiles.

cd ../
mkdir -p merqury && cd merqury

Run Merqury on each platform: illumina, HiFi, hybrid.

ln -s ../meryl/illumina.meryl
ln -s ../meryl/hapmers/mat.hapmer.meryl mat.meryl
ln -s ../meryl/hapmers/pat.hapmer.meryl pat.meryl

# For trio based evaluation
mkdir -p illumina && cd illumina
$MERQURY/_submit_merqury.sh ../illumina.meryl ../mat.meryl ../pat.meryl ../../assemblies/$asm.hap1.fa ../../assemblies/$asm.hap2.fa illumina
cd ../

# HiFi and Hybrid - for QV estimates
for pf in HiFi hybrid # illumina if no hapmers are available
do
  ln -s ../meryl/$pf.meryl
  mkdir -p $pf && cd $pf
  $MERQURY/_submit_merqury.sh ../$pf.meryl ../../assemblies/$asm.hap1.fa ../../assemblies/$asm.hap2.fa $pf
  cd ../
done

Once Merqury completes, collect *_only.bed and *_only.wig.

cd hybrid
cat *_only.bed | bedtools merge -i - > $asm.error.bed
cd ../

Additionally, collect hapmer tracks for trios:

cd illumina
# Merge hapmer tracks
cat *.mat.wig > $asm.mat.wig
cat *.pat.wig > $asm.pat.wig

# Convert to bw files (optional, to reduce size)
wigToBigWig $asm.mat.wig ../../assemblies/$asm.fa.fai $asm.mat.bw
wigToBigWig $asm.pat.wig ../../assemblies/$asm.fa.fai $asm.pat.bw
cd ../

Link the $asm.error.bed under pattern.

cd ../pattern/
ln -s ../../merqury/hybrid/$asm.error.bed $asm.error.bed
cd ../

4. Issues track

Winnowmap submitter generates filtered bam and paf files at the end. The most informative issues comes from read_to_dip alignments.

This step requires a lot of files to be in place. Make sure they exist in the proper location. Again, don't forget to link $ver.error.bed from Merqury hybrid.

$tools/T2T-Polish/coverage/issues.sh
Usage: ./issues.sh in.paf name ver platform pattern
  in.paf   input paf file
  name     name to appear on the .wig files
  ver      assembly version
  platform HiFi, ONT or Hybrid
  pattern  path to pattern folder

Required: pattern/ver/, pattern/ver.bed, pattern/ver.error.bed, pattern/ver.exclude.bed, pattern/ver.telo.bed

The above script generates a lot of .wig files and .bed files, as well as useful coverage related statistics. Refer to the coverage description for more details. ver should match as it appears in the pattern.

for pf in HiFi ONT
do
  out=${pf}_to_dip
  mkdir -p $out && cd $out
  paf=`ls ../../mapping/$out/$out.*pri.paf`
  ln -s $paf
  paf=`basename $paf`
  $tools/T2T-Polish/coverage/issues.sh $paf $out $ver $pf ../../pattern
  cd ../
done

5. Evaluate on IGV

Load the *.issues.bed and *.cov.wig on IGV, along with the pattern *.bw files, and enjoy(?) reading the error profiles!

Here is another IGV session file containing all tracks for the Bighorn assembly evaluation, as in the slides.

Next steps

Use the alignments to call variants, inspect, and create correction .vcf files. Alternatively, the alignments could get further corrected using SeqPhase and polished with DeepPolisher. Note that DeepPolisher currently only works for HiFi reads.