Skip to content
fuggersbergerdavid edited this page Jul 3, 2013 · 14 revisions

Unclonable regions in C. burnetii

description

Accounts

ssh [email protected] -p 443
ssh ibisdb02
cd /dbraid

Updates uncloneableRegions

Tuesday 21.5., Wednesday 22.5., Friday 24.5.

General:

  • getting rid of methods not used (window, forward, reverse, start, stop coverage etc) and corresponding outputs.

Function: startEnd

  • maxLeange of templates from 2library size to library size + 2.5Stdev
  • number of ti associated with template calculated by pattern matching instead of deviding by 9 because of different ti lengths (9 or 10)
  • handling case of FF mates and RR mates and output in Errors.txt
  • adding of invalid pairs to Errors.txt
  • Output of valid circulare readpairs in Ringschluss.txt

Function: minCompCov

  • cutoff for template coverage of uncloneable regions from 9 to 10
  • within regions of interrest we dont look at every position but every 10th. Runtime from ~1h to ~20 min.
  • adding feature: detecting uncloneable regions by looking for regions low coverage and no protein annotated. If region of minlength 20 not covered by any template try extending regions by 10. Output in MinRegions.txt.
  • adding feature: detecting decreased covered regions by looking for regions with low coverage and no protein annotated. If region of minlength 20 not covered by more than 2 templates try extending regions by 10. Output in DecreasedRegions.txt.
  • adding feature: detecting decreased covered proteins by looking for proteins with low coverage but covered by 1 or 2 templates. Output in DecreasedProt.txt. (Issue: no id, funct, hypothetical etc!)
  • adding feature: beeing able to determine coverage of decreased regions and proteins in plot

Monday 27.5.

Function: statistik

  • adding feature: output number of single reads, multiple templates and valid templates.

Plot:

  • modifications concerning input format (no libraries needed; before: perl data2plot.pl --organism --number of contigs --Library1 --Library2 now: perl data2plot.pl --organism --number of contigs) and colour schema

Tuesday 28.5.

Function: TraceInfo

  • now in own class

Function: startEnd

  • adding feature: diregarding mate-pairs if start or stop position of read is negative or greater than contig length. Output in Errorlog as pos-Error.

Function: statistik

  • adding feature: runtime, date, mean coverage, mean insert size compared to libraries, number of invalid mate-pairs which means mate-pairs being in Errorlog due to Maxlaenge-Error, Beide-F-Error, Beide-R-Error or pos-Error.

Monday 03.06.

Function: minCompCov

  • increased decreased-coverage cutoff to 8.
  • fixed bug responsible for not showing COXBURSA331_A1292 in plot.

Monday 17.06.

Function: main/lesenVonASSEMBLY

  • replaced lesenVonASSEMBLY with lesenSam.einlesenSam
  • adding feature: output NichtGemappt.txt which show unmapped reads which now means reads with pos 0 in .sam file. Former: These are reads which are in TRACEINFO.xml but not in ASSEMBLY.xml.

Tuesday 18.06.

Function: main

  • now needs sorted .sam file as input

Function: startEnd

  • fixed Ringschluss.txt also including FR and not only RF

Friday 21.06.

Function: main

  • added and edited NichtGemapped.txt output now showing unmapped valid reads

Monday 01.07.

Function: main

  • now takes cutoff value as input

Function: statistik

  • added several features concerning the mapping of mate reads. Is now only representating of mate.sam. Single reads in statistik.txt are actually mate reads but mate is mapped in other contig thus seen as single (thus #single reads = #of reads with mate in other contig).

Mapping

Week 03.06. - 09.06. and Week 10.06. - 16.06. and 17.06. - 23.06

  • downloaded .fasta .qual and xml. files from NCBI TRACE
  • Written Class FastQual.java which reads .fasta .qual and xml. and gives as output.fasta and .qual files for single, mate (devided in forward, mate_f, and reverse file, mate_r) , multi, finishing , rr-reads and ff-reads (see Issues). No file for templates with finishing reads which means non pre-assembled pseudo reads (not necessary?).
  • perl script allToFastq.pl (parameters: perl allToFastq.pl lib1 lib2) which uses tofastq.pl (parameters: perl tofastq Nameoffasta.fasta) which generates corresponding .fastq files.
  • we then did:
    • MIRA denovo assembly: for 331 got 88 contigs in first step could assemble further by contig joining (runme.sh) but no file which could get us good information about positions of reads on contig
    • MIRA assembly-mapping: a lot of output files but no file which could get us good information about positions of reads on contig (runme_mapping.sh)
    • SOAPdenovo assembly: Had to use Biocluster BIMSC because need 5GB ram. thousends of contigs because not usefull for long reads only reads < 200.
    • BWA mapping: also used BIMSC because of previous runtime of days. Concatinated .fasta files of Contigs to use as backbone for mapping. Output of mapping in usefull .sam file. Also good look at output with viewer Tablet which can also generate file of mapping (rightclick contig -> save summary).
  • in order to be able to use .sam files in uncloneableRegions instead of ASSEMBLY.txt we wrote lesenSam.java which reads .sam files and returns hashmap of ti -> F/R$Start$Stop. F/R depending on +/- in .sam file. Sidenote: we looked again at TRACEINFO.xml and ASSEMBLY.xml specifications after noticing that a lot of reads where not mapped an therfor in TRACEINFO but not in ASSEMBLY. There seems to be an error concerning tilling_start/stop and consensus_start/stop. tilling should be included in consensus but often is only partly and sometimes not at all which contradicts specifications by NCBI and should normally lead to denying the submission of TIGR.
  • first output with new mapping: We found in 331 contig 1: with old mapping 0, new 0. contig 2: with old mapping 47, new 48. So as a first impression mapping worked fairly well. Reminder: We talked to Jonathan Hoser about mapping of reads and obout reads with no or multiple mates. He suggested to find out if any filtering was done before, that multiple reads are unlikely and that we could try to take reverse from one of the rr and ff-mates an map again and also look at orientation. He also has some scripts to use (Email!). Update: We got scripts but they dont fit our format. Maybe usefull for assembly.
  • New, better .sam file viewer IGV. Needs sorted .sam file so we got samtools to covert to .bam, sort and reconvert to .sam to use in viewer (howto.txt). Sidenote: Bugged display of "view as mates" no usefull tips in goolge+ support channel. One read is mapped two times? Split due to ring? Also only good qual (>20) shown in viewer or bug that read is 1000 but only 100 in viewer?
  • lesenSam.java now needs sorted .sam as input so that a read that has two ocurences gets fixed

Week 24.06. - 30.06. and Week 01.07. - 07.07.

  • build pipeline pipeline.pl. Needs in one Folder: pipeline.pl, /Input_Files, /Output_Files, /Code. /Input_Files needs to contain ASSEMBLY.xml (we could change that and use index.fasta instead) .fasta .qual index.fasta (>NZ_AAUP02000001 Sequ >Header2 Sequ etc) INSD.xml fixedxml.xml. Parameters: perl pipeline.pl -OrgName -AnzahlLib -Lib1 -Lib2

  • take close look at xml file before running look for update/new problematic, wrong libraries etc!

  • To-Do:

  • update statistik.txt. Take in other .sam files, fix multiple reads, look at positions that are shown

  • orientation in .sam files (see issues) are wrongly extracted. fix that

  • make new output with new header format for .fastq (see issues) and therefore all other files

  • make new output with clipped .fastq

  • integrate finish reads blast in pipeline. we might need to use other param than bwa mem

Assembly

  • we scp'd Q154 and Q212 from Helmoltz but didn't find a way to use .g1 files