dssr2017-huaqiedward

Our model organism is Trichomonas vaginalis, which causes the most common non-viral sexually transmitted infection worldwide. So far we have had the whole genome sequence of its G3 strain ready, as well as some transcriptomic, proteomic and microRNA data. It has been proved and widely accepted that the secondary structure of an RNA molecule impacts its transcription, splicing, translation, turnover and localization. Nowadays, a colossal amount of studies are focusing on the regulative effects mediated by different RNA structural patterns. The objective of my research is to scan T. vaginalis transcriptome, calculate the average regional folding energies of those messenger and non-coding RNA transcripts and discover some characteristic structural signatures that potentially impact gene expression and regulation. To achieve this, I plan to adopt a “sliding window scheme”. Please read below the detailed research proposal:

1. Isolate five datasets of Trichomonas vaginalis messenger RNAs (mRNA) with varying degrees of reliability from TrichDB. Dataset 1 comprises all annotated reading frames within the TrichDB database; Dataset 2 contains all annotated reading frames with either evidence of transcription, from an expressed sequence tag (EST) database, or evidence of protein expression, from mass spectrometry (MS) data derived from proteomic studies; Dataset 3 is the union of both types of evidence used in Dataset 2.

2. Parse the downloaded files and remove any incomplete sequence data, rRNA or tRNA gene and sequence without a canonical ATG start codon.

3. Interrogate the 50 nucleotides upstream of the start codon for putative core promoter elements. Divide the sequence data into 3 subgroups based on the core promoter elements:

a. The inr core promoter element: those with ‘TCANWY’ consensus sequences within 30 nucleotides of the start codon.

b. The m5 core promoter element: Those with invariant ‘CCTTT’ pentanucleotide motif within 20 nucleotides of the start codon.

c. m3 and m5 elements in tandem: those with an m3 element (DRCSGYTD) within 30 nucleotides of a unique m5 element.

4. Process databases to create putative transcripts. Both inr and m5 direct their respective RNA polymerases to a specific transcription start site (TSS) – the adenine of ‘TCANWY’ in the case of the inr, and the second cytosine of ‘CCTTT’ in the case of m5. Use this information to trim each sequence in the 9 unique databases to its putative transcript.

5. Use uclust to cluster sequences with ≥ 85% sequence identity.

6. Trim each sequence to different fragments following a “window sliding scheme”. Scheme 1: The first window of an mRNA molecule starts from its 5’ cap site, with 40 nucleotides (nt) in length and the next one stemmed from a frame moving based on the previous one in a step of 1 nt. Keep sliding down until the first 100 nt of each transcript were all covered. Scheme 2: Change the window length to 30 nt and step to 2 nt.

7. Produce randomized sequences for each window in each transcript. Scheme 1: Keep the composition of each window unchanged and reshuffle the sequence in 100 different ways. Scheme 2: Randomly reshuffle the first 100 nt of each transcript in 1000 different ways by di-nucleotides or mononucleotide. Trim each scrambled sequence as in step 6.

8. Use the RNAfold program in the ViennaRNA package to calculate the local folding energies (ΔG) of each window and its randomized sequences. Calculate the z-score of the local 5’ stability (Z ΔG) for each sliding window by a formula in the reference.

9. Similarly, calculate the local GC composition for each sliding window and its randomized sequences and get the ZGC value.

10. Use R to draw line charts showing ΔG, Z ΔG and ZGC change along the 100 nt location. Determine the significance of difference between native and randomized sequences.

11. Further divide the sequences into subgroups based on different traits, e.g. with or without miRNA binding sites, low or high expression level, sensitive or insensitive to stimuli etc. and compare the 5’ structural patterns between each pair of subgroups.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
course_materials		course_materials
README.md		README.md
TeamCharter.md		TeamCharter.md
hello_git.txt		hello_git.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dssr2017-huaqiedward

2. Parse the downloaded files and remove any incomplete sequence data, rRNA or tRNA gene and sequence without a canonical ATG start codon.

3. Interrogate the 50 nucleotides upstream of the start codon for putative core promoter elements. Divide the sequence data into 3 subgroups based on the core promoter elements:

a. The inr core promoter element: those with ‘TCANWY’ consensus sequences within 30 nucleotides of the start codon.

b. The m5 core promoter element: Those with invariant ‘CCTTT’ pentanucleotide motif within 20 nucleotides of the start codon.

c. m3 and m5 elements in tandem: those with an m3 element (DRCSGYTD) within 30 nucleotides of a unique m5 element.

5. Use uclust to cluster sequences with ≥ 85% sequence identity.

8. Use the RNAfold program in the ViennaRNA package to calculate the local folding energies (ΔG) of each window and its randomized sequences. Calculate the z-score of the local 5’ stability (Z ΔG) for each sliding window by a formula in the reference.

9. Similarly, calculate the local GC composition for each sliding window and its randomized sequences and get the ZGC value.

10. Use R to draw line charts showing ΔG, Z ΔG and ZGC change along the 100 nt location. Determine the significance of difference between native and randomized sequences.

11. Further divide the sequences into subgroups based on different traits, e.g. with or without miRNA binding sites, low or high expression level, sensitive or insensitive to stimuli etc. and compare the 5’ structural patterns between each pair of subgroups.

About

Releases

Packages

Contributors 3

digital-skills-for-researchers-pd/dssr2017-huaqiedward

Folders and files

Latest commit

History

Repository files navigation

dssr2017-huaqiedward

2. Parse the downloaded files and remove any incomplete sequence data, rRNA or tRNA gene and sequence without a canonical ATG start codon.

3. Interrogate the 50 nucleotides upstream of the start codon for putative core promoter elements. Divide the sequence data into 3 subgroups based on the core promoter elements:

a. The inr core promoter element: those with ‘TCANWY’ consensus sequences within 30 nucleotides of the start codon.

b. The m5 core promoter element: Those with invariant ‘CCTTT’ pentanucleotide motif within 20 nucleotides of the start codon.

c. m3 and m5 elements in tandem: those with an m3 element (DRCSGYTD) within 30 nucleotides of a unique m5 element.

5. Use uclust to cluster sequences with ≥ 85% sequence identity.

8. Use the RNAfold program in the ViennaRNA package to calculate the local folding energies (ΔG) of each window and its randomized sequences. Calculate the z-score of the local 5’ stability (Z ΔG) for each sliding window by a formula in the reference.

9. Similarly, calculate the local GC composition for each sliding window and its randomized sequences and get the ZGC value.

10. Use R to draw line charts showing ΔG, Z ΔG and ZGC change along the 100 nt location. Determine the significance of difference between native and randomized sequences.

11. Further divide the sequences into subgroups based on different traits, e.g. with or without miRNA binding sites, low or high expression level, sensitive or insensitive to stimuli etc. and compare the 5’ structural patterns between each pair of subgroups.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages