New data formats: Project 1 Create a larger nucleotide dictionary #24

iamciera · 2018-08-21T18:47:49Z

2. New data formats

The most energy and creativity for machine learning projects is on organizing the data, optimizing the data structures, and extracting features. That being said, the current data structure is not what we will using in the future.

See below projects and pick which seems most interesting to you. A good strategy would be to use only a small subset of the data to start messing with, run experiments & test accuracy and then and only then, make the whole dataset and run on a larger machine.

Project 1: Create a larger nucleotide dictionary. It would be great if the neural network found which motifs were important on its own, without us giving the sequence a score. One way to do this would be to create a larger dictionary for the one hot enoding step. Instead of single letters being the code, it would be better if the motifs were embedded into the dictionary. Please read Ng et al., 2017. This is based on the popular wordtovec strategy used in Natural Language Processing. Then use sliding window to assign to the sequences. The hardest part of this strategy is that the dataset will be HUGE. We need to find a way to prioritize which sequences are the most useful. One way would be to make a dictionary based on known TFBS. Or over-represented motifs found in these sequences. A lot to work with.

iamciera · 2018-09-28T21:21:12Z

One of the problems: I think to proceed it would be interesting to make the dictionary very small and made up of known TFBS. This would be something that could be implemented pretty easy and keep the data relatively small. We can then go forward depending on size.

Would it be possible to implement nucleotide sequences AND dictionary?

iamciera · 2018-10-02T18:40:27Z

About

See above for more context.

Below is a description of what could be done. Please read through and make sure you understand each step. Find the area that most interests you and come up with a strategy.

Step 1a: Simple Position Weight Matrix (PWM) Input for Dictionary

One of the more simple ways to implement a larger dictionary for mapping is to get the top motifs from an example file like this. It would be great if we automated this, but we could start by hand coding the words. Example PWM are found in /data/input/jaspar_pwm.

## PWM example: giant

>MA0447.1   gt
A  [    28      0      1     54      0      7      0     55     60      2 ]
C  [     5      0      1      0     53      0      6      3      0     25 ]
G  [    25      0      3      6      0     53      0      1      0      5 ]
T  [     2     60     55      0      7      0     54      1      0     28 ]

Words would be something like this: giant = [ATTACGTAAC, GTTACGTAAC, ATTACGTAAT, GTTACGTAAT]. Each PWM would be another list. They have to be separate and be represented seperatly. As though each word is a synonym. Where we would ignore the letters that only make up a small percentage at each position.

Step 2: Make dictionary based on words
Step 3: Search sequence on forward and reverse strand using sliding window.
- Make sure that you keep this small, like 1,000 seq to start
- With the sliding window approach we loose actual position. Our question becomes only concerned with known words that occur, but not where. Is there away to use sliding window to map onto a backbone position of the nucleotide letters.
Step 4: Score Words
- what would this look like? Step 3 and 4 are intimately linked.

Step 1b. Alternative route to learn words in dictionary rather the define them

Find most common words?
Would we do this on both types of sequences? The negative and positive functional sequences?
Could we just use a de novo motif discovery program?
Skp gram vs "bag of words"
Bag of words - order and structure of words is discarded.
We will have to learn about handling sparse vectors?
Could we use groups of motifs? This could be very interesting.

Reiterate over Step 2 - 4: May need to modify

thethomaslane · 2018-10-02T21:01:42Z

I would like to work on Step 1b.. I am looking into the de novo motif discovery programs. But I would like to first focus on finding the most common words as I believe I can write the program fairly simply.

zhanyuanucb · 2018-10-05T18:01:14Z

I would like to first start from Step 1a. But in order to apply this dictionary to encode our nucleotide sequence for classification, we need to assume that the frequency of motif is correlated to whether a sequence expresses in the early embryo

iamciera · 2018-10-05T18:36:26Z

we need to assume that the frequency of motif is correlated to whether a sequence expresses in the early embryo

Yes @zhanyuanucb, the PWM sequences I would provide you would all be verified TFBS that are know to be in enhancers that direct direct gene expression in the early drosophila embryo.

zhanyuanucb · 2018-10-05T18:53:41Z

Cool.
I also looked up the wiki for PWM (https://en.wikipedia.org/wiki/Position_weight_matrix), and I found that they do a log likelihood transform to the entries in the matrix. I guess we can do the same thing and use the transformed entries to set the threshold of whether a nucleotide should be considered in that position (like setting the threshold to be 0)?

thethomaslane · 2018-10-05T23:26:26Z

http://meme-suite.org/
This has a lot of tools that could be useful for finding motifs

zhanyuanucb · 2018-10-19T12:03:10Z

I've implemented the notebook to create words from those position weight matrix. The next step is to apply those words to a toy sample.

zhanyuanucb · 2018-10-19T17:04:44Z

A link to the notebook:
https://github.com/DiscoveryDNA/team_neural_network/blob/master/code/utility/Generate_words_by_motif.ipynb

thethomaslane · 2018-10-26T19:28:46Z

I made a notebook that can create words from motifs found using the MEME suite. I made sure this would output the same data format as Sean's program.

https://github.com/DiscoveryDNA/team_neural_network/blob/master/code/utility/Generate_words_from_fasta_motifs.ipynb

thethomaslane · 2018-11-09T01:31:25Z

Here is a sample output for the meme motif discovery. This one was checked against the jaspar database and is found in drosophila.
https://github.com/DiscoveryDNA/team_neural_network/blob/master/data/input/motif_1_fasta.txt

zhanyuanucb · 2018-11-09T09:35:04Z

Sample output of motif counts and location information generated by the following notebook:
https://github.com/DiscoveryDNA/team_neural_network/blob/master/code/utility/count_motifs.ipynb

zhanyuanucb · 2018-11-16T23:29:00Z

Feel free to check the reformatted data...
https://github.com/DiscoveryDNA/team_neural_network/blob/master/data/input/motif_freq.csv

iamciera changed the title ~~Creating new data formats~~ New data formats: Project 1 Create a larger nucleotide dictionary Aug 22, 2018

thethomaslane self-assigned this Oct 2, 2018

zhanyuanucb self-assigned this Oct 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New data formats: Project 1 Create a larger nucleotide dictionary #24

New data formats: Project 1 Create a larger nucleotide dictionary #24

iamciera commented Aug 21, 2018 •

edited

Loading

iamciera commented Sep 28, 2018

iamciera commented Oct 2, 2018 •

edited

Loading

thethomaslane commented Oct 2, 2018

zhanyuanucb commented Oct 5, 2018

iamciera commented Oct 5, 2018

zhanyuanucb commented Oct 5, 2018

thethomaslane commented Oct 5, 2018

zhanyuanucb commented Oct 19, 2018

zhanyuanucb commented Oct 19, 2018

thethomaslane commented Oct 26, 2018

thethomaslane commented Nov 9, 2018

zhanyuanucb commented Nov 9, 2018

zhanyuanucb commented Nov 16, 2018

New data formats: Project 1 Create a larger nucleotide dictionary #24

New data formats: Project 1 Create a larger nucleotide dictionary #24

Comments

iamciera commented Aug 21, 2018 • edited Loading

2. New data formats

iamciera commented Sep 28, 2018

iamciera commented Oct 2, 2018 • edited Loading

About

Step 1a: Simple Position Weight Matrix (PWM) Input for Dictionary

Step 1b. Alternative route to learn words in dictionary rather the define them

thethomaslane commented Oct 2, 2018

zhanyuanucb commented Oct 5, 2018

iamciera commented Oct 5, 2018

zhanyuanucb commented Oct 5, 2018

thethomaslane commented Oct 5, 2018

zhanyuanucb commented Oct 19, 2018

zhanyuanucb commented Oct 19, 2018

thethomaslane commented Oct 26, 2018

thethomaslane commented Nov 9, 2018

zhanyuanucb commented Nov 9, 2018

zhanyuanucb commented Nov 16, 2018

iamciera commented Aug 21, 2018 •

edited

Loading

iamciera commented Oct 2, 2018 •

edited

Loading