Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New data formats: Project 1 Create a larger nucleotide dictionary #24

Open
iamciera opened this issue Aug 21, 2018 · 13 comments
Open

New data formats: Project 1 Create a larger nucleotide dictionary #24

iamciera opened this issue Aug 21, 2018 · 13 comments
Assignees

Comments

@iamciera
Copy link
Member

iamciera commented Aug 21, 2018

2. New data formats

The most energy and creativity for machine learning projects is on organizing the data, optimizing the data structures, and extracting features. That being said, the current data structure is not what we will using in the future.

See below projects and pick which seems most interesting to you. A good strategy would be to use only a small subset of the data to start messing with, run experiments & test accuracy and then and only then, make the whole dataset and run on a larger machine.

  • Project 1: Create a larger nucleotide dictionary. It would be great if the neural network found which motifs were important on its own, without us giving the sequence a score. One way to do this would be to create a larger dictionary for the one hot enoding step. Instead of single letters being the code, it would be better if the motifs were embedded into the dictionary. Please read Ng et al., 2017. This is based on the popular wordtovec strategy used in Natural Language Processing. Then use sliding window to assign to the sequences. The hardest part of this strategy is that the dataset will be HUGE. We need to find a way to prioritize which sequences are the most useful. One way would be to make a dictionary based on known TFBS. Or over-represented motifs found in these sequences. A lot to work with.
@iamciera iamciera changed the title Creating new data formats New data formats: Project 1 Create a larger nucleotide dictionary Aug 22, 2018
@iamciera
Copy link
Member Author

One of the problems: I think to proceed it would be interesting to make the dictionary very small and made up of known TFBS. This would be something that could be implemented pretty easy and keep the data relatively small. We can then go forward depending on size.

  1. Would it be possible to implement nucleotide sequences AND dictionary?

@iamciera
Copy link
Member Author

iamciera commented Oct 2, 2018

About

See above for more context.

Below is a description of what could be done. Please read through and make sure you understand each step. Find the area that most interests you and come up with a strategy.

Step 1a: Simple Position Weight Matrix (PWM) Input for Dictionary

One of the more simple ways to implement a larger dictionary for mapping is to get the top motifs from an example file like this. It would be great if we automated this, but we could start by hand coding the words. Example PWM are found in /data/input/jaspar_pwm.

## PWM example: giant

>MA0447.1   gt
A  [    28      0      1     54      0      7      0     55     60      2 ]
C  [     5      0      1      0     53      0      6      3      0     25 ]
G  [    25      0      3      6      0     53      0      1      0      5 ]
T  [     2     60     55      0      7      0     54      1      0     28 ]

Words would be something like this: giant = [ATTACGTAAC, GTTACGTAAC, ATTACGTAAT, GTTACGTAAT]. Each PWM would be another list. They have to be separate and be represented seperatly. As though each word is a synonym. Where we would ignore the letters that only make up a small percentage at each position.

Step 2: Make dictionary based on words
Step 3: Search sequence on forward and reverse strand using sliding window.
- Make sure that you keep this small, like 1,000 seq to start
- With the sliding window approach we loose actual position. Our question becomes only concerned with known words that occur, but not where. Is there away to use sliding window to map onto a backbone position of the nucleotide letters.
Step 4: Score Words
- what would this look like? Step 3 and 4 are intimately linked.

Step 1b. Alternative route to learn words in dictionary rather the define them

  • Find most common words?
  • Would we do this on both types of sequences? The negative and positive functional sequences?
  • Could we just use a de novo motif discovery program?
  • Skp gram vs "bag of words"
  • Bag of words - order and structure of words is discarded.
  • We will have to learn about handling sparse vectors?
  • Could we use groups of motifs? This could be very interesting.

Reiterate over Step 2 - 4: May need to modify

@thethomaslane thethomaslane self-assigned this Oct 2, 2018
@thethomaslane
Copy link
Contributor

I would like to work on Step 1b.. I am looking into the de novo motif discovery programs. But I would like to first focus on finding the most common words as I believe I can write the program fairly simply.

@zhanyuanucb zhanyuanucb self-assigned this Oct 5, 2018
@zhanyuanucb
Copy link
Contributor

I would like to first start from Step 1a. But in order to apply this dictionary to encode our nucleotide sequence for classification, we need to assume that the frequency of motif is correlated to whether a sequence expresses in the early embryo

@iamciera
Copy link
Member Author

iamciera commented Oct 5, 2018

we need to assume that the frequency of motif is correlated to whether a sequence expresses in the early embryo

Yes @zhanyuanucb, the PWM sequences I would provide you would all be verified TFBS that are know to be in enhancers that direct direct gene expression in the early drosophila embryo.

@zhanyuanucb
Copy link
Contributor

Cool.
I also looked up the wiki for PWM (https://en.wikipedia.org/wiki/Position_weight_matrix), and I found that they do a log likelihood transform to the entries in the matrix. I guess we can do the same thing and use the transformed entries to set the threshold of whether a nucleotide should be considered in that position (like setting the threshold to be 0)?

@thethomaslane
Copy link
Contributor

http://meme-suite.org/
This has a lot of tools that could be useful for finding motifs

@zhanyuanucb
Copy link
Contributor

I've implemented the notebook to create words from those position weight matrix. The next step is to apply those words to a toy sample.

@zhanyuanucb
Copy link
Contributor

@thethomaslane
Copy link
Contributor

I made a notebook that can create words from motifs found using the MEME suite. I made sure this would output the same data format as Sean's program.

https://github.com/DiscoveryDNA/team_neural_network/blob/master/code/utility/Generate_words_from_fasta_motifs.ipynb

@thethomaslane
Copy link
Contributor

Here is a sample output for the meme motif discovery. This one was checked against the jaspar database and is found in drosophila.
https://github.com/DiscoveryDNA/team_neural_network/blob/master/data/input/motif_1_fasta.txt

@zhanyuanucb
Copy link
Contributor

Sample output of motif counts and location information generated by the following notebook:
https://github.com/DiscoveryDNA/team_neural_network/blob/master/code/utility/count_motifs.ipynb

@zhanyuanucb
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants