-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New data formats: Project 1 Create a larger nucleotide dictionary #24
Comments
One of the problems: I think to proceed it would be interesting to make the dictionary very small and made up of known TFBS. This would be something that could be implemented pretty easy and keep the data relatively small. We can then go forward depending on size.
|
AboutSee above for more context. Below is a description of what could be done. Please read through and make sure you understand each step. Find the area that most interests you and come up with a strategy. Step 1a: Simple Position Weight Matrix (PWM) Input for DictionaryOne of the more simple ways to implement a larger dictionary for mapping is to get the top motifs from an example file like this. It would be great if we automated this, but we could start by hand coding the words. Example PWM are found in /data/input/jaspar_pwm.
Words would be something like this: Step 2: Make dictionary based on words Step 1b. Alternative route to learn words in dictionary rather the define them
Reiterate over Step 2 - 4: May need to modify |
I would like to work on Step 1b.. I am looking into the de novo motif discovery programs. But I would like to first focus on finding the most common words as I believe I can write the program fairly simply. |
I would like to first start from Step 1a. But in order to apply this dictionary to encode our nucleotide sequence for classification, we need to assume that the frequency of motif is correlated to whether a sequence expresses in the early embryo |
Yes @zhanyuanucb, the PWM sequences I would provide you would all be verified TFBS that are know to be in enhancers that direct direct gene expression in the early drosophila embryo. |
Cool. |
http://meme-suite.org/ |
I've implemented the notebook to create words from those position weight matrix. The next step is to apply those words to a toy sample. |
I made a notebook that can create words from motifs found using the MEME suite. I made sure this would output the same data format as Sean's program. |
Here is a sample output for the meme motif discovery. This one was checked against the jaspar database and is found in drosophila. |
Sample output of motif counts and location information generated by the following notebook: |
Feel free to check the reformatted data... |
2. New data formats
The most energy and creativity for machine learning projects is on organizing the data, optimizing the data structures, and extracting features. That being said, the current data structure is not what we will using in the future.
See below projects and pick which seems most interesting to you. A good strategy would be to use only a small subset of the data to start messing with, run experiments & test accuracy and then and only then, make the whole dataset and run on a larger machine.
The text was updated successfully, but these errors were encountered: