Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code Review Feb 27 #12

Open
iamciera opened this issue Feb 28, 2019 · 0 comments
Open

Code Review Feb 27 #12

iamciera opened this issue Feb 28, 2019 · 0 comments

Comments

@iamciera
Copy link
Member

iamciera commented Feb 28, 2019

Goal

Since we have a list of where all the TFBS are in a sequence, we need to use that list to get the nucleotides in that regions from all 24 species (whether they have a TFBS or not).

@ndesaraju , you can do the more clunky, but likely faster way as seen below. @joanne-chen can you investigate if this is possible to do within the code already written, when we make the raw_sequence.....

Steps for how I see it can be performed

  1. Scrape the unique align position from the align_position column from the files in output/map_motif_bcd_with_threshold. These files show where all the TFBS are found in each species across the entire region (each file). We just need to know that one TFBS was found in that position in one species to know we need to retrieve in all 24 species, even know there may be more than one species that has that same align position. That is why we only need the unique align_position number.

  2. Know you should have a list of unique align_position numbers which correspond to the start of a TFBS. Now you need to get the raw position from the raw_position column for every species.
    You can find the raw_position that corresponds to every align_position in the files in map_motif_bcd_no_thresholdSanity check should be the number of unique numbers * 24 species. How you grab every species might be a little tricky since you can't just grab the header information. But see below for a list of all the 24 species.

  3. Now that you have the starting raw_position from each of the 24 species, you can use that number to grab 1. the length_of_the_TFBS forward from that position to capture the TFBS 2. n nucleotides beyond (length_of_the_TFBS + n) and n nucleotides behind the TFBS (raw_position - n). In each species. You will be grabbing the raw_sequence from raw directory

Note

It would be good to know which TFBSs in which species are on the original output/map_motif_bcd_with_threshold lists.

List of 24 species

Dkik
MEMB002A
MEMB002B
MEMB002C
MEMB002D
MEMB002E
MEMB002F
MEMB003A
MEMB003B
MEMB003C
MEMB003D
MEMB003E
MEMB003F
MEMB004A
MEMB004B
MEMB004E
MEMB005D
MEMB006B
MEMB006C
MEMB007A
MEMB007B
MEMB007C
MEMB007D
MEMB008C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant