You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since we have a list of where all the TFBS are in a sequence, we need to use that list to get the nucleotides in that regions from all 24 species (whether they have a TFBS or not).
@ndesaraju , you can do the more clunky, but likely faster way as seen below. @joanne-chen can you investigate if this is possible to do within the code already written, when we make the raw_sequence.....
Steps for how I see it can be performed
Scrape the unique align position from the align_position column from the files in output/map_motif_bcd_with_threshold. These files show where all the TFBS are found in each species across the entire region (each file). We just need to know that one TFBS was found in that position in one species to know we need to retrieve in all 24 species, even know there may be more than one species that has that same align position. That is why we only need the uniquealign_position number.
Know you should have a list of unique align_position numbers which correspond to the start of a TFBS. Now you need to get the raw position from the raw_position column for every species.
You can find the raw_position that corresponds to every align_position in the files in map_motif_bcd_no_thresholdSanity check should be the number of unique numbers * 24 species. How you grab every species might be a little tricky since you can't just grab the header information. But see below for a list of all the 24 species.
Now that you have the starting raw_position from each of the 24 species, you can use that number to grab 1. the length_of_the_TFBS forward from that position to capture the TFBS 2. n nucleotides beyond (length_of_the_TFBS + n) and n nucleotides behind the TFBS (raw_position - n). In each species. You will be grabbing the raw_sequence from raw directory
Goal
Since we have a list of where all the TFBS are in a sequence, we need to use that list to get the nucleotides in that regions from all 24 species (whether they have a TFBS or not).
@ndesaraju , you can do the more clunky, but likely faster way as seen below. @joanne-chen can you investigate if this is possible to do within the code already written, when we make the raw_sequence.....
Steps for how I see it can be performed
Scrape the unique align position from the
align_position
column from the files in output/map_motif_bcd_with_threshold. These files show where all the TFBS are found in each species across the entire region (each file). We just need to know that one TFBS was found in that position in one species to know we need to retrieve in all 24 species, even know there may be more than one species that has that same align position. That is why we only need the uniquealign_position
number.Know you should have a list of unique
align_position
numbers which correspond to the start of a TFBS. Now you need to get the raw position from theraw_position
column for every species.You can find the raw_position that corresponds to every align_position in the files in map_motif_bcd_no_thresholdSanity check should be the number of unique numbers * 24 species. How you grab every species might be a little tricky since you can't just grab the header information. But see below for a list of all the 24 species.
Now that you have the starting
raw_position
from each of the 24 species, you can use that number to grab 1. the length_of_the_TFBS forward from that position to capture the TFBS 2. n nucleotides beyond (length_of_the_TFBS + n) and n nucleotides behind the TFBS (raw_position - n). In each species. You will be grabbing the raw_sequence from raw directoryNote
It would be good to know which TFBSs in which species are on the original output/map_motif_bcd_with_threshold lists.
List of 24 species
Dkik
MEMB002A
MEMB002B
MEMB002C
MEMB002D
MEMB002E
MEMB002F
MEMB003A
MEMB003B
MEMB003C
MEMB003D
MEMB003E
MEMB003F
MEMB004A
MEMB004B
MEMB004E
MEMB005D
MEMB006B
MEMB006C
MEMB007A
MEMB007B
MEMB007C
MEMB007D
MEMB008C
The text was updated successfully, but these errors were encountered: