-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When will the version supporting indels come out? Why input 17-bp sequence when you train the model with 11bp sequence? #9
Comments
QBiC doesn’t support indel since it computes linear combination of the OLS coefficient estimates for all 6-mers overlapping SNP of interest. This leads the need to use 11bp sequece to get all 6-mers. However, we also use PBM E-score to give the binding status change prediction (e.g. the SNP change the TF from bound > unbound) which requires 8-mers model overlapping the SNP—this leads to 17bp sequence input. Unfortunately, right now, we don’t have a way to support indels. Right now, we don’t have the feature for two columns WT and MUT. But we support text file that contains 17-mer DNA sequences with the "mutated from" nucleotide in the middle and the "mutated to" nucleotide on the right, separated by a space or tab character. See example here: qbic.genome.duke.edu/download/QBiC-sequence-format-example-ELK1_17mers.txt . However, this shouldn’t be hard to implement, I will mark this as a TODO. Yes, we are counting the 6-mers and using the count with the OLS coeffcient to make the prediction. All 6-mers overlap the SNP of interest which is the reason we can only calculate TF binding changes from SNPs. More responses below from our lab PI:
Why input 17-bp sequence when you train the model with 11bp sequence?
Instead of inputting a 17-bp sequence flanking an SNV, can we build another format to suit for Indels?
Can we input a table of two columns while the left column storing WT sequences and the right column storing MUT sequences?
I thought you are just counting the 6-mer haplotypes and implement the count data into the linear regression to make the prediction?
|
Thanks for the info! But I'm a bit confused now. The model is trained on 6-mers to get coefficients and when we input WT and MUT sequences, it summarizes the difference in counts of all 6-mers in an 11-bp window and stores them into a vector c, which leads to delta S. Then, using the distribution of normalized intensities across all the observations(BTW, how you defined a single observation? several neighborhood probes correspond to one fluorescent signal spot, one signal spot can be treated as one observation because the output signal intensity is certain?) to calculate the p-value and z-score. The E-score, according to the article you published in 2006(Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities), is rank-based. I did not get the whole details of how it is calculated but I get how you rank the 8-mers by looking at Figure2.a in your 2006 publication. Using the count of k-mers with E-scores >0.45 to determine whether an L-bp sequence can bind with a given TF. Therefore, the 6-mer generated delta S is just used to calculate the z-score and p-value to show the confidence of the change while the abs value of delta S is not used for classification. Instead, with the training PBM data, you know each 8-mer's E-score given a TF. With the input of variants, you can fetch 17bp sequence with the SNV at the middle from the hg19/hg38 ref genome. And you know the changes of the counts of each 8-mer with E-score. Based on that, you can classify the 17bp sequence bount/unbound to a TF. If what I stated above is correct to you, why use two different systems to calculate p-value and make classifications, respectively. Why not train data with 8-mer in a 17bp window ? Really appreciate if you can respond to this. Thanks! |
The model is trained on 6-mers to get coefficients and when we input WT and MUT sequences, it summarizes the difference in counts of all 6-mers in an 11-bp window and stores them into a vector c, which leads to delta S. Then, using the distribution of normalized intensities across all the observations(BTW, how you defined a single observation? several neighborhood probes correspond to one fluorescent signal spot, one signal spot can be treated as one observation because the output signal intensity is certain?) to calculate the p-value and z-score. The E-score, according to the article you published in 2006(Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities), is rank-based. I did not get the whole details of how it is calculated but I get how you rank the 8-mers by looking at Figure2.a in your 2006 publication. Using the count of k-mers with E-scores >0.45 to determine whether an L-bp sequence can bind with a given TF. Therefore, the 6-mer generated delta S is just used to calculate the z-score and p-value to show the confidence of the change, while the abs value of delta S is not used for classification. Instead, with the training PBM data, you know each 8-mer's E-score given a TF. With the input of variants, you can fetch 17bp sequence with the SNV at the middle from the hg19/hg38 ref genome. And you know the changes of the counts of each 8-mer with E-score. Based on that, you can classify the 17bp sequence bount/unbound to a TF. If what I stated above is correct to you, why use two different systems to calculate p-value and make classifications, respectively. Why not train data with 8-mer in a 17bp window ? |
Instead of inputting a 17-bp sequence flanking an SNV, can we build another format to suit for Indels?
Can we input a table of two columns while the left column storing WT sequences and the right column storing MUT sequences?
I thought you are just counting the 6-mer haplotypes and implement the count data into the linear regression to make the prediction?
The text was updated successfully, but these errors were encountered: