-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defining features and labels #59
Comments
I'm happy to see mediocre results for modeling some mutations. With gene expression, universally positive results are usually a good indication that you're overlooking something. @gwaygenomics or @cgreene would know better, but here's why it may be totally acceptable that a mutation doesn't have an expression signature:
I think @cgreene suspects most mutations will be difficult to classify. The ones that classify well are truly special and may point to eventual therapeutic targets.
Yes, good observation. In practice X could be subsetted if a user selects only samples with a certain cancer. But if you're using all samples, X will be the same for every model.
This is expected and makes sense to me. When you change your mutation, you are asking a different question that requires a different model.
Regarding the reliability of Y, I sort of feel like this is an upstream issue. We take what we get. However, the raw Xena mutation data does contain some sequencing replicates, which could allow you to estimate the sequencing fidelity. My impression is that the mutation calling is decent, right @gwaygenomics? |
Thank you for your comments. I had misconceptions regarding the use of matrix X. Regarding the sequencing replicates, if I understood well, this would correspond to the multi-label thing that I mentioned earlier. Unfortunately, I am not aware of many papers that deal with this issue. The common approach is the majority voting technique. |
According to sklearn's docs:
I'm not sure how multilabel classification would fit in with sequencing replicates. I was thinking that the replicates would be most useful as a way of examining the reliability of the sequencing. Do two independent sequencing runs of the same sample yield the same mutations? To keep things simple, we probably don't want to venture into multilabel classification, ... but we could fit a model where each mutation was a separate "label". Maybe there would be certain benefits to fitting all models together, but I'm not sure. |
Agree as a future interest. Transfer learning approaches should be very well suited here (and transferability is also interesting scientifically). Not sure how much we want to dig in at this time. |
They are state of the art exome mutation calls. Right now, 6 mutations caller algorithms are applied to the data and variants are removed if they are only called once. |
This issue is a follow-up of the results obtained for different genes #52 . It is still not clear why few oncogenes produced such bad results. Before analyzing genes themselves, I got puzzled by one thing in the code.
If we want to run the classifier for a different gene, the only part that is currently changed is y, i.e., vector of labels y=Y[GENE]. Matrix X, which contains our feature values, remains the same. This means that one set of feature values can belong to class '0' in one iteration, while in another iteration same set is denoted as class '1'. Even though each iteration corresponds to a different gene, classifier sees it as another combination of '0' and '1' for which model has to be built.
If the matrix X is static, i.e., its values are completely reliable, I guess the main question is how reliable are the labels given in matrix Y and would it be possible to measure that reliability.
The text was updated successfully, but these errors were encountered: