Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining features and labels #59

Open
brankaj opened this issue Oct 13, 2016 · 5 comments
Open

Defining features and labels #59

brankaj opened this issue Oct 13, 2016 · 5 comments

Comments

@brankaj
Copy link
Member

brankaj commented Oct 13, 2016

This issue is a follow-up of the results obtained for different genes #52 . It is still not clear why few oncogenes produced such bad results. Before analyzing genes themselves, I got puzzled by one thing in the code.

If we want to run the classifier for a different gene, the only part that is currently changed is y, i.e., vector of labels y=Y[GENE]. Matrix X, which contains our feature values, remains the same. This means that one set of feature values can belong to class '0' in one iteration, while in another iteration same set is denoted as class '1'. Even though each iteration corresponds to a different gene, classifier sees it as another combination of '0' and '1' for which model has to be built.

If the matrix X is static, i.e., its values are completely reliable, I guess the main question is how reliable are the labels given in matrix Y and would it be possible to measure that reliability.

@dhimmel
Copy link
Member

dhimmel commented Oct 14, 2016

It is still not clear why few oncogenes produced such bad results.

I'm happy to see mediocre results for modeling some mutations. With gene expression, universally positive results are usually a good indication that you're overlooking something. @gwaygenomics or @cgreene would know better, but here's why it may be totally acceptable that a mutation doesn't have an expression signature:

  • the gene doesn't do much so whether it's mutated or not doesn't affect cellular function.
  • the mutations are mostly passenger mutations rather than driver mutations. Basically, they're along for the ride, but aren't in the driver's seat.
  • our mutation measure isn't fine grained enough to be biologically meaningful. See Extract detailed mutation information for TCGA samples cancer-data#15

I think @cgreene suspects most mutations will be difficult to classify. The ones that classify well are truly special and may point to eventual therapeutic targets.

X, which contains our feature values, remains the same

Yes, good observation. In practice X could be subsetted if a user selects only samples with a certain cancer. But if you're using all samples, X will be the same for every model.

This means that one set of feature values can belong to class '0' in one iteration, while in another iteration same set is denoted as class '1'.

This is expected and makes sense to me. When you change your mutation, you are asking a different question that requires a different model.

I guess the main question is how reliable are the labels given in matrix Y and would it be possible to measure that reliability.

Regarding the reliability of Y, I sort of feel like this is an upstream issue. We take what we get. However, the raw Xena mutation data does contain some sequencing replicates, which could allow you to estimate the sequencing fidelity. My impression is that the mutation calling is decent, right @gwaygenomics?

@brankaj
Copy link
Member Author

brankaj commented Oct 17, 2016

Thank you for your comments. I had misconceptions regarding the use of matrix X. Regarding the sequencing replicates, if I understood well, this would correspond to the multi-label thing that I mentioned earlier. Unfortunately, I am not aware of many papers that deal with this issue. The common approach is the majority voting technique.

@dhimmel
Copy link
Member

dhimmel commented Oct 17, 2016

According to sklearn's docs:

Multilabel classification assigns to each sample a set of target labels. This can be thought as predicting properties of a data-point that are not mutually exclusive, such as topics that are relevant for a document. A text might be about any of religion, politics, finance or education at the same time or none of these.

I'm not sure how multilabel classification would fit in with sequencing replicates. I was thinking that the replicates would be most useful as a way of examining the reliability of the sequencing. Do two independent sequencing runs of the same sample yield the same mutations?

To keep things simple, we probably don't want to venture into multilabel classification, ... but we could fit a model where each mutation was a separate "label". Maybe there would be certain benefits to fitting all models together, but I'm not sure.

@cgreene
Copy link
Member

cgreene commented Oct 17, 2016

Agree as a future interest. Transfer learning approaches should be very well suited here (and transferability is also interesting scientifically). Not sure how much we want to dig in at this time.

@gwaybio
Copy link
Member

gwaybio commented Oct 17, 2016

My impression is that the mutation calling is decent, right @gwaygenomics?

They are state of the art exome mutation calls. Right now, 6 mutations caller algorithms are applied to the data and variants are removed if they are only called once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants