Scale model? #93

khkk378 · 2021-05-21T07:27:16Z

khkk378
May 21, 2021

I need a way to assess the biological reliability of the estimates. One way I was thinking of was to include a bunch of cell types that I know aren't part of my tissue of interest, and then use those estimates as a metric for reliability. That would include training on 100-200 samples, with maybe 100 cell types in total. Do you think I would need to scale the networks for that?

KevinMenden · 2021-05-21T07:45:31Z

KevinMenden
May 21, 2021
Maintainer

Hi @khkk378 ,

interesting idea - I'm not sure though whether that will give you the reliability estimates that you want. 100-200 samples will be too small of a training dataset (but maybe you mean something different). I never tried to estimate more than say 10-15 different celltypes, then it gets very tricky. 100 different celltypes is extremely challenging and I have the feeling that you would get rather nonsense results from that :-)

Regarding scaling, the networks should be expressive enough to deal with that. Nevertheless, I don't think that would work.

I know that missing uncertainty estimates are a major drawback of Scaden currently, and I have planned to include something like this soon. If you're interested, the easiest way of including that now would be to run the different Scaden models with dropout enabled during prediction time for say 100 times and then average the results. The standard deviation of those results would give you some uncertainty estimate.

Let me know if you're interested to try this out, I wanted to test it too at some point. Turning this into a discussion.

0 replies

khkk378 · 2021-05-21T07:58:29Z

khkk378
May 21, 2021
Author

So, I mean 100-200 donor/tissue combinations. So, say, 5 million cells in total. I'm not really talking about the uncertainty of the estimates from a statistical viewpoint (although also needed) but from the biological. A model could give consistent but wrong results. Say I want to estimate cell type fractions in kidney. There are maybe 20 cell types there. Then I was thinking about also including, say, hepatocytes, pancreatic beta cells and so on in the model. Cell types that I know aren't in kidney. I expect zero estimates for all of those, and the deviance from that could be a metric for how prone the model is for picking up more technical aspects of the data (say damaged cells).

6 replies

khkk378 May 21, 2021
Author

Now I had a (maybe) great idea :) All these methods are evaluated first by their ability to accurately estimate synthetic data and then on some neat PBMC data with technical replicates and FACS fractions. My view is that it's more important to not be wrong, rather than to be precisely right. I've been benchmarking a bunch of these methods on various clinical tissue datasets and often the results are non-sensical. The same is true if I look at issues raised in the various repos. That probably boils down to the model training on something that's largely technical, like fraction of mitochondrial reads for cells that are easily damaged in the sample prep or amount of RNA for small cells. I think it would be possible to have a model that trains on cell types that you label "in scope" (from the tissue of interest) and those that are "out of scope" (cell types that can't be present in the tissue of interest). The cost function would then be to accurately predict the "in scope" cell types when predicting fractions from only the tissue of interest, while at the same time having estimated zeros for the "out of scope" cell types. Note that there is no need to estimate the individual fractions for the "out of scope" cell types. This could maybe be formulated as a GAN or similar. I think this could be highly resistant to the technical signal and more purely train on the biology. Could be a paper in there :)

KevinMenden May 21, 2021
Maintainer

I disagree that the model (or any deconvolution method for that matter) only uses technical information. Largely because, at least for Scaden, the model is trained on data simulated from scRNA-seq data and performs quite well on bullk RNA-seq data, although clearly there is a large distribution shift here. This shows that it must have learned the biology, or genes, that separate the cell types. We also discuss in the paper that closing this gap between simulated and bulk data would most likely help to improve performance.
Nevertheless I agree that the data setting must be well defined, and the training tissue must be the same, for these methods to work well.

If I understand correctly what you are suggesting basically boils down to generating one model for multiple tissues, right? Or are you suggesting to mix-in other cell types and let the model learn to focus only on the relevant cell types, like immune cells like is commonly done?
Would definitely be interesting to test. I actually never tested this because I assumed it would be easier if you stay within a tissue. But it's certainly possible that the additional information could help.

Interesting stuff! Always great to hear some new ideas :-) I have a couple of small and not-so-small deviations from the original Scaden model in mind myself that might (or might not) perform better than the original one.

If you want to build a proof of concept for your idea, I'd be certainly happy to help, maybe we can also discuss some ideas and come up with something new. I don't have much time for it unfortunately though :/

khkk378 May 21, 2021
Author

Unclear wording by me there, I didn't mean that Scaden trains mainly on technical parameters. Quite the opposite, we're going with Scaden in our project specifically because our benchmarking showed it to be superior when it comes to translating between methods with large technical differences. What I meant was that when you get nonsensical estimates for a sample it (I think) has a lot to do with technical parameters.

I mean the latter, to mix in cell types to anti-select for in training. Then you would have multiple cell types with high mitochondrial content/fragile cells/small cells/high ribosomal content/proliferating cells and so on, and the model will learn not to take these large kind-of-technical features into account and rather train on the more specific expression patterns. Which would make it more robust when estimating unknown samples where the technical biases would be different. Or that is the idea :)

I would be very happy to discuss this or something related. I work in pharma and also have limited time for method development, but on the other hand I have a lot of tissue transcriptomics data :)

KevinMenden May 25, 2021
Maintainer

Sounds great - I will be on holiday for a good week now, maybe we can set up a small slack workspace or something like this where it's easier to discuss some ideas. If we set up some good test datasets we can maybe try to prototype some ideas. Would be fun! If it leads to something better, awesome, if not, also fine 😁

khkk378 May 25, 2021
Author

Sounds like a plan! I'll ping you here next week and we can chat on Slack or Teams. Enjoy your vacation!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale model? #93

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Scale model? #93

khkk378 May 21, 2021

Replies: 2 comments · 6 replies

KevinMenden May 21, 2021 Maintainer

khkk378 May 21, 2021 Author

khkk378 May 21, 2021 Author

KevinMenden May 21, 2021 Maintainer

khkk378 May 21, 2021 Author

KevinMenden May 25, 2021 Maintainer

khkk378 May 25, 2021 Author

khkk378
May 21, 2021

Replies: 2 comments 6 replies

KevinMenden
May 21, 2021
Maintainer

khkk378
May 21, 2021
Author

khkk378 May 21, 2021
Author

KevinMenden May 21, 2021
Maintainer

khkk378 May 21, 2021
Author

KevinMenden May 25, 2021
Maintainer

khkk378 May 25, 2021
Author