Question about OLS testing #547

shy218 · 2021-04-25T06:45:42Z

shy218
Apr 25, 2021

What do you think is a good way to test OLS with covariates? What could be the label column and covariates columns in general?
I am currently deciding the dataset size for testing. I have some vcf files with over millions of variants and over 1000 samples. I want to control the number of variants when I test and compare, so I will subsample from the VCF files from 10000 to 100000 (?). Or do you think what is the best range of variants for testing?

eric-czech · 2021-04-25T08:24:52Z

eric-czech
Apr 25, 2021
Maintainer

There is a short GWAS (i.e. OLS) simulation that could be scaled up or down easily in https://nbviewer.jupyter.org/gist/eric-czech/bdff3402d27b5cccd7d8aacab0957a93. Covariates are typically clinical/behavioral parameters (age, sex, etc.) and principal components intended to reflect population structure. Labels are typically a phenotype of some sort, e.g. disease status.

It would be helpful to understand performance as sample and variant counts increase from 100 or 1000 or so. As high as you can go from there on a single machine would be great. I would also not do this using VCF files since the VCF IO will be hard to isolate in the running times. I think you are safe to simply generate data straight in Xarray like that notebook above.

5 replies

shy218 Apr 25, 2021
Author

Can we schedule a quick meeting if possible?

shy218 Apr 25, 2021
Author

Our testing should be on a single machine and local mode for Hail, which means I don't need to set up master-workers on multiple machines, right?

eric-czech Apr 25, 2021
Maintainer

Our testing should be on a single machine and local mode for Hail, which means I don't need to set up master-workers on multiple machines, right?

From my perspective, yes. Any thoughts on distributed benchmarks at this point @hammer?

shy218 Apr 25, 2021
Author

For now, I only know how to setup multiple machines for sgkit. There is a bug in Hail distributed mode that I found today. hail-is/hail#10352 Some other people find this same bug 5 days ago, but he has not got any responses. Maybe distributed mode for Hail is currently not available. I will keep an eye on the bug report. I prefer to just compare the local benchmarks for now, and all of these tools support well to run parallelly in local machine.

hammer Apr 25, 2021
Maintainer

Fine to start with local mode and benchmark scale up. At some point we'll need to benchmark scale out though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about OLS testing #547

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Question about OLS testing #547

shy218 Apr 25, 2021

Replies: 1 comment · 5 replies

eric-czech Apr 25, 2021 Maintainer

shy218 Apr 25, 2021 Author

shy218 Apr 25, 2021 Author

eric-czech Apr 25, 2021 Maintainer

shy218 Apr 25, 2021 Author

hammer Apr 25, 2021 Maintainer

shy218
Apr 25, 2021

Replies: 1 comment 5 replies

eric-czech
Apr 25, 2021
Maintainer

shy218 Apr 25, 2021
Author

shy218 Apr 25, 2021
Author

eric-czech Apr 25, 2021
Maintainer

shy218 Apr 25, 2021
Author

hammer Apr 25, 2021
Maintainer