I'm a statistician and data scientist. I'm currently at Brown University working as a postdoctoral research associate at the Data Science Institute with Sohini Ramachandran. My PhD work was at McGill University in Quantitative Life Sciences with Simon Gravel, where I studied topological data analysis methods for genetic data. You can find my published research on Google Scholar.
I also enjoy collecting data on a variety of topics. Some of my side-projects include tracking the length of the Rideau Canal skating season and collecting news stories of traffic violence.
Some of my academic research:
UMAP is an efficient method to visualize biobank data. You can find structure in your data (i.e. population structure) related to factors like demographic history or biobank sampling methodology. When you colour in the visualizations with other data, like geography or phenotypic measures, you can see lots of patterns and study them further. You can also work in 3D and get creative, doing stuff like converting UMAP's
Paper: UMAP reveals cryptic population structure and phenotype heterogeneity in large genomic cohorts, Diaz-Papkovich et al, PLoS Genetics, 2019.
Related Github repositories:
- Code for the paper
- Interactive Python notebook with data from the Thousand Genomes Project
- I also have code for a review paper of UMAP in population genetics
Though UMAP tends to generate clusters, it is not a clustering algorithm. To extract clusters from UMAP data, we use a density-based method called HDBSCAN. We can use this for stratification to get a better grasp of the population structure in our data, study how methods like polygenic scores transfer between populations, and do QC on biobank data.
Preprint: Topological stratification of continuous genetic variation in large biobanks, Diaz-Papkovich et al, bioRxiv, 2023.
Related Github repositories: