This repository contains the implemnetation of our papers titled "Recurrent Deep Embedding Networks for Genotype Clustering and Ethnicity Prediction" and "Convolutional Embedded Networks for Population Scale Clustering and Bio-ancestry Inferencing". The former is available on "Arxiv as pre-print"(link: The later has been submitted to IEEE/ACM Transactions on Computational Biology and Bioinformatics, which is under review.
This repo will have two different implementations: i) Deep Embedding Networks(DEC) and Recurrent Deep Embedding Networks(CDEC) using ii) Spark and H2O implementations of our paper titled "Recurrent Deep Embedding Networks for Genotype Clustering and Ethnicity Prediction".
The proof of the concept of our approach is implemented in Spark, ADAM, and Keras. In particular, for the scalable and faster preprocessing of huge number of genetic variants across all the chromosomes (i.e. 870 GB of data), we used ADAM and Spark to convert the genetic variants from VCF format to Spark DataFrame. Then we convert Spark DataFrame into NumPy arrays. Finally, we use Keras to implement Conv-LSTM and CDEC networks for for Population Scale Clustering and Ancestry Inference, respectively.
Experiments were carried out on a computing cluster having 32 cores, 64-bit Ubuntu 14.04 OS. Software stack consisting of Apache Spark v2.3.0, H2O v3.14.0.1, Sparkling Water v1.2.5, ADAM v0.22.0 and Keras v2.0.9 with TensorFlow backend. We compare approach with the state-of-the-art such as ADMIXTURE and VariationSpark.
Refer to for more details. Network training were carried out on a Nvidia TitanX GPU with CUDA and cuDNN enabled to make the overall pipeline faster.
For this, first, download the VCF files (containing the variants) and the panel file (containing the labels) from
Then go to and use the featureExtractor.scala to extract the features and save as a DataFrame in CSV to be used by Keras-based DEC.
For this, make sure that you've configured Spark correctly on your machine. Alternatively, execute this script as a standalone Scala project from Eclipse or IntelliJ IDEA.
Go to Then there are several Python scripts and a sample genetic variants feature in csv for the clustering and classification, respectively.
- genome.csv: is the sample genetic variants featres
- for creating custom clustering layer in Keras
- for performing conv unpooling operation for COnv autoencoder part of the network
- contains the data preparation helper modules
- CDEC network creation for the clustering
- the main class that encapsulates all the steps.
This implementation slightly based on
A modified version of Keras based DEC implementation ( proposed by Ali F. et al. is used in our approach. Network training were carried out on a Nvidia TitanX GPU with CUDA and cuDNN enabled to make the overall pipeline faster.
For this, first, download the VCF files (containing the variants) and the panel file (containing the labels) from
Then go to and use the featureExtractor.scala to extract the features and save as a DataFrame in CSV to be used by Keras-based DEC.
For this, make sure that you've configured Spark correctly on your machine. Alternatively, execute this script as a standalone Scala project from Eclipse or IntelliJ IDEA.
Go to Then there are 2 Python scripts and a sample genetic variants feature in csv for the clustering and classification respectively.
- genome.csv: is the sample genetic variants featres
- for the clustering
- for the classification
For this, first download the VCF files (containing the variants) and the panel file (containing the labels) from Then go to and you'll see there Scala scripts as listed below:
- PopGenomicsClassificationSpark.scala: this is the Spark implementation of ethnicity prediction
- PopStratClassification.scala: this is the H2O implementation of ethnicity prediction
- PopStratClustering.scala: this is the H2O/Spark implementation of the genotype clustering but using K-means prediction
For this, make sure that you've configured Spark and Adam (see correctly on your machine. Alternatively, execute this script as a standalone Scala project from Eclipse or IntelliJ IDEA.
title={Recurrent Deep Embedding Networks for Genotype Clustering and Ethnicity Prediction},
author={Karim, Md and Cochez, Michael and Beyan, Oya Deniz and Zappa, Achille and Sahay, Ratnesh and Decker, Stefan and Schuhmann, Dietrich-Rebholz and others},
booktitle={arXiv preprint arXiv:1805.12218},
For any questions, feel free to open an issue or contact at [email protected]