Pipeline to carry out comprehensive population genomic analyses.
scalepopgen is a fully automated nextflow-based pipeline that takes vcf files or plink generated bed files as input and employ a variety of open-source tools to carry out comprehensive population genomic analyses. Additionally, python and R scripts have been developed to combine and (wherever possible) plot the results of various analyses.
Broadly, the pipeline consists of the following four “sub-workflows”:
- filtering and basic statistics
- explore genetic structure
- phylogeny using treemix
- signature of selection analysis
These four sub-workflows can be run separately or in combination with each other.
The pipeline can be run on any Linux operating system and require these three dependencies: Java, nextflow (Di Tommaso et al., 2017) and a software container or environment system such as conda, mamba, singularity, or docker. The pipeline can also be run on both local linux system as well as high performance computing (HPC) clusters. Note that all the software related dependencies of the pipeline will be handled by nextflow after it is installed. Meaning that the user install only the three dependencies listed above, while nextflow automatically downloads the rest of the tools for the analyses. scalepopgen was built and tested on nextflow version 22.10.6.5843, conda version 23.1.0 and singularity version 3.8.6.
To test the pipeline, simply run the following command:
nextflow run scalepopgen.nf -profile mamba,test_genstruct
The output folder will be created here:"…/test_genstruct_out/". The folder will contain interactive plots for PCA, Fst-based NJ tree, IBS-based NJ tree. It will also contains plots for “ADMIXTURE” analyses. These plots can be customized using the yaml file present inside the ". /parameters/plots/ "folder. Description of the inputs and outputs of the test run can be found here.
The workflow implement a lot of programs and tools, with the aim of enabling users to perform a wide range of analyses. This also brings with it a large number of parameters that need to be determined for each sub-workflow. In order to make it more easier for user, we developed graphical user interface (GUI). This GUI is available as an executable within the "scp_config_generator" folder within this repo. With GUI you can specify analyses and their options by moving through the tabs of each workflow section marked with the arrow.Once you select and specify the parameters according to analyses you want to perform, simply click on "File" and "Save as" yml file.
After that run it with the command:
nextflow run scalepopgen.nf -params-file analyses.yml -profile <conda,mamba,singularity,docker> -qs <number of processes>
A great advantage of this workflow are the interactive plots generated using bokeh, which are stored in the output folder of the respective analysis. They provide the user with a graphical interpretation of the results, allowing to immediately get an impression about the genomic patterns of the analyzed samples. As an example, please take a look at the interactive plots created with cattle data in all the different analyzes offered by the workflow.
Note that read me associated with the workflow will be extensively updated in the coming days. Before using the tiles for your publication do not forget to include the attribution. This package comes with the tile from Esri, therefore, these rules must be followed: https://developers.arcgis.com/documentation/mapping-apis-and-services/deployment/basemap-attribution/. The paper to cites have been mentioned in the respective read me documentation.
1). complete the read me for all the sub-workflows.
2). test the signature of selection extensively.
3). add validation for all the features of the workflow.
4). a gui to make parameter files