This document contains an overview of the pipeline parameters. They should be defined using the Nextflow configuration file (see below for details).
MINI-AC has 4 main inputs that need to be given as paths or folder names, with two them being optional:
- ACR files: Path of the folder with the BED files containing genomic coordinates corresponding to accessible chromatin regions (minimal format of 3 columns: chromosome, start, stop). This path should be given to the parameter
ACR_dir
. - Output folder: Path where the results will be stored. This path should be given to the parameter
OutDir
. - (Optional) DEGs file: Path of folder with tab-separated txt files with differential expression data associated with the input ACRs. First column must be gene ID. It can be one DEGs file per input ACR file, or paired DEGs files-ACR files. For more details see inputs format example. This path should be given to the parameter
DE_genes_dir
. - (Optional) Expressed genes file: Path of folder with one-column txt files with gene IDs for genes expressed in the biological context of the input ACRs, to filter the inferred GRNs. It can be one Expression file per input ACR file, or paired Expressed genes files-ACR files. For more details see inputs format example. This path should be given to the parameter
Set_genes_dir
.
MINI-AC has several optional parameters that affect the output and some aspects of the network inference process:
-
Species:
--species arabidopsis
(command line) orspecies = "arabidopsis"
(configuration file) for Arabidopsis,--species maize_v4
(command line) orspecies = "maize_v4"
(configuration file) for maize genome version 4, and--species maize_v5
(command line) orspecies = "maize_v5"
(configuration file) for maize genome version 5. -
MINI-AC mode:
--mode genome_wide
(command line) ormode = "genome_wide"
(configuration file) for the genome-wide mode, and--mode locus_based
(command line) ormode = "locus_based"
(configuration file) for the locus-based mode. -
DEGs parameters: Since providing DEGs files is optional, it needs to be specified if the path with the DEGs files is available with the parameter
DE_genes
set toDE_genes = true
orDE_genes = false
. Additionally, if there is only one DEG file for all the input ACRs, you need set the parameterOne_DE_set
toOne_DE_set = true
, and toOne_DE_set = false
if otherwise. -
Expressed genes files parameters: Since providing Expressed genes files is optional, it needs to be specified if the path with the Expressed genes files is available with the parameter
Filter_set_genes
set toFilter_set_genes = true
orFilter_set_genes = false
. Additionally, if there is only one Expression file for all the input ACRs, you need set the parameterFilter_set_genes
toFilter_set_genes = true
, and toFilter_set_genes = false
if otherwise.
-
Motif enrichment p-value cut-off: This is the p-value cut-off that determines which motifs are enriched and used for GRN building. We do not recommend changing this parameter. It has been internally pre-defined for each MINI-AC mode based on the p-value cut-offs with a false discovery rate of 0 (see publication). If wished, however, this p-value can be overwritten in the configuration file by setting the parameter
P_val
to whatever value (see below) or in the command line options. For example:nextflow -C mini_ac.config run mini_ac.nf --mode genome_wide --species maize_v4 --P_val 0.05
-
Overlap criteria parameter: By default, MINI-AC computes motif enrichment counting the motif matches within ACRs. This, however, is difficult if the ACRs are shorter than or of similar size to the motifs, which is the case of footprints. In this case, we observed that counting the absolute base-pair overlap is useful. Therefore, in case of using footprints or short ACRs (high resolution), we recommend setting the parameter
Bps_intersect = true
. Otherwise it should be keptBps_intersect = false
. -
Annotation of second closest gene in genome-wide mode: The parameters
Second_gene_annot
andSecond_gene_dist
are only taken into account by the genome-wide mode. In the genome-wide mode the motif matches are annotated to the closest gene, but in genomes like maize, there are very distal regulatory elements that regulate non-neighboring genes. Although we showed in the original publication that this does not improve results, we give the possibility of annotating the second closest genes that are within a certain distance from the motif match. To activate this option the parameterSecond_gene_annot
should be set toSecond_gene_annot = true
. If so, the parameterSecond_gene_dist
should be used to set the specific distance cut-off (in absolute base-pairs) at which the second-closest gene has to be from the motif match in order to be assigned as target gene.
The configuration or "config" file, is a file that Nextflow uses to manage and specify the inputs and parameters settings of a pipeline. For more details read Nextflow documentation. Here we review the main aspects of the configuration file when running MINI-AC.
To set the above-mentioned parameters of the pipeline in the configuration file, here's a code snippet with the default and the recommended settings:
params {
//// Output folder
OutDir = "/absolute/path/to/output/directory"
//// Required input
ACR_dir = "/absolute/path/to/acrs/directory"
//// Optional input
// Differential expression data
DE_genes = true
DE_genes_dir = "/absolute/path/to/degs_genes/directory"
One_DE_set = true
// Expression data
Filter_set_genes = true
Set_genes_dir = "/absolute/path/to/ex_files/directory"
One_filtering_set = true
//// Prediction parameters
Bps_intersect = false
// P_val = 0.01 // This is commented because we do not recommend changing it.
//// Prediction parameters only genome-wide
Second_gene_annot = false
Second_gene_dist = 500
}
The Work directory is where the temporary files are created and where Nextflow stores the files of the different processes. By default this directory is created in the folder where the pipeline is executed, but we recommend to set it to a scratch or tmp folder. To set it in the configuration file, the following code line should be added and edited:
workDir = '/absolute/path/to/work/dir'
Singularity is a container platform. It allows to run the Nextflow pipeline with a pre-determined environment that ensures reproducibility. We created a Docker image with the necessary dependencies for MINI-AC to run, without need for the user to install any of them. We strongly recommend to always run MINI-AC using the specified singularity container. For that, the following code needs to be included in the configuration file:
process.container = "vibpsb/mini-ac:latest"
singularity {
enabled = true
cacheDir = "singularity_cache"
autoMounts = true
}
Sometimes the temporary directory used by Singularity is not in the same root path as in the pipeline, which can cause Singularity to struggle to find it. In this case, add the runOptions
line below with the absolute path to the tmp folder. To know the absolute path to the tmp folder in linux execute in the command line echo $TMPDIR
. Then add it as shown below.
process.container = "vibpsb/mini-ac:latest"
singularity {
enabled = true
cacheDir = "singularity_cache"
autoMounts = true
runOptions = "--bind /absolute/path/to/tmp/folder"
}
The executor of the pipeline is the system where the pipeline processes run and supervises its execution. It can be a computer, a cluster resource manager, or the cloud. In the configuration file it can be specified what is the executor of the pipeline. To execute it in a normal computer locally, the code below should be added in the configuration file:
executor {
name = 'local'
}
MINI-AC was developed in an SGE computer cluster, for which we used the configuration below. This was used to run the genome-wide mode on maize using an input dataset of ~600,000 MOA-seq peaks. For smaller datasets, the memory values can be further reduced. Additionally, for Arabidopsis, a species with a smaller genome, less memory can also be used.
executor {
name = 'sge'
queueSize = 25
}
process {
withName: get_ACR_shufflings {
clusterOptions = '-l h_vmem=4G'
}
withName: getStats {
clusterOptions = '-l h_vmem=10G'
}
withName: getStats_bps {
clusterOptions = '-l h_vmem=50G'
}
withName: getNetwork {
clusterOptions = '-l h_vmem=20G'
}
withName: filterSetOfGenes {
clusterOptions = '-l h_vmem=5G'
}
withName: GOenrichment {
clusterOptions = '-l h_vmem=5G'
}
withName: getIntegrativeOutputs {
clusterOptions = '-l h_vmem=3G'
}
}
The MINI-AC Nextflow pipeline contains a set of pre-defined parameter files specified within the main pipeline script (mini_ac.nf). This is because they are fixed data files for each MINI-AC mode and specie's genome version. However, there are cases where some of this files might want to be changed by the user. Nextflow allows to easily change these parameter files, either through the command line options or in the configuration file, thanks to a hierarchical prioritization of the configuration sources:
- Parameters specified on the command line (--something value)
- Config file specified using the -C mini_ac.config option
- The config file named nextflow.config in the current directory
- The config file named nextflow.config in the workflow project directory
- Values defined within the pipeline script itself (e.g. main.nf)
Therefore, if the user wishes to change any of these parameters, it possible either through the command line options, or in the config file.
There are mainly two cases in which the user might want to alter the internal MINI-AC files, which are explained below.
By default, the maize MINI-AC locus-based mode (for both genome versions) runs on the "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, we generated two additional motif mapping files for the locus-based mode of maize, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. For Arabidopsis only the "medium" non-coding genomic space motif mapping file was generated because it already covers 73.5% of the whole non-coding genomic psace (see publication). To use these files, first they need to be downloaded, and then, the corresponding parameters of the motif mapping file (MotMapsFile
) and the non-coding genomic space coordinates file (Promoter_file
) should be modified either on the command line or in the configuration file.
To download the maize "large" motif mapping file and coordinates of the "large" non-coding genomic space:
For maize RefGen_v4 large locus-based mode files
wget https://zenodo.org/record/7974527/files/zma_locus_based_motif_mappings_15kbup_2.5kbdown.bed?download=1 -O data/zma_v4/zma_v4_locus_based_motif_mappings_15kbup_2.5kbdown.bed
wget https://zenodo.org/record/7974527/files/zma_promoter_15kbup_2.5kbdown_sorted.bed?download=1 -O data/zma_v4/zma_v4_promoter_15kbup_2.5kbdown_sorted.bed
For maize RefGen_v5 large locus-based mode files
wget https://zenodo.org/record/8386283/files/zma_v5_locus_based_motif_mappings_15kbup_2.5kbdown_sorted.bed?download=1 -O data/zma_v5/zma_v5_promoter_15kbup_2.5kbdown_sorted.bed
wget https://zenodo.org/record/8386283/files/zma_v5_promoter_15kbup_2.5kbdown_sorted.bed?download=1 -O data/zma_v5/zma_v5_promoter_15kbup_2.5kbdown_sorted.bed
To download the maize "small" motif mapping file and coordinates of the "small" non-coding genomic space:
For maize RefGen_v4 small locus-based mode files
wget https://zenodo.org/record/7974527/files/zma_locus_based_motif_mappings_1kbup_1kbdown.bed?download=1 -O data/zma_v4/zma_v4_locus_based_motif_mappings_1kbup_1kbdown.bed
wget https://zenodo.org/record/7974527/files/zma_promoter_1kbup_1kbdown_sorted.bed?download=1 -O data/zma_v4/zma_v4_promoter_1kbup_1kbdown_sorted.bed
For maize RefGen_v5 small locus-based mode files
wget https://zenodo.org/record/8386283/files/zma_v5_locus_based_motif_mappings_1kbup_1kbdown_sorted.bed?download=1 -O data/zma_v5/zma_v5_locus_based_motif_mappings_1kbup_1kbdown.bed
wget https://zenodo.org/record/8386283/files/zma_v5_promoter_1kbup_1kbdown_sorted.bed?download=1 -O data/zma_v5/zma_v5_promoter_1kbup_1kbdown_sorted.bed
Then (using the "small" definition as example), change the parameters on the command line:
nextflow -C mini_ac.config run mini_ac.nf --mode locus_based --species maize_v4 --MotMapsFile data/zma_v4/zma_v4_locus_based_motif_mappings_1kbup_1kbdown.bed --Promoter_file data/zma_v4/zma_v4_promoter_1kbup_1kbdown_sorted.bed
or add them to the configuration file, along with the other parameters:
params {
/// [Other parameters...]
MotMapsFile = "$projectDir/data/zma_v4/zma_v4_locus_based_motif_mappings_1kbup_1kbdown.bed"
Promoter_file = "$projectDir/data/zma_v4/zma_v4_promoter_1kbup_1kbdown_sorted.bed"
/// [Other parameters...]
}
To perform the functional network analysis, an internal gene-GO annotation file is used for each species. They were obtained as described in here for Arabidopsis and in here and here for maize. However, if the user wants to use a custom GO-gene file, the following parameters should be medified either on the command line or in the configuration file.
nextflow -C mini_ac.config run mini_ac.nf --mode locus_based --species maize_v4 --Feature_file custom_go_gene.txt
params {
/// [Other parameters...]
Feature_file = "custom_go_gene.txt"
/// [Other parameters...]
}
It is important, however, to make sure that the format is correct. The GO terms should be extended for parental terms, and this file should contain two tab-separated columns (no header), where the first column is the GO ID, and the second column is the gene ID, as shown here. It is vital that the gene IDs are either on Araport11 or AGPv4/NAM5.0.
This same principle can also be applied to other parameters that the user wants to change.
The configuration file of iCREs-based MINI-AC has a similar structure and input parameters as regular MINI-AC (given that it runs genome-wide MINI-AC "under the hood"). The parameter ACR_dir
should be replaced by Gene_list_dir
. This parameter should be the path to a directory containing files in a ".txt" format, with each line containing a maize gene ID from the V4 or V5 genome version. One example can be found here. One GRN will be predicted for each input file.
There is an additional input parameter named --icres_set
, that can either be all
or maxf1
. The parameter all
uses a more comprehensive and complete collection of maize putative CREs, while maxf1
uses a set of putative CREs that is smaller but more precise (less false positives).
One example of the parameters configuration from the file mini_ac_icres.config can be found below:
params {
//// Output folder
OutDir = "$projectDir/example/outputs_icres"
//// Required input
Gene_list_dir = "$projectDir/example/inputs/gene_set_files"
//// Optional input
// Differential expression data
DE_genes = false
DE_genes_dir = "$projectDir/example/inputs/de_files"
One_DE_set = true
// Expression data
Filter_set_genes = false
Set_genes_dir = "$projectDir/example/inputs/exp_genes_files"
One_filtering_set = true
//// Prediction parameters
Bps_intersect = false
//// Prediction parameters only genome-wide
Second_gene_annot = false
Second_gene_dist = 500
}
This version of MINI-AC can also be run with DE_genes = true
and Filter_set_genes = true
. However, the input files should be named accordingly, with the same name as the input file, followed by _icres_
and _degs_table.txt
and/or _expressed_genes.txt
. For example, in the case of the input file UP_gene_set.txt, the corresponding DEGs and expressed genes files should be named UP_gene_set_icres_degs_table.txt
and UP_gene_set_icres_expressed_genes.txt
, respectively.