This README summarizes the format of the input and output files of MINI-AC overall, but more specifically the files of the example test to run MINI-AC. WARNING: some output files where shortened to only the first 100 lines to be able to upload them. To get the full output files, please run pipeline test in your machine.
The INPUTS folder contains three sub-folders:
- acr_files: contains the BED files with the genomic coordinates of the ACRs in a minimal format of 3 columns: chromosome, start and stop
- bundle_sheath_marand_top10k.bed: cell-type specific ACRs of bundle sheath.
- mesophyll_marand_top10k.bed: cell-type specific ACRs of mesophyll.
- de_files: contains tab-separated tables with differential expression analysis results. The only format requirements are that the first row has to be the header (column names), and the first column has to contain gene IDs. There is no requirement for the number of columns or content, although it should contain statistics associated to a DE analysis. It can be one single file for all the input ACRs in which case the name of the file should end with "*_degs_table.txt". They can also be paired ACR-DE files. In the case they are paired datasets, the names of the files need to be the same as the corresponding ACR files with "*_degs_table.txt" added. For this specific example, if there were paired ACR-DE files, they should be named: bundle_sheath_marand_top10k_degs_table.txt and mesophyll_marand_top10k_degs_table.txt
- maize_leaf_celltypes_degs_table.txt: DE analysis table.
- exp_genes_files: contains single-column txt files with gene IDs for which the functional networks should to be filtered for. For example, if we are interested in a specific tissue or cell-type, these can be genes expressed in that tissue or cell-type. It can also be a single file for all the ACRs, or paired with the ACRs. In any case, the naming format is the same as the "de_files" mentioned above, except that the files have to end with "*_expressed_genes.txt".
- maize_leaf_expressed_genes.txt
The OUTPUTS folder contains four sub-folders:
- enrichment_stats: Contains one file per input ACR with the motif enrichment statistics. Per row it shows the enrichment statistics of every motif:
bundle_sheath_marand_top10k_allshuff_sorted_miniac_stats.txt mesophyll_marand_top10k_allshuff_sorted_miniac_stats.txt
- dataset: dataset name (ACR file name).
- input_total_peaks: Total number of non-overlapping peaks in the dataset.
- motif: Motif ID from JASPAR 2020 or CisBP version 2.00.
- real_int: Number motif matches within the ACRs.
- shuffled_int: Number of motif matches in the background set of ACRs generated by shuffling through the non-coding genomic space.
- p_val: p-value of motif enrichment significance.
- adj_pval: p-value of motif enrichment significance corrected for multiple testing using Benjamini-Hochberg method.
- enr_fold: The motif enrichment fold indicates how much more frequently the motif matches occur within ACRs compared to what is expected by chance (division of real_int by suffled_int).
- networks: Contains one file per input ACR with the predicted GRN in an edge-list format. Each row is a transcription factor-target gene interaction. If a file with expressed genes is given, two files will be generated: one with the full network and one with the network filtered for the genes provided.
bundle_sheath_marand_top10k_network.txt bundle_sheath_marand_top10k_network_filtered.txt mesophyll_marand_top10k_network.txt mesophyll_marand_top10k_network_filtered.txt
- TF: Gene ID of the transcription factor.
- TG: Gene ID of the target gene.
- GO_enrichment: Contains one file per input ACR with the functional enrichment analysis of the predicted GRNs. If a file with expressed genes is given, the functional analysis is done on the filtered network and not the original.
bundle_sheath_marand_top10k_GO_enrichment.txt mesophyll_marand_top10kGO_enrichment.txt
- column 1: Transcription factor gene ID.
- column 2: Gene ontology term ID.
- column 3: p-value of GO enrichment.
- column 4: q-value of GO enrichment (p-value corrected for multiple testing with Benjamini-Hochberg method).
- column 5: Enrichment fold of gene ontology enrichment.
- column 6: Number of target genes in regulons with gene ontology annotation.
- column 7: Number of genes annotated to the gene ontology term indicated in column 2.
- column 8: Overlap between column 6 and column 7, meaning number of target genes in the regulon annotated to the gene ontology term indicated in column 2.
- column 9: Gene IDs of the target genes indicated in column 8.
- column 10: Description of the gene ontology term indicated in column 2.
- integrative_outputs: Contains a group of files that, per input ACR file, integrate the motif enrichment, network and GO enrichment results with the expression data provided by the user.
-
TF centered output: Enrichment information per TF. Since one TF can be associated with multiple motifs, the motif enrichment statistics are collapsed per TF. It integrates motif enrichment, network, GO enrichment and DE information, as well as metadata of each TF. The association of one TF with multiple motifs can cause big jumps in motif rank. Below, the columns' content is explained in groups.
bundle_sheath_marand_top10k_TF_centric.xlsx mesophyll_marand_top10k_TF_centric.xlsx
- TF and motif rank:
- Dataset name.
- Gene ID.
- TF rank based on the rank of the motif with minimum enrichment rank associated with the TF.
- Motif ID of the motif with minimum rank associated with the TF.
- Motif enrichment rank of motif in 4th column.
- q-value of motif enrichment for the motif indicated in the 4th column.
- TFs metadata:
- TF family.
- Gene name (according to MaizeGDB).
- Gene description (according to MaizeGDB).
- Arabidopsis ortholog gene ID (according to PLAZA monocots 4.5).
- Arabidopsis gene name (according to TAIR).
- Maize gene name and Arabidopsis ortholog gene name combined.
- (Optional; if expressed genes provided) True if the TF is present in the user-provided list of expressed genes, False otherwise.
- (Optional; if DE table provided) Differential expression information. The first column is the gene ID. The rest of columns depend on the content of the user-provided table in input folder "de_files".
- Network data:
- Total number target genes for TF in the predicted GRN.
- (Optional; if DE table provided) Total number of target genes for the TF in the predicted GRN that are DE.
- (Optional; if DE table provided) Percentage of target genes for the TF in predicted GRN that are DE.
- Functional network data:
- Gene ontology terms that yielded enrichment for the TF's TGs (regulon).
- q-values associated with the gene ontology enrichment.
- Motif enrichment data:
- Motif IDs of the motifs associated with the TF.
- Enrichment folds of the motifs associated with the TF.
- q-values of the motifs associated with the TF.
- Motif enrichment ranks of the motifs associated with the TF (based on pi-value).
- TF and motif rank:
-
Motif-centered output: Enrichment information per motif. Since one TF can be associated with multiple motifs, the TF information is collapsed per motif. It integrates motif enrichment, associated TFs and expression information. Allows to explore enrichment statistics more easily.
bundle_sheath_marand_top10k_motif_centric.xlsx mesophyll_marand_top10k_motif_centric.xlsx
- Motif enrichment statistics:
- Dataset name.
- Motif ID.
- Number of motif matches in ACR.
- Number of motif matches in the background set of ACRs generated by shuffling through the non-coding genomic space.
- Motif enrichment p-value.
- Motif enrichment fold.
- Motif enrichment q-value (p-value adjusted for multiple testing).
- Motif enrichment pi-value (-log10(p-value)*enrichment fold).
- Motif enrichment rank based on pi-value.
- TF family.
- (Optional; if DE table provided) Differential expression information. The first column is the gene ID, and the rest of columns depend on the content of the user-provided table in input folder "de_files".
- TFs metadata:
- Gene name (according to MaizeGDB).
- Gene description (according to MaizeGDB).
- Arabidopsis ortholog gene ID (according to PLAZA monocots 4.5).
- Arabidopsis gene name (according to TAIR).
- Maize gene name and Arabidopsis ortholog gene name combined.
- (Optional; if expressed genes provided) True if any of the TFs associated with the motif is present in the user-provided list of expressed genes, False otherwise.
- Motif enrichment statistics:
-
GO enrichment output: GO enrichment output integrated with TFs metadata and DE information data.
bundle_sheath_marand_top10k_GO_enrichment.xlsx mesophyll_marand_top10k_GO_enrichment.xlsx
- GO enrichment results:
- first 3 and last 7 are the same columns as in the raw GO enrichment output
- TFs metadata:
- TF family.
- Gene ID of the transcription factor.
- Gene name (according to MaizeGDB).
- Gene description (according to MaizeGDB).
- Arabidopsis ortholog gene ID (according to PLAZA monocots 4.5).
- Arabidopsis gene name (according to TAIR).
- Maize gene name and Arabidopsis ortholog gene name combined.
- (Optional; if expressed genes provided) True if the TF is present in the user-provided list of expressed genes, False otherwise.
- (Optional; if DE table provided) Differential expression information. The first column is the gene ID, and the rest of columns depend on the content of the user-provided table in input folder "de_files".
- GO enrichment results:
-
Functional GRN: network formatted as an edge list with additional columns with functional enrichment data (meant to be used for network visualization using Cytoscape):
bundle_sheath_marand_top10k_functional_network.txt mesophyll_marand_top10k_functional_network.txt
- TF: Gene ID of transcription factor.
- TG: Gene ID of target gene.
- GO term: if the interaction was present in a regulon that yielded GO enrichment, GO term associated to it.
- q-value: q-value of GO enrichment
- enrichment_fold: enrichment fold of GO enrichment
-
GRN nodes attributes: tab-separated file with information and metadata about the network genes (meant to be used for network visualization using Cytoscape):
bundle_sheath_marand_top10k_node_attributes.txt mesophyll_marand_top10k_node_attributes.txt
- Node gene ID.
- Type of node: TF if transcription factor or TG if target gene.
- Motif ID of the motif with minimum rank associated to TF.
- Motif enrichment rank of the motif indicated in the 3rd column.
- q-value of motif enrichment for the motif in the 3rd column.
- Enrichment fold of the motif in the 3rd column.
- pi-value of motif enrichment for the motif in the 3rd column.
- Gene name (according to MaizeGDB).
- Gene description (according to MaizeGDB).
- Arabidopsis ortholog gene ID (according to PLAZA monocots 4.5).
- Arabidopsis gene name (according to TAIR).
- Maize gene name and Arabidopsis ortholog gene name combined.
- (Optional; if expressed genes provided) True if the TF is present in the user-provided list of expressed genes, False otherwise.
- (Optional; if DE table provided) Differential expression information. The first column is the gene ID, and the rest of columns depend on the content of the user-provided table in input folder "de_files".
-
The outputs of the iCREs-based MINI-AC runs are identical to the default MINI-AC, as it can be seen in the folder outputs_icres (not available until publication). However, two input parameters change:
-
Instead of providing an input BED file with genomic coordinates, the input should be a list of gene IDs from the version V4 or V5 of the maize genome, as in this example.
-
There is an additional input parameter named
--icres_set
that can either beall
ormaxf1
. The parameterall
uses a more comprehensive and complete collection of maize putative CREs, whilemaxf1
uses a set of putative CREs that is smaller but more precise (less false positives). To download the files with the genomic coordinates of these two iCREs sets, the following commands should be executed on the top-level directory of the repository:
NOT AVAILABLE UNTIL PUBLICATION