Additional information

Gene descriptors and PCA variables

The following variables are computed for each gene and the results are written to gene_table_taxon_assignment.csv. Any combination of these can be used in the PCA.

Contig related

c_name name of the contig the given gene is located on

Positional
- c_num_of_genes number of genes annotated on the given contig
- c_len contig length
- c_pct_assembly_len percentage of assembly length: proportion of total assembly length contained in the given contig
- c_genelenm mean gene length on the given contig
- c_genelensd standard deviation of the gene lengths in the set of genes annotated on the given contig
Coverage
- c_cov mean read coverage of the given contig
- c_covsd standard deviation of read coverage along the given contig
- c_covdev extent to which 'c_cov' deviates from the mean assembly coverage, in units of total assembly coverage SD
- c_genecovm mean read coverage in the set of genes annotated on the given contig
- c_genecovsd standard deviation of the read coverage in the set of genes annotated on the given contig
Sequence composition
- c_pearson_r Pearson’s correlation coefficient for the tetranucleotide-derived z-score vectors of the given contig and the overall assembly
- c_pearson_p probability (p-value) for two random sequences to show a correlation that is at least as high as 'c_pearson_r'
- c_gc_cont percentage of GC content in the given contig
- c_gcdev extent to which 'c_gc_cont' deviates from the mean assembly GC content, in units of total assembly GC SD

Gene related

g_name gene name (as given in the GFF file, 9th column, ID attribute)

Positional
- g_len length of the given gene
- g_lendev_c extent to which 'g_len' deviates from the corresponding c_genelenm, in units of 'c_genelensd'
- g_lendev_o extent to which 'g_len' deviates from the average gene length in the assembly, in units of the length SD over the complete set of genes in this assembly
- g_abspos absolute position of the gene (the closer to contig centre, the closer 'g_abpspos' to 0; the closer to either contig end, the closer g_abpspos to 1)
- g_terminal classifier of terminal genes (1 if gene is terminal, 0 if not)
- g_single classifier of single genes (1 if the given gene is the only gene annotated on its contig, 0 if not)
Coverage
- g_cov mean read coverage of the given gene
- g_covsd standard deviation of the read coverage along the given gene
- g_covdev_c extent to which 'g_cov' deviates from the corresponding 'c_genecovm', in units of 'c_genecovsd'
- g_covdev_o extent to which 'g_cov' deviates from the average gene coverage in the assembly, in units of the coverage SD over the complete set of genes in this assembly
Sequence composition
- g_pearson_r_o Pearson’s correlation coefficient for the tetranucleotide-derived z-score vectors of the given gene and the overall assembly
- g_pearson_p_o probability (p-value) of a random data set to show a correlation with the overall assembly composition that is at least as high as 'g_pearson_r_o'
- g_pearson_r_c Pearson’s correlation coefficient for the tetranucleotide-derived z-score vectors of the given gene and its respective contig
- g_pearson_p_c probability (p-value) for two random sequences to show a correlation that is at least as high as 'g_pearson_r_c'
- g_gc_cont percentage of GC content in the given gene
- g_gcdev_c extent to which 'g_gc_cont' deviates from the mean gene GC content, in units of GC SD over the set of genes on the corresponding contig
- g_gcdev_o extent to which 'g_gc_cont' deviates from the mean GC content in overall gene set, in units of GC SD over the complete set of genes in this assembly

Taxonomic assignment information

fasta_header header in FASTA file that was matched to gene ID and corresponding sequence is used for taxonomic assignment [header is truncated at first whitespace]
lca LCA of hits in 10% score range of best hit
best_hit best hit hit of sequence similarity search in terms of alignment score
bh_evalue e-value of best hit
bh_pident percentage of identical matches between best hit and query protein
refined_lca refined LCA: LCA of closest hit (to query) and query species
taxon_assignment final taxonomic assignment: LCA if LCA is not in query's lineage, else the refined LCA
plot_label final label for gene displayed in plot after merging of taxonomic assignments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional information

Gene descriptors and PCA variables

Contig related

Gene related

Taxonomic assignment information

Clone this wiki locally