forked from fdarthen/taXaminer
-
Notifications
You must be signed in to change notification settings - Fork 1
Additional information
Freya Arthen edited this page May 29, 2024
·
2 revisions
The following variables are computed for each gene and the results are written to gene_table_taxon_assignment.csv
. Any combination of these can be used in the PCA.
c_name
name of the contig the given gene is located on
- Positional
-
c_num_of_genes
number of genes annotated on the given contig -
c_len
contig length -
c_pct_assembly_len
percentage of assembly length: proportion of total assembly length contained in the given contig -
c_genelenm
mean gene length on the given contig -
c_genelensd
standard deviation of the gene lengths in the set of genes annotated on the given contig
-
- Coverage
-
c_cov
mean read coverage of the given contig -
c_covsd
standard deviation of read coverage along the given contig -
c_covdev
extent to which 'c_cov' deviates from the mean assembly coverage, in units of total assembly coverage SD -
c_genecovm
mean read coverage in the set of genes annotated on the given contig -
c_genecovsd
standard deviation of the read coverage in the set of genes annotated on the given contig
-
- Sequence composition
-
c_pearson_r
Pearson’s correlation coefficient for the tetranucleotide-derived z-score vectors of the given contig and the overall assembly -
c_pearson_p
probability (p-value) for two random sequences to show a correlation that is at least as high as 'c_pearson_r' -
c_gc_cont
percentage of GC content in the given contig -
c_gcdev
extent to which 'c_gc_cont' deviates from the mean assembly GC content, in units of total assembly GC SD
-
g_name
gene name (as given in the GFF file, 9th column, ID attribute)
- Positional
-
g_len
length of the given gene -
g_lendev_c
extent to which 'g_len' deviates from the corresponding c_genelenm, in units of 'c_genelensd' -
g_lendev_o
extent to which 'g_len' deviates from the average gene length in the assembly, in units of the length SD over the complete set of genes in this assembly -
g_abspos
absolute position of the gene (the closer to contig centre, the closer 'g_abpspos' to 0; the closer to either contig end, the closer g_abpspos to 1) -
g_terminal
classifier of terminal genes (1 if gene is terminal, 0 if not) -
g_single
classifier of single genes (1 if the given gene is the only gene annotated on its contig, 0 if not)
-
- Coverage
-
g_cov
mean read coverage of the given gene -
g_covsd
standard deviation of the read coverage along the given gene -
g_covdev_c
extent to which 'g_cov' deviates from the corresponding 'c_genecovm', in units of 'c_genecovsd' -
g_covdev_o
extent to which 'g_cov' deviates from the average gene coverage in the assembly, in units of the coverage SD over the complete set of genes in this assembly
-
- Sequence composition
-
g_pearson_r_o
Pearson’s correlation coefficient for the tetranucleotide-derived z-score vectors of the given gene and the overall assembly -
g_pearson_p_o
probability (p-value) of a random data set to show a correlation with the overall assembly composition that is at least as high as 'g_pearson_r_o' -
g_pearson_r_c
Pearson’s correlation coefficient for the tetranucleotide-derived z-score vectors of the given gene and its respective contig -
g_pearson_p_c
probability (p-value) for two random sequences to show a correlation that is at least as high as 'g_pearson_r_c' -
g_gc_cont
percentage of GC content in the given gene -
g_gcdev_c
extent to which 'g_gc_cont' deviates from the mean gene GC content, in units of GC SD over the set of genes on the corresponding contig -
g_gcdev_o
extent to which 'g_gc_cont' deviates from the mean GC content in overall gene set, in units of GC SD over the complete set of genes in this assembly
-
-
fasta_header
header in FASTA file that was matched to gene ID and corresponding sequence is used for taxonomic assignment [header is truncated at first whitespace] -
lca
LCA of hits in 10% score range of best hit -
best_hit
best hit hit of sequence similarity search in terms of alignment score -
bh_evalue
e-value of best hit -
bh_pident
percentage of identical matches between best hit and query protein -
refined_lca
refined LCA: LCA of closest hit (to query) and query species -
taxon_assignment
final taxonomic assignment: LCA if LCA is not in query's lineage, else the refined LCA -
plot_label
final label for gene displayed in plot after merging of taxonomic assignments