Skip to content

Additional information

Freya Arthen edited this page May 29, 2024 · 2 revisions

Gene descriptors and PCA variables

The following variables are computed for each gene and the results are written to gene_table_taxon_assignment.csv. Any combination of these can be used in the PCA.

Contig related

c_name name of the contig the given gene is located on

  • Positional
    • c_num_of_genes number of genes annotated on the given contig
    • c_len contig length
    • c_pct_assembly_len percentage of assembly length: proportion of total assembly length contained in the given contig
    • c_genelenm mean gene length on the given contig
    • c_genelensd standard deviation of the gene lengths in the set of genes annotated on the given contig
  • Coverage
    • c_cov mean read coverage of the given contig
    • c_covsd standard deviation of read coverage along the given contig
    • c_covdev extent to which 'c_cov' deviates from the mean assembly coverage, in units of total assembly coverage SD
    • c_genecovm mean read coverage in the set of genes annotated on the given contig
    • c_genecovsd standard deviation of the read coverage in the set of genes annotated on the given contig
  • Sequence composition
    • c_pearson_r Pearson’s correlation coefficient for the tetranucleotide-derived z-score vectors of the given contig and the overall assembly
    • c_pearson_p probability (p-value) for two random sequences to show a correlation that is at least as high as 'c_pearson_r'
    • c_gc_cont percentage of GC content in the given contig
    • c_gcdev extent to which 'c_gc_cont' deviates from the mean assembly GC content, in units of total assembly GC SD

Gene related

g_name gene name (as given in the GFF file, 9th column, ID attribute)

  • Positional
    • g_len length of the given gene
    • g_lendev_c extent to which 'g_len' deviates from the corresponding c_genelenm, in units of 'c_genelensd'
    • g_lendev_o extent to which 'g_len' deviates from the average gene length in the assembly, in units of the length SD over the complete set of genes in this assembly
    • g_abspos absolute position of the gene (the closer to contig centre, the closer 'g_abpspos' to 0; the closer to either contig end, the closer g_abpspos to 1)
    • g_terminal classifier of terminal genes (1 if gene is terminal, 0 if not)
    • g_single classifier of single genes (1 if the given gene is the only gene annotated on its contig, 0 if not)
  • Coverage
    • g_cov mean read coverage of the given gene
    • g_covsd standard deviation of the read coverage along the given gene
    • g_covdev_c extent to which 'g_cov' deviates from the corresponding 'c_genecovm', in units of 'c_genecovsd'
    • g_covdev_o extent to which 'g_cov' deviates from the average gene coverage in the assembly, in units of the coverage SD over the complete set of genes in this assembly
  • Sequence composition
    • g_pearson_r_o Pearson’s correlation coefficient for the tetranucleotide-derived z-score vectors of the given gene and the overall assembly
    • g_pearson_p_o probability (p-value) of a random data set to show a correlation with the overall assembly composition that is at least as high as 'g_pearson_r_o'
    • g_pearson_r_c Pearson’s correlation coefficient for the tetranucleotide-derived z-score vectors of the given gene and its respective contig
    • g_pearson_p_c probability (p-value) for two random sequences to show a correlation that is at least as high as 'g_pearson_r_c'
    • g_gc_cont percentage of GC content in the given gene
    • g_gcdev_c extent to which 'g_gc_cont' deviates from the mean gene GC content, in units of GC SD over the set of genes on the corresponding contig
    • g_gcdev_o extent to which 'g_gc_cont' deviates from the mean GC content in overall gene set, in units of GC SD over the complete set of genes in this assembly

Taxonomic assignment information

  • fasta_header header in FASTA file that was matched to gene ID and corresponding sequence is used for taxonomic assignment [header is truncated at first whitespace]
  • lca LCA of hits in 10% score range of best hit
  • best_hit best hit hit of sequence similarity search in terms of alignment score
  • bh_evalue e-value of best hit
  • bh_pident percentage of identical matches between best hit and query protein
  • refined_lca refined LCA: LCA of closest hit (to query) and query species
  • taxon_assignment final taxonomic assignment: LCA if LCA is not in query's lineage, else the refined LCA
  • plot_label final label for gene displayed in plot after merging of taxonomic assignments