When inputing parts of genes, gene `SGR00005418.1` could not find anchors in another species #566

YANGMEI0127 · 2023-05-10T12:56:31Z

Thanks for this great tool.

We encountered a problem while analyzing data using jcvi. Below are the steps and issues we analyzed.

Firstly, we prepared bed and cds files for 2 species. The format of the files for 2 species is the same. One species was named LHG and another species named Species2.

The format of bed is shown below (LHG_scaffold_3):

Hic_scaffold_3 31125864 31140016 SGR00007415.1 0 +
Hic_scaffold_3 31141175 31144628 SGR00007416.1 0 -
Hic_scaffold_3 31149146 31150574 SGR00007417.1 0 +
Hic_scaffold_3 31153232 31155626 SGR00007418.1 0 -

The format of cds is shown below (LHG_scaffold_3):

>SGR00007415.1
ATGGATTCAGTGGAGGCTATAGAGGAGTTGGCTCAGCTGTCGGACTCGATGCGGCAAGCTGCGGCTTTGCTTGCCGATGAGGATGTCGACGAAACTACCACCTCCGGTACTTCCTCTCGTAGGCTTTCCACTTTCCTCAACGTCGTGACACTGGGGAATGTCGGAGCTGGTAAATCTGCAGTGTTAAATAGCTTG……
>SGR00007416.1
ATGCAGCGGTCACGTAGAGCTCTTCTTCAGAGAAGAGCTTTGGAGAAAACTATCAGTGGGAGGAATCGTTTGTATATGTTTTCTCTGACCTTAGTTTTTGTTTTGTGGGGGCTTTTCTTCCTTTTTAGCTTATGGATCAGACAGGGCAACGGTCGTAGAGATGGATGTACTGCACTACCAGATAGCATATCAACT……

Secondly, we performed ortholog analysis for LHG with all other species using code like python -m jcvi.compara.catalog ortholog {main_species} {other_species} --no_strip_names to extract collinear information for all genes in LHG_scaffold_3 with all other species.

We have obtained the collinearity information file of LHG_scaffold_3 compared to species. We display the content of each file (displaying gene SGR00005418.1 information as it exists in each file) such as:

LHG_scaffold_3.Species2_v2_5.last

SGR00005418.1   Cla97C02G033340.2       80.71   1405    262     2       11      1406    14      1418    0       1.34e+03
SGR00005418.1   Cla97C02G033330.1       67.11   1420    440     3       7       1402    13      1429    4.1e-212        737
SGR00005418.1   Cla97C02G033320.1       65.94   1415    455     3       11      1401    17      1428    2.2e-195        682
……

LHG_scaffold_3.Species2_v2_5.last.filtered

SGR00005418.1   Cla97C02G033340.2   80.71   1405    262 2   11  1406    14  1418    0   1.34e+03
……

*LHG_scaffold_3.Species2_v2_5.anchors

SGR00005418.1   Cla97C02G033340.2       1340
……

*LHG_scaffold_3.Species2_v2_5.lifted.anchors

SGR00005418.1   Cla97C02G033340.2   1340
……

Thirdly, we extracted collinear information for several genes (include SGR00005418.1) in LHG_scaffold_3 instead of all genes in LHG_scaffold_3 and rerun the analysis.

LHG_UGTscaffold_3.Species2_v2_5.last:

SGR00005418.1   Cla97C02G033340.2       80.71   1405    262     2       11      1406    14      1418    0       1.34e+03

LHG_UGTscaffold_3.Species2_v2_5.last.filtered:

SGR00005418.1   Cla97C02G033340.2       80.71   1405    262     2       11      1406    14      1418    0       1.34e+03

LHG_UGTscaffold_3.Species2_v2_5.lifted.anchors: were not formed.
LHG_UGTscaffold_3.Species2_v2_5.anchors: The file have no content, and gene SGR00005418.1 A is not in the file.

The question is: in the second step, when inputting all genes, gene SGR00005418.1 could find anchors in another species. But in the third step when inputing parts of genes, gene SGR00005418.1 could not find anchors in another species. Is there any fltering steps for generating the anchor file from the last.filtered file. (By the way, I also tried to change --min_size to 1 or 0, still no luck.)

Would you please give me some suggestion? When we focus on specific gene sets, should we feed only these gene sets or whole scaffold to the jcvi.compara.catalog step?

Thank you very much.

The text was updated successfully, but these errors were encountered:

tanghaibao · 2023-05-11T03:16:38Z

@YANGMEI0127

In your third step, did you include the neighbors that formed the synteny block close to SGR00005418.1?

Filtering the input files to only subsets of the genes is not recommended, since jcvi relies on the LAST hits from other genes (even those that aren't forming the blocks) as part of the filtering decision. So subsetting may lead to unexpected results?

Would you please explain a bit why you want to run on subsets, rather than extracting what you need from the full run?

YANGMEI0127 · 2023-05-11T07:00:01Z

Hi @tanghaibao,

Thank you very much for your reply.

In my third step, I include the neighbors that formed the synteny block close to SGR00005418.1 , however, only two genes are close to the SGR00005418.1. Such as:

Hic_scaffold_3  5284539 5286279 SGR00005419.1   0   +
Hic_scaffold_3  5274281 5275706 SGR00005418.1   0   +
Hic_scaffold_3  5253515 5254979 SGR00005417.1   0   -

The reason I want to run on subsets is that the full data is abundant, I want to simplify the data, and analyze the data for the genes I am interested in. This could be more convenient.
To my surprise, the data processing results were different.

Thank you again for your reply.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When inputing parts of genes, gene `SGR00005418.1` could not find anchors in another species #566

When inputing parts of genes, gene `SGR00005418.1` could not find anchors in another species #566

YANGMEI0127 commented May 10, 2023

tanghaibao commented May 11, 2023

YANGMEI0127 commented May 11, 2023 •

edited

Loading

When inputing parts of genes, gene SGR00005418.1 could not find anchors in another species #566

When inputing parts of genes, gene SGR00005418.1 could not find anchors in another species #566

Comments

YANGMEI0127 commented May 10, 2023

tanghaibao commented May 11, 2023

YANGMEI0127 commented May 11, 2023 • edited Loading

When inputing parts of genes, gene `SGR00005418.1` could not find anchors in another species #566

When inputing parts of genes, gene `SGR00005418.1` could not find anchors in another species #566

YANGMEI0127 commented May 11, 2023 •

edited

Loading