Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

content search of AllTheBacteria with selected GTDB genomes #3478

Open
ctb opened this issue Jan 9, 2025 · 0 comments
Open

content search of AllTheBacteria with selected GTDB genomes #3478

ctb opened this issue Jan 9, 2025 · 0 comments
Labels
fyi Information that is interesting or useful tutorial links to tutorials

Comments

@ctb
Copy link
Contributor

ctb commented Jan 9, 2025

ATB issue: AllTheBacteria/AllTheBacteria#54 (comment)

hackmd for this issue: https://hackmd.io/-7KJTe14TZGooSW3TAhlgw?both

AllTheBacteria Buchnera and Sulcia search w/sourmash

ATB download: #3247

GTDB files: #3183 (comment)

Build a picklist for the GTDB genomes we're interested in (Sulcia and Buchnera):

head -1 gtdb-rs220.lineages.csv > sublineages.csv
egrep -i 'buchnera|sulcia' gtdb-rs220.lineages.csv   >> sublineages.csv

Extract just those genomes from GTDB RS220:

sourmash sig cat --picklist sublineages.csv:ident:ident gtdb-rs220-k21.zip -o sublineages.sig.zip

Search ATB v0.2 for overlaps:

sourmash scripts manysearch sublineages.sig.zip allthebacteria-r0.2-k21.zip -o sublineages.x.atb.csv -k 21

(Takes about 10-15 minutes )

Parse results:

>>> df = pd.read_csv('sublineages.x.atb.csv')
>>> df2 = df[df['max_containment_ani'] >= 0.9]
>>> df2[['match_name', 'max_containment_ani']]
                                              match_name  max_containment_ani
23861  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.907215
23879  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.921266
23901  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.910469
23906  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.997631
23909  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             1.000000
23911  NZ_CP056771.1 Buchnera aphidicola (Aphis gossy...             0.999172

Eliminate all ATB genomes already labeled as Buchnera:

>>> df3 = df[~df['match_name'].str.contains('Buchnera')]
>>> len(df3)
26764
>>> df3['max_containment'].max()
np.float64(0.0448430493273542)
>>> df3.sort_values(by='max_containment')[['match_name', 'max_containment']]
                                              match_name  max_containment
21457  NZ_KB944588.1 Enterococcus faecalis ATCC 19433...         0.010017
96     NZ_BNJW01000001.1 Enterococcus lactis strain C...         0.010017
26119  NZ_LT700212.1 Actinomyces provencensis strain ...         0.010017
25967  NZ_MVFY01000010.1 Citrobacter portucalensis st...         0.010017
25963  NZ_PRDB01000030.1 Klebsiella michiganensis str...         0.010017
...                                                  ...              ...
23667  JAGGDU010000001.1 Trichodesmium erythraeum GBR...         0.042601
23596  JAGGDU010000001.1 Trichodesmium erythraeum GBR...         0.042601
23525  JAGGDU010000001.1 Trichodesmium erythraeum GBR...         0.042601
24492  NZ_JPOS01000001.1 Phaeodactylibacter xiamenens...         0.044643
24451  NZ_JPOS01000001.1 Phaeodactylibacter xiamenens...         0.044843

So, there's a 4.5% overlap in k-mers with some Buchnera queries for JAGGDU010000001.1 and NZ_JPOS01000001.1, which is probably not enough to call it a Buchnera genome... The estimated ANI is in the 80% range (column max_containment_ani).

tl;dr no matches to Sulcia, only one match to Buchnera, in all the ATB genomes.

Download the CSV file here for more exploration: https://farm.cse.ucdavis.edu/~ctbrown/buchnera-sulcia.manysearch.zip

benchmark info

with 4 CPUs:

        Command being timed: "sourmash scripts manysearch sublineages.sig.zip allthebacteria-r0.2-k21.zip -o sublineages.x.atb.csv.2 -k 21 -c 4"
        User time (seconds): 10967.28
        System time (seconds): 98.50
        Percent of CPU this job got: 385%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 47:51.17
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 20568304
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 506284
        Minor (reclaiming a frame) page faults: 4419875
        Voluntary context switches: 63584
        Involuntary context switches: 120076
        Swaps: 0
        File system inputs: 129128808
        File system outputs: 21096
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0
@ctb ctb added fyi Information that is interesting or useful tutorial links to tutorials labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fyi Information that is interesting or useful tutorial links to tutorials
Projects
None yet
Development

No branches or pull requests

1 participant