gapseq_find taxonomy and unreviewed sequences use #122
-
Thanks again for gapseq. I was wondering if you could comment on why it was deemed necessary or desirable to have gapseq_find determine the (super)kingdom (Archaea/Bacteria) if not provided and only blast corresponding sequences. Inter-kingdom HGT is known, mostly from Archaea to Bacteria; see for instance doi: 10.7717/peerj.3865. Was this done for computational efficacy? Is only marginal gain expected or was only marginal gain obtained from use of a Prokaryote (uniprot.sh supports this) sequence query? The second related set of questions I have is about combining query sequence sources. Currently, gapseq_find only uses "user" sequences if these are defined for a given EC number or reaction name, but most of these are 2-3 years old. So are these presumably curated sequences the final set to be considered regardless of new UniProt reviewed sequences for instance? Do these user sequences contain additions or have UniProt reviewed sequences been rejected as well? Then unreviewed sequences are considered only if no reviewed sequences are available in the default search strategy ("Quality": -z 2). I think I understand the inspiration and the diversity of these was seemingly favoured by downloading the seed sequences of reviewed UniRef 90% similarity clusters instead of 50% for the unreviewed. But how representative are reviewed sequences for the microbial diversity, I guess this is highly reaction- and habitat dependent? Indeed the optional (-z 3) combines both reviewed- and unreviewed query sequences but gapseq_find does not track the provenance - wouldn't it be useful to add a confidence value for the reaction in the model, depending on the user-rev/unrev provenance? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
hi, The |
Beta Was this translation helpful? Give feedback.
hi,
good points! I agree HGT could be an issue. The reason for separating archaea and bacteria are that we got false positive hits when combining both sequences databases. This was especially the case for enzyme complexes which could differ in terms of subunits between archaea and bacteria. On the other hand, if a HGT is described, it should also be contained in the respective database.
The
user sequences
are manually revised and updated sequences to improve the annotation. In default mode, they are preferred to other sources and you are right it could be a problem if new sequences are available but not used because of an user defined sequence. This affects round about 70 user defined seq…