multiple smarts in --cut-smarts #15

acquaregia · 2019-08-30T17:43:41Z

I have large database 500K compounds and I am interested in finding only few transforms.
Ideally I would like to give transform in the form of smirks.
I understand that it might be easier to ask for a different fragmentation pattern and perform indexing on it.
I can translate the smirks into smarts specifying specific bonds.
For the tool to be useful I would like to be able to provide more than one SMARTS to the --cut-smarts option.
It would be excellent if an option like --cache would allow using a fragmentation file and enhance it by specifying other cut patterns.
Thanks.
marco

KramerChristian · 2019-09-01T06:49:20Z

Andrew commented on this request on the RDKit-discuss mailing list as:
"""
I took a look at the code. It expects that there is only a single SMARTS, so there's no way to get what you want.

The SMARTS handling code only touches <50 lines of code. It does not seem that hard to have it take multiple --cut-smarts, apply each of the cuts, find the unique union of those cuts, and work with them.

Could you add that as a issue in the mmpdb tracker?

It is in principle possible to merge two fragment files together and index the result. However, it would be difficult to use the indexed database for analysis purposes, because any input/query structure would use the single SMARTS pattern defined in the database.
"""

KramerChristian · 2019-09-27T09:25:49Z

At the RDKit UGM Hackathon 2019, this question came up again. Participants wanted to use the RECAP rules for cutting. Creating a single SMARTS to match all 11 rules might theoretically be possible, but would results in an extremely complicated string which would then be hard to debug and modify. Extending RDKit such that a list of SMARTS is appears as the preferred long term solution.

adalke · 2019-09-27T11:40:02Z

There are 12 rules, not 11:

>>> from rdkit.Chem import Recap
>>> len(Recap.reactionDefs)
12
>>> for rxn in Recap.reactionDefs:
...   print(rxn)
...
[#7;+0;D2,D3:1]!@C(!@=O)!@[#7;+0;D2,D3:2]>>*[#7:1].[#7:2]*
[C;!$(C([#7])[#7]):1](=!@[O:2])!@[#7;+0;!D1:3]>>*[C:1]=[O:2].*[#7:3]
[C:1](=!@[O:2])!@[O;+0:3]>>*[C:1]=[O:2].[O:3]*
[N;!D1;+0;!$(N-C=[#7,#8,#15,#16])](-!@[*:1])-!@[*:2]>>*[*:1].[*:2]*
[#7;R;D3;+0:1]-!@[*:2]>>*[#7:1].[*:2]*
[#6:1]-!@[O;+0]-!@[#6:2]>>[#6:1]*.*[#6:2]
[C:1]=!@[C:2]>>[C:1]*.*[C:2]
[n;+0:1]-!@[C:2]>>[n:1]*.[C:2]*
[O:3]=[C:4]-@[N;+0:1]-!@[C:2]>>[O:3]=[C:4]-[N:1]*.[C:2]*
[c:1]-!@[c:2]>>[c:1]*.*[c:2]
[n;+0:1]-!@[c:2]>>[n:1]*.*[c:2]
[#7;+0;D2,D3:1]-!@[S:2](=[O:3])=[O:4]>>[#7:1]*.*[S:2](=[O:3])=[O:4]

How to people want to specify the cut with these? Is the cut match defined with the product side of the reaction, and the reactant side ignored?

Some of those SMARTS use more than two atoms. The first makes a cut between :1 and :2 while the second makes a cut between :2 and :3. That means that if the reaction side is ignored (eg, if the cut is always made between :1 and :2) then there will be problems.

It could do a more in-depth analysis of the transform to detect if there is a labeled pair on the product side which is not a labeled pair in the reactant side, and use that for the cut.

But that's overkill if people really just want --cut-smarts RECAP as an option, since that list could be hard-coded using only the product side SMARTS, and only with :1 and :2.

adalke · 2019-10-11T13:41:46Z

I'm thinking to support it as --cut-smarts RECAP, and have --cut-smarts support multiple SMARTS patterns, where either the SMARTS pattern defines two atoms and a single bond, or the SMARTS pattern contains atoms labeled :1 and :2 where the cut occurs between them - which must match a single bond.

Looking at the RECAP rules, there are several places where I see problems.

Pattern 1: [#7;+0;D2,D3:1]!@C(!@=O)!@[#7;+0;D2,D3:2]>>*[#7:1].[#7:2]* (urea)

Given NC(=O)N this removes the C(=O) to give N.N. Should the SMARTS be [#7;+0;D2,D3:1]!@[C:2](!@=O)!@[#7;+0;D2,D3], which will match and cut both of the N-C bonds?

cuts on "any" bond

The existing code only allows cuts on single bonds. The RECAP patterns use !@ to match any non-ring bonds. I want to change them to -!@ to enforce that it must match a single bond.

Note that [#7]=!@C(=O)!A[#7] matches nothing in ChEMBL. However, pattern 2 (amide), [C;!$(C([#7])[#7]):1](=!@[O:2])!@[#7;+0;!D1:3] does match. More specifically, if I replace the !@ between :2 and :3 with =!@ then I get matches like:

% obgrep '[C;\!$(C([#7])[#7]):1](=\!@[O:2])=\!@[#7;+0;\!D1:3]' ~/databases/chembl_23.rdkit.smi
O=C=NC1CCCCC1	CHEMBL26886
CCCCN=C=O	CHEMBL27104
CCCC(N=C=O)C(=O)OC	CHEMBL65298
COC(=O)C(CCSC)N=C=O	CHEMBL67787
CC(C)c1cccc(C(C)C)c1N=C=O	CHEMBL109470
CCc1cccc(CC)c1N=C=O	CHEMBL111198
[C-]#[N+][C@@]1(C)CC[C@@H]2[C@@H](C)C[C@H]3C[C@@H](C)[C@@](C)(N=C=O)[C@H]4CC[C@H]1[C@@H]2[C@H]34	CHEMBL169156
CC(=O)O[C@H]1CC[C@@]2(C)[C@@H](CC[C@]3(C)[C@@H]2CC=C2[C@@H]4[C@@H](C)[C@H](C)CC[C@]4(C)CC[C@@]32C)[C@@]1(C)N=C=O	CHEMBL235436
O=C=NCCc1ccccc1	CHEMBL2074871
CC(=O)O[C@@H]1CC[C@@]2(C)[C@@H](CC[C@]3(C)[C@@H]2CC=C2[C@@H]4[C@@H](C)[C@H](C)CC[C@]4(C)CC[C@@]32C)[C@@]1(C)N=C=O	CHEMBL237112
O=C=Nc1cccc2ccccc12	CHEMBL2074791
 ...

There are far more matches with -!@.

It looks like in those few cases where the non-ring bond type is not specified, it's okay for me to say it's a single bond, without changing the intent of matching an amide.

It also seems like that RECAP definition in RDKit is wrong, in that it is not supposed to match a double bond there.

Match with explicit double bond

The pattern [C:1]=!@[C:2]>>[C:1]*.*[C:2] explicitly matches a double bond which is a non-ring bond. The underlying code says this is to handle olefins, so it really does want to match a double bond.

mmpdb cannot handle this case. Should I drop it?

adalke · 2019-10-11T18:42:44Z

Going back to acquaregia's request, can you give an example of of the SMIRKS you are interested in?

I can see two steps that might be affected: 1) limit the fragment to just a few SMARTS patterns, and 2) limit the indexing to just a few SMIRKS patterns.

I would like to see some of the SMIRKS to get a better feel for how to handle this. For example, if all of the SMIRKS were transforms of R-groups to R-groups, where the R-groups could be defined as SMILES fragments with a single attachment point denoted *, then those SMILES could be merged into a single recursive SMARTS.

Otherwise, if multiple distinct SMARTS are needed, then the mmpdb file formats need to change someone in order to store them. There could be multiple entries, one per definition, or they could be space/tab separated.

baoilleach · 2024-09-05T09:16:28Z

@adalke, I know you've stepped back from this, so this is more a record of my thoughts than a request...

I came across this issue today, when trying to work out why the database was 'missing' a matched pair compared to what I expected, and it led me to consider how the available fragmentation patterns differ from how Matsy worked. The Matsy fragmentation scheme is described in https://pubs.acs.org/doi/10.1021/jm500022q:

"The fragmentation scheme used involved a single cut at each acyclic single bond in turn if either end of the bond was involved in a ring or if the bond was between a non-sp2-hybridized carbon atom and a non-carbon atom."

This came from Roger. It's deceptively simple, but captures the the synthetically aware sense of the RECAP rules. Subsequently, or at the same time, Antonio and Bajorath did work in this area explicitly using RECAP (here's one reference, https://pubs.rsc.org/en/content/articlelanding/2014/md/c3md00259d). We didn't use SMARTS, but I think the following is equivalent in mmpdb terms:

Must match
[!#6;!#0;!#1;!R]-[#6!R;$(*=*)]
or
[R]!@!=!#[!#0;!#1]    # this is the exocyclic one from mmpdb

I've been scratching my head wondering how to use the existing codebase with these patterns. Can I combine these into a single gnarly recursive SMARTS, where each end of the bond recursively contains the full pattern? It's not going to be pretty and it's not going to be fast - that's for sure. Since I'm only interested in indexing (i.e. identifying matched pairs), it sounds like I can build two separate dbs and merge them. I'll try that first.

adalke · 2024-09-05T14:15:58Z

On Sep 5, 2024, at 11:16, baoilleach ***@***.***> wrote: I've been scratching my head wondering how to use the existing codebase with these patterns. Can I combine these into a single gnarly recursive SMARTS, where each end of the bond recursively contains the full pattern? It's not going to be pretty and it's not going to be fast - that's for sure. Since I'm only interested in indexing (i.e. identifying matched pairs), it sounds like I can build two separate dbs and merge them. I'll try that first.

They cannot be combined now. One option is to create a superset SMARTS grammar, eg, use "%%" to separate SMARTS patterns ("%" cannot be used to start or end a SMARTS, and "%%" is not valid in SMARTS; alternative options exist, like the space character) - Change fragment_types.py in parse_cut_smarts() so it returns a list of compiled SMARTS molecules, rather than a single. smarts_terms = smarts.split("%%") if not smarts_terms: raise ValueError("cut smarts must not be empty") patterns = [] for smarts_term in smarts_terms: pattern = Chem.MolFromSmarts(smarts_term) if pattern is None: raise ValueError("unable to parse cut SMARTS") if pattern.GetNumAtoms() != 2: raise ValueError("cut SMARTS must match exactly two atoms") if pattern.GetNumBonds() != 1: raise ValueError("cut SMARTS must connect both atoms") patterns.append(pattern) return patterns a fancier one to give a more narrowed down error reporting might be: smarts_terms = smarts.split("%%") if not smarts_terms: raise ValueError("cut smarts must not be empty") def get_value_error(msg, smarts_term): # if there are multiple terms then narrow it down if len(smarts_terms) == 1: extra = "" else: extra = " term {smarts_term!r}" return ValueError(msg.format(extra=extra)) patterns = [] for smarts_term in smarts_terms: pattern = Chem.MolFromSmarts(smarts_term) if pattern is None: raise get_value_error("unable to parse cut SMARTS{extra}", smarts_term) if pattern.GetNumAtoms() != 2: raise get_value_error("cut SMARTS{extra} must match exactly two atoms", smarts_term) if pattern.GetNumBonds() != 1: raise get_value_error("cut SMARTS{extra} must connect both atoms", smarts_term) patterns.append(pattern) return patterns - Change fragment_types.py "cut_pattern" to "cut_patterns" to reflect the change. - Change fragment_types.py "get_cut_atom_pairs" to handle the new logic seen = set() for pat in self.cut_patterns: for (atom1_idx, atom2_idx) in mol.GetSubstructMatches(self.cut_pattern): # put into canonical order so cuts are consistent across all patterns if atom1_idx < atom2_idx: seen.add((atom1_idx, atom2_idx)) else: seen.add((atom2_idx, atom1_idx)) return list(seen) This would, I think, allow --cut-smarts on the command-line, and stored in the database, and be backwards compatible. Andrew ***@***.***

baoilleach · 2024-09-05T15:34:49Z

Works like a dream::

$  mmpdb smifrag "c1ccccc1C(=O)Cl" --cut-smarts 'exocyclic%%[!#6;!#0;!#1;!R]-[#6!R;$(*=*)]'
                   |--------------  variable  --------------|       |---------------------  constant  --------------------
#cuts | enum.label | #heavies | symm.class | smiles         | order | #heavies | symm.class | smiles         | with-H
------+------------+----------+------------+----------------+-------+----------+------------+----------------+------------
  1   |     N      |    1     |      1     | *Cl            |    0  |    8     |      1     | *C(=O)c1ccccc1 | O=Cc1ccccc1
  1   |     N      |    8     |      1     | *C(=O)c1ccccc1 |    0  |    1     |      1     | *Cl            | Cl
  2   |     N      |    2     |     11     | *C(*)=O        |   01  |    7     |     12     | *Cl.*c1ccccc1  | -
  1   |     N      |    3     |      1     | *C(=O)Cl       |    0  |    6     |      1     | *c1ccccc1      | c1ccccc1
  1   |     N      |    6     |      1     | *c1ccccc1      |    0  |    3     |      1     | *C(=O)Cl       | O=CCl

I've checked it in over at https://github.com/baoilleach/mmpdb.

…it#15 for details.

adalke · 2024-09-09T06:32:26Z

On Sep 5, 2024, at 17:35, baoilleach ***@***.***> wrote: Works like a dream::

Good to hear! I like how it works to extend existing names.

$ mmpdb smifrag "c1ccccc1C(=O)Cl" --cut-smarts 'exocyclic%%[!#6;!#0;!#1;!R]-[#6!R;$(*=*)]'

In retrospect, I think "||" is a better separator than "%%" as "|" is not used in SMARTS at all, as "|" has a meaning as "or" , and "||" has an even stronger meaning as "or". Andrew ***@***.***

baoilleach · 2024-09-09T09:54:46Z

Indeed. Done.

KramerChristian added the enhancement label Sep 1, 2019

baoilleach added a commit to baoilleach/mmpdb that referenced this issue Sep 5, 2024

Support multiple SMARTS patterns in argument to --cut_smarts. See rdk…

f43568a

…it#15 for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple smarts in --cut-smarts #15

multiple smarts in --cut-smarts #15

acquaregia commented Aug 30, 2019 •

edited

Loading

KramerChristian commented Sep 1, 2019

KramerChristian commented Sep 27, 2019

adalke commented Sep 27, 2019

adalke commented Oct 11, 2019

adalke commented Oct 11, 2019

baoilleach commented Sep 5, 2024

adalke commented Sep 5, 2024 via email

baoilleach commented Sep 5, 2024

adalke commented Sep 9, 2024 via email

baoilleach commented Sep 9, 2024

multiple smarts in --cut-smarts #15

multiple smarts in --cut-smarts #15

Comments

acquaregia commented Aug 30, 2019 • edited Loading

KramerChristian commented Sep 1, 2019

KramerChristian commented Sep 27, 2019

adalke commented Sep 27, 2019

adalke commented Oct 11, 2019

adalke commented Oct 11, 2019

baoilleach commented Sep 5, 2024

adalke commented Sep 5, 2024 via email

baoilleach commented Sep 5, 2024

adalke commented Sep 9, 2024 via email

baoilleach commented Sep 9, 2024

acquaregia commented Aug 30, 2019 •

edited

Loading