-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The charset #1
Comments
Great suggestion. Yes, SMILES is clearly suboptimal for this reason. The molecular autoencoder would almost certainly work better if we used a modified language that had fewer opportunities to produce invalid strings. |
Dear Hsiao Yi,
you may find relevant the following paper that we have submitted very
recently to the arxiv:
https://arxiv.org/abs/1703.01925
By using a grammar and building the variational autoencoder on the
production rules of that grammar we avoid some of the problems that you
mention.
Miguel.
…On Tue, Feb 28, 2017 at 8:14 PM, hsiao yi ***@***.***> wrote:
As I proposed in maxhodak/keras-molecules#54
<maxhodak/keras-molecules#54>. I am interested
in why the charset is designed like this. It's not straightforward. From
the viewpoint of chemistry, the chlorine "Cl" should not be treated as "C"
and "l". Maybe it will be some improvement if we re-design the charset. I
used the implementation from keras-molecules, and when I tried to
interpolate between 2 chemical structures (CC=C(C(=CC)c1ccc(O)cc1)c1ccc(O)cc1
and CN1C(=O)CCS(=O)(=O)C1c1ccc(Cl)cc1).
). I got something like these invalid structures below, so I guess the
charset is the reason for this.
CC(C)(O)CCC1CCC(*Cr*)So2c1ccc(C)cc1
CCNC(=O)CN(CC*1*(*(l)CN1*c1ccc(OC)cc1
CN*1*C(=O)CN(CC*1*(*(#)CN1*c1ccc(OC)cc1
CN*1*C(=O)CC(CC**()*(=O)C*1
*c1ccc(Cl)cc1 CN*1*C(=O)CC(NC*()*(=O)C*1**c1ccc(Cl)cc1
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ABndalP7TtYxcN8-1sFXRDruGMOAp52tks5rhIARgaJpZM4MO2Ge>
.
|
Hi, I tried to find the code of bayesian optimization used in this paper. But it seems the code not included. Will you plan to share the code of bo? |
I tried use the bayesian optimization to find the better molecules. But when use BO search in the 292 space, I alwasy got invalid smiles same like Hsiao Yi got , so I guess this might be caused by the way to chose inducing point , right? |
You were doing BayesOpt in a 292-dimensional space? We were already having a hard time with a 56D space. One thing you might want to look at are the lengthscales of each dimension - we found that they were often very long, and that the GP was basically just doing linear regression. |
I will try to upload the code for Bayesian optimization by next week. In our experiments we obtained a large number of invalid smiles. At each point, we decoded a large number of smiles (500) and from those, we only kept the valid ones. |
Thank you very much for answering my questions. Yes, I tried 292 dimensions by using GpyOpt. For the lengthscale of each dimension, I use [-1,1], I guess this lengthscale might not be correct. I look forward to your BO code. |
Can you suggest why we are using 292 space . What's the logic behind it ? |
As I proposed in maxhodak/keras-molecules#54. I am interested in why the charset is designed like this. It's not straightforward. From the viewpoint of chemistry, the chlorine "Cl" should not be treated as "C" and "l". Maybe it will be some improvement if we re-design the charset. I used the implementation from keras-molecules, and when I tried to interpolate between 2 chemical structures (CC=C(C(=CC)c1ccc(O)cc1)c1ccc(O)cc1 and CN1C(=O)CCS(=O)(=O)C1c1ccc(Cl)cc1).
). I got something like these invalid structures below, so I guess the charset is the reason for this.
CC(C)(O)CCC1CCC(Cr)So2c1ccc(C)cc1
CCNC(=O)CN(CC1((l)CN1c1ccc(OC)cc1
CN1C(=O)CN(CC1((#)CN1c1ccc(OC)cc1
CN1C(=O)CC(CC**()(=O)C1c1ccc(Cl)cc1
CN1C(=O)CC(NC()(=O)C1**c1ccc(Cl)cc1
The text was updated successfully, but these errors were encountered: