Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The charset #1

Open
hsiaoyi0504 opened this issue Feb 28, 2017 · 8 comments
Open

The charset #1

hsiaoyi0504 opened this issue Feb 28, 2017 · 8 comments

Comments

@hsiaoyi0504
Copy link

hsiaoyi0504 commented Feb 28, 2017

As I proposed in maxhodak/keras-molecules#54. I am interested in why the charset is designed like this. It's not straightforward. From the viewpoint of chemistry, the chlorine "Cl" should not be treated as "C" and "l". Maybe it will be some improvement if we re-design the charset. I used the implementation from keras-molecules, and when I tried to interpolate between 2 chemical structures (CC=C(C(=CC)c1ccc(O)cc1)c1ccc(O)cc1 and CN1C(=O)CCS(=O)(=O)C1c1ccc(Cl)cc1).
). I got something like these invalid structures below, so I guess the charset is the reason for this.
CC(C)(O)CCC1CCC(Cr)So2c1ccc(C)cc1
CCNC(=O)CN(CC1((l)CN1c1ccc(OC)cc1
CN1C(=O)CN(CC1((#)CN1c1ccc(OC)cc1
CN1C(=O)CC(CC**()(=O)C1c1ccc(Cl)cc1
CN
1C(=O)CC(NC()(=O)C1**c1ccc(Cl)cc1

@duvenaud
Copy link
Contributor

duvenaud commented Mar 1, 2017

Great suggestion. Yes, SMILES is clearly suboptimal for this reason. The molecular autoencoder would almost certainly work better if we used a modified language that had fewer opportunities to produce invalid strings.

@jmhernandezlobato
Copy link

jmhernandezlobato commented Mar 12, 2017 via email

@yangxiufengsia
Copy link

Hi, I tried to find the code of bayesian optimization used in this paper. But it seems the code not included. Will you plan to share the code of bo?

@yangxiufengsia
Copy link

I tried use the bayesian optimization to find the better molecules. But when use BO search in the 292 space, I alwasy got invalid smiles same like Hsiao Yi got , so I guess this might be caused by the way to chose inducing point , right?

@duvenaud
Copy link
Contributor

duvenaud commented Jun 7, 2017

You were doing BayesOpt in a 292-dimensional space? We were already having a hard time with a 56D space. One thing you might want to look at are the lengthscales of each dimension - we found that they were often very long, and that the GP was basically just doing linear regression.

@jmhernandezlobato
Copy link

I will try to upload the code for Bayesian optimization by next week. In our experiments we obtained a large number of invalid smiles. At each point, we decoded a large number of smiles (500) and from those, we only kept the valid ones.

@yangxiufengsia
Copy link

Thank you very much for answering my questions. Yes, I tried 292 dimensions by using GpyOpt. For the lengthscale of each dimension, I use [-1,1], I guess this lengthscale might not be correct. I look forward to your BO code.

@abhik1368
Copy link

Can you suggest why we are using 292 space . What's the logic behind it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants