ReactSeq

⚓ Environments

Need two virtual environments (opennmt3 and rdkit2019)

Environment 1：opennmt3（for training and inferencing）

conda create -n opennmt3 python==3.8
conda activate opennmt3
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple torch==2.0 numpy transformers pandas tqdm
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -U OpenNMT-py

OpenNMT-py requires:

Python >= 3.8
PyTorch >= 2.0 <2.1

Environment 2：rdkit2019 (for data processing related to rdkit and indigo)

conda create -n rdkit2019 python==3.7
conda activate rdkit2019
conda install -c rdkit rdkit=2019.03.2 -y
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple epam.indigo
pip install ipykernel --upgrade

rdkit2019 requires:

python <=3.7

🚀 Quick Start of Generating ReactSeq

For mapped and kekulized rxn_smiles, we can get their corresponding ReactSeq.

Here is an example:

mapped_rxn: 
[CH:5]1=[C:1]([C:2]([CH3:3])=[O:4])[CH:9]=[C:8]2[C:7](=[CH:6]1)[NH:12][CH:11]=[CH:10]2.[O:20]([C:21]([O:22][C:23]([CH3:24])([CH3:26])[CH3:25])=[O:27])[C:13](=[O:14])[O:15][C:16]([CH3:17])([CH3:18])[CH3:19]>>[C:1]1([C:2]([CH3:3])=[O:4])=[CH:5][CH:6]=[C:7]2[C:8](=[CH:9]1)[CH:10]=[CH:11][N:12]2[C:13](=[O:14])[O:15][C:16]([CH3:17])([CH3:18])[CH3:19]

SMILES of Product: 
C1(C(C)=O)=CC=C2C(=C1)C=CN2C(=O)OC(C)(C)C
ReactSeq: 
C1(C(C)=O)=CC=C2C(=C1)C=CN2!C(=O)OC(C)(C)C<><C(OC(C)(C)C)(=O)[O:1]>
ReactSeq (rxn): 
C1(C(C)=O)=CC=C2C(=C1)C=CN2C(=O)OC(C)(C)C>>>C1(C(C)=O)=CC=C2C(=C1)C=CN2!C(=O)OC(C)(C)C<><C(OC(C)(C)C)(=O)[O:1]>

More details related to generating ReactSeq and transforming ReactSeq to SMILES of Reactants can be found in Usage_Example_of_ReactSeq.ipynb

🛠️ Data and Preprocessing

The USPTO_50K raw data are sourced from typed_schneider50k and stored in

data/50k_raw

You can generate the augmentated data by using

python preprocess_data.py -data 50k -split train -augtime 100 -rxn_class False
python preprocess_data.py -data 50k -split val -augtime 20 -rxn_class False
python preprocess_data.py -data 50k -split test -augtime 20 -rxn_class False

Note: It is suggested to process data under rdkit2019 environment (rdkit version: 2019.03.2)

The processed data will be stored in

data/50k_ReactSeq/aug100_train
data/50k_ReactSeq/aug20_val
data/50k_ReactSeq/aug20_test

You can also download our pre-processed data from google_drive and put them into the above place.

Training

Before training, check out the settings in train.sh and corresponding .yml file in ./config. Then, run

bash train.sh

Inferencing

Before inferencing, check out the settings in inference.sh and corresponding .yml file in ./config. Then, run

bash inference.sh

Here, we inference the test set (augtime x20) by our model.

Transforming

The predictions of model are in the format of ReactSeq, need to be transformed to SMILES of reactants.

conda activate rdkit2019
python transform.py \
    -src "datasets/50k_ReactSeq/aug20_test/src_aug20_test.txt" \
    -tgt "output/tgt_50k_ReactSeq_aug100_train_aug20_test_infer.txt" \
    -output "output/pred_reactants_50k_ReactSeq_aug100_train_aug20_test_infer.txt"

Note: Transform need to be under rdkit2019 environment.

Calculating Top-k Accuracy

Run cal_top_k_accuracy.ipynb. The results are reproducible by placing our predictions from google_drive into output/.

🔥 Quick Retrosynthesis Prediction

Please download our trained model from google_drive and put them into trained_models/.

🙌 Acknowledgments

Special thanks to GraphRetro and OpenNMT-py for the code used in this project.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
ELN		ELN
__pycache__		__pycache__
checkpoints		checkpoints
config		config
datasets		datasets
embedding		embedding
onmt		onmt
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Usage_Example_of_ReactSeq.ipynb		Usage_Example_of_ReactSeq.ipynb
cal_top_k_accuracy.ipynb		cal_top_k_accuracy.ipynb
e_smiles.py		e_smiles.py
inference.py		inference.py
inference.sh		inference.sh
preprocess_data.py		preprocess_data.py
train.py		train.py
train.sh		train.sh
transform.py		transform.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReactSeq

⚓ Environments

Environment 1：opennmt3（for training and inferencing）

Environment 2：rdkit2019 (for data processing related to rdkit and indigo)

🚀 Quick Start of Generating ReactSeq

🛠️ Data and Preprocessing

Training

Inferencing

Transforming

Calculating Top-k Accuracy

🔥 Quick Retrosynthesis Prediction

🙌 Acknowledgments

About

Releases 1

Packages

Contributors 2

Languages

License

jiachengxiong/ReactSeq

Folders and files

Latest commit

History

Repository files navigation

ReactSeq

⚓ Environments

Environment 1：opennmt3（for training and inferencing）

Environment 2：rdkit2019 (for data processing related to rdkit and indigo)

🚀 Quick Start of Generating ReactSeq

🛠️ Data and Preprocessing

Training

Inferencing

Transforming

Calculating Top-k Accuracy

🔥 Quick Retrosynthesis Prediction

🙌 Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages