- Please refer to our techical report for the details of implementation and performance.
ogb==1.3.0
rdkit>=2019.03.1
obabel>=3.1.0
torch>=1.7.0
paddlepaddle-gpu>=2.1.0
pgl>=2.1.4
Under the root directory, please run following command to downlaod the original pcqm4m dataset, DFT results for auxiliary tasks, and cross-validation split indexes.
mkdir dataset && cd dataset
wget http://ogb-data.stanford.edu/data/lsc/pcqm4m_kddcup2021.zip
unzip pcqm4m_kddcup2021.zip
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/PCQM_pretrain/sdf.tar.gz
mv sdf.tar.gz pcqm_pyscf_sdf.tar.gz
tar -xzvf pcqm_pyscf_sdf.tar.gz
wget https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/PCQM_pretrain/cross_split.pkl
cd ..
|-- ogbg_lsc
|-- README.md
|-- src # scripts
|-- models # model definition
|-- utils
|-- outputs # model predictions (generated by training)
|-- dataset # the original dataset and customed splits
|-- checkpoints # model checkpoints (generated by training)
|-- ensemble # submitted predictions and code for ensemble
|-- logs # train logs (generated by training)
Model hyper-parameters and other arguments are all defined in ./src/config.yaml
.
Our pipeline consists of 4 steps:
-
Data preprocessing
cd ./features python mol_tree.py # takes about 30 min
-
Model training
Train the model with 2-fold cross validation:
cd ./src . ./cross_run.sh 0 1 # training on the whole dataset will take about 10 days # "0 1" defines the CUDA devices
Or barely train the model with original validation set:
cd ./src export CUDA_VISIBLE_DEVICES=0 python main.py --config config.yaml
-
Test inference (optional)
There is no need to call the inference program separately since it is included in the training program. If is needed, please set the model saved path on the
infer_from
hyper-parameter in./src/config.yaml
after training, then run the following commands:cd ./src python test.py --config config.yaml --output_path ./test_result
The test result will be saved in
./src/test_result
. -
Ensemble
Copy model predictions and do the ensemble:
cd ../outputs rsync -av * ../ensemble/model_pred/new_run cd ../ensemble python ensemble.py
The whole training/ensemble pipeline is collectly defined in ./src/main.sh
. The shortcut to start default training with 2-fold cross validation:
cd ./src
sh main.sh
Model | Test MAE | #Parameters | Hardware |
---|---|---|---|
LiteGEM | 0.1204 | 74M | Nvidia Tesla P40 (24GB GPU) |
GIN* | 0.1678 | 3.8M | GeForce RTX 2080 (11GB GPU) |
GIN-virtual* | 0.1487 | 6.7M | GeForce RTX 2080 (11GB GPU) |
GCN* | 0.1838 | 2.0M | GeForce RTX 2080 (11GB GPU) |
GCN-virtual* | 0.1579 | 4.9M | GeForce RTX 2080 (11GB GPU) |
MLP+Fingerprint* | 0.2068 | 16.1M | GeForce RTX 2080 (11GB GPU) |
* Results copied from the baseline performance for PCQM4M-LSC.
@misc{fang2021chemrlgem,
title={ChemRL-GEM: Geometry Enhanced Molecular Representation Learning for Property Prediction},
author={Xiaomin Fang and Lihang Liu and Jieqiong Lei and Donglong He and Shanzhuo Zhang and Jingbo Zhou and Fan Wang and Hua Wu and Haifeng Wang},
year={2021},
eprint={2106.06130},
archivePrefix={arXiv},
primaryClass={cs.LG}
}