Question related to Model generated after training complete #67

Vinceuwe · 2022-09-20T12:05:59Z

Vinceuwe
Sep 20, 2022

Hi
I am new to MACE. I am encountering a problem that after training complete, why my mace mode is generated from the epoch with lowest loss value, not the last epoch with the lowest RMSE for energy and forces? Is this due to my training data set? (I am using oxides)
In addition, after "changing the loss based on SWA", the loss doesn't get lower (still > 0.5), is that any possible solution for this?

The following are part of my output:
......
2022-09-20 10:29:03.008 INFO: Epoch 172: loss=0.2121, RMSE_E_per_atom=86.6 meV, RMSE_F=144.2 meV / A
2022-09-20 10:29:30.293 INFO: Epoch 174: loss=0.2073, RMSE_E_per_atom=84.6 meV, RMSE_F=142.7 meV / A
2022-09-20 10:29:56.333 INFO: Epoch 176: loss=0.2055, RMSE_E_per_atom=85.2 meV, RMSE_F=142.0 meV / A
2022-09-20 10:30:23.293 INFO: Epoch 178: loss=0.2055, RMSE_E_per_atom=84.9 meV, RMSE_F=141.9 meV / A
2022-09-20 10:30:49.278 INFO: Epoch 180: loss=0.2064, RMSE_E_per_atom=85.1 meV, RMSE_F=142.3 meV / A
2022-09-20 10:31:15.227 INFO: Epoch 182: loss=0.2080, RMSE_E_per_atom=85.4 meV, RMSE_F=142.9 meV / A
......
2022-09-20 12:44:32.619 INFO: Epoch 798: loss=0.2939, RMSE_E_per_atom=79.2 meV, RMSE_F=172.0 meV / A
2022-09-20 12:44:58.490 INFO: Epoch 800: loss=0.2940, RMSE_E_per_atom=79.3 meV, RMSE_F=172.0 meV / A
2022-09-20 12:44:58.490 INFO: Changing loss based on SWA
2022-09-20 12:45:24.412 INFO: Epoch 802: loss=4.4866, RMSE_E_per_atom=67.2 meV, RMSE_F=217.0 meV / A
2022-09-20 12:45:50.332 INFO: Epoch 804: loss=3.2585, RMSE_E_per_atom=56.8 meV, RMSE_F=290.4 meV / A
2022-09-20 12:46:16.266 INFO: Epoch 806: loss=2.5655, RMSE_E_per_atom=50.1 meV, RMSE_F=340.0 meV / A
......
2022-09-20 13:26:00.144 INFO: Epoch 990: loss=0.7132, RMSE_E_per_atom=23.5 meV, RMSE_F=363.9 meV / A
2022-09-20 13:26:25.996 INFO: Epoch 992: loss=0.7140, RMSE_E_per_atom=23.4 meV, RMSE_F=363.2 meV / A
2022-09-20 13:26:51.794 INFO: Epoch 994: loss=0.7081, RMSE_E_per_atom=23.3 meV, RMSE_F=362.3 meV / A
2022-09-20 13:27:17.722 INFO: Epoch 996: loss=0.7218, RMSE_E_per_atom=23.6 meV, RMSE_F=361.9 meV / A
2022-09-20 13:27:43.711 INFO: Epoch 998: loss=0.7198, RMSE_E_per_atom=23.5 meV, RMSE_F=362.8 meV / A
2022-09-20 13:27:56.566 INFO: Training complete
2022-09-20 13:27:56.570 INFO: Loading checkpoint: checkpoints/MACE_model_run-123_epoch-176.pt
2022-09-20 13:27:56.624 INFO: Loaded model from epoch 176
2022-09-20 13:27:56.624 INFO: Computing metrics for training, validation, and test sets
2022-09-20 13:28:11.828 INFO: Evaluating train ...
2022-09-20 13:28:28.318 INFO: Evaluating valid ...
2022-09-20 13:28:31.363 INFO: Evaluating Default ...
2022-09-20 13:28:33.182 INFO: Evaluating slab_MD ...
2022-09-20 13:28:33.662 INFO:
+-------------+---------------------+------------------+-------------------+
| config_type | RMSE E / meV / atom | RMSE F / meV / A | relative F RMSE % |
+-------------+---------------------+------------------+-------------------+
| train | 70.5 | 59.9 | 6.33 |
| valid | 85.2 | 142.0 | 15.79 |
| Default | 78.6 | 71.1 | 2364.92 |
| slab_MD | 41.8 | 166.3 | 13.01 |
+-------------+---------------------+------------------+-------------------+
2022-09-20 13:28:33.662 INFO: Saving model to checkpoints/MACE_model_run-123.model

davkovacs · 2022-09-20T12:36:07Z

davkovacs
Sep 20, 2022
Maintainer

Hi! Thank you for reporting back. Could you please send us the input line for MACE?

When swa is switched on the loss changes, so it's numerical value will be different, which is not a problem.

0 replies

ilyes319 · 2022-09-20T12:37:14Z

ilyes319
Sep 20, 2022
Maintainer

Hi @Vinceuwe,
Thanks for your interest in MACE!
The model that was saved corresponds to the model with the overall lowest loss. This means that it will depend on how you weight the forces and energies in your loss. As the default is using a larger weight on forces than energies, the model with the best RMSE on forces (at epoch 176 for you) was saved.
The reason it might be confusing for you is the usage of swa. The way it works in your case is the following :

For the first 800 epochs, the loss was computed by putting a weight of 10 on the forces and 1 on the energies, hence your forces being much better.
After 800 epochs, the loss weights changed to 1000 on the energies and 1 on the forces. Hence the better energies.

Because it seems the model is struggling with learning energies, this results in a significant deterioration in the accuracy of the forces. The best model saved was thus the one at 176.
Could you please send us the input file for the trained model? Also, could you tell us your system size and check if you are using the correct E0s.

0 replies

Vinceuwe · 2022-09-20T12:56:50Z

Vinceuwe
Sep 20, 2022
Author

Hi, Thanks for your reply. This information you provide definitely help me understand MACE more. The following is the submitting script:
python /raven/u/hwan/mace/scripts/run_train.py
--name="MACE_model"
--train_file="atoms_training_32.xyz"
--valid_fraction=0.05
--test_file="atoms_test_32.xyz"
--config_type_weights='{"Default":1.0}'
--energy_key="DFT_energy"
--forces_key="DFT_forces"
--model="MACE"
--hidden_irreps='128x0e + 128x1o'
--r_max=5.0
--batch_size=30
--max_num_epochs=1000
--swa
--start_swa=800
--ema
--ema_decay=0.99
--amsgrad
--restart_latest
--device=cuda \

2022-09-20 10:27:01.622 INFO: CUDA version: 11.1, CUDA device: 0
2022-09-20 10:27:07.698 INFO: Using isolated atom energies from training file
2022-09-20 10:27:07.725 INFO: Loaded 931 training configurations from 'atoms_training_32.xyz'
2022-09-20 10:27:07.725 INFO: Using random 5.0% of training set for validation
2022-09-20 10:27:07.864 INFO: Loaded 207 test configurations from 'atoms_test_32.xyz'
2022-09-20 10:27:07.864 INFO: Total number of configurations: train=885, valid=46, tests=[Default: 131, slab_MD: 76]
2022-09-20 10:27:07.870 INFO: AtomicNumberTable: (8, 77)
2022-09-20 10:27:07.871 INFO: Atomic energies: [-0.08969644, -0.33524439]
2022-09-20 10:27:24.751 INFO: WeightedEnergyForcesLoss(energy_weight=1.000, forces_weight=10.000)
2022-09-20 10:27:24.908 INFO: Average number of neighbors: 39.096

For my training set, it has atoms ranging from 4 atoms to 200 atoms, they are quite diverse which is the result of GAP workflow over 30 iterations

0 replies

ilyes319 · 2022-09-20T13:06:04Z

ilyes319
Sep 20, 2022
Maintainer

Thanks for you reply. Your input script seems correct to me. However I think you might have a problem with your atomic energies. Could you please try to run again while adding to your input script --E0s="average". This will do a linear fit on your training data to compute your E0s.

0 replies

Vinceuwe · 2022-09-21T08:53:17Z

Vinceuwe
Sep 21, 2022
Author

Hi I test --E0s="average"
with Isolated atom in my training set:
2022-09-20 20:22:45.071 INFO: Epoch 994: loss=4.3130, RMSE_E_per_atom=68.1 meV, RMSE_F=450.6 meV / A
2022-09-20 20:23:09.699 INFO: Epoch 996: loss=4.1301, RMSE_E_per_atom=66.6 meV, RMSE_F=445.2 meV / A
2022-09-20 20:23:34.471 INFO: Epoch 998: loss=4.0964, RMSE_E_per_atom=66.3 meV, RMSE_F=447.6 meV / A
2022-09-20 20:23:46.774 INFO: Training complete
2022-09-20 20:23:46.775 INFO: Loading checkpoint: checkpoints/MACE_model_run-123_epoch-460.pt
2022-09-20 20:23:47.353 INFO: Loaded model from epoch 460
2022-09-20 20:23:47.353 INFO: Computing metrics for training, validation, and test sets
2022-09-20 20:24:03.228 INFO: Evaluating train ...
2022-09-20 20:24:19.352 INFO: Evaluating valid ...
2022-09-20 20:24:22.500 INFO: Evaluating Default ...
2022-09-20 20:24:24.432 INFO: Evaluating slab_MD ...
2022-09-20 20:24:24.914 INFO:
+-------------+---------------------+------------------+-------------------+
| config_type | RMSE E / meV / atom | RMSE F / meV / A | relative F RMSE % |
+-------------+---------------------+------------------+-------------------+
| train | 110.3 | 211.1 | 1.49 |
| valid | 122.6 | 186.5 | 4.17 |
| Default | 103.0 | 101.5 | 3378.72 |
| slab_MD | 52.5 | 177.4 | 13.87 |
+-------------+---------------------+------------------+-------------------+
2022-09-20 20:24:24.914 INFO: Saving model to checkpoints/MACE_model_run-123.model

without Isolated atom in my training set:
2022-09-20 20:44:35.017 INFO: Epoch 996: loss=219.9501, RMSE_E_per_atom=503.1 meV, RMSE_F=517.0 meV / A
2022-09-20 20:45:00.217 INFO: Epoch 998: loss=236.2850, RMSE_E_per_atom=521.5 meV, RMSE_F=523.2 meV / A
2022-09-20 20:45:12.731 INFO: Training complete
2022-09-20 20:45:12.732 INFO: Loading checkpoint: checkpoints/MACE_model_run-123_epoch-800.pt
2022-09-20 20:45:13.144 INFO: Loaded model from epoch 800
2022-09-20 20:45:13.144 INFO: Computing metrics for training, validation, and test sets
2022-09-20 20:45:28.456 INFO: Evaluating train ...
2022-09-20 20:45:44.218 INFO: Evaluating valid ...
2022-09-20 20:45:47.300 INFO: Evaluating Default ...
2022-09-20 20:45:49.142 INFO: Evaluating slab_MD ...
2022-09-20 20:45:49.613 INFO:
+-------------+---------------------+------------------+-------------------+
| config_type | RMSE E / meV / atom | RMSE F / meV / A | relative F RMSE % |
+-------------+---------------------+------------------+-------------------+
| train | 68.4 | 120.9 | 0.85 |
| valid | 475.9 | 281.9 | 6.30 |
| Default | 307.2 | 174.9 | 5819.96 |
| slab_MD | 44.9 | 185.4 | 14.50 |
+-------------+---------------------+------------------+-------------------+
2022-09-20 20:45:49.613 INFO: Saving model to checkpoints/MACE_model_run-123.model

These results still looks not that satisfying, do you have any possible suggestions for this?

0 replies

ilyes319 · 2022-09-21T10:56:03Z

ilyes319
Sep 21, 2022
Maintainer

Could you please link me your full log file and your train file please?

0 replies

Vinceuwe · 2022-09-21T12:27:52Z

Vinceuwe
Sep 21, 2022
Author

he, can you provide your email?

0 replies

ilyes319 · 2022-09-21T12:32:11Z

ilyes319
Sep 21, 2022
Maintainer

Yes, it is [email protected] .

0 replies

gabor1 · 2023-01-16T21:13:26Z

gabor1
Jan 16, 2023
Maintainer

This behaviour is also consistent with the situation when your energies and forces are not consistent. Where is your data from? Are you using the electronic free energy?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question related to Model generated after training complete #67

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Question related to Model generated after training complete #67

Vinceuwe Sep 20, 2022

Replies: 9 comments

davkovacs Sep 20, 2022 Maintainer

ilyes319 Sep 20, 2022 Maintainer

Vinceuwe Sep 20, 2022 Author

ilyes319 Sep 20, 2022 Maintainer

Vinceuwe Sep 21, 2022 Author

ilyes319 Sep 21, 2022 Maintainer

Vinceuwe Sep 21, 2022 Author

ilyes319 Sep 21, 2022 Maintainer

gabor1 Jan 16, 2023 Maintainer

Vinceuwe
Sep 20, 2022

davkovacs
Sep 20, 2022
Maintainer

ilyes319
Sep 20, 2022
Maintainer

Vinceuwe
Sep 20, 2022
Author

ilyes319
Sep 20, 2022
Maintainer

Vinceuwe
Sep 21, 2022
Author

ilyes319
Sep 21, 2022
Maintainer

Vinceuwe
Sep 21, 2022
Author

ilyes319
Sep 21, 2022
Maintainer

gabor1
Jan 16, 2023
Maintainer