Trainer framework #178

ipcamit · 2024-05-23T03:41:36Z

I dont expect this PR to be immediately merged. But I think I can use a fresh pair of eyes for the trainer. Now it has three trainers, KIM trainer, DNN trainer for descriptors, and GNN trainer using torch lightining. I think at this point I can really use comments and recommendations.

…r stress in training loss

mjwen

Looks great! Two major comments:

Can we merge torch_trainer.py and lightning_trainer.py, to use lightning functions and avoid recreating the wheels?
For models, we support three modes, kim, tar, and graph. If I understand correctly, tar is just a local tarball of the kim mode. So can we merge these two?

Detailed comments on these are provided in place.

kliff/dataset/dataset.py

mjwen · 2024-05-28T04:10:51Z

kliff/models/kim.py

+# list of model drivers that are not supported by this trainer.
+# example quip, torchml, etc.
+# TODO: Get the complete list of unsupported model drivers.
+UNSUPPORTED_MODEL_DRIVERS = [


Would it be more appropriate to use SUPPORTED_MDOEL_DRIVERS to list the ones that the trainer supports? This might be easier to manage in the long run.

Let me try, supported list might be too long. Idea was that in longer run (perhaps in next 6 months or so) KLIFF will have a trainer for every model driver possible. For SNAP/GAP etc trainer will just use external dependencies like quip. But all in all KLIFF will provide means to train all models that can be trained, and have portable model driver. So the Unsupported list might be shorter and easily maintainable.

mjwen · 2024-05-28T04:14:54Z

kliff/models/kim.py

+        Example `param_manifest`:
+        ```yaml
+            parameter:
+                    - A         # dict means the parameter is transformed


What does it mean by dict means the parameter is transformed? Is this for simga below?

mjwen · 2024-05-28T16:22:31Z

kliff/trainer/kim_trainer.py

+        """
+        Compute the loss function for the given parameters. It will compute the loss
+        function. It seems like MPI might be only way to make it parallel as the
+        multiprocessing does not work with the KIM models. KIMPY models are not yet


The current version does support parallel via multiprocess. If I remember correctly, it gets around to picking the entire KIM model object.

But I agree -- let's focus on MPI in the future to avoid complicating it.

That is interesting, I could not find any obvious way of pickling Pybind11 C++ bindings using pickle. Can you provide more information on it? I dont think I saw it in KIM. Just for personal interest.

kliff/trainer/kim_trainer.py

mjwen · 2024-05-28T16:37:49Z

kliff/trainer/kim_trainer.py

+            return MSE_residuals
+
+    def save_kim_model(self):
+        if self.export_manifest["model_type"].lower() == "kim":


To simplify it, I don't think we need to distinguish kim and tar, because it seems the only difference is that the latter is the tarred version of the former. This requires change of some other files as well.

I think we can generally call this a kim model. We have the option to tar it or not here. But when reading the kim model (in another place), we can use a utility function to check what is provided and then proceed accordingly. To be more specific, I can think of three modes:

If a simple model name like SW_xxxxxx is provided, this will be regarded as the user wanting to use a model in a kim collection (e.g. user);

If a path is provided like, ./SW_xxxxxx, then the local model at the corresponding path will be installed and used;

If a tar path is provided like ./SW_xxxxxx.tgz, we untar and then install.

For the three models, we can internally use a utility function check whether it is a name, a path to a directory, or a path to a tarball.

It is a good point. The reason for tar was that KIM-Kit, which is supposed to be a Python package for managing KIM-API installation, track model history etc., needs tarballs of models. I will remove it in next iteration after discussing with Brendon once.

kliff/trainer/torch_trainer.py

mjwen · 2024-05-28T16:48:08Z

kliff/trainer/torch_trainer.py

@@ -0,0 +1,366 @@
+import os


Any major reason we want to implement the torch_trainer.py? Can it be merged with the lightning trainer to simplify things?

We can, but the real reason behind torch_trainer is to provide access to descriptor based models, and more flexible trainer for further custom training. Lightning trainer, which most will use, is limited to GNNs as 1) they are clear favorite and most sought after, 2) they can be trained on a scale where lightning makes a difference.

ipcamit · 2024-05-29T19:11:37Z

I will go through the comments in coming week, as I get some time (currently working on making the Mac package for libdescriptor so all tests can pass). But regarding your major questions:

Can we merge torch_trainer.py and lightning_trainer.py, to use lightning functions and avoid recreating the wheels?

The reason for two separate trainers is that Lightning can get bit rigid at times, and there might be more complicated workflows. For example, EMA updates of weights needs explicitly uploading them to GPU, as Lightning did not have support for it yet. In such cases having a pure torch trainer might be useful. For large models, Lightning is essential, but for small, more innovative research issues, like say TorchMD-net etc, torch based trainer will be better on the side.

For models, we support three modes, kim, tar, and graph. If I understand correctly, tar is just a local tarball of the kim mode. So can we merge these two?

Yes, tar is just KIM model which is tarball. It can also be a ML model, in which case we will pluck out the torchscript wile and use that as model. So in that sense tar = KIM physics + KIM ML models. But you are right, I think I can merge them such that it will treat tarballs are ML or KIM model based on files inside.

ipcamit · 2024-06-05T15:31:55Z

I have added some changes as requested. Some are pending for next iteration (along with additions to torch lightning). I would like to discuss any changes regarding the API and design. As if that is finalized, I can start cleaning up the code and making it more consistent.

…ckpt resume capabilities + docstrings

codecov · 2024-06-10T05:25:53Z

Codecov Report

Attention: Patch coverage is 9.75855% with 897 lines in your changes missing coverage. Please review.

Project coverage is 53.93%. Comparing base (9ee3ce8) to head (b7a157f).

Files	Patch %	Lines
kliff/trainer/base_trainer.py	0.00%	263 Missing ⚠️
kliff/trainer/torch_trainer.py	0.00%	193 Missing ⚠️
kliff/trainer/lightning_trainer.py	0.00%	173 Missing ⚠️
kliff/models/kim.py	13.26%	85 Missing ⚠️
kliff/trainer/kim_trainer.py	0.00%	79 Missing ⚠️
kliff/dataset/dataset.py	41.12%	63 Missing ⚠️
kliff/utils.py	22.22%	14 Missing ⚠️
.../configuration_transforms/graphs/generate_graph.py	56.52%	10 Missing ⚠️
kliff/dataset/weight.py	64.28%	5 Missing ⚠️
kliff/trainer/kim_residuals.py	0.00%	5 Missing ⚠️
... and 3 more

Additional details and impacted files

@@            Coverage Diff             @@
##               v1     #178      +/-   ##
==========================================
- Coverage   63.14%   53.93%   -9.21%     
==========================================
  Files          50       56       +6     
  Lines        4718     5699     +981     
==========================================
+ Hits         2979     3074      +95     
- Misses       1739     2625     +886

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ipcamit · 2024-06-18T15:36:15Z

Added fresh PR as per discussed. Only containing the lightning trainer for now. Will add KIM next. Removing this PR

ipcamit added 18 commits February 14, 2024 19:22

Trainer module

9108450

Merge branch 'kliff-master-v1' into kliff-trainer-v1

2dc9be0

Trainer base class implemented

731c5df

Trainer base class working

86ab5d2

First draft trainer framework

2f57f56

from config functionality in KIMModel

01322ec

DS and Model manifest initialization

2fa8bb8

Moved back from omegaconf to dict

e1ef24e

Working KIM trainer module

614c1b9

Merged Eric's PR

d3050a6

Torch ml trainer added, to test

ccd5e29

working descriptor module

a121895

stress in loss function

18f9bc4

Functional non NaN torch trainer

eee1459

Functioning Lightning trainer, tested on nequip

0cbe35e

Data resume capabilities added

5dc5eee

Dynamic loading in Lightning train

e8cb5fc

Added Lightning checkpoints for model save and loss traj + prelims fo…

26edfce

…r stress in training loss

mjwen self-requested a review May 28, 2024 03:51

mjwen reviewed May 28, 2024

View reviewed changes

ipcamit added 2 commits June 3, 2024 16:12

Modified dataset weights + tests to reflect

289a2a9

Cleanup + indices

1f4e6eb

ipcamit added 4 commits June 8, 2024 22:05

implemented save kim odel

533bf2f

Checked model export

9f65157

Added restart capabilities

c494f5e

Lightning trainer updates: Added KIM-API model export capabilities + …

b7a157f

…ckpt resume capabilities + docstrings

ipcamit closed this Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer framework #178

Trainer framework #178

ipcamit commented May 23, 2024

mjwen left a comment

mjwen May 28, 2024

ipcamit Jun 2, 2024 •

edited

Loading

mjwen May 28, 2024

mjwen May 28, 2024

ipcamit May 29, 2024

mjwen May 28, 2024

ipcamit Jun 5, 2024

mjwen May 28, 2024

ipcamit Jun 5, 2024

ipcamit commented May 29, 2024

ipcamit commented Jun 5, 2024

codecov bot commented Jun 10, 2024

ipcamit commented Jun 18, 2024

Trainer framework #178

Trainer framework #178

Conversation

ipcamit commented May 23, 2024

mjwen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ipcamit Jun 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ipcamit commented May 29, 2024

ipcamit commented Jun 5, 2024

codecov bot commented Jun 10, 2024

Codecov Report

ipcamit commented Jun 18, 2024

ipcamit Jun 2, 2024 •

edited

Loading