Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer framework #178

Closed
wants to merge 24 commits into from
Closed

Trainer framework #178

wants to merge 24 commits into from

Conversation

ipcamit
Copy link

@ipcamit ipcamit commented May 23, 2024

I dont expect this PR to be immediately merged. But I think I can use a fresh pair of eyes for the trainer. Now it has three trainers, KIM trainer, DNN trainer for descriptors, and GNN trainer using torch lightining. I think at this point I can really use comments and recommendations.

@mjwen mjwen self-requested a review May 28, 2024 03:51
Copy link
Collaborator

@mjwen mjwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Two major comments:

  1. Can we merge torch_trainer.py and lightning_trainer.py, to use lightning functions and avoid recreating the wheels?

  2. For models, we support three modes, kim, tar, and graph. If I understand correctly, tar is just a local tarball of the kim mode. So can we merge these two?

Detailed comments on these are provided in place.

kliff/dataset/dataset.py Outdated Show resolved Hide resolved
kliff/dataset/dataset.py Outdated Show resolved Hide resolved
kliff/dataset/dataset.py Outdated Show resolved Hide resolved
# list of model drivers that are not supported by this trainer.
# example quip, torchml, etc.
# TODO: Get the complete list of unsupported model drivers.
UNSUPPORTED_MODEL_DRIVERS = [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more appropriate to use SUPPORTED_MDOEL_DRIVERS to list the ones that the trainer supports? This might be easier to manage in the long run.

Copy link
Author

@ipcamit ipcamit Jun 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try, supported list might be too long. Idea was that in longer run (perhaps in next 6 months or so) KLIFF will have a trainer for every model driver possible. For SNAP/GAP etc trainer will just use external dependencies like quip. But all in all KLIFF will provide means to train all models that can be trained, and have portable model driver. So the Unsupported list might be shorter and easily maintainable.

Example `param_manifest`:
```yaml
parameter:
- A # dict means the parameter is transformed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it mean by dict means the parameter is transformed? Is this for simga below?

"""
Compute the loss function for the given parameters. It will compute the loss
function. It seems like MPI might be only way to make it parallel as the
multiprocessing does not work with the KIM models. KIMPY models are not yet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current version does support parallel via multiprocess. If I remember correctly, it gets around to picking the entire KIM model object.

But I agree -- let's focus on MPI in the future to avoid complicating it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is interesting, I could not find any obvious way of pickling Pybind11 C++ bindings using pickle. Can you provide more information on it? I dont think I saw it in KIM. Just for personal interest.

kliff/trainer/kim_trainer.py Show resolved Hide resolved
return MSE_residuals

def save_kim_model(self):
if self.export_manifest["model_type"].lower() == "kim":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify it, I don't think we need to distinguish kim and tar, because it seems the only difference is that the latter is the tarred version of the former. This requires change of some other files as well.

I think we can generally call this a kim model. We have the option to tar it or not here. But when reading the kim model (in another place), we can use a utility function to check what is provided and then proceed accordingly. To be more specific, I can think of three modes:

  1. If a simple model name like SW_xxxxxx is provided, this will be regarded as the user wanting to use a model in a kim collection (e.g. user);
  2. If a path is provided like, ./SW_xxxxxx, then the local model at the corresponding path will be installed and used;
  3. If a tar path is provided like ./SW_xxxxxx.tgz, we untar and then install.

For the three models, we can internally use a utility function check whether it is a name, a path to a directory, or a path to a tarball.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a good point. The reason for tar was that KIM-Kit, which is supposed to be a Python package for managing KIM-API installation, track model history etc., needs tarballs of models. I will remove it in next iteration after discussing with Brendon once.

kliff/trainer/torch_trainer.py Outdated Show resolved Hide resolved
@@ -0,0 +1,366 @@
import os
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any major reason we want to implement the torch_trainer.py? Can it be merged with the lightning trainer to simplify things?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but the real reason behind torch_trainer is to provide access to descriptor based models, and more flexible trainer for further custom training. Lightning trainer, which most will use, is limited to GNNs as 1) they are clear favorite and most sought after, 2) they can be trained on a scale where lightning makes a difference.

@ipcamit
Copy link
Author

ipcamit commented May 29, 2024

I will go through the comments in coming week, as I get some time (currently working on making the Mac package for libdescriptor so all tests can pass). But regarding your major questions:

Can we merge torch_trainer.py and lightning_trainer.py, to use lightning functions and avoid recreating the wheels?

The reason for two separate trainers is that Lightning can get bit rigid at times, and there might be more complicated workflows. For example, EMA updates of weights needs explicitly uploading them to GPU, as Lightning did not have support for it yet. In such cases having a pure torch trainer might be useful. For large models, Lightning is essential, but for small, more innovative research issues, like say TorchMD-net etc, torch based trainer will be better on the side.

For models, we support three modes, kim, tar, and graph. If I understand correctly, tar is just a local tarball of the kim mode. So can we merge these two?

Yes, tar is just KIM model which is tarball. It can also be a ML model, in which case we will pluck out the torchscript wile and use that as model. So in that sense tar = KIM physics + KIM ML models. But you are right, I think I can merge them such that it will treat tarballs are ML or KIM model based on files inside.

@ipcamit
Copy link
Author

ipcamit commented Jun 5, 2024

I have added some changes as requested. Some are pending for next iteration (along with additions to torch lightning). I would like to discuss any changes regarding the API and design. As if that is finalized, I can start cleaning up the code and making it more consistent.

Copy link

codecov bot commented Jun 10, 2024

Codecov Report

Attention: Patch coverage is 9.75855% with 897 lines in your changes missing coverage. Please review.

Project coverage is 53.93%. Comparing base (9ee3ce8) to head (b7a157f).

Files Patch % Lines
kliff/trainer/base_trainer.py 0.00% 263 Missing ⚠️
kliff/trainer/torch_trainer.py 0.00% 193 Missing ⚠️
kliff/trainer/lightning_trainer.py 0.00% 173 Missing ⚠️
kliff/models/kim.py 13.26% 85 Missing ⚠️
kliff/trainer/kim_trainer.py 0.00% 79 Missing ⚠️
kliff/dataset/dataset.py 41.12% 63 Missing ⚠️
kliff/utils.py 22.22% 14 Missing ⚠️
.../configuration_transforms/graphs/generate_graph.py 56.52% 10 Missing ⚠️
kliff/dataset/weight.py 64.28% 5 Missing ⚠️
kliff/trainer/kim_residuals.py 0.00% 5 Missing ⚠️
... and 3 more
Additional details and impacted files
@@            Coverage Diff             @@
##               v1     #178      +/-   ##
==========================================
- Coverage   63.14%   53.93%   -9.21%     
==========================================
  Files          50       56       +6     
  Lines        4718     5699     +981     
==========================================
+ Hits         2979     3074      +95     
- Misses       1739     2625     +886     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ipcamit
Copy link
Author

ipcamit commented Jun 18, 2024

Added fresh PR as per discussed. Only containing the lightning trainer for now. Will add KIM next. Removing this PR

@ipcamit ipcamit closed this Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants