-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trainer framework #178
Trainer framework #178
Conversation
…r stress in training loss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Two major comments:
-
Can we merge
torch_trainer.py
andlightning_trainer.py
, to use lightning functions and avoid recreating the wheels? -
For models, we support three modes,
kim
,tar
, andgraph
. If I understand correctly,tar
is just a local tarball of thekim
mode. So can we merge these two?
Detailed comments on these are provided in place.
# list of model drivers that are not supported by this trainer. | ||
# example quip, torchml, etc. | ||
# TODO: Get the complete list of unsupported model drivers. | ||
UNSUPPORTED_MODEL_DRIVERS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more appropriate to use SUPPORTED_MDOEL_DRIVERS
to list the ones that the trainer supports? This might be easier to manage in the long run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try, supported list might be too long. Idea was that in longer run (perhaps in next 6 months or so) KLIFF will have a trainer for every model driver possible. For SNAP/GAP etc trainer will just use external dependencies like quip. But all in all KLIFF will provide means to train all models that can be trained, and have portable model driver. So the Unsupported list might be shorter and easily maintainable.
Example `param_manifest`: | ||
```yaml | ||
parameter: | ||
- A # dict means the parameter is transformed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean by dict means the parameter is transformed
? Is this for simga below?
""" | ||
Compute the loss function for the given parameters. It will compute the loss | ||
function. It seems like MPI might be only way to make it parallel as the | ||
multiprocessing does not work with the KIM models. KIMPY models are not yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current version does support parallel via multiprocess. If I remember correctly, it gets around to picking the entire KIM model object.
But I agree -- let's focus on MPI in the future to avoid complicating it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is interesting, I could not find any obvious way of pickling Pybind11 C++ bindings using pickle. Can you provide more information on it? I dont think I saw it in KIM. Just for personal interest.
return MSE_residuals | ||
|
||
def save_kim_model(self): | ||
if self.export_manifest["model_type"].lower() == "kim": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify it, I don't think we need to distinguish kim
and tar
, because it seems the only difference is that the latter is the tarred version of the former. This requires change of some other files as well.
I think we can generally call this a kim
model. We have the option to tar it or not here. But when reading the kim model (in another place), we can use a utility function to check what is provided and then proceed accordingly. To be more specific, I can think of three modes:
- If a simple model name like
SW_xxxxxx
is provided, this will be regarded as the user wanting to use a model in a kim collection (e.g. user); - If a path is provided like,
./SW_xxxxxx
, then the local model at the corresponding path will be installed and used; - If a tar path is provided like
./SW_xxxxxx.tgz
, we untar and then install.
For the three models, we can internally use a utility function check whether it is a name, a path to a directory, or a path to a tarball.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a good point. The reason for tar was that KIM-Kit, which is supposed to be a Python package for managing KIM-API installation, track model history etc., needs tarballs of models. I will remove it in next iteration after discussing with Brendon once.
@@ -0,0 +1,366 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any major reason we want to implement the torch_trainer.py
? Can it be merged with the lightning trainer to simplify things?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can, but the real reason behind torch_trainer
is to provide access to descriptor based models, and more flexible trainer for further custom training. Lightning trainer, which most will use, is limited to GNNs as 1) they are clear favorite and most sought after, 2) they can be trained on a scale where lightning makes a difference.
I will go through the comments in coming week, as I get some time (currently working on making the Mac package for libdescriptor so all tests can pass). But regarding your major questions:
The reason for two separate trainers is that Lightning can get bit rigid at times, and there might be more complicated workflows. For example, EMA updates of weights needs explicitly uploading them to GPU, as Lightning did not have support for it yet. In such cases having a pure torch trainer might be useful. For large models, Lightning is essential, but for small, more innovative research issues, like say TorchMD-net etc, torch based trainer will be better on the side.
Yes, tar is just KIM model which is tarball. It can also be a ML model, in which case we will pluck out the torchscript wile and use that as model. So in that sense tar = KIM physics + KIM ML models. But you are right, I think I can merge them such that it will treat tarballs are ML or KIM model based on files inside. |
I have added some changes as requested. Some are pending for next iteration (along with additions to torch lightning). I would like to discuss any changes regarding the API and design. As if that is finalized, I can start cleaning up the code and making it more consistent. |
…ckpt resume capabilities + docstrings
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## v1 #178 +/- ##
==========================================
- Coverage 63.14% 53.93% -9.21%
==========================================
Files 50 56 +6
Lines 4718 5699 +981
==========================================
+ Hits 2979 3074 +95
- Misses 1739 2625 +886 ☔ View full report in Codecov by Sentry. |
Added fresh PR as per discussed. Only containing the lightning trainer for now. Will add KIM next. Removing this PR |
I dont expect this PR to be immediately merged. But I think I can use a fresh pair of eyes for the trainer. Now it has three trainers, KIM trainer, DNN trainer for descriptors, and GNN trainer using torch lightining. I think at this point I can really use comments and recommendations.