-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turn ParallelAnalysisBase into dask custom collection #136
base: master
Are you sure you want to change the base?
Conversation
Hello @yuxuanzhuang! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
Comment last updated at 2020-08-20 10:39:58 UTC |
This PR really depends on PR #132 so we should look at that one first. Then you can rebase this one and it will become much cleaner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks like a really interesting way to move forward. This, together with the notebook, is a good study for how the next version of PMDA could look like.
I have a bunch of initial questions/comments inline.
Also note that we would first need to merge PR #132 before really moving forward here.
We would also need to remove Python 2 as soon as we become dependent on MDA 2.0.0 (but that's for PR #132).
Tests will obviously be needed...
return self._keys | ||
|
||
# it uses multiprocessing scheduler in default | ||
__dask_scheduler__ = staticmethod(dask.multiprocessing.get) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though multiprocessing is the default scheduler, one can still use distributed, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it can either be a global dask config, a context manager, or an arg in self.compute()
. (https://docs.dask.org/en/latest/scheduler-overview.html#configuring-the-schedulers)
pmda/parallel.py
Outdated
np.array([el[5] for el in res])) | ||
|
||
# this is crucial if the analysis does not iterate over | ||
# the whole trajectory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this crucial? What would happen? Add more comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed here
https://github.com/MDAnalysis/pmda/pull/132/files#r455247843
|
||
def __dask_postpersist__(self): | ||
# we don't need persist implementation. | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it not be possible to persist
?
Presumably, that would have been possible previously if we had chosen persist
in run
instead of compute()
.
pmda/parallel.py
Outdated
times_io), np.sum(times_compute) | ||
|
||
@staticmethod | ||
def _reduce(res, result_single_frame): | ||
""" 'append' action for a time series""" | ||
res.append(result_single_frame) | ||
return res | ||
|
||
def __getstate__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does DaskMixin require the whole class to be picklable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes...I mean in the old implementation, the whole class has to be picklable as well.
FYI, the code here is not needed anymore after MDAnalysis/mdanalysis#2893 is merged
pmda/parallel.py
Outdated
@@ -284,6 +281,69 @@ def _single_frame(self, ts, atomgroups): | |||
""" | |||
raise NotImplementedError | |||
|
|||
def prepare_jobs(self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prepare_jobs sounds confusing to me – what "jobs"? If it's part of the documented workflow then it could just be prepare.
prepare_dask would be more explicit but also a bit pointless because PMDA is fully intertwined with dask so that's the only thing we would be preparing for. create_dask_graph is too long and really talks to much about implementation details.
All in all, I'd just call it prepare and add more docs stating clearly what is being prepared and under which circumstances a user needs to run it.
Fixes #135
Note the only file changes from #132 is
parallel.py
You can read https://github.com/yuxuanzhuang/pmda/pull/1/files to get the actual changes.Changes made in this Pull Request:
PR Checklist