-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Refactor of subsampling module #98
base: master
Are you sure you want to change the base?
Conversation
All three of our subsampling functions should now be able to consume any dataframe in our two standard u_nk and dHdl forms and produce meaningful results. Each of these functions now performs a groupby with all non-time indexes, and separately performs its operation on each group. Some notes on the implementation: 1. The `series` kwarg is now the `column` arg for `statistical_inefficiency`, `equilibrium_detection`. A column name is now what we use, as requiring a column of the dataframe offers guarantees around the index matching, ordering, etc. that we had to do backflips to guarantee before. It is still possible to use another series as the basis for either operation, but it should be grafted on to the dataframe as a column to do so. 2. Since `column` is an arg, it is now required for the above functions. Therefore, those functions no longer behave like `slicing` when `column` isn't specified. My logic with breaking this is to favor explicitness and to avoid surprising users with behavior that throws no warnings but silently may not be doing what they want. 3. I've added some kwargs to `force` ignoring exceptions. The only exception these functions throw now is one due to duplicate time values in the groupings, as this tends to indicate two or more timeseries have been dumped together, which is unlikely what one wants to do with these subsampling functions that assume correlation to work as intended. Some other nice things: 1. I will be adding the minimal assumptions the standard forms make to their doc page as part of this change's PR. One of the assumptions I want to propagate is that the outermost index is *always* time, or at least something implying time (frame, sample, whatever). It needn't be called 'time', however, for all the tooling to work. 2. These subsamplers now explicitly call out that they *can't be expected* to preserve any custom ordering of rows. They will always be doing sorting of some kind, so the time index values are meaningful.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #98 +/- ##
==========================================
- Coverage 97.26% 94.71% -2.55%
==========================================
Files 12 12
Lines 695 776 +81
Branches 141 158 +17
==========================================
+ Hits 676 735 +59
- Misses 5 18 +13
- Partials 14 23 +9 ☔ View full report in Codecov by Sentry. |
This gist demonstrates the behavior of the refactored functions. |
Not getting all passes yet, and not sure I'm happy with how these work. There are clear shortcomings with the static column approach, requiring perhaps manual effort on the part of the user to work around. This is counter to alchemlyb's philosophy of making the 'most right' thing easy to do.
In the spirit of wanting to make it easy for users to do the 'most right' thing, I have looked at the details of what `alchemical-analysis` currently does for subsampling with statisticalInefficiency, and it uses an approach similar to the one documented in each of our `statistical_inefficiency` and `equilbrium_detection` functions now. Next steps are to *actually* implement this functionality. We are going to make our lives easier by adding an `alchemform` attribute to dataframes our parsers produce, and add this to the standard form spec. This attribute will have 'dHdl' or 'u_nk' as its value, depending on the form of data we're dealing with. Our estimators can also use this as a check or warning.
`how` methods implemented; these require explicit tests yet. We should also add logging to each of these functions, using a module-level logger. This would allow for auditability later in case downstream issues are discovered.
About to break out the generic bits to make the same changes to equilibrium_detection.
…iciency All tests pass, but we should try and think carefully about tests that we really want to add to ensure these are working as expected.
Using something like `pd.concat` on a set of DataFrames doesn't propagate forward the `.attrs` dictionary, so having an `alchemform` key here created by the parsers is not sufficient for signalling to preprocessors the type of data. We will likely remove the 'auto' option from the subsamplers and insist on explicit values for `how`. This is not *necessarily* bad, as it promotes some understanding for the user of what they are choosing to do here.
Alternatively, they can choose a `column`. I think this is appropriate, as it is still important that users are consciously aware of some of the high-level choices the software is making. There will always be touchpoints like this, though we should try to minimize them where possible.
These changes apply to tests of `statistical_inefficiency`, `equilibrium_detection`. I also set `nskip=5` for `equilibrium_detection` cases, which required some changes to test parameters. This change speeds up the tests, however, at no real coverage cost.
The changes I've made to the codebase are ready for review! I've updated the PR description with the current state of the design, as it took a few iterations for me to arrive at. I would like some feedback on this design as soon as possible so changes can be made to accommodate needs. This PR is also a bit of a blocker for #94 and #99, since subsampling is a key component of the I am now working on an example notebook that will make its way into our examples docs; this will showcase how to use these functions, as well as when each makes sense to use. I don't consider this notebook a requirement for review of the implementation, but consider it important to make sure the feel of this approach is right. I think I can turn it around and add to this PR by the end of this week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dotsdl , I had a quick look and overall this looks like great work. Some comments inline.
|
||
g \equiv (1 + 2\tau) > 1 | ||
|
||
where :math:`\tau` is the autocorrelation time of the timeseries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tau has to be dimensionless for this to make sense. Define what tau is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good call. Tau in this usage is measured in frames, or ordered rows, of a timeseries. I will change the wording to reflect this.
|
||
The ``how`` parameter indicates the choice of observable for performing the calculation of :math:`g`. | ||
|
||
* For :math:`u_{nk}` datasets, the choice of ``'right'`` is recommended: the column immediately to the right of the column corresponding to the group's lambda index value is used as the observable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do these recommendations come from? Cite a paper or provide more detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to, but they come from my attempt to emulate what is already done in alchemical-analysis.py, and I don't know if a reference exists for the choices made there. @mrshirts may be able to provide some direction on this.
I do believe I understand the reason for the choices, however: in principle, any column in a timeseries would be fine to base decorrelation on, so long as the column is not uniformly zero or insensitive to time/frame for the given lambda window being sampled.
- The approach of choosing the column to the right of the lambda window being sampled for
u_nk
avoids cases where theu_nk
of calculated lambdas are relative to the sampled one, giving all zeros for the column corresponding to the lambda actually sampled. An example of this observed:
from alchemtest.gmx import load_water_particle_without_energy
from alchemlyb.parsing.gmx import extract_u_nk
import pandas as pd
ds = load_water_particle_without_energy()
print(pd.concat(extract_u_nk(i, T=300) for i in ds.data['AllStates']).sort_index(level=[1, 2, 0]))
This gives:
(0.0, 0.0) ... (1.0, 1.0)
time coul-lambda vdw-lambda ...
0.0 0.0 0.0 0.000000 ... 148.347412
18.6 0.0 0.0 0.000000 ... 1983.809792
37.2 0.0 0.0 0.000000 ... 205.930556
55.8 0.0 0.0 0.000000 ... 4512.816690
74.4 0.0 0.0 0.000000 ... 25939.040202
... ... ... ...
9913.8 1.0 1.0 45.353152 ... 0.000000
9932.4 1.0 1.0 41.207448 ... 0.000000
9951.0 1.0 1.0 44.375182 ... 0.000000
9969.6 1.0 1.0 45.505032 ... 0.000000
9988.2 1.0 1.0 37.779798 ... 0.000000
[20444 rows x 38 columns]
Note that if we chose any old column, for one of the groupings of the ('coul-lambda', 'vdw-lambda')
indices, we would end up trying to decorrelate on all zeros. the how="right"
choice sidesteps this (as does how="left"
).
- The approach of summing all
dHdl
columns is not as clear to me, and I wouldn't be surprised if I've misinterpreted the implementation inalchemical-analysis.py
. It seemed vaguely like aDH = (dH/dl_0) Dl_0 + (dH/dl_1) Dl_1 + ...
, whereDl_0, Dl_1, ... == 1
, but the detailed motivation for this choice isn't clear to me from the code or the git logs.
* For :math:`u_{nk}` datasets, the choice of ``'right'`` is recommended: the column immediately to the right of the column corresponding to the group's lambda index value is used as the observable. | ||
* For :math:`\frac{dH}{d\lambda}` datasets, the choice of ``'sum'`` is recommended: the columns are simply summed, and the resulting :py:class:`pandas.Series` is used as the observable. | ||
|
||
See the API documentation below on the possible values for ``how``, as well as more detailed explanations for each choice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not good enough when you explicitly tell users what they should be doing. Be clear here why you're recommending one over the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I would like some assistance with this. See my notes above.
|
||
Equilibrium Detection | ||
--------------------- | ||
The :func:`~alchemlyb.preprocessing.subsampling.equilibrium_detection` function subsamples each timeseries in the dataset using the equilibration detection approach developed by John Chodera (see reference below). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of "reference below", cite it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would generally avoid mentioning people by name as it sounds to much like trying to convince readers through appeal to authority instead to facts (i.e., papers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, I will change this accordingly.
See :py:func:`pymbar.timeseries.detectEquilibration` for more details; this is used internally by this subsampler. | ||
Please reference the following if you use this function in your research: | ||
|
||
[1] John D. Chodera. A simple method for automated equilibration detection in molecular simulations. Journal of Chemical Theory and Computation, 12:1799, 2016. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use proper reST citation syntax.
``pymbar.timeseries.statisticalInefficiency()``; previously, the statistical | ||
inefficiency was _rounded_ (instead of ``ceil()``) and thus one could | ||
end up with correlated data. | ||
``pymbar.timeseries.subsampleCorrelatedData()``; previously, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did pymbar change the function name or was this a mistake?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a mistake in the docstring that I fixed here; the same mistake was not made in the equivalent line for equilibrium_detection
.
|
||
return indices, calculated | ||
|
||
return _decorrelator( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the more modular approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
is used. If there is no column to the left, then the column to the | ||
right is used. If there is no column corresponding to the group's | ||
lambda index value, then 'random' is used for that group (see below). | ||
'random' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan... 👎
# results can vary slightly with different machines | ||
# so possibly do | ||
# delta = 10 | ||
# assert size - delta < len(sliced) < size + delta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also turn it into a floating point comparison, e.g., size/len(sliced)
as the fraction and then just use assert_almost_equal(...., decimals=2)
or whatever. Either way, if you already know that there can be variation then put it in the test and comment it.
def test_conservative(self, data, size, conservative): | ||
sliced = self.slicer(data, series=data.iloc[:, 0], conservative=conservative) | ||
def test_conservative(self, data, size, conservative, how): | ||
sliced = self.subsampler(data, how=how, conservative=conservative) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less stringent test?
Thank you for your review @orbeckst! Your comments are incredibly helpful in steering this refactor to a satisfying state; I will be addressing the points you have raised over the next several days. |
@dotsdl what do we need here to get this in? I’m asking because if we want to do breaking changes then I’d rather do it sooner than later. Or we do 0.4 #113 and then do this one for a 0.5 with an eye towards 1.0. Overall it looks to me that alchemlyb has sufficiently stabilized to just declare the API stable soon. |
Is this PR still relevant? @xiki-tempula how much of what's here was already addressed in 1.0.1/2.0.0-dev? |
I don't think there are significant features that are not implemented in the 2.0.0. But I do feel that the documentation could benefit. |
Addresses #95, #79, #91
Required for merge:
alchemlyb.preprocessing.subsampling
modulealchemtest
This is an API breaking change.
All three of our subsampling functions should now be able to consume any DataFrame in our two standard
u_nk
anddHdl
forms and produce meaningful results. Each of these functions now performs agroupby
with all non-time indexes, and separately performs its operation on each group. The resulting subsampled groups are concatenated and returned.Some notes on the implementation:
series
kwarg is now thecolumn
kwarg forstatistical_inefficiency
,equilibrium_detection
. This can take either a column name in the DataFrame or a Series object (the index of the Series must match exactly that of the DataFrame). Taking a Series object allows for code that uses the existing usage of e.g.statistical_inefficiency(u_nk, u_nk[<column_name>])
to still work as before.statistical_inefficiency
andequilibrium_detection
now have ahow
kwarg. I expect this to be the main usage we suggest for users of these functions. I took a look at howalchemical_analysis
currently handlesdHdl
andu_nk
cases, and to replicate that as closely as possible, we recommend usinghow='right'
foru_nk
data, andhow='sum'
fordHdl
data. Please see the docstrings forstatistical_inefficiency
,equilibrium_detection
for details on these treatments.column
orhow
must be used forstatistical_inefficiency
,equilibrium_detection
. The function can no longer default to simple slicing if onlydata
is given as an input. This is to avoid (potentially new) users from applying the functions without specifying these inputs with no obvious failure, but with the function doing nothing to decorrelate their data (essentially a silent failure).force
kwarg tostatistical_inefficiency
,equilibrium_detection
. The only exception these functions will skip withforce=True
is one due to duplicate time values in the groupings, as this tends to indicate two or more timeseries have been dumped together, which is unlikely what one wants to do with these subsampling functions that assume correlation to work as intended.return_calculated=True
forstatistical_inefficiency
,equilibrium_detection
will return calculated values used for subsampling each lambda grouping:statistical_inefficiency
:statinef
equilibrium_detection
:t
,statinef
,Neff_max
Some other nice things: