Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Refactor of subsampling module #98

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Conversation

dotsdl
Copy link
Member

@dotsdl dotsdl commented Jan 29, 2020

Addresses #95, #79, #91

Required for merge:

  • reference documentation for the functions included in the alchemlyb.preprocessing.subsampling module
  • example notebook showing usage, as well as rationale for usage, of each function
  • tests across more datasets in alchemtest
  • changelog entry

This is an API breaking change.

All three of our subsampling functions should now be able to consume any DataFrame in our two standard u_nk and dHdl forms and produce meaningful results. Each of these functions now performs a groupby with all non-time indexes, and separately performs its operation on each group. The resulting subsampled groups are concatenated and returned.

Some notes on the implementation:

  1. The series kwarg is now the column kwarg for statistical_inefficiency, equilibrium_detection. This can take either a column name in the DataFrame or a Series object (the index of the Series must match exactly that of the DataFrame). Taking a Series object allows for code that uses the existing usage of e.g. statistical_inefficiency(u_nk, u_nk[<column_name>]) to still work as before.
  2. statistical_inefficiency and equilibrium_detection now have a how kwarg. I expect this to be the main usage we suggest for users of these functions. I took a look at how alchemical_analysis currently handles dHdl and u_nk cases, and to replicate that as closely as possible, we recommend using how='right' for u_nk data, and how='sum' for dHdl data. Please see the docstrings for statistical_inefficiency, equilibrium_detection for details on these treatments.
  3. Either column or how must be used for statistical_inefficiency, equilibrium_detection. The function can no longer default to simple slicing if only data is given as an input. This is to avoid (potentially new) users from applying the functions without specifying these inputs with no obvious failure, but with the function doing nothing to decorrelate their data (essentially a silent failure).
  4. I've added the force kwarg to statistical_inefficiency, equilibrium_detection. The only exception these functions will skip with force=True is one due to duplicate time values in the groupings, as this tends to indicate two or more timeseries have been dumped together, which is unlikely what one wants to do with these subsampling functions that assume correlation to work as intended.
  5. Setting return_calculated=True for statistical_inefficiency, equilibrium_detection will return calculated values used for subsampling each lambda grouping:
    • statistical_inefficiency: statinef
    • equilibrium_detection: t, statinef, Neff_max

Some other nice things:

  1. I will be adding the minimal assumptions the standard forms make to their doc page as part of this change's PR. One of the assumptions I want to propagate is that the outermost index is always time, or at least something implying time (frame, sample, whatever). It needn't be called 'time', however, for all the tooling to work.
  2. These subsamplers now explicitly call out that they can't be expected to preserve any custom ordering of rows. They will always be doing sorting of some kind, so the time index values are meaningful.

All three of our subsampling functions should now be able to consume any
dataframe in our two standard u_nk and dHdl forms and produce meaningful
results. Each of these functions now performs a groupby with all
non-time indexes, and separately performs its operation on each group.

Some notes on the implementation:
1. The `series` kwarg is now the `column` arg for
   `statistical_inefficiency`, `equilibrium_detection`. A column name is
   now what we use, as requiring a column of the dataframe offers
   guarantees around the index matching, ordering, etc. that we had to
   do backflips to guarantee before. It is still possible to use another
   series as the basis for either operation, but it should be grafted on
   to the dataframe as a column to do so.
2. Since `column` is an arg, it is now required for the above functions.
   Therefore, those functions no longer behave like `slicing` when
   `column` isn't specified. My logic with breaking this is to favor
   explicitness and to avoid surprising users with behavior that throws
   no warnings but silently may not be doing what they want.
3. I've added some kwargs to `force` ignoring exceptions. The only
   exception these functions throw now is one due to duplicate time values
   in the groupings, as this tends to indicate two or more timeseries have
   been dumped together, which is unlikely what one wants to do with these
   subsampling functions that assume correlation to work as intended.

Some other nice things:
1. I will be adding the minimal assumptions the standard forms make to
their doc page as part of this change's PR. One of the assumptions I
want to propagate is that the outermost index is *always* time, or at
least something implying time (frame, sample, whatever). It needn't be
called 'time', however, for all the tooling to work.
2. These subsamplers now explicitly call out that they *can't be
expected* to preserve any custom ordering of rows. They will always be
doing sorting of some kind, so the time index values are meaningful.
@codecov
Copy link

codecov bot commented Jan 29, 2020

Codecov Report

Attention: Patch coverage is 81.96721% with 22 lines in your changes missing coverage. Please review.

Project coverage is 94.71%. Comparing base (7fcdf7a) to head (dc1b3ff).
Report is 197 commits behind head on master.

Files with missing lines Patch % Lines
src/alchemlyb/preprocessing/subsampling.py 81.96% 13 Missing and 9 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #98      +/-   ##
==========================================
- Coverage   97.26%   94.71%   -2.55%     
==========================================
  Files          12       12              
  Lines         695      776      +81     
  Branches      141      158      +17     
==========================================
+ Hits          676      735      +59     
- Misses          5       18      +13     
- Partials       14       23       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dotsdl
Copy link
Member Author

dotsdl commented Jan 29, 2020

This gist demonstrates the behavior of the refactored functions.

Not getting all passes yet, and not sure I'm happy with how these work.
There are clear shortcomings with the static column approach, requiring
perhaps manual effort on the part of the user to work around.
This is counter to alchemlyb's philosophy of making the 'most right'
thing easy to do.
In the spirit of wanting to make it easy for users to do the 'most
right' thing, I have looked at the details of what `alchemical-analysis`
currently does for subsampling with statisticalInefficiency, and it uses
an approach similar to the one documented in each of our
`statistical_inefficiency` and `equilbrium_detection` functions now.

Next steps are to *actually* implement this functionality.
We are going to make our lives easier by adding an `alchemform`
attribute to dataframes our parsers produce, and add this to the
standard form spec. This attribute will have 'dHdl' or 'u_nk' as its
value, depending on the form of data we're dealing with.
Our estimators can also use this as a check or warning.
`how` methods implemented; these require explicit tests yet.
We should also add logging to each of these functions, using a module-level logger.
This would allow for auditability later in case downstream issues are
discovered.
About to break out the generic bits to make the same changes to
equilibrium_detection.
…iciency

All tests pass, but we should try and think carefully about tests that
we really want to add to ensure these are working as expected.
Using something like `pd.concat` on a set of DataFrames doesn't
propagate forward the `.attrs` dictionary, so having an `alchemform` key
here created by the parsers is not sufficient for signalling to
preprocessors the type of data.

We will likely remove the 'auto' option from the subsamplers and insist
on explicit values for `how`. This is not *necessarily* bad, as it
promotes some understanding for the user of what they are choosing to do
here.
Alternatively, they can choose a `column`.
I think this is appropriate, as it is still important that users are
consciously aware of some of the high-level choices the software is making.
There will always be touchpoints like this, though we should try to
minimize them where possible.
These changes apply to tests of `statistical_inefficiency`, `equilibrium_detection`.
I also set `nskip=5` for `equilibrium_detection` cases, which required
some changes to test parameters.
This change speeds up the tests, however, at no real coverage cost.
@dotsdl
Copy link
Member Author

dotsdl commented Mar 11, 2020

The changes I've made to the codebase are ready for review! I've updated the PR description with the current state of the design, as it took a few iterations for me to arrive at. I would like some feedback on this design as soon as possible so changes can be made to accommodate needs. This PR is also a bit of a blocker for #94 and #99, since subsampling is a key component of the alchemlyb workflow.

I am now working on an example notebook that will make its way into our examples docs; this will showcase how to use these functions, as well as when each makes sense to use. I don't consider this notebook a requirement for review of the implementation, but consider it important to make sure the feel of this approach is right. I think I can turn it around and add to this PR by the end of this week.

@dotsdl
Copy link
Member Author

dotsdl commented Mar 11, 2020

@orbeckst and @mrshirts I've requested your eyes to give this implementation a pass. In particular, I would like a check on the how parameter options for statistical_inefficiency and equilibrium_detection, and whether they make sense or appear correct.

Copy link
Member

@orbeckst orbeckst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dotsdl , I had a quick look and overall this looks like great work. Some comments inline.


g \equiv (1 + 2\tau) > 1

where :math:`\tau` is the autocorrelation time of the timeseries.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tau has to be dimensionless for this to make sense. Define what tau is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good call. Tau in this usage is measured in frames, or ordered rows, of a timeseries. I will change the wording to reflect this.


The ``how`` parameter indicates the choice of observable for performing the calculation of :math:`g`.

* For :math:`u_{nk}` datasets, the choice of ``'right'`` is recommended: the column immediately to the right of the column corresponding to the group's lambda index value is used as the observable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these recommendations come from? Cite a paper or provide more detail.

Copy link
Member Author

@dotsdl dotsdl Mar 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be happy to, but they come from my attempt to emulate what is already done in alchemical-analysis.py, and I don't know if a reference exists for the choices made there. @mrshirts may be able to provide some direction on this.

I do believe I understand the reason for the choices, however: in principle, any column in a timeseries would be fine to base decorrelation on, so long as the column is not uniformly zero or insensitive to time/frame for the given lambda window being sampled.

  1. The approach of choosing the column to the right of the lambda window being sampled for u_nk avoids cases where the u_nk of calculated lambdas are relative to the sampled one, giving all zeros for the column corresponding to the lambda actually sampled. An example of this observed:
from alchemtest.gmx import load_water_particle_without_energy
from alchemlyb.parsing.gmx import extract_u_nk
import pandas as pd

ds = load_water_particle_without_energy()

print(pd.concat(extract_u_nk(i, T=300) for i in ds.data['AllStates']).sort_index(level=[1, 2, 0]))

This gives:

                               (0.0, 0.0)  ...    (1.0, 1.0)
time   coul-lambda vdw-lambda              ...              
0.0    0.0         0.0           0.000000  ...    148.347412
18.6   0.0         0.0           0.000000  ...   1983.809792
37.2   0.0         0.0           0.000000  ...    205.930556
55.8   0.0         0.0           0.000000  ...   4512.816690
74.4   0.0         0.0           0.000000  ...  25939.040202
...                                   ...  ...           ...
9913.8 1.0         1.0          45.353152  ...      0.000000
9932.4 1.0         1.0          41.207448  ...      0.000000
9951.0 1.0         1.0          44.375182  ...      0.000000
9969.6 1.0         1.0          45.505032  ...      0.000000
9988.2 1.0         1.0          37.779798  ...      0.000000

[20444 rows x 38 columns]


Note that if we chose any old column, for one of the groupings of the ('coul-lambda', 'vdw-lambda') indices, we would end up trying to decorrelate on all zeros. the how="right" choice sidesteps this (as does how="left").

  1. The approach of summing all dHdl columns is not as clear to me, and I wouldn't be surprised if I've misinterpreted the implementation in alchemical-analysis.py. It seemed vaguely like a DH = (dH/dl_0) Dl_0 + (dH/dl_1) Dl_1 + ..., where Dl_0, Dl_1, ... == 1, but the detailed motivation for this choice isn't clear to me from the code or the git logs.

* For :math:`u_{nk}` datasets, the choice of ``'right'`` is recommended: the column immediately to the right of the column corresponding to the group's lambda index value is used as the observable.
* For :math:`\frac{dH}{d\lambda}` datasets, the choice of ``'sum'`` is recommended: the columns are simply summed, and the resulting :py:class:`pandas.Series` is used as the observable.

See the API documentation below on the possible values for ``how``, as well as more detailed explanations for each choice.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not good enough when you explicitly tell users what they should be doing. Be clear here why you're recommending one over the other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I would like some assistance with this. See my notes above.


Equilibrium Detection
---------------------
The :func:`~alchemlyb.preprocessing.subsampling.equilibrium_detection` function subsamples each timeseries in the dataset using the equilibration detection approach developed by John Chodera (see reference below).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of "reference below", cite it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would generally avoid mentioning people by name as it sounds to much like trying to convince readers through appeal to authority instead to facts (i.e., papers).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, I will change this accordingly.

See :py:func:`pymbar.timeseries.detectEquilibration` for more details; this is used internally by this subsampler.
Please reference the following if you use this function in your research:

[1] John D. Chodera. A simple method for automated equilibration detection in molecular simulations. Journal of Chemical Theory and Computation, 12:1799, 2016.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use proper reST citation syntax.

``pymbar.timeseries.statisticalInefficiency()``; previously, the statistical
inefficiency was _rounded_ (instead of ``ceil()``) and thus one could
end up with correlated data.
``pymbar.timeseries.subsampleCorrelatedData()``; previously, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did pymbar change the function name or was this a mistake?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a mistake in the docstring that I fixed here; the same mistake was not made in the equivalent line for equilibrium_detection.


return indices, calculated

return _decorrelator(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the more modular approach here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

is used. If there is no column to the left, then the column to the
right is used. If there is no column corresponding to the group's
lambda index value, then 'random' is used for that group (see below).
'random'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan... 👎

# results can vary slightly with different machines
# so possibly do
# delta = 10
# assert size - delta < len(sliced) < size + delta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also turn it into a floating point comparison, e.g., size/len(sliced) as the fraction and then just use assert_almost_equal(...., decimals=2) or whatever. Either way, if you already know that there can be variation then put it in the test and comment it.

def test_conservative(self, data, size, conservative):
sliced = self.slicer(data, series=data.iloc[:, 0], conservative=conservative)
def test_conservative(self, data, size, conservative, how):
sliced = self.subsampler(data, how=how, conservative=conservative)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less stringent test?

@dotsdl
Copy link
Member Author

dotsdl commented Mar 16, 2020

Thank you for your review @orbeckst! Your comments are incredibly helpful in steering this refactor to a satisfying state; I will be addressing the points you have raised over the next several days.

@orbeckst
Copy link
Member

@dotsdl what do we need here to get this in? I’m asking because if we want to do breaking changes then I’d rather do it sooner than later.

Or we do 0.4 #113 and then do this one for a 0.5 with an eye towards 1.0. Overall it looks to me that alchemlyb has sufficiently stabilized to just declare the API stable soon.

@orbeckst
Copy link
Member

Is this PR still relevant?

@xiki-tempula how much of what's here was already addressed in 1.0.1/2.0.0-dev?

@xiki-tempula
Copy link
Collaborator

I don't think there are significant features that are not implemented in the 2.0.0. But I do feel that the documentation could benefit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants