WIP: Refactor of subsampling module #98

dotsdl · 2020-01-29T05:57:52Z

Addresses #95, #79, #91

Required for merge:

reference documentation for the functions included in the alchemlyb.preprocessing.subsampling module
example notebook showing usage, as well as rationale for usage, of each function
tests across more datasets in alchemtest
changelog entry

This is an API breaking change.

All three of our subsampling functions should now be able to consume any DataFrame in our two standard u_nk and dHdl forms and produce meaningful results. Each of these functions now performs a groupby with all non-time indexes, and separately performs its operation on each group. The resulting subsampled groups are concatenated and returned.

Some notes on the implementation:

The series kwarg is now the column kwarg for statistical_inefficiency, equilibrium_detection. This can take either a column name in the DataFrame or a Series object (the index of the Series must match exactly that of the DataFrame). Taking a Series object allows for code that uses the existing usage of e.g. statistical_inefficiency(u_nk, u_nk[<column_name>]) to still work as before.
statistical_inefficiency and equilibrium_detection now have a how kwarg. I expect this to be the main usage we suggest for users of these functions. I took a look at how alchemical_analysis currently handles dHdl and u_nk cases, and to replicate that as closely as possible, we recommend using how='right' for u_nk data, and how='sum' for dHdl data. Please see the docstrings for statistical_inefficiency, equilibrium_detection for details on these treatments.
Either column or how must be used for statistical_inefficiency, equilibrium_detection. The function can no longer default to simple slicing if only data is given as an input. This is to avoid (potentially new) users from applying the functions without specifying these inputs with no obvious failure, but with the function doing nothing to decorrelate their data (essentially a silent failure).
I've added the force kwarg to statistical_inefficiency, equilibrium_detection. The only exception these functions will skip with force=True is one due to duplicate time values in the groupings, as this tends to indicate two or more timeseries have been dumped together, which is unlikely what one wants to do with these subsampling functions that assume correlation to work as intended.
Setting return_calculated=True for statistical_inefficiency, equilibrium_detection will return calculated values used for subsampling each lambda grouping:
- statistical_inefficiency: statinef
- equilibrium_detection: t, statinef, Neff_max

Some other nice things:

I will be adding the minimal assumptions the standard forms make to their doc page as part of this change's PR. One of the assumptions I want to propagate is that the outermost index is always time, or at least something implying time (frame, sample, whatever). It needn't be called 'time', however, for all the tooling to work.
These subsamplers now explicitly call out that they can't be expected to preserve any custom ordering of rows. They will always be doing sorting of some kind, so the time index values are meaningful.

All three of our subsampling functions should now be able to consume any dataframe in our two standard u_nk and dHdl forms and produce meaningful results. Each of these functions now performs a groupby with all non-time indexes, and separately performs its operation on each group. Some notes on the implementation: 1. The `series` kwarg is now the `column` arg for `statistical_inefficiency`, `equilibrium_detection`. A column name is now what we use, as requiring a column of the dataframe offers guarantees around the index matching, ordering, etc. that we had to do backflips to guarantee before. It is still possible to use another series as the basis for either operation, but it should be grafted on to the dataframe as a column to do so. 2. Since `column` is an arg, it is now required for the above functions. Therefore, those functions no longer behave like `slicing` when `column` isn't specified. My logic with breaking this is to favor explicitness and to avoid surprising users with behavior that throws no warnings but silently may not be doing what they want. 3. I've added some kwargs to `force` ignoring exceptions. The only exception these functions throw now is one due to duplicate time values in the groupings, as this tends to indicate two or more timeseries have been dumped together, which is unlikely what one wants to do with these subsampling functions that assume correlation to work as intended. Some other nice things: 1. I will be adding the minimal assumptions the standard forms make to their doc page as part of this change's PR. One of the assumptions I want to propagate is that the outermost index is *always* time, or at least something implying time (frame, sample, whatever). It needn't be called 'time', however, for all the tooling to work. 2. These subsamplers now explicitly call out that they *can't be expected* to preserve any custom ordering of rows. They will always be doing sorting of some kind, so the time index values are meaningful.

codecov · 2020-01-29T06:01:16Z

Codecov Report

Attention: Patch coverage is 81.96721% with 22 lines in your changes missing coverage. Please review.

Project coverage is 94.71%. Comparing base (7fcdf7a) to head (dc1b3ff).
Report is 197 commits behind head on master.

Files with missing lines	Patch %	Lines
src/alchemlyb/preprocessing/subsampling.py	81.96%	13 Missing and 9 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master      #98      +/-   ##
==========================================
- Coverage   97.26%   94.71%   -2.55%     
==========================================
  Files          12       12              
  Lines         695      776      +81     
  Branches      141      158      +17     
==========================================
+ Hits          676      735      +59     
- Misses          5       18      +13     
- Partials       14       23       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dotsdl · 2020-01-29T06:07:53Z

This gist demonstrates the behavior of the refactored functions.

Not getting all passes yet, and not sure I'm happy with how these work. There are clear shortcomings with the static column approach, requiring perhaps manual effort on the part of the user to work around. This is counter to alchemlyb's philosophy of making the 'most right' thing easy to do.

In the spirit of wanting to make it easy for users to do the 'most right' thing, I have looked at the details of what `alchemical-analysis` currently does for subsampling with statisticalInefficiency, and it uses an approach similar to the one documented in each of our `statistical_inefficiency` and `equilbrium_detection` functions now. Next steps are to *actually* implement this functionality. We are going to make our lives easier by adding an `alchemform` attribute to dataframes our parsers produce, and add this to the standard form spec. This attribute will have 'dHdl' or 'u_nk' as its value, depending on the form of data we're dealing with. Our estimators can also use this as a check or warning.

`how` methods implemented; these require explicit tests yet. We should also add logging to each of these functions, using a module-level logger. This would allow for auditability later in case downstream issues are discovered.

About to break out the generic bits to make the same changes to equilibrium_detection.

…iciency All tests pass, but we should try and think carefully about tests that we really want to add to ensure these are working as expected.

Using something like `pd.concat` on a set of DataFrames doesn't propagate forward the `.attrs` dictionary, so having an `alchemform` key here created by the parsers is not sufficient for signalling to preprocessors the type of data. We will likely remove the 'auto' option from the subsamplers and insist on explicit values for `how`. This is not *necessarily* bad, as it promotes some understanding for the user of what they are choosing to do here.

Alternatively, they can choose a `column`. I think this is appropriate, as it is still important that users are consciously aware of some of the high-level choices the software is making. There will always be touchpoints like this, though we should try to minimize them where possible.

These changes apply to tests of `statistical_inefficiency`, `equilibrium_detection`. I also set `nskip=5` for `equilibrium_detection` cases, which required some changes to test parameters. This change speeds up the tests, however, at no real coverage cost.

dotsdl · 2020-03-11T23:39:35Z

The changes I've made to the codebase are ready for review! I've updated the PR description with the current state of the design, as it took a few iterations for me to arrive at. I would like some feedback on this design as soon as possible so changes can be made to accommodate needs. This PR is also a bit of a blocker for #94 and #99, since subsampling is a key component of the alchemlyb workflow.

I am now working on an example notebook that will make its way into our examples docs; this will showcase how to use these functions, as well as when each makes sense to use. I don't consider this notebook a requirement for review of the implementation, but consider it important to make sure the feel of this approach is right. I think I can turn it around and add to this PR by the end of this week.

dotsdl · 2020-03-11T23:42:40Z

@orbeckst and @mrshirts I've requested your eyes to give this implementation a pass. In particular, I would like a check on the how parameter options for statistical_inefficiency and equilibrium_detection, and whether they make sense or appear correct.

orbeckst

Hi @dotsdl , I had a quick look and overall this looks like great work. Some comments inline.

orbeckst · 2020-03-15T22:45:39Z