Bootstrapping: should it be at the alchemlyb level, or pymbar level? #80

mrshirts · 2019-05-08T20:54:22Z

Should bootstrapping be implemented at the alchemlyb level, or pymbar level? For MBAR, it would be better at the pymbar level, since it can be easily encapsulated (user doesn't have to worry about it), and one can request either uncertainty estimate.

For BAR over several states, then the bootstrapping needs to be done at the level ABOVE the BAR call, since we need to bootstrap all of the data simultaneously before feeding it into BAR. Same for EXP applied to a string of states.

Thoughts?

orbeckst · 2019-05-10T03:58:27Z

If there's a common function that we can use for "everything" (and just plugin the estimator) then alchemlyb would be a good place, I think – along the lines that we only have to write and test the code once.

I generally like building blocks that I can freely combine. Something like

def bootstrapped(data, estimator):
     ...
     return mean, error

Alternatively, we could hide the machinery in the Estimator.fit() function: enable it with fit(..., ..., bootstrap=True) and just do under the hood whatever needs to be done and set the appropriate output attributes. In this case, one might just use a pymbar bootstrap directly if available or roll our own for other estimators.

One advantage of doing it at the alchemlyb level is that it might not be difficult to run alchemlyb with dask (essentially, use the dask.DataFrame) and then the bootstrapping can be parallelized without effort. A while ago @dotsdl played around with alchemlyb and dask – I can't quite remember how much would need to be changed.

mrshirts · 2019-05-10T04:44:42Z

def bootstrapped(data, estimator):

That might be a good idea. It could return a dictionary of all the bootstrapped results, along with the uncertainty estimate. I'll think about how to organize this.

One issue is that the data will look different with each estimator, thus requiring fairly different conditionals inside the bootstrapped data. Also, if one was analyzing K states, calculating the free energy with BAR executed pairwise, one would want to bootstrap over the entire data set of K states; i.e. you would need to bootstrap the entire procedure, not over a single estimator.

dotsdl · 2019-07-24T20:51:48Z

@mrshirts do you have a paper or writeup you can point to for this approach? I'd be happy to prototype something. We may be able to steal design inspiration from scikit-learn such as their GridSearchCV and Pipeline objects.

mrshirts · 2019-07-27T07:43:52Z

So, I don't really have a good simple paper. http://www.alchemistry.org/wiki/Analyzing_Simulation_Results#Bootstrap_Sampling is a good summary.

I agree that something like sklearn that carries out some algorithm given an input function. In the case of bootstrapping, the metamethod takes some function of N arrays, each of which is a a set of samples from an independent distribution (or independent draw of the same distribution). Say, for example, N sets of samples, one at each lambda state; for each bootstrap you would want to bootstrap sample all N of the samples each time. (i.e. always bootstrap over ALL input data before any analysis).

After the bootstrap sampling with replacement, everything else is pretty trivial. You calculate your function on each of the bootstrapped. You then have a set of resullts (could be multivalue return, and you can simply return a list of all the answers. You can optionally return various statistical measures of this list for each of the results - mean, standard deviation, confidence intervals.

One could make decorrelation of the data sets part of the algorithm, but it would perhaps be more modular to do the decorrelation as a separate step.

wildromi · 2020-01-17T16:23:53Z

Dear alchemlyb team!
It is quite awesome to have this library. We will use it in a publication I am currently preparing. For this it is crucial to have an error measure, and a built in function that carries out bootstrapping and returns the bootstrap standard deviation would be the best. Maybe one could also access the bootstrap data sets used to calculate st dev, but that is not too important. I cannot code but gladly test anything you might have. Thank you so much!
Romina

dotsdl · 2020-01-18T01:19:38Z

Hey all, after discussions with @wildromi, I've committed to working on this issue over the next two weeks. I expect the first iteration to be usable but probably not the approach we end up with. I'll post a WIP PR as soon as I can.

mrshirts · 2020-01-18T03:03:02Z

Hi, David- I'd love to talk some more about this, as I've been dealing with similar setups for a while. Shoot me an email at the CU email and we can strategize some more? A key is bootstrapping simultaneously over multiple time series, for example.

dotsdl · 2020-01-18T05:43:53Z

@mrshirts sent! I'm looking forward to leveraging your experience to jumpstart the approach.

mrshirts · 2020-01-19T20:28:50Z

@dotsdl: take a look at https://github.com/choderalab/pymbar/blob/pmf/pymbar/pmf.py and look at lines 590 to 615 to get a sense at how bootstrapping works in a complicated case (in this case, calculating a potential of mean force)

dotsdl · 2020-01-21T07:12:18Z

I met with @mrshirts yesterday, and we aligned on an approach. I have started a WIP PR on #94. There is a list of things to do yet, but we have the start of our implementation. You can check out how things work so far in this gist. Comments welcome!

Please don't use this in production work yet until we have tests ensuring that bootstrap is working as expected. It appears to work from my ad-hoc tests, so at most give it a try on the feature-bootstrapping branch.

dotsdl · 2020-01-29T06:01:11Z

The gist for #94 has been updated; it requires components of #98, which can be played with on this branch.

mrshirts added the question label May 8, 2019

dotsdl mentioned this issue Jan 20, 2020

WIP: bootstrapping submodule, parameter for estimators #94

Closed

5 tasks

orbeckst mentioned this issue Jan 26, 2020

A problem with slicing in subsampling #95

Open

dotsdl mentioned this issue Mar 11, 2020

python script to calculate the free energy #100

Closed

alchemistry locked and limited conversation to collaborators Nov 8, 2022

orbeckst converted this issue into discussion #270 Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Bootstrapping: should it be at the alchemlyb level, or pymbar level? #80

Bootstrapping: should it be at the alchemlyb level, or pymbar level? #80

mrshirts commented May 8, 2019

orbeckst commented May 10, 2019

mrshirts commented May 10, 2019

dotsdl commented Jul 24, 2019

mrshirts commented Jul 27, 2019

wildromi commented Jan 17, 2020

dotsdl commented Jan 18, 2020 •

edited

Loading

mrshirts commented Jan 18, 2020

dotsdl commented Jan 18, 2020

mrshirts commented Jan 19, 2020 •

edited

Loading

dotsdl commented Jan 21, 2020

dotsdl commented Jan 29, 2020

This issue was moved to a discussion.

This issue was moved to a discussion.

Bootstrapping: should it be at the alchemlyb level, or pymbar level? #80

Bootstrapping: should it be at the alchemlyb level, or pymbar level? #80

Comments

mrshirts commented May 8, 2019

orbeckst commented May 10, 2019

mrshirts commented May 10, 2019

dotsdl commented Jul 24, 2019

mrshirts commented Jul 27, 2019

wildromi commented Jan 17, 2020

dotsdl commented Jan 18, 2020 • edited Loading

mrshirts commented Jan 18, 2020

dotsdl commented Jan 18, 2020

mrshirts commented Jan 19, 2020 • edited Loading

dotsdl commented Jan 21, 2020

dotsdl commented Jan 29, 2020

This issue was moved to a discussion.

dotsdl commented Jan 18, 2020 •

edited

Loading

mrshirts commented Jan 19, 2020 •

edited

Loading