batching in MultiDiracDeterminant::mw_accept_rejectMove #5071

kgasperich · 2024-06-28T17:15:23Z

Proposed changes

Restructure MultiDiracDeterminant::mw_accept_rejectMove to handle all accepted moves together and all rejected moves together (minimize branching).
A first pass through all walker moves is done to get a vector of accepted walker indices and a vector of rejected walker indices, then the necessary data movement is done first for all accepted moves and then for all rejected moves.

The code introduced in this PR retains the same data movement patterns as the old code (which is still present in acceptMove and , but there are some differences in how it's done compared to the old code.

I separated accepted and rejected moves so that similar work is all done together (i.e. loop over all accepted then all rejected, rather than loop over all walkers and branch based on accept/reject for each walker)
I built pointer lists for the data that needs to be copied
- old
  - loop over walkers, for each walker, do necessary copies with std::copy and do H2D transfer
- new:
  - loop over accepted walkers, build list of pointers to the start of each set of data elements (psiMinv_temp, psiMinv, psiV, TpsiM, etc.) that need to be copied
  - loop over accepted walkers, copy with BLAS::copy using pointer lists (can be trivially replaced with copy_batched in subsequent PR when data movement requirements are figured out)
  - loop over accepted walkers, do H2D transfer
  - similar for rejected walkers

todo in subsequent PR:

change copy loop to copy_batched (if available on platform?)
consider pre-allocating some work space for acc/rej idx lists and pointer lists
look into upstream/downstream providers/users of this data to see if we can get away with fewer data transfers
if quantities like psiMinv_temp, psiV, dpsiV, new_ratios_to_ref, etc are already up to date on device when this is called, then just do the device copy_batched instead of copy on host followed by H2D transfer
maybe some opportunity for async copy/copy_batched with H2D transfer? (similar to what's described here)

There is a clear need for more refactoring and optimization of this and related functions, and that is why some things are structured the way that they are. I've made some choices (and left in some comments) that I think will make some of the next steps clearer/easier, but if anyone is opposed to that I can tidy it up more.

current code behavior

here's a table that summarizes the current behavior of acceptMove:

the first column shows which data is copied
the next three columns show for which cases of UpdateMode the relevant copy occurs (* denotes copies that only occur when using spinors)
the 'updateTo' column shows which arrays are updated on the device after the host copy
- these are the same in all cases of UpdateMode, so even when UpdateMode==ORB_PBYP_RATIO, an updateTo is called for dpsiM, even though dpsiM is not modified by this function for that value of UpdateMode
'in mw_res' shows which arrays are already in the mw resource collection (same for both arrays in each row).
- ✅: in mw_res
- ❌*: not in mw_res, but is dual-space allocated
- ❌: not in mw_res and not dual-space allocated

copy	ratio	partial	default	updateTo	in mw_res
psiMinv_temp -> psiMinv	✅	✅	✅	✅	✅
psiV[:] -> TpsiM[:,WorkingIndex]	✅	✅	✅	✅	✅
new_ratios_to_ref_ -> ratios_to_ref_	✅	✅	✅	✅	❌*
psiV[:] -> psiM[WorkingIndex,:]	✅	✅	✅	✅	✅
dpsiV[:] -> dpsiM[WorkingIndex,:]		✅	✅	✅	✅
d2psiV[:] -> d2psiM[WorkingIndex,:]		✅	✅		❌*
dspin_psiV[:] -> dspin_psiM[WorkingIndex,:]		✅*	✅*		❌
new_grads -> grads			✅		❌
new_lapls -> lapls			✅		❌
new_spingrads -> spingrads			✅*		❌

restore is much simpler:

copy	updateTo	in mw_res
psiMinv -> psiMinv_temp		✅
psiM[WorkingIndex,:] -> TpsiM[:,WorkingIndex]	✅	✅

What type(s) of changes does this code introduce?

Refactoring (no functional changes, no api changes)

Does this introduce a breaking change?

No

What systems has this change been tested on?

Checklist

No. This PR is up to date with current the current state of 'develop'
Yes. Code added or changed in the PR has been clang-formatted
N/A. This PR adds tests to cover any new code, or to catch a bug that is being fixed
N/A. Documentation has been added (if appropriate)

kgasperich · 2024-06-28T23:06:18Z

There are three different cases for data members of MultiDiracDeterminant that are affected by mw_accept_rejectMove:
The ones marked with (*) are updated on the device in the current code (in MultiDiracDeterminant::acceptMove)

already included in the MW resource
- psiMinv_temp
- psiV
- dpsiV
- (*) psiMinv
- (*) TpsiM
- (*) psiM
- (*) dpsiM
not included in the MW resource but are dual-space allocated
- new_ratios_to_ref_
- (*) ratios_to_ref_
- d2psiV
- d2psiM
not included in the MW resource and not dual-space allocated
- dspin_psiV
- dspin_psiM
- new_grads
- grads
- new_lapls
- lapls
- new_spingrads
- spingrads

In this PR, I currently have it implemented so that anything that is in the MW resource or handled by a dual-space allocator is updated on the device only. Anything that is not dual-space allocated is only updated on the host.

Currently, there is no separation based on anything like ENABLE_OFFLOAD or Phi->isOMPoffload(), but I can do that if needed; I'm not sure what the state of the data is when this function is called or what it's supposed to be when this function returns (I've assumed for now that any dual-space data is up-to-date on the device when this function is called, and I update it on the device before returning; anything that is not dual-space allocated is assumed to be up-to-date on the host when this function is called, and I update it on the host before returning).

If the platform is added as a template parameter, at least the copy_batched calls should be handled cleanly by that.
I assume the pointers from device_data() and from the mw resource collection deviceptr_lists should also be correct already (i.e. if compiled without offload, device_data() and data() should both return the same (host) pointer), but it would be good to have verification from someone more familiar with those parts of the code.

@ye-luo @anbenali if you know what data is supposed to be up-to-date upon entering and exiting this function, I'd appreciate any feedback (if not, then I'll start looking at what happens upstream and downstream from here and check with you to see if it makes sense)

I can also add any of the data in the list above to the resource collection if it makes sense to do that, but that could also be a separate PR.

ye-luo · 2024-06-28T23:19:13Z

Test this please

PDoakORNL

Preliminary review.

Consider factoring some of this out into more manageable private functions.

PDoakORNL · 2024-07-03T15:50:19Z

src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.cpp

-  for (int iw = 0; iw < wfc_list.size(); iw++)
+  const int nw = wfc_list.size();
+  // separate accepted/rejected walker indices
+  const int n_accepted = std::count(isAccepted.begin(), isAccepted.begin() + nw, true);


Please have this follow the common idiom of

std::count(isAccepted.begin(), isAccepted.end(), true)

This exceptional looking code lead me and likely other carteful readers to wonder just why isAccepted could be longer than wfc_list. Also you are tossing the implicit bounds checking that using the begin() end() iterator pair provides.

mw functions "multi walker" arguments must always be the same size you can and should write the code as if there were true. If you have doubts during development use an assert definitely don't write code that "defends" against this defect.

yeah, this makes sense; sometimes I try too hard to replicate exactly what the prior logic was even when (like here) that results in some ugly, non-standard code

PDoakORNL · 2024-07-03T16:03:42Z

src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.cpp

+  // setup device pointer lists
+  switch (wfc_leader.UpdateMode)
+  {
+  default:


Because there are no real unit tests for this class and just a massive procedure test for multislater determinant. I'm not sure whether this coverage warning means none of this default case is covered. My reading is the default case is not covered. I think the ORB_PYPB _PARTIAL/RATIOS cases are covered. Seems like a bunch of important code with no coverage.

Before doing a big refactor why not write the missing unit tests. What's the expected outcome of calling this function?

It seems like this function basically just copies a bunch of state data in response to the isAccepted vector and the internal WorkingIndex state. Significantly there are three modes of this.

Fill members with test data, intialize WorkingIndex, call, check side_effects are correct. You will probably need a testing friend class to break the encapsulation. See src/Estimators/OneBodyDensityMatricesTests.h

Before doing a big refactor why not write the missing unit tests. What's the expected outcome of calling this function?

My plan going into this refactor was that it wasn't going to be a huge change, and that I was just planning to group some similar work together (the rejects and accepts) without any other change in behavior. I don't yet have a complete picture of everything else upstream and downstream from this function and what different cases might need to be handled.

It seems like this function basically just copies a bunch of state data in response to the isAccepted vector and the internal WorkingIndex state.

This is also my take on what's happening, but the part I'm not sure about (and what I asked about in my comment above) is which state data I can assume to be up-to-date on the host/device before entering this function, and what needs to be up-to-date (and where) after exiting.
This is the only point where I'd planned to possibly deviate from the current functionality (e.g. in the ORB_PBYP_RATIO case for accepted moves, we do the relevant copying (host to host) to psiMinv, psiM, TpsiM, and ratios_to_ref_, and then we do an H2D update of all of those in addition to dpsiM. If there are cases where all of these (and psiV, psiMinv_temp, and new_ratios_to_ref_) are already up-to-date on the device, and if anything downstream from here will only use those on the device, then it would make sense to just do the copy on the device).

PDoakORNL · 2024-07-03T17:21:40Z

src/QMCWaveFunctions/Fermion/MultiDiracDeterminant.cpp

+  }
+
+  // pointers to data for only accepted walkers
+  OffloadVector<ValueType*> psiMinv_temp_acc_deviceptr_list;


Since there aren't retained I'm not sure why the copy and pointer list building cases aren't fused and these are just in those cases. Put blocks in the cases and these lists will be constructed only if you are actually using them.

yeah, this got a bit messy; I got stuck between not wanting to duplicate code but also not wanting to make the fallthrough logic of the switches hard to follow, so I retained the original logic for the actual copying and used different logic for the pointer lists to avoid duplication (and to only resize the ones I'm actually using), and I ended up with something worse than if I'd just committed to one or the other

prckent · 2024-07-16T14:56:11Z

@kgasperich Are you clear on the next steps?

kgasperich · 2024-07-16T15:22:58Z

@kgasperich Are you clear on the next steps?

@prckent yes, I talked to Ye a few days ago and I think the next steps are mostly clear. I'm going to resolve some of the data movement issues by more closely mirroring the behavior of the existing accept function, and then performance improvements on top of that (e.g. not doing unnecessary H2D transfers in cases where we can just do a device copy) will be added in a subsequent PR.

kgasperich · 2024-08-19T22:02:58Z

I've updated this so that it has the same data movement patterns as the old code, but there are some differences in how it's done compared to the old code.

I separated accepted and rejected moves so that similar work is all done together (i.e. loop over all accepted then all rejected, rather than loop over all walkers and branch based on accept/reject for each walker)
I built pointer lists for the data that needs to be copied
- old
  - loop over walkers, for each walker, do necessary copies with std::copy and do H2D transfer
- new:
  - loop over accepted walkers, build list of pointers to the start of each set of data elements (psiMinv_temp, psiMinv, psiV, TpsiM, etc.) that need to be copied
  - loop over accepted walkers, copy with BLAS::copy using pointer lists (can be trivially replaced with copy_batched in subsequent PR when data movement requirements are figured out)
  - loop over accepted walkers, do H2D transfer
  - similar for rejected walkers

todo in subsequent PR:

change copy loop to copy_batched (if available on platform?)
consider pre-allocating some work space for acc/rej idx lists and pointer lists
look into upstream/downstream providers/users of this data to see if we can get away with fewer data transfers
if quantities like psiMinv_temp, psiV, dpsiV, new_ratios_to_ref, etc are already up to date on device when this is called, then just do the device copy_batched instead of copy on host followed by H2D transfer
maybe some opportunity for async copy/copy_batched with H2D transfer? (similar to what's described here)

There is a clear need for more refactoring and optimization of this and related functions, and that is why some things are structured the way that they are. I've made some choices (and left in some comments) that I think will make some of the next steps clearer/easier, but if anyone is opposed to that I can tidy it up more.

ye-luo · 2024-08-19T22:33:41Z

Test this please

ye-luo · 2024-08-20T20:14:15Z

CI passes. @kgasperich could you update the PR description to reflect what the current code does?

kgasperich · 2024-08-21T21:12:50Z

CI passes. @kgasperich could you update the PR description to reflect what the current code does?

I edited the description to summarize what it does and add some more info that might be useful in future refactoring

prckent

Thanks Kevin. I think this is more than enough of a step forward for this PR! For posterity it would be helpful to note any before/after timings that you have for this.

prckent · 2024-08-23T15:54:28Z

Test this please

kgasperich added 4 commits June 20, 2024 16:55

separate accept/reject walkers in MultiDiracDeterminant

8488e12

started mw_acceptreject in MultiDiracDet; needs better handling of grads

018906f

fix ld/inc for TpsiM column

49bd511

fixed grad array pointers

2c9b71d

PDoakORNL reviewed Jul 3, 2024

View reviewed changes

cleanup some logic in mw_accept_rejectMove

c216769

keep old acc/rej data movement behavior

42986f4

ye-luo mentioned this pull request Aug 19, 2024

Split QMCDriverNew::initializeQMC #5132

Merged

moved H2D into each case

1d94b8c

Merge branch 'QMCPACK:develop' into md-mw-accept-reject

25ca1dd

kgasperich changed the title ~~[WIP] batching in MultiDiracDeterminant::mw_accept_rejectMove~~ batching in MultiDiracDeterminant::mw_accept_rejectMove Aug 21, 2024

prckent approved these changes Aug 23, 2024

View reviewed changes

Merge branch 'develop' into md-mw-accept-reject

8a2f6ae

prckent merged commit 9a4dd3b into QMCPACK:develop Aug 23, 2024
39 of 41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batching in MultiDiracDeterminant::mw_accept_rejectMove #5071

batching in MultiDiracDeterminant::mw_accept_rejectMove #5071

kgasperich commented Jun 28, 2024 •

edited

Loading

kgasperich commented Jun 28, 2024

ye-luo commented Jun 28, 2024

PDoakORNL left a comment

PDoakORNL Jul 3, 2024

kgasperich Jul 8, 2024

PDoakORNL Jul 3, 2024

kgasperich Jul 8, 2024

PDoakORNL Jul 3, 2024

kgasperich Jul 8, 2024

prckent commented Jul 16, 2024

kgasperich commented Jul 16, 2024

kgasperich commented Aug 19, 2024 •

edited

Loading

ye-luo commented Aug 19, 2024

ye-luo commented Aug 20, 2024

kgasperich commented Aug 21, 2024

prckent left a comment

prckent commented Aug 23, 2024

batching in MultiDiracDeterminant::mw_accept_rejectMove #5071

batching in MultiDiracDeterminant::mw_accept_rejectMove #5071

Conversation

kgasperich commented Jun 28, 2024 • edited Loading

Proposed changes

current code behavior

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

kgasperich commented Jun 28, 2024

ye-luo commented Jun 28, 2024

PDoakORNL left a comment

Choose a reason for hiding this comment

PDoakORNL Jul 3, 2024

Choose a reason for hiding this comment

kgasperich Jul 8, 2024

Choose a reason for hiding this comment

PDoakORNL Jul 3, 2024

Choose a reason for hiding this comment

kgasperich Jul 8, 2024

Choose a reason for hiding this comment

PDoakORNL Jul 3, 2024

Choose a reason for hiding this comment

kgasperich Jul 8, 2024

Choose a reason for hiding this comment

prckent commented Jul 16, 2024

kgasperich commented Jul 16, 2024

kgasperich commented Aug 19, 2024 • edited Loading

ye-luo commented Aug 19, 2024

ye-luo commented Aug 20, 2024

kgasperich commented Aug 21, 2024

prckent left a comment

Choose a reason for hiding this comment

prckent commented Aug 23, 2024

kgasperich commented Jun 28, 2024 •

edited

Loading

kgasperich commented Aug 19, 2024 •

edited

Loading