Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dump walkers periodically for post-processing #4940

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

Paul-St-Young
Copy link
Contributor

@Paul-St-Young Paul-St-Young commented Mar 13, 2024

Proposed changes

Repurpose unused input "record_configs" to set frequency of dumping walkers for post-processing.

Example input of a simple harmonic oscillator tb38_msho.zip

A normal checkpoint file {prefix}.config.h5 is overwritten at every checkpoint and contains

block                    Dataset {SCALAR}
number_of_walkers        Dataset {SCALAR}
walker_partition         Dataset {2}
walker_weights           Dataset {65}
walkers                  Dataset {65, 1, 3}

The walkers dataset contains the snapshot at checkpoint.

With "record_configs" > 0, each addition to the config.h5 is preserved. Instead of one walkers dataset, we need to identify the block at which the walkers are dumped

block                    Dataset {SCALAR}
number_of_walkers        Dataset {SCALAR}
walker_partition         Dataset {2}
walker_partition32       Dataset {2}
walker_partition64       Dataset {2}
walker_weights           Dataset {64}
walker_weights32         Dataset {63}
walker_weights64         Dataset {64}
walkers                  Dataset {64, 1, 3}
walkers32                Dataset {63, 1, 3}
walkers64                Dataset {64, 1, 3}

The number 32 in the name walkers32 identify that these walkers were dump at block 32.

What type(s) of changes does this code introduce?

  • New feature

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

Intel workstation

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

  • Yes. This PR is up to date with current the current state of 'develop'
  • No. Code added or changed in the PR has been clang-formatted
  • Yes/No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
  • Yes/No. Documentation has been added (if appropriate)

@Paul-St-Young Paul-St-Young changed the title [WIP] dump walkers periodically for post-processing dump walkers periodically for post-processing Mar 14, 2024
@prckent
Copy link
Contributor

prckent commented Mar 15, 2024

Hi Paul. Thanks for adding/restoring this ghost feature. As we have discussed this functionality is useful for a variety of uses, and clearly the functionality could be evolved further.

For now, two questions:

  1. How much testing have you done that the energies and weights are correct and that, when averaged, they match what is in the scalar files?
  2. Does this work with vmc_batched?

Copy link
Contributor

@ye-luo ye-luo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only did an initial pass. Didn't fully understand the change at the moment. Requested more documentation.

@@ -52,7 +52,7 @@ class HDFWalkerOutput
/** dump configurations
* @param w walkers
*/
bool dump(const WalkerConfigurations& w, int block);
bool dump(const WalkerConfigurations& w, int block, const bool identify_block=false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document what identify_block=true do.

std::vector<std::string> names = {"block", hdf::num_walkers, partition_name, dataset_name, weights_name};
for (auto aname : names)
if (hout.is_dataset(aname)) hout.unlink(aname);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you mentioned the data set name used for recording instead of checkpointing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset name for recording is walkers{block_idx}, whereas the name for checkpoint is simply walkers.

@@ -62,7 +62,7 @@ class HDFWalkerOutput
std::array<BufferType, 2> RemoteData;
std::array<std::vector<QMCTraits::FullPrecRealType>, 2> RemoteDataW;
int block;
void write_configuration(const WalkerConfigurations& W, hdf_archive& hout, int block);
void write_configuration(const WalkerConfigurations& W, hdf_archive& hout, int block, const bool identify_block);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document what identify_block=true do.

@ye-luo
Copy link
Contributor

ye-luo commented Mar 26, 2024

The asan failure was found caused by CI OS image. It has been fixed.

@ye-luo
Copy link
Contributor

ye-luo commented Mar 26, 2024

I also noticed that the added change only affect legacy driving. For my curiosity, what is missing for you to adopt the batched driver?

@PDoakORNL
Copy link
Contributor

With regards to batched I want to remove WalkerConfiguration as an object and heavily refactor eliminate the "Walker-->Particle" objects. The Walker object and some of its data possessions which are stuffed into particle set is stuffed full of values that should be contiguous vectors in the crowds or even at rank scope but are instead here in an AOS with respect to walkers and mapped into a byte buffer. As it stands all this "dumping walkers" stuff is predicated on that. Pointer arithmetic is done etc. The less dependency there is is on this the better. I think we would be better off writing an "estimator" that wrote the per walker values for data out. This exposes legacy walker implementation details that we don't want to carry forward either, to the extent that some still exist that is because refactoring ParticleSet has been blocked for several years now.

If you want to dump walkers it should be a crowd or population level operation so it can be synchronized properly and done efficiently. If you want to use parallel hdf5 the appropriate level would be at MCPopulation. Ideally it would be done from a separate io thread while real work is being done by the main crowd threads.

@Paul-St-Young
Copy link
Contributor Author

Paul-St-Young commented Apr 11, 2024

@prckent apologies for the delayed response. To answer your questions:

How much testing have you done that the energies and weights are correct and that, when averaged, they match what is in the scalar files?

I did not test the energies and weights. The same dump function used for a normal checkpoint is called, so I just assumed it would be correct.

Does this work with vmc_batched?

No, I did not add this feature to the batched drivers.

The current implementation breaks down in an MPI run. I think this is due to problems with parallel hdf5 file writing.
@ye-luo I added some more details of this feature in the PR description. Does this clear up the meaning of identify_block?
I will also add some comments in the source code.

@ye-luo @PDoakORNL I would love some help on getting this feature to work.
It's not clear to me exactly what needs to be done to make parallel HDF5 writing correct.

@prckent prckent mentioned this pull request Jun 5, 2024
15 tasks
@prckent
Copy link
Contributor

prckent commented Jun 28, 2024

Hi Paul - Is #5019 working for you? If so, we can close this PR.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 28, 2024

@prckent please keep this PR alive. it is needed for different uses like reply.

@Paul-St-Young
Copy link
Contributor Author

@prckent yes, #5019 does what I need.
It outputs a bit too much information, but it seems straight-forward to modify so I can go from there.
I don't mind if you decide to close this PR.

@ye-luo would it be easier to add a "replay" flag to the new WalkerLog class?
I think it just needs to output walker coordinates and weights at every step without all the other debugging contents.

@ye-luo
Copy link
Contributor

ye-luo commented Jun 30, 2024

WalkerLog has very different design logics and use cases. For example, it writes one file per MPI rank instead of using scalable I/O. Data are not organized in steps and thus requires extra work to recover a true reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants