dump walkers periodically for post-processing #4940

Paul-St-Young · 2024-03-13T22:12:22Z

Proposed changes

Repurpose unused input "record_configs" to set frequency of dumping walkers for post-processing.

Example input of a simple harmonic oscillator tb38_msho.zip

A normal checkpoint file {prefix}.config.h5 is overwritten at every checkpoint and contains

block                    Dataset {SCALAR}
number_of_walkers        Dataset {SCALAR}
walker_partition         Dataset {2}
walker_weights           Dataset {65}
walkers                  Dataset {65, 1, 3}

The walkers dataset contains the snapshot at checkpoint.

With "record_configs" > 0, each addition to the config.h5 is preserved. Instead of one walkers dataset, we need to identify the block at which the walkers are dumped

block                    Dataset {SCALAR}
number_of_walkers        Dataset {SCALAR}
walker_partition         Dataset {2}
walker_partition32       Dataset {2}
walker_partition64       Dataset {2}
walker_weights           Dataset {64}
walker_weights32         Dataset {63}
walker_weights64         Dataset {64}
walkers                  Dataset {64, 1, 3}
walkers32                Dataset {63, 1, 3}
walkers64                Dataset {64, 1, 3}

The number 32 in the name walkers32 identify that these walkers were dump at block 32.

What type(s) of changes does this code introduce?

New feature

Does this introduce a breaking change?

No

What systems has this change been tested on?

Intel workstation

Checklist

Update the following with a yes where the items apply. If you're unsure about any of them, don't hesitate to ask. This is
simply a reminder of what we are going to look for before merging your code.

Yes. This PR is up to date with current the current state of 'develop'
No. Code added or changed in the PR has been clang-formatted
Yes/No. This PR adds tests to cover any new code, or to catch a bug that is being fixed
Yes/No. Documentation has been added (if appropriate)

append block index after walkers

in addition to walkers

prckent · 2024-03-15T13:04:38Z

Hi Paul. Thanks for adding/restoring this ghost feature. As we have discussed this functionality is useful for a variety of uses, and clearly the functionality could be evolved further.

For now, two questions:

How much testing have you done that the energies and weights are correct and that, when averaged, they match what is in the scalar files?
Does this work with vmc_batched?

ye-luo

I only did an initial pass. Didn't fully understand the change at the moment. Requested more documentation.

ye-luo · 2024-03-26T19:44:34Z

src/Particle/HDFWalkerOutput.h

@@ -52,7 +52,7 @@ class HDFWalkerOutput
  /** dump configurations
   * @param w walkers
   */
-  bool dump(const WalkerConfigurations& w, int block);
+  bool dump(const WalkerConfigurations& w, int block, const bool identify_block=false);


Please document what identify_block=true do.

ye-luo · 2024-03-26T19:45:27Z

src/Particle/HDFWalkerOutput.cpp

+    std::vector<std::string> names = {"block", hdf::num_walkers, partition_name, dataset_name, weights_name};
+    for (auto aname : names)
+      if (hout.is_dataset(aname)) hout.unlink(aname);
+  }


could you mentioned the data set name used for recording instead of checkpointing.

Dataset name for recording is walkers{block_idx}, whereas the name for checkpoint is simply walkers.

ye-luo · 2024-03-26T19:45:53Z

src/Particle/HDFWalkerOutput.h

@@ -62,7 +62,7 @@ class HDFWalkerOutput
  std::array<BufferType, 2> RemoteData;
  std::array<std::vector<QMCTraits::FullPrecRealType>, 2> RemoteDataW;
  int block;
-  void write_configuration(const WalkerConfigurations& W, hdf_archive& hout, int block);
+  void write_configuration(const WalkerConfigurations& W, hdf_archive& hout, int block, const bool identify_block);


Please document what identify_block=true do.

ye-luo · 2024-03-26T19:47:39Z

The asan failure was found caused by CI OS image. It has been fixed.

ye-luo · 2024-03-26T19:48:20Z

I also noticed that the added change only affect legacy driving. For my curiosity, what is missing for you to adopt the batched driver?

PDoakORNL · 2024-04-04T18:18:12Z

With regards to batched I want to remove WalkerConfiguration as an object and heavily refactor eliminate the "Walker-->Particle" objects. The Walker object and some of its data possessions which are stuffed into particle set is stuffed full of values that should be contiguous vectors in the crowds or even at rank scope but are instead here in an AOS with respect to walkers and mapped into a byte buffer. As it stands all this "dumping walkers" stuff is predicated on that. Pointer arithmetic is done etc. The less dependency there is is on this the better. I think we would be better off writing an "estimator" that wrote the per walker values for data out. This exposes legacy walker implementation details that we don't want to carry forward either, to the extent that some still exist that is because refactoring ParticleSet has been blocked for several years now.

If you want to dump walkers it should be a crowd or population level operation so it can be synchronized properly and done efficiently. If you want to use parallel hdf5 the appropriate level would be at MCPopulation. Ideally it would be done from a separate io thread while real work is being done by the main crowd threads.

Paul-St-Young · 2024-04-11T18:08:18Z

@prckent apologies for the delayed response. To answer your questions:

How much testing have you done that the energies and weights are correct and that, when averaged, they match what is in the scalar files?

I did not test the energies and weights. The same dump function used for a normal checkpoint is called, so I just assumed it would be correct.

Does this work with vmc_batched?

No, I did not add this feature to the batched drivers.

The current implementation breaks down in an MPI run. I think this is due to problems with parallel hdf5 file writing.
@ye-luo I added some more details of this feature in the PR description. Does this clear up the meaning of identify_block?
I will also add some comments in the source code.

@ye-luo @PDoakORNL I would love some help on getting this feature to work.
It's not clear to me exactly what needs to be done to make parallel HDF5 writing correct.

prckent · 2024-06-28T18:28:51Z

Hi Paul - Is #5019 working for you? If so, we can close this PR.

ye-luo · 2024-06-28T18:44:14Z

@prckent please keep this PR alive. it is needed for different uses like reply.

Paul-St-Young · 2024-06-30T16:14:24Z

@prckent yes, #5019 does what I need.
It outputs a bit too much information, but it seems straight-forward to modify so I can go from there.
I don't mind if you decide to close this PR.

@ye-luo would it be easier to add a "replay" flag to the new WalkerLog class?
I think it just needs to output walker coordinates and weights at every step without all the other debugging contents.

ye-luo · 2024-06-30T16:24:50Z

WalkerLog has very different design logics and use cases. For example, it writes one file per MPI rank instead of using scalable I/O. Data are not organized in steps and thus requires extra work to recover a true reply.

update instead of rewrite config.h5

a1b5edc

Paul-St-Young added the help wanted label Mar 13, 2024

Paul-St-Young requested a review from ye-luo March 13, 2024 22:12

Paul-St-Young self-assigned this Mar 13, 2024

Paul-St-Young added 4 commits March 14, 2024 08:00

more thorough cleanup of previous checkpoint

106569e

implement identify_block

3c74d97

append block index after walkers

rename walker_weights and walker_partition

bd6bd54

in addition to walkers

use dump with identify_block

42b10ed

Paul-St-Young changed the title ~~[WIP] dump walkers periodically for post-processing~~ dump walkers periodically for post-processing Mar 14, 2024

Paul-St-Young added help wanted and removed help wanted labels Mar 14, 2024

ye-luo reviewed Mar 26, 2024

View reviewed changes

comment on identify_block

e8594eb

prckent mentioned this pull request Jun 5, 2024

Driver-portable walker logging #5019

Merged

15 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dump walkers periodically for post-processing #4940

dump walkers periodically for post-processing #4940

Paul-St-Young commented Mar 13, 2024 •

edited

Loading

prckent commented Mar 15, 2024

ye-luo left a comment

ye-luo Mar 26, 2024

ye-luo Mar 26, 2024

Paul-St-Young Apr 12, 2024

ye-luo Mar 26, 2024

ye-luo commented Mar 26, 2024

ye-luo commented Mar 26, 2024

PDoakORNL commented Apr 4, 2024

Paul-St-Young commented Apr 11, 2024 •

edited

Loading

prckent commented Jun 28, 2024

ye-luo commented Jun 28, 2024 •

edited

Loading

Paul-St-Young commented Jun 30, 2024

ye-luo commented Jun 30, 2024

dump walkers periodically for post-processing #4940

Are you sure you want to change the base?

dump walkers periodically for post-processing #4940

Conversation

Paul-St-Young commented Mar 13, 2024 • edited Loading

Proposed changes

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Checklist

prckent commented Mar 15, 2024

ye-luo left a comment

Choose a reason for hiding this comment

ye-luo Mar 26, 2024

Choose a reason for hiding this comment

ye-luo Mar 26, 2024

Choose a reason for hiding this comment

Paul-St-Young Apr 12, 2024

Choose a reason for hiding this comment

ye-luo Mar 26, 2024

Choose a reason for hiding this comment

ye-luo commented Mar 26, 2024

ye-luo commented Mar 26, 2024

PDoakORNL commented Apr 4, 2024

Paul-St-Young commented Apr 11, 2024 • edited Loading

prckent commented Jun 28, 2024

ye-luo commented Jun 28, 2024 • edited Loading

Paul-St-Young commented Jun 30, 2024

ye-luo commented Jun 30, 2024

Paul-St-Young commented Mar 13, 2024 •

edited

Loading

Paul-St-Young commented Apr 11, 2024 •

edited

Loading

ye-luo commented Jun 28, 2024 •

edited

Loading