Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a number of problems when running in multiprocessing mode #312

Merged
merged 25 commits into from
Dec 11, 2022

Conversation

rmjarvis
Copy link
Contributor

@rmjarvis rmjarvis commented Dec 8, 2022

The existing code base doesn't work when output.nproc != 1. Mostly related to input objects and proxies. This PR fixes the following things:

  1. The initial problem was with the proxy objects sometimes raising ConnectionErrors when accessing methods. I don't understand what multipocessing.managers is doing with all this stuff, but I managed to come up with a fix to the GalSim InputProxy function that works. For now, this is done as a monkey patch, so we don't have to rely on a new version of GalSim, but I'll upstream this fix, so eventually we can just depend on GalSim >= v2.4.7 and remove this bit.
  2. We used opsim_meta[..] syntax a couple places, which isn't allowed when that's a proxy. Need to use get.
  3. The TreeRing input object had trouble with self.logger for some reason related to Proxy stuff. Again, I don't really understand this, but I just removed the self.logger from that class.
  4. I realized that not only was the atmospheric psf being built separately in each process, but it was using the file_num rng each time, which is different for each CCD. That's not what we want. We want the AtmosphericPSF to be built once in the main process and then used by all the child processes working on each CCD. Unfortunately, I couldn't figure out how to get Josh's screen sharing trick to work correctly. So for now, I have the main process build the atmosphere and then save it to a file (or just read it in if it's already made). Then each child reads in that file. So we're still duplicating memory for the atmospheres, but at least it's doing the correct thing and using the same physical atmosphere for all CCDs.

Other small fixes I made along the way:

  1. I got rid of the gaussianFWHM option, since we don't do things that way anymore.
  2. I have it ignore a fits warning about ROTTELPOS being more than 8 characters, since it was annoying me.
  3. I made the test files all runnable, since many of them weren't.
  4. I have the pytest run of test_wcs_fit only do 1 realization rather than 30, since it was taking ages to run at nersc.
  5. I fixed a logging error about the sky level that was trying to cast the sky image as a float.

I still need to add some new tests that exercise the code in multiprocessing mode and compare it to single-processing. But @jchiang87 I think this runs your config file on Cori successfully with one addition:

input.tree_rings.only_dets: [R22_S11, R22_S12]
input.atm_psf.save_file:
    type: FormattedStr
    format : atm_psf_%08d-%1d-%s.pkl
    items:
        - { type: OpsimMeta, field: observationId }
        - { type: OpsimMeta, field: snap }
        - { type: OpsimMeta, field: band }

Feel free to change that to some other file name of course, but there needs to be something there.

@rmjarvis
Copy link
Contributor Author

rmjarvis commented Dec 8, 2022

Oh, also, I rebased this onto your branch with the atm_psf save_file stuff. Although, I then moved that functionality around a bit.

Copy link
Collaborator

@jchiang87 jchiang87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmjarvis Apart from the InputProxy change, which I can't really comment on, everything looks good to me.

However, I've tried this with output.nproc, output.nfiles = (2, 3), (3, 2), (8, 9), (32, 189) and everything works as expected, but when I use output.nproc=2 and output.nfiles=1, I get this error:

  File "/global/u1/j/jchiang8/dev/imSim/imsim/skycat.py", line 265, in SkyCatObj
    obj = skycat.getObj(index, gsparams=gsparams, rng=rng, exp_time=exp_time)
  File "<string>", line 2, in getObj
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_50/conda/envs/lsst-scipipe-5.0.0-ext/lib/python3.10/multiprocessing/managers.py", line 833, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Unserializable message: Traceback (most recent call last):
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_50/conda/envs/lsst-scipipe-5.0.0-ext/lib/python3.10/multiprocessing/managers.py", line 308, in serve_client
    send(msg)
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_50/conda/envs/lsst-scipipe-5.0.0-ext/lib/python3.10/multiprocessing/connection.py", line 211, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2022_50/conda/envs/lsst-scipipe-5.0.0-ext/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'SED._mul_scalar.<locals>.<lambda>'

---------------------------------------------------------------------------

This is from running on Cori, and I see the same error running on my laptop. Given that output.nproc=2, output.nfiles=1 seems to be a corner case and wouldn't be a common use case anyway, I'd be inclined to go ahead and merge this since we need the functionality to run in the near term.

@rmjarvis
Copy link
Contributor Author

rmjarvis commented Dec 10, 2022

I'd be inclined to go ahead and merge this since we need the functionality to run in the near term.

I actually got a different error (related to AtmosphericPSF). But I think in general, this class of error is from input objects that really ought to be loaded separately in each process, but when nfiles=1, GalSim thinks it doesn't have to reload the ones that are already build in the main process. When that happens, it needs to be able to pickle things to communicate from the main process to the worker processes, which some of our input types don't do reliably.

So let's go ahead and merge this for your upcoming runs, since it seems to work for that use case. And keep this as a stretch goal to make (nproc>1, nfiles=1) work for everything. I think some of the fix needs to be in GalSim, and some in imSim. (e.g. #1187 is relevant for yours.)

@jchiang87 jchiang87 merged commit 1c6a598 into main Dec 11, 2022
@cwwalter cwwalter deleted the multiproc_fixes branch June 8, 2023 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants