Replies: 14 comments 16 replies
-
I confirm that I can see a Dask status page at But the graphs for the actual workers don't change much in the baseline computation. I think I'll need to dig deeper in ogcore.execute to understand more. I noticed that the calibration computation takes about 10 minutes before getting to the actual policy computation, so I make a pickle file of the initially created Calibration object so I can just skip to the |
Beta Was this translation helpful? Give feedback.
-
@talumbau Great- thanks for looking into this. Please let us know if there are specific questions we can help you with. |
Beta Was this translation helpful? Give feedback.
-
Even the baseline policy has been running for about 24 hours. Just looking at That's unexpected, at least for me. The first significant part of the calculation appears to be here in SS: https://github.com/PSLmodels/OG-Core/blob/master/ogcore/SS.py#L229-L259 where some root finding is done. The loop has no interloop dependencies so the idea is just to create a series of lazy computations and give them to the dask client to execute in parallel. Seems reasonable but I'm not sure it's making good progress. Some print statements that come later in that file don't appear in my hung output so I think something is going wrong in this |
Beta Was this translation helpful? Give feedback.
-
@talumbau This is not my experience. I've had some hanging like this in the reform policy run, but not baseline. I will try to recreate this today or over the weekend. Can I ask - what hardware and version of Python are you using? |
Beta Was this translation helpful? Give feedback.
-
Yes please do. It might be that something changed recently in a completely clean environment setup. The machine is a pretty beefy Linux machine. 256GB of RAM, AMD Ryzen Threadripper 64-Cores
|
Beta Was this translation helpful? Give feedback.
-
Strange situation continues.... I have discovered that the computation is not proceeding past the scatter of the Parameters here:
So for some reason we never return from the |
Beta Was this translation helpful? Give feedback.
-
@talumbau Today I:
|
Beta Was this translation helpful? Give feedback.
-
Yep, same versions of
|
Beta Was this translation helpful? Give feedback.
-
The only difference I can see between our setups would be the Mac/Linux difference (and I guess that your Python was built with clang and mine with gcc). If that makes a difference in the operation of |
Beta Was this translation helpful? Give feedback.
-
OK, I re-ran on my machine with the above fix and just used the basic cProfile tool:
Then I used snakeviz to take a look: The claim from this profiling is that TPI takes the majority of the time. The section for SS (the long rectangle) is basically dominated by root finding, which makes sense. The section for TPI shows time spent in either the |
Beta Was this translation helpful? Give feedback.
-
I figured out how to use Dask pre-loading to get cProfile data from every worker. See this PR if you want to try this: However, I accidentally did a test run while still using the built-in Dask profiling capability (so I had double profiling going on). |
Beta Was this translation helpful? Give feedback.
-
I put some wrappers around the root finding calls to get some data on how much they are called, average time per call, etc. in this PR: https://github.com/PSLmodels/OG-Core/pull/904/files I dumped that to a file and processed the log with pandas in this gist: https://gist.github.com/talumbau/5ae4f134bbda11ccf231b2b06fa1d63c That gives this data:
So one observation is that there is one very long call to TPI definitely has more time spent in root finding, although it's interesting that a given call is not very time consuming. At the end of this run, the output was:
so I would not say I totally understand how the sum of the time in root finding exceeds the wallclock time of the computation for some of the work processes. Each process is set up to have multiple threads but I don't believe this should help for this kind of computation. A bad situation would be if somehow we are starting multiple calls to |
Beta Was this translation helpful? Give feedback.
-
We determined that the long call to The other calls to |
Beta Was this translation helpful? Give feedback.
-
I made some progress here. Since you mentioned that Parameters doesn't change between iterations, I am using Next step would be to try something similar for TPI. |
Beta Was this translation helpful? Give feedback.
-
To get started, I found that I wasn't able to easily capture the output of the dask distributed job. I found that I needed to use PYTHONUNBUFFERED in order to get output to the console and to
tee
:While the job is running, it looks like you can get the status by pointing a browser to
http://localhost:8787/status
. That will be my next place to look.I have found that when I run the example reform policy starting here the job will not actually run to completion, even after 24+ hours. So the goal will be to get profiling information just from the baseline policy in line 62.
Beta Was this translation helpful? Give feedback.
All reactions