You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@david Lawrence On the EICrecon timeout and warmup timeout, is it technically possible to print a warning for instances when we hit, say, over 80% of the time out but still succeed, including some info from the 'recovered' thread about what it was doing that took so long? That would be very helpful in debugging timeouts.
The text was updated successfully, but these errors were encountered:
The second part is harder since by the time we recognize that we are approaching the timeout, it's too late to turn on the call stack recording to find out which source/factory/processor is taking too long. I'm still thinking through the design implications, but here goes, ordered from "Nathan is optimistic" to "Nathan is pessimistic":
We could just include instructions in the warning message about how to enable call stack recording to get the timing information for each factory/processor the next time they run it. The only downside I see is that users don't read error messages or follow directions contained therein.
We could tweak the call stack recorder so that we always know what our stack is, and expose this information to the arrow timeout mechanism. This will add a bit of overhead. (Note: We would not add this information to the overall call graph unless the parameter record_call_stack=1, which would save us a lot of string compares and memory use.)
We could enable the call stack recording for the next n events and hope that the other threads will also encounter the same slowdown. I don't consider this particularly likely, since the timeout is often caused by loading a big geometry file from a network disk or connecting to a calibrations database over a bad network, which would happen once overall (as opposed to once per thread or 'fresh' event)
We could capture the 'real' stack trace for the slow thread the same we do with the USR1 signal. The problem with this is that these stack traces are usually useless because we've compiled with optimization and without debug symbols. Plus they don't know about source/factory/processor abstractions. Which means that the top 10 or so frames will be deep in the internals of ROOT or whatever JSON library we are using.
I'm going to start by implementing (1) and see what happens when I try for (2).
Question from Wouter in ePIC compSW MatterMost channel:
Wouter Deconinck (he/him)
8:49 PM
@david Lawrence On the EICrecon timeout and warmup timeout, is it technically possible to print a warning for instances when we hit, say, over 80% of the time out but still succeed, including some info from the 'recovered' thread about what it was doing that took so long? That would be very helpful in debugging timeouts.
The text was updated successfully, but these errors were encountered: