Partial "Orphaned" Stacktraces Emitted By Agent #178

PhRX · 2017-01-18T19:57:16Z

At the bottom you'll find an extract from a real HP log, or at least the output I got by modifying the LogCollector to print the THREADSTARTs followed by the Thread Id, and the FRAMEs followed by the method name and id. (as an aside, the log is a fixed file run through sequentially by the LogCollector, there's no concurrency issues in play here).

I did this because I systematically saw some unexplained "top-level" methods in the tree view just under the thread level (so siblings of the java.lang.Thread.run method for example), which clearly didn't belong there.

The 2 lines in the middle (in other cases I saw up to 5-6 lines like this, but always in fairly "deep" stacks) between 2 THREADSTART events look like the bottom of a stack. Inspection for 2 different cases (looking at what's in the log and at the original source code being profiled) shows that these lines are "orphaned", i.e. they are not the "bottom" of the trace following the next THREADSTART (i.e. this eliminates the possible case that somehow spurious THREADSTARTS could be emitted in the middle of a stack being dumped).

My theory is that the agent starts emitting a stacktrace, and gets interrupted by the timer who asks for the next stacktrace, so the rest of the current stacktrace gets devnulled. I didn't look at the C++ code however, this is just a guess.

I can probably work around this in the LogCollector with a heuristic (peek at last frame before a new threadstart, if it isn't one of the very few top-level frames for the thread it belongs to, throw the stack away) but IMHO it should get fixed properly in the agent.

[Start of dump extract, anonimized a tad ;-) ]
[...]
FRAME com.[...].doExecute 139658604373232
FRAME com.[...].execute 139658589859688
FRAME com.[...].execute 139658590061520
FRAME com.[...].run 139658590030528
FRAME java.lang.Thread.run 139658581930536
THREADSTART 31885
FRAME com.[...].getNameBaseInternalPrivate 139656096479528
FRAME com.[...].getNextSiblingRefPrivate 139656096658576
THREADSTART 31885
FRAME com.[...].getContentPrivate 139656099318112
FRAME com.[...].getContent 139656099318384
FRAME com.[...].getNodeValue 139656099314848
FRAME com.[...].serializeTextNode 139656099112320
FRAME com.[...].serializeNode 139656099112520
[...]

nitsanw · 2017-01-18T20:09:58Z

A likely cause is the limit on stack depth, see if you can avoid this issue by increasing it to 1024 or something.

PhRX · 2017-01-18T20:16:05Z

I may try that. However I would guess (with fairly high confidence) from the nature of what I'm profiling that these particular stacks are exact duplicates of some other stacks in the log, which would then be exactly as deep. I have to verify this (maybe tomorrow) but if I'm right, it's more likely to be a a timer issue.

PhRX · 2017-01-18T20:24:14Z

By the way, wouldn't it make more sense if I test the converse ? Set maxFrames to a low number (e.g. 5) and see what happens ? If I end up with the bottom 5 lines of all stacks then that would be the best proof that you're right and that probably in a couple of instances the depth got exceeded. Or am I missing something ?

PhRX · 2017-01-18T21:20:27Z

I tried setting maxFrames to 5. Now I indeed get a lot of "stack bottoms" - that is, every "stack" is 5 deep. And they all seem to be "real bottoms". That is to say, the bottom frame is practically always a "final" frame which doesn't even have anything to call anymore anyway.

However, this indicates that the behaviour of the agent really is to limit the number of frames dumped per stack to .

If in my original example I was running up against the default value of maxFrames (don't know what it is), the expected result of such behaviour would be that the "orphaned partial stacks" I have in those logs have depth == maxFrames. That is not the case : their depth is a lot less, typically between 2 and 6 frames in the example I investigated. If the default maxFrames is something like 50, these orphans would have contained 50 frames.
So I'm not really satisfied with the "maximum depth" explanation yet. I'll leave the case open. I may find the time to test a high maxFrames soon and see what drops out.

And @nitsanw - thanks for the quick reaction and the interesting suggestion !

nitsanw · 2017-01-19T06:57:16Z

@PhRX good reasoning, and indeed from your evidence I would say you are correct to assume max depth is not the issue.
Is the issue in the signal handler? The logic is very straight forward. Assuming the issue is not in AGCT itself it would suggest the ring buffer of preallocated frames is allowing concurrent writes to the same slot.
Another option is that in the aggregation logic the stacks get mangled. I'll have a look.
Mid way interruption of the signal handler should not result in mangling.
You can use the log dumper to dump the binary file in human readable form and try to correlate the mangled stack to each other as a means to further tracking the issue to the agent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Partial "Orphaned" Stacktraces Emitted By Agent #178

Partial "Orphaned" Stacktraces Emitted By Agent #178

PhRX commented Jan 18, 2017

nitsanw commented Jan 18, 2017

PhRX commented Jan 18, 2017

PhRX commented Jan 18, 2017

PhRX commented Jan 18, 2017

nitsanw commented Jan 19, 2017

Partial "Orphaned" Stacktraces Emitted By Agent #178

Partial "Orphaned" Stacktraces Emitted By Agent #178

Comments

PhRX commented Jan 18, 2017

nitsanw commented Jan 18, 2017

PhRX commented Jan 18, 2017

PhRX commented Jan 18, 2017

PhRX commented Jan 18, 2017

nitsanw commented Jan 19, 2017