Tracking lock acquisition/exit events #5413

julealgon · 2024-03-05T23:03:22Z

julealgon
Mar 5, 2024

Our current application suffers a bit from thread contention, and we believe this is in part due to lock utilization around caching, which is used in many places of our APIs.

I thought about the idea of tracking lock acquisitions and lock exits in telemetry, specifically as span events: evey time a lock was requested, an event would be recorded, then when the lock is acquired, another event would be recorded, and finally when the critical section was over, a lock exit event would be recorded.

From what I checked thus far, there is no way to "be notified" of those events using the lock keyword or the Monitor static methods, so I'm assuming I'd have to create my own wrapper lock classes that could then either have Activity enrichment logic built-in, or that could be decorated to do so.

Alternatively, I wondered if leveraging the new .NET8 interceptors capabilities could achieve something like this without touching the original code by adding the additional tracing to Monitor.Enter and Monitor.Exit calls.

My questions are as follows:

Do you believe tracking lock scenarios as span events like this makes sense for observability, considering our scenario of heavy thread contention? Or should I consider actually using a span to track the entire critical section duration?
Do you know of any other mechanism that I could use to add those events without needing to rewrite all lock calls in the code with a wrapper object?
And finally, is this something the team has ever considered providing as a "native" instrumentation somehow?

Somewhat related:

Answered by noahfalk

Mar 6, 2024

Do you believe tracking lock scenarios as span events like this makes sense for observability, considering our scenario of heavy thread contention? Or should I consider actually using a span to track the entire critical section duration?

I suspect for many scenarios the telemetry this generates would be very verbose. .NET supports workloads that acquire and release locks millions of times per second. There certainly could be other scenarios where the verbosity is much lower or where the dev is willing to take high overheads, but my initial guess is that it would be somewhat limited.

Do you know of any other mechanism that I could use to add those events without needing to rewrite al…

View full answer

cijothomas · 2024-03-05T23:23:22Z

cijothomas
Mar 5, 2024
Collaborator

For 2:
Check if this helps:

I have used https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Instrumentation.Runtime#processruntimedotnetmonitorlock_contentioncount to first find that an app is suffering from too much contention. And then dotnet-trace collect --clrevents contention -n <name_of_your_app> was used to find the exact points. Since dotnet-trace is able to collect the lock events, it must be already firing some events and you can steal the same to turn them into Activity. But....

For 1: I am not sure if using Activity for this purpose is well suited... Activity itself introduce some contention, and it may not be worth the cost (cost here = the cost of storing the spans somewhere, like a vendor).

For 3: I'll tag the .NET runtime owners who can help with this.

4 replies

julealgon Mar 5, 2024
Author

And then dotnet-trace collect --clrevents contention -n <name_of_your_app> was used to find the exact points. Since dotnet-trace is able to collect the lock events, it must be already firing some events and you can steal the same to turn them into Activity. But....

That's interesting, I had no idea there were some events related to locks... I tried searching for this and nothing came up. Thanks for that info!

For 1: I am not sure if using Activity for this purpose is well suited... Activity itself introduce some contention, and it may not be worth the cost (cost here = the cost of storing the spans somewhere, like a vendor).

You mean if I opted for a full-blown span, right (basically a child activity)? But what about the initial proposal, of tracking these as span events instead? Would that also be too costly you think?

cijothomas Mar 6, 2024
Collaborator

I won't be able to make a firm recommendation as the cost would be subjective and depends a lot of other factors!
(Some vendors convert span-events into a telemetry item itself, and have seen complaints about span-events causing too much billing -- eg: open-telemetry/opentelemetry-dotnet-contrib#416)

julealgon Mar 6, 2024
Author

@cijothomas is this why stuff like exception recording is disabled by default in the AspNetCore instrumentation, or is there more to it? I know this is a bit off topic so if you want me to post a separate discussion I definitely could.

cijothomas Mar 6, 2024
Collaborator

That is one reason. Instrumentations follow OTel semantic conventions, and those conventions does not require exceptions be recorded by default. Pretty much everything not explicitly listed as required/recommended by sem conventions will require some opt-in. Exception is one such example.

cijothomas · 2024-03-05T23:26:10Z

cijothomas
Mar 5, 2024
Collaborator

https://github.com/dotnet/runtime/blob/4822e3c3aa77eb82b2fb33c9321f923cf11ddde6/src/libraries/System.Diagnostics.DiagnosticSource/src/System/Diagnostics/ActivitySource.cs#L423 This is the contention which I ultimately found in my case! (As must be obvious now, I was trying to find contention within OTel SDK itself). I believe in your case, you already know of contention, but you want to better track it and make changes..

1 reply

julealgon Mar 5, 2024
Author

I believe in your case, you already know of contention, but you want to better track it and make changes..****

Basically, yes. We have a lot of telemetry in the environment and some of it comes from older Datadog Agent integrations, including thread contention metrics. What we were trying to achieve with this one is just to "point" those out in traces to see if our suspicions are correct. We have a few cases where traces go a long time without apparently processing anything, like a significant "blank" area inside aspnet traces without db calls or anything, and we suspect at least some of those are due to waiting on locks. If those events could be surfaced alongside those traces, it would be possible to clearly see that correlation (and potentially optimize/get rid of the locking in a few of those places).

noahfalk · 2024-03-06T07:22:50Z

noahfalk
Mar 6, 2024

Do you believe tracking lock scenarios as span events like this makes sense for observability, considering our scenario of heavy thread contention? Or should I consider actually using a span to track the entire critical section duration?

I suspect for many scenarios the telemetry this generates would be very verbose. .NET supports workloads that acquire and release locks millions of times per second. There certainly could be other scenarios where the verbosity is much lower or where the dev is willing to take high overheads, but my initial guess is that it would be somewhat limited.

Do you know of any other mechanism that I could use to add those events without needing to rewrite all lock calls in the code with a wrapper object?

If you wanted to experiment with it, you can also use EventListener to listen to those same contention events @cijothomas mentioned. You could then encode the events as logs (or ActivityEvents, or anything else) and include them with other telemetry. One gotcha is that because the contention events are captured from the runtime's native code implementation the events are placed in a buffer and dispatched asynchronously from another thread. All the events include a timestamp and thread ID for the thread where the contention originally occurred. However other context that you might find interesting such as a callstack or a reference to originating thread's current Activity object isn't something that EventListener supports.

There are some other ways to get those events depending on how far down the rabbit hole you wanted to go. @cijothomas mentioned dotnet-trace which is very straightforward if you can run additional tools on the production machine. There is also the DiagnosticClient and TraceEvent libraries (example) that dotnet-trace is built on top of or the ICorProfiler APIs if you want to get down to the metal. (Traditionally ICorProfiler is complicated enough to use that very few people do it outside of dedicated profiling tool authors)

And finally, is this something the team has ever considered providing as a "native" instrumentation somehow?

In terms of telemetry that would be uploaded and stored I assume it is too high volume to be effective for general use. For better understanding long application pauses/hangs/poor performance I suspect stack sampling somewhere between 1-1000Hz would be useful at a lower total amount of telemetry collected. I believe the OTel profiling working group is exploring scenarios like that.

HTH!

2 replies

julealgon Mar 6, 2024
Author

I suspect for many scenarios the telemetry this generates would be very verbose. .NET supports workloads that acquire and release locks millions of times per second. There certainly could be other scenarios where the verbosity is much lower or where the dev is willing to take high overheads, but my initial guess is that it would be somewhat limited.

So @noahfalk , if I were to have these as activity events though, wouldn't the default sampler already take care of not pushing a ton of stuff anyways? Wouldn't the sampler be the limiter on how much actual verbosity there is?

This is one of the reasons why using events (instead of logs, for example) came to my mind. It would give us data in the problematic traces, but it would only show up when traces are sampled too.

Additionally, to be fair, we are, at least currently, interested in our own locking calls and not so much on library or framework locks etc. Even if there was a general mechanism to be notified of all lock events, I'd still probably want to filter them to only our instances for sanity. And because we believe the problems we are having currently really are due to our own locks.

If you wanted to experiment with it, you can also use EventListener to listen to those same contention events @cijothomas mentioned. You could then encode the events as logs (or ActivityEvents, or anything else) and include them with other telemetry. One gotcha is that because the contention events are captured from the runtime's native code implementation the events are placed in a buffer and dispatched asynchronously from another thread. All the events include a timestamp and thread ID for the thread where the contention originally occurred. However other context that you might find interesting such as a callstack or a reference to originating thread's current Activity object isn't something that EventListener supports.

Interesting. It is a bummer though that there would be no way to attach the event to the originating activity. I think that would defeat the purpose if only a little. Will still take a look though and really appreciate the references there!

cijothomas Mar 6, 2024
Collaborator

events are captured from the runtime's native code implementation the events are placed in a buffer and dispatched asynchronously from another thread

This means we cannot attach it to an existing Activity as ActivityEvent, so a new Activity itself (or other logging api)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking lock acquisition/exit events #5413

{{title}}

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Tracking lock acquisition/exit events #5413

julealgon Mar 5, 2024

Replies: 3 comments · 7 replies

cijothomas Mar 5, 2024 Collaborator

julealgon Mar 5, 2024 Author

cijothomas Mar 6, 2024 Collaborator

julealgon Mar 6, 2024 Author

cijothomas Mar 6, 2024 Collaborator

cijothomas Mar 5, 2024 Collaborator

julealgon Mar 5, 2024 Author

noahfalk Mar 6, 2024

julealgon Mar 6, 2024 Author

cijothomas Mar 6, 2024 Collaborator

julealgon
Mar 5, 2024

Replies: 3 comments 7 replies

cijothomas
Mar 5, 2024
Collaborator

julealgon Mar 5, 2024
Author

cijothomas Mar 6, 2024
Collaborator

julealgon Mar 6, 2024
Author

cijothomas Mar 6, 2024
Collaborator

cijothomas
Mar 5, 2024
Collaborator

julealgon Mar 5, 2024
Author

noahfalk
Mar 6, 2024

julealgon Mar 6, 2024
Author

cijothomas Mar 6, 2024
Collaborator