Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support for OkHttp thread metrics #1022

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

ianbotsf
Copy link
Contributor

@ianbotsf ianbotsf commented Jan 4, 2024

Issue #

#894

Description of changes

This change adds metrics for the OkHttp engine's thread pool. It should be considered WIP until the following open questions are addressed:

  • How can we best test this in an automated fashion?
  • This change only adds support for OkHttp. CRT doesn't seem to have support for interrogating metrics about HTTP client thread counts. Is that good enough for now or should we push for enhanced support in CRT?
  • The additions to HttpClientMetrics eschew the existing pattern of pushing metrics to intermediate refs and the polling them from the async counter in favor of pulling the source data directly via a passed callback. This seems cleaner to me since it removes unnecessary state but I'm curious whether others think it's a good idea (and whether we'd potentially want to extend it to other async counters as appropriate).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

"smithy.client.http.threads.count",
{ measurement -> recordThreadState(measurement, callback) },
"{thread}",
"Current state of HTTP engine threads (active, running)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are "running" threads different from "active" ones? I see only "total" and "active" are recorded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, "active" and "running" are not the correct terms here. I can change those to "total" and "active".

@lauzadis
Copy link
Contributor

lauzadis commented Jan 8, 2024

  1. I think unit tests for the metrics being emitted would be a good start. I'm not sure, is it easy to change the number of active threads in the OkHttp client, or is it more an internal implementation detail in the engine? If the latter, I think it would be sufficient to just check that the metrics are being emitted.
  2. It would be nice to have both engines' metrics aligned but it can probably be solved on the longer term
  3. I think it's a good idea

@ianbotsf
Copy link
Contributor Author

ianbotsf commented Jan 8, 2024

  1. I think unit tests for the metrics being emitted would be a good start. I'm not sure, is it easy to change the number of active threads in the OkHttp client, or is it more an internal implementation detail in the engine? If the latter, I think it would be sufficient to just check that the metrics are being emitted.

To be clear, you mean verify that a metric value exists but not what it is? Because this is multi-threaded behavior and we don't control the implementation of the underlying thread pool, it'll be difficult to achieve a stable expectation for number of total/active threads in a test case.

@lauzadis
Copy link
Contributor

lauzadis commented Jan 8, 2024

  1. I think unit tests for the metrics being emitted would be a good start. I'm not sure, is it easy to change the number of active threads in the OkHttp client, or is it more an internal implementation detail in the engine? If the latter, I think it would be sufficient to just check that the metrics are being emitted.

To be clear, you mean verify that a metric value exists but not what it is? Because this is multi-threaded behavior and we don't control the implementation of the underlying thread pool, it'll be difficult to achieve a stable expectation for number of total/active threads in a test case.

Yes, just testing that a metric value exists if we don't have predictable behavior to test on.

Copy link
Contributor

@aajtodd aajtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we best test this in an automated fashion?
I'm not sure we can in every scenario. I had to do a lot of testing locally with an observability stack setup to actually view the metrics working. Let me know if you discover something better.

This change only adds support for OkHttp. CRT doesn't seem to have support for interrogating metrics about HTTP client thread counts. Is that good enough for now or should we push for enhanced support in CRT?

I'd say do what we can for now (although see my comment about getting this in SRA first before finalizing it).

The additions to HttpClientMetrics eschew the existing pattern of pushing metrics to intermediate refs and the polling them from the async counter in favor of pulling the source data directly via a passed callback. This seems cleaner to me since it removes unnecessary state but I'm curious whether others think it's a good idea (and whether we'd potentially want to extend it to other async counters as appropriate).

The reason it's structured this way is so that engines can update the values however they see fit (whether that be inc/dec or absolute value changes). See my other comment here but I don't think I see a reason to special case this new metric and would likely prefer to keep them consistent.

@@ -39,6 +44,7 @@ public object HttpClientMetricAttributes {
public class HttpClientMetrics(
scope: String,
public val provider: TelemetryProvider,
threadStateCallback: (() -> ThreadState?)? = null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems different for no apparent reason. It's not wrong I guess but feels odd that we would special case it since each engine can adapt however they want to the current way metrics are exposed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I concede it's very different from what we have elsewhere and feels very specialized right now.

I think one problem is that we don't directly control the thread pool. We don't decide when new threads are created or when they swap from idle to active. All we can do is inspect the state at any given moment to know the current value. That sounds like what an AsyncUpDownCounter is intended to do via a callback.

We could start a background thread in the OkHttp engine to periodically inspect the thread pool and stuff the idle/active counts into HttpClientMetrics, which would then be read when the callback is run. But this means the value will be out of sync by whatever interval the background thread uses. If, for instance, the background thread runs every 10 seconds but the meter polls every 5 seconds, then up to two datapoints will be reported inaccurately.

Is your greater concern that this change exposes the concept of callbacks outside of HttpClientMetrics and inverts the collection model from a push to a pull? Is it the inconsistency of some metrics being exposed outside of HttpClientMetrics as callbacks while others are mutable properties inside? Or is it that this may not be flexible enough to adapt to other engines where thread updates are more deterministic?

@@ -25,8 +25,13 @@ public object HttpClientMetricAttributes {
public val AcquiredConnection: Attributes = attributesOf { "state" to "acquired" }
public val QueuedRequest: Attributes = attributesOf { "state" to "queued" }
public val InFlightRequest: Attributes = attributesOf { "state" to "in-flight" }
public val TotalThread: Attributes = attributesOf { "state" to "total" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correctness: The state attribute for this should probably follow the usage convention where the sub states total to some limit.
e.g. thread state usage metric smithy.client.http.threads.usage with states of idle | consumed and a separate metric for the limit smithy.client.http.threads.limit.

Also I suppose we should probably get this defined in the SRA before codifying it in an implementation.

We may not know the total thread limit in every case or thread pools may be dynamic and spin down threads which would break the "should" part of the convention but I think that's fine. In the ideal case the convention applies and even when it doesn't it's still how you'd want it graphed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll change this to idle/active threads.

Copy link

sonarcloud bot commented Jan 12, 2024

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants