Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create separate worker usage data collection and move hardware emit there #1293

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

timl3136
Copy link
Contributor

@timl3136 timl3136 commented Nov 14, 2023

What changed?
Create a dedicated worker usage collector
Move hardware usage emitting functionality from base worker to the worker usage collector

Why?
We want to create a separate component responsible for collecting worker usage rather than a huge code block in the base worker.

How did you test it?
Tested locally as well as tested in staging env to ensure metrics consistency and no goroutine leak.

Potential risks
Instead of using Sync.Once to ensure the goroutine is run once per host, we move the hardware emitting to decision worker only as the Sync.Once might cause test timeout as it would keep other goroutine wait until the current one returns. So if a host does not have decision worker (impossible at the moment), it's hardware metrics won't be emitted.


func (w *workerUsageCollector) Start() {
w.wg.Add(1)
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to spawn a goroutine per worker? Why not ensure only 1 running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the hardware emitting is once per host, all other metrics will be worker-specific. (e.g activity poll response vs. decision poll response)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I see only w.collectHardwareUsage() which will just spawn bunch of data into the same scope. I would suggest separating hardware emitter and worker specific metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's current design, for each type of metrics based on their origin, I will create a separate gorountine for each of them. But they would be contained under a single workerusagecollector so that their result can be collected and sent in one place

@timl3136 timl3136 marked this pull request as ready for review November 16, 2023 18:44
internal/internal_worker_base.go Outdated Show resolved Hide resolved
internal/internal_worker_usage_collector.go Outdated Show resolved Hide resolved
internal/internal_worker_usage_collector.go Outdated Show resolved Hide resolved
case <-ticker.C:
// Given that decision worker and activity worker are running in the same host, we only need to collect
// hardware usage from one of them.
if w.workerType == "DecisionWorker" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might not be future proof and also if customer is running separate processes for decision and activity workers then we will not have the hardware usage of those hosts that only runs activity workers. we should also not create no-op workerUsageCollectors if only one of them will do the work.
@Groxx what would be your recommendation for host level metric reporting on the client side? I would like to avoid global static variables but this use case probably requires one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried Sync.Once before, but that would cause issues with unit testing as it will just wait indefinitely for this routine to stop while blocking all other goroutine from closing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can override it in the unit tests

type once interface {
   Do(func())
}

var collectHardwareUsageOnce once

in typical startup this would be set to sync.Once:

collectHardwareUsageOnce = sync.Once{}

in test code you can initialize this to a fake implementation

collectHardwareUsageOnce = myFakeOnce{}  // myFakeOnce implements Do(func())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion. I have implemented that in the latest commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see EmitOnce being used in workerUsageCollector. We should only have one (singleton) instance of workerUsageCollector which would be lazily created by the first worker instance. Rest of the workers would create a noOpUsageCollector. This lazy initialization logic should be hidden from workers. Worker just calls newWorkerUsageCollector() and that function should determine whether it's first time or not. Let's discuss offline if more clarification needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our usecase, only the hardware info are once per host collected. Other worker type (decision worker and activity worker) should have different workerUsageCollector as they track different task type behaviors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what type of information are you planning to collect per worker basis in this workerUsageCollector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tasklist backlog/poll response since decision and activity worker have their own pollers and that need to be scaled independently

zap.String(tagPanicStack, st))
}
}()
defer w.wg.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a few things problematic about this goroutine closure

  1. this wg.Done() will be called once goroutine for go w.runHardwareCollector() is started. It shouldn't be marked as done until runHardwareCollector() terminates so should be moved there
  2. no need for a panic recovery here
  3. no need for a goroutine to invoke runHardwareCollector.

case <-ticker.C:
// Given that decision worker and activity worker are running in the same host, we only need to collect
// hardware usage from one of them.
if w.workerType == "DecisionWorker" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what type of information are you planning to collect per worker basis in this workerUsageCollector?

Comment on lines +274 to +279

// Optional: This implementation ensures that a specific function is executed only once per instance.
// The mechanism can be overridden by other interfaces that implement the 'Do()' method.
//
// default: nil, that would ensure some functions are executed only once
Sync oncePerHost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is user visible worker options. we shouldn't expose oncePerHost here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not exposed that here, we might need to insert that as part of the workerExecutionParameters and add the Sync.Once as part of the parameter into function "NewWorker" and "newAggregatedWorker". What do you think about that idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it makes sense to add to workerExecutionParameters as it is not exposed outside. I'd also recommend looking at WithSomeOption pattern in go https://www.sohamkamani.com/golang/options-pattern/.

Copy link

codecov bot commented May 2, 2024

Codecov Report

Attention: Patch coverage is 67.01031% with 32 lines in your changes are missing coverage. Please review.

Project coverage is 73.23%. Comparing base (7f81710) to head (822564b).

Additional details and impacted files
Files Coverage Δ
internal/internal_worker.go 79.65% <100.00%> (+0.08%) ⬆️
internal/worker.go 14.28% <ø> (ø)
internal/internal_worker_base.go 75.54% <89.65%> (+9.04%) ⬆️
internal/internal_worker_usage_collector.go 53.22% <53.22%> (ø)

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f81710...822564b. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants