-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ll_schedule: add a DSP load tracker to ll_tasks_execute #4943
Conversation
Add a lightweight load calculatio to ll_tasks_execute() and prints out the average every 2^10 runs (roughly once per second on most systems). Example log output: --cut-- [ 6950848.421715] ( 809129.187500) c0 ll-schedule ./schedule/ll_schedule.c:139 INFO ll avg 19720/38400 [ 8868838.449667] ( 679100.000000) c0 ll-schedule ./schedule/ll_schedule.c:139 INFO ll avg 19658/38400 [ 10802857.643650] ( 565147.187500) c0 ll-schedule ./schedule/ll_schedule.c:139 INFO ll avg 19822/38400 --cut-- Signed-off-by: Kai Vehmanen <[email protected]>
We have similar trace output in SOF Zephyr builds and they've proven to be very useful for DSP algorithm debug and to track performance changes in CI and daily tests. |
They've also proven very verbose and preventing the DSP from going to sleep. I comment them out in my workspace. |
@marc-hb wrote:
Why there is no bug about this? That sounds like an error if it prevents DSP from going to sleep. It's a good question what the default should be for reporting. We can put this under Kconfig, but this should be enabled in CI for this to provide as a useful too to track DSP load. |
Related to older discussion in thesofproject/sof-test#726 |
@kv2019i thank for thinking of this. Basically the idea here is simple enough, though the usage of CONFIG_SYSTICK_PERIOD could be not accurate sometimes, as we are not always scheduled in 1 ms period. OTOH, we actually already have similar mechanism (no percentage data though), please check the perf_cnt_xxx(), if we select CONFIG_PERFORMANCE_COUNTERS from kconfig, we will get logs like "perf ll_work peak plat %u cpu %u" with info level, where it only cares the peak but not the average value. To add average or percentage data support, better to do that with the existed mechanism. If we want to get accurate percentage data, we might need to leverage the new_target_tick value, e.g.
|
@keyonjie wrote:
Good point. I ported this from the new Zephyr-based scheduler that has a fixed tick. But indeed, especially if we want to flag errors from the data, the target tick should be taken into account. That doesn affect the other part:
I looked again at this and the main problem this is designed to be a debugging feature. There is definite value if the DSP load tracking can be made lightweight enough so we can leave it on in at least CI builds, and even in the product builds. This does limit what we can do. The existing mechanisms can print out traces at a much higher rate (when problems arise) and they do more calculations per trace. The peak data is useful for debugging inprogress algorithms, but I'd argue a periodic DSP load measurement is more useful to debug issues with e.g. tplg changes and/or regressions in CI. What this PR implements is not zero cost either, but due to fixed reporting interval, the cost is at least deterministic. But that does mean we don't really want to spend cycles to pretty-print the traces and e.g. the target-tick makes this a bit more challenging, but let's have a look. Maybe we can bake it in. Of course, if the consensus is that even this is too much for non-R&D builds, then we might be better off to just add this functionality to existing PERFORMANCE_COUNTERS. Looking forward for input. |
|
||
if (++load->checks == 1 << CHECKS_WINDOW_SIZE) { | ||
load->cycles_sum >>= CHECKS_WINDOW_SIZE; | ||
max = clock_us_to_ticks(PLATFORM_DEFAULT_CLOCK, CONFIG_SYSTICK_PERIOD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be light-weight, but the above line performs run-time divisions and multiplications with constants, even if only every N runs. Since clock_us_to_ticks()
is a real function, I don't think this will be pre-calculated and optimised away by the compiler. But since it's always the same calculation (as long as the clock frequency doesn't change), this should be calculated once and stored. To also take frequency calculation into account we can also store the frequency (index) that was used to calculate this, and if it changes - recalculate.
But yes, this would also somehow have to account for imprecise XTOS ticks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only part doing the div/mult is the clock_us_to_ticks()
and this will only happen once per second (and you are right, we could do this at init since it's all const), all the rest is shifts which are optimal on xtensa
if (++load->checks == 1 << CHECKS_WINDOW_SIZE) { | ||
load->cycles_sum >>= CHECKS_WINDOW_SIZE; | ||
max = clock_us_to_ticks(PLATFORM_DEFAULT_CLOCK, CONFIG_SYSTICK_PERIOD); | ||
tr_info(&ll_tr, "ll avg %u/%u", load->cycles_sum, max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For automatic parsing, using blank as separator for numbers instead of "/" could be nicer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but we could do the clock us conversion at init since it always const.
|
||
if (++load->checks == 1 << CHECKS_WINDOW_SIZE) { | ||
load->cycles_sum >>= CHECKS_WINDOW_SIZE; | ||
max = clock_us_to_ticks(PLATFORM_DEFAULT_CLOCK, CONFIG_SYSTICK_PERIOD); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only part doing the div/mult is the clock_us_to_ticks()
and this will only happen once per second (and you are right, we could do this at init since it's all const), all the rest is shifts which are optimal on xtensa
No time to progress with this, so closing. |
Add a lightweight load calculatio to ll_tasks_execute() and prints out
the average every 2^10 runs (roughly once per second on most systems).
Example log output:
--cut--
[ 6950848.421715] ( 809129.187500) c0 ll-schedule ./schedule/ll_schedule.c:139 INFO ll avg 19720/38400
[ 8868838.449667] ( 679100.000000) c0 ll-schedule ./schedule/ll_schedule.c:139 INFO ll avg 19658/38400
[ 10802857.643650] ( 565147.187500) c0 ll-schedule ./schedule/ll_schedule.c:139 INFO ll avg 19822/38400
--cut--
UPDATE: example trace output in a case where DSP is under >95% load:
https://sof-ci.01.org/sofpr/PR4943/build10955/devicetest/?model=ADLP_RVP_NOCODEC&testcase=multiple-pipeline-all
Signed-off-by: Kai Vehmanen [email protected]