Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plugin consumes even more CPU when idle #3

Open
AnClark opened this issue Jan 3, 2024 · 25 comments
Open

Plugin consumes even more CPU when idle #3

AnClark opened this issue Jan 3, 2024 · 25 comments
Labels
help wanted Extra attention is needed

Comments

@AnClark
Copy link

AnClark commented Jan 3, 2024

Hi Wasted Audio Team,

I've encountered a strange issue when using WSTD EQ on REAPER for Linux. If the plugin is processing audio, CPU usage is below 1.0% on average. However, when I click "Stop" on REAPER, CPU usage will terribly increase to 7.0%.

See the following screenshots:

  • On processing:

图片

  • On idle:

图片


My system environment:

  • PC: ThinkPad R400
  • CPU: Intel(R) Core(TM)2 Duo CPU P9500 @ 2.53GHz
  • OS: Arch Linux
  • DAW: REAPER v6.83
  • WSTD EQ version: v1.0 (official release)
@AnClark
Copy link
Author

AnClark commented Jan 3, 2024

This is a Linux perf stat when testing with REAPER (VST3 edition). I stayed for a little long time on idle.

perf.data.tar.gz.

Here are some screenshots:

图片

图片

@dromer
Copy link
Contributor

dromer commented Jan 3, 2024

Hey @AnClark thank you for the detailed report.

This will take some dedicated time to figure out. I haven't seen this issue before and I can't directly think of what could cause it.

It might be related to the framework we use. Have you used any other DPF based plugins before that show a similar load increase when transport is stopped?

@dromer
Copy link
Contributor

dromer commented Jan 3, 2024

Hmm, so from your perf inspection it seems that the biquad filters take a lot of time on your machine

I'm quickly trying this with the v1.0 release in REAPER on my AMD Ryzen 5 (quite a bit more performant than your ancient C2D). And I don't see any such discrepancies:

Idle:

Active:

Occasionally I see a tiny "jump", when stopping, to 0.03% but it quickly goes down to 0.02% again.
Not sure how else I could reproduce.

I am considering to enable SSE4.1 for all plugins this year, which should give a near 4x performance increase. This instruction set is supported for C2D at least. Maybe we can do some preliminary tests to see if this improves this a bit for you.

@dromer
Copy link
Contributor

dromer commented Jan 3, 2024

Btw it seems your perf.data file is incompatible with my system, so I cannot read the output myself.

I'm guessing those visual stats are also an extra feature of that version, which I don't seem to have.

@dromer dromer added the help wanted Extra attention is needed label Jan 3, 2024
@AnClark
Copy link
Author

AnClark commented Jan 4, 2024

Occasionally I see a tiny "jump", when stopping, to 0.03% but it quickly goes down to 0.02% again.

Not sure how else I could reproduce.

Here's another way you can reproduce the issue:

  1. Add a new track, and load WSTD EQ;
  2. Create a new MIDI item, and add JSFX "White Noise Generator" to Take FX;
  3. Switch on repeat (activate the "Toggle Repeat" button);
  4. Play.

WSTD EQ still consumes more CPU when I stopped playing.


It's strange that during processing, biquad filter works perfectly. Only if I stopped transport, the filter begins to consume CPU.

Is it possible that any inappropriate samples were being processed by DSP, which made it misbehave?

@AnClark
Copy link
Author

AnClark commented Jan 4, 2024

It might be related to the framework we use. Have you used any other DPF based plugins before that show a similar load increase when transport is stopped?

Yes. I'm porting some LV2 plugins to DPF. Both of the following plugins used to have similar issue:

Both of them have a Moog-style filter. If I stopped transport, the filter will increase the CPU load tremendously. Currently I didn't figure out why it happens, so I just made a workaround: bypass filters if oscillators does not send samples to them.

@dromer
Copy link
Contributor

dromer commented Jan 4, 2024

Here's another way you can reproduce the issue:

1. Add a new track, and load WSTD EQ;

2. Create a new MIDI item, and add JSFX "White Noise Generator" to Take FX;

3. Switch on repeat (activate the "Toggle Repeat" button);

4. Play.

I tried following these instructions. I have a midi section, JS: White noise Generator, then VST3: WSTD EQ. playing this selection on repeat and playing or not playing it doesn't get beyond 0.03%

@AnClark
Copy link
Author

AnClark commented Jan 4, 2024

I’ve checked __hv_biquad_f() in generated code.

Seems that newer CPU like your Ryzen 5 enabled solution(s) optimized with AVX or SSE 4.1, while my ancient C2D only supports SSE and SSE2, so it fallbacks to this simple solution:

  const float y = bIn*bX0 + o->xm1*bX1 + o->xm2*bX2 - o->ym1*bY1 - o->ym2*bY2;
  o->xm2 = o->xm1; o->xm1 = bIn;
  o->ym2 = o->ym1; o->ym1 = y;
  *bOut = y;

However it's still strange: this solution performs quite well on transport, but CPU load increases when transport stops.

Full `__hv_biquad_f()` code in WSTD EQ
#if _WIN32 && !_WIN64
void __hv_biquad_f_win32(SignalBiquad *o, hv_bInf_t *_bIn, hv_bInf_t *_bX0, hv_bInf_t *_bX1, hv_bInf_t *_bX2, hv_bInf_t *_bY1, hv_bInf_t *_bY2, hv_bOutf_t bOut) {
  hv_bInf_t bIn = *_bIn;
  hv_bInf_t bX0 = *_bX0;
  hv_bInf_t bX1 = *_bX1;
  hv_bInf_t bX2 = *_bX2;
  hv_bInf_t bY1 = *_bY1;
  hv_bInf_t bY2 = *_bY2;
#else
void __hv_biquad_f(SignalBiquad *o, hv_bInf_t bIn, hv_bInf_t bX0, hv_bInf_t bX1, hv_bInf_t bX2, hv_bInf_t bY1, hv_bInf_t bY2, hv_bOutf_t bOut) {
#endif
#if HV_SIMD_AVX
  __m256 x = _mm256_permute_ps(bIn, _MM_SHUFFLE(2,1,0,3));  // [3 0 1 2 7 4 5 6]
  __m256 y = _mm256_permute_ps(o->x, _MM_SHUFFLE(2,1,0,3)); // [d a b c h e f g]
  __m256 n = _mm256_permute2f128_ps(y,x,0x21);              // [h e f g 3 0 1 2]
  __m256 xm1 = _mm256_blend_ps(x, n, 0x11);                 // [h 0 1 2 3 4 5 6]

  x = _mm256_permute_ps(bIn, _MM_SHUFFLE(1,0,3,2));  // [2 3 0 1 6 7 4 5]
  y = _mm256_permute_ps(o->x, _MM_SHUFFLE(1,0,3,2)); // [c d a b g h e f]
  n = _mm256_permute2f128_ps(y,x,0x21);              // [g h e f 2 3 0 1]
  __m256 xm2 = _mm256_blend_ps(x, n, 0x33);          // [g h 0 1 2 3 4 5]

  __m256 a = _mm256_mul_ps(bIn, bX0);
  __m256 b = _mm256_mul_ps(xm1, bX1);
  __m256 c = _mm256_mul_ps(xm2, bX2);
  __m256 d = _mm256_add_ps(a, b);
  __m256 e = _mm256_add_ps(c, d); // bIn*bX0 + o->x1*bX1 + o->x2*bX2

  float y0 = e[0] - o->ym1*bY1[0] - o->ym2*bY2[0];
  float y1 = e[1] - y0*bY1[1] - o->ym1*bY2[1];
  float y2 = e[2] - y1*bY1[2] - y0*bY2[2];
  float y3 = e[3] - y2*bY1[3] - y1*bY2[3];
  float y4 = e[4] - y3*bY1[4] - y2*bY2[4];
  float y5 = e[5] - y4*bY1[5] - y3*bY2[5];
  float y6 = e[6] - y5*bY1[6] - y4*bY2[6];
  float y7 = e[7] - y6*bY1[7] - y5*bY2[7];

  o->x = bIn;
  o->ym1 = y7;
  o->ym2 = y6;

  *bOut = _mm256_set_ps(y7, y6, y5, y4, y3, y2, y1, y0);
#elif HV_SIMD_SSE
  __m128 n = _mm_blend_ps(o->x, bIn, 0x7); // [a b c d] [e f g h] = [e f g d]
  __m128 xm1 = _mm_shuffle_ps(n, n, _MM_SHUFFLE(2,1,0,3)); // [d e f g]
  __m128 xm2 = _mm_shuffle_ps(o->x, bIn, _MM_SHUFFLE(1,0,3,2)); // [c d e f]

  __m128 a = _mm_mul_ps(bIn, bX0);
  __m128 b = _mm_mul_ps(xm1, bX1);
  __m128 c = _mm_mul_ps(xm2, bX2);
  __m128 d = _mm_add_ps(a, b);
  __m128 e = _mm_add_ps(c, d);

  const float *const bbe = (float *) &e;
  const float *const bbY1 = (float *) &bY1;
  const float *const bbY2 = (float *) &bY2;

  float y0 = bbe[0] - o->ym1*bbY1[0] - o->ym2*bbY2[0];
  float y1 = bbe[1] - y0*bbY1[1] - o->ym1*bbY2[1];
  float y2 = bbe[2] - y1*bbY1[2] - y0*bbY2[2];
  float y3 = bbe[3] - y2*bbY1[3] - y1*bbY2[3];

  o->x = bIn;
  o->ym1 = y3;
  o->ym2 = y2;

  *bOut = _mm_set_ps(y3, y2, y1, y0);
#elif HV_SIMD_NEON
  float32x4_t xm1 = vextq_f32(o->x, bIn, 3);
  float32x4_t xm2 = vextq_f32(o->x, bIn, 2);

  float32x4_t a = vmulq_f32(bIn, bX0);
  float32x4_t b = vmulq_f32(xm1, bX1);
  float32x4_t c = vmulq_f32(xm2, bX2);
  float32x4_t d = vaddq_f32(a, b);
  float32x4_t e = vaddq_f32(c, d);

  float y0 = e[0] - o->ym1*bY1[0] - o->ym2*bY2[0];
  float y1 = e[1] - y0*bY1[1] - o->ym1*bY2[1];
  float y2 = e[2] - y1*bY1[2] - y0*bY2[2];
  float y3 = e[3] - y2*bY1[3] - y1*bY2[3];

  o->x = bIn;
  o->ym1 = y3;
  o->ym2 = y2;

  *bOut = (float32x4_t) {y0, y1, y2, y3};
#else
  const float y = bIn*bX0 + o->xm1*bX1 + o->xm2*bX2 - o->ym1*bY1 - o->ym2*bY2;
  o->xm2 = o->xm1; o->xm1 = bIn;
  o->ym2 = o->ym1; o->ym1 = y;
  *bOut = y;
#endif
}

@dromer
Copy link
Contributor

dromer commented Jan 5, 2024

As I said we do not build with SIMD optimizations yet (only on ARM).

Your CPU should support SSE4.1 which I might enable later this year.
C2D is about 15 years old now.

You could try this optimization by adding -msse41 to the CXXFLAGS in the plugin/source/Makefile.

@AnClark
Copy link
Author

AnClark commented Jan 5, 2024

I have a newer ThinkPad X201 Tablet. It has a Core 1st Gen processor (Core i7 L 640).

I enabled -msse41, and tested again. Even though SIMD instructions reduced CPU usages by 1.0% on idle, the problem still exists.

Sounds like we have something to do with the algorithm.

@AnClark
Copy link
Author

AnClark commented Jan 5, 2024

For reference, here's a Moog-style filter from RaffoSynth, which has the same problem as I described:

//hace lo mismo que la versión en asm
void equalizer(float* buffer, float* prev_vals, uint32_t sample_count, float psuma0, float psuma2, float psuma3, float ssuma0, float ssuma1, float ssuma2, float ssuma3, float factorSuma2){
    float psuma1 = psuma0 *2;
  for (int i = 0; i < sample_count; i++) {
    //low-pass filter    

    float temp = buffer[i];
    buffer[i] *= psuma0; 	//psuma0 == factorsuma1
    buffer[i] += psuma0 * prev_vals[0] + psuma1 * prev_vals[1] 
                    + psuma2 * prev_vals[2] + psuma3* prev_vals[3];
    prev_vals[0] = prev_vals[1];
    prev_vals[1] = temp;
    
    // peaking EQ (resonance)
    float temp2 = buffer[i];

    buffer[i] *= factorSuma2;
    buffer[i] += ssuma0 * prev_vals[2] + ssuma1 * prev_vals[3] 
                    + ssuma2 * prev_vals[4] + ssuma3 * prev_vals[5];
    prev_vals[2] = prev_vals[3];
    prev_vals[3] = temp;
    prev_vals[4] = prev_vals[5];
    prev_vals[5] = buffer[i];
 	}
}

@dromer
Copy link
Contributor

dromer commented Jan 5, 2024

I got a hint from FalkTX on what could be going on.
Can you perhaps try the following?

To the top of WSTD_EQ/plugin/source/HeavyDPF_WSTD_EQ.cpp add

#include "extra/ScopedDenormalDisable.hpp"

And in the run function set the following:

  const ScopedDenormalDisable sdd;
  const TimePosition& timePos(getTimePosition());

Rebuild and try again.

@AnClark
Copy link
Author

AnClark commented Jan 6, 2024

@dromer OK. I'll try tonight (BJT), and give you report.

@dromer
Copy link
Contributor

dromer commented Jan 6, 2024

@AnClark you can try this build when it finishes: https://github.com/Wasted-Audio/wstd-eq/actions/runs/7431093334

@AnClark
Copy link
Author

AnClark commented Jan 6, 2024

I got a hint from FalkTX on what could be going on. Can you perhaps try the following?

To the top of WSTD_EQ/plugin/source/HeavyDPF_WSTD_EQ.cpp add

#include "extra/ScopedDenormalDisable.hpp"

And in the run function set the following:

  const ScopedDenormalDisable sdd;
  const TimePosition& timePos(getTimePosition());

Rebuild and try again.

Great! By adding those lines, and build with -O3 CXX flag, problem resolved. Now CPU usage is about 0.6% on idle.

图片

@AnClark
Copy link
Author

AnClark commented Jan 6, 2024

@AnClark you can try this build when it finishes: https://github.com/Wasted-Audio/wstd-eq/actions/runs/7431093334

I've also tested your build.

Your build has better performance than mine. CPU usage is not beyond 0.5% on idle. So disabling denormal numbers really works.

@dromer
Copy link
Contributor

dromer commented Jan 6, 2024

Cool! thank you for confirming. I guess on older systems as yours this really makes a difference.
On my machines I couldn't spot any significant change.

Now comes the question on how to best apply this, as setting this option can potentially break things as well ..

@AnClark
Copy link
Author

AnClark commented Jan 6, 2024

My pleasure!

It would be better if there were any document for ScopedDenormalDisable. It's the first time I know this API. I wonder if it's proved stable by FalkTX and contributors.

Also you can do more tests on other platforms, including Apple Silicon. All of my machines are not newer than Core i5 5th-Gen.

@dromer
Copy link
Contributor

dromer commented Jan 6, 2024

I do not own any Windows or MacOS machines, so doing "proper" testing on those is not possible.
What I generally do is pass builds to friends and ask them to report if it works 🤷

@dromer
Copy link
Contributor

dromer commented Jan 6, 2024

Btw the only documentation for this class is in the code: https://github.com/DISTRHO/DPF/blob/main/distrho/extra/ScopedDenormalDisable.hpp

@AnClark
Copy link
Author

AnClark commented Jan 6, 2024

I've found a solution: add a new entry in HVCC JSON metadata (e.g. dpf.enable_denormal_number_fix or other better name), to control whether to enable this fix or not. So we can only apply this fix on WSTD EQ, and let other products uneffected.

What's more, we can also provide 2 builds of WSTD EQ since next release. One applys this fix, and the other one keeps as-is.

@dromer
Copy link
Contributor

dromer commented Jan 6, 2024

I don't see any reason to provide two completely separate builds of the same plugin, that doesn't make any sense.
Either such a patch will be in place, or it won't.

Having it as a configurable option in the json is a nice idea, so it won't be put there automatically for all DPF builds.
I'd like to know more about the implications of the patch and how it could disrupt plugin and host behavior before moving forward with a permanent solution.

@AnClark
Copy link
Author

AnClark commented Jan 6, 2024

Maybe I can help test on Windows (as well as Wine). I have a Hewlett-Packard Pavillion with Windows 11 and Msys2 installed (though it uses i7-5500U).

What's more, if WSTD and HVCC had unit test (or benchmark test) it would also help a lot.

@dromer
Copy link
Contributor

dromer commented Jan 6, 2024

HVCC does have some testing in place (although not everything works), but that's a discussion for a different project :)

@AnClark
Copy link
Author

AnClark commented Jan 7, 2024

So how could we do tests? Maybe we can make a roadmap for testing plugins (maybe not limited to WSTD EQ). For example, specify test cases and target DAWs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants