Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance on prime multipliers #88

Open
vitalsong opened this issue Apr 20, 2023 · 0 comments
Open

Poor performance on prime multipliers #88

vitalsong opened this issue Apr 20, 2023 · 0 comments

Comments

@vitalsong
Copy link

Hi, I was wondering what method you use for FFT size not equal to a power of two (chirp z-transform or factorization). I picked up even lengths with prime factors and ran a test.

The results confused me:

Run on (12 X 2600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB
  L1 Instruction 32 KiB
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB
Load Average: 2.70, 3.20, 3.63
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_FFT_KISS/1018          1387 us         1386 us          444
BM_FFT_KISS/2246          6771 us         6771 us          103
BM_FFT_KISS/11200          337 us          337 us         2090
BM_FFT_KISS/11202        62048 us        61918 us           12
BM_FFT_KISS/11182       171242 us       171165 us            4
BM_FFT_KISS/11146       169774 us       169752 us            4
BM_FFT_KISS/16384          201 us          201 us         3476

Factorization result:

factor(1018) = [2, 509]
factor(2246) = [2, 1123]
factor(11200) = [2, 2, 2, 2, 2, 2, 5, 5, 7]
factor(11202) = [2, 3, 1867]
factor(11182) = [2, 5591]
factor(11146) = [2, 5573]
factor(16384) = 2^14

The difference is 500-800 times between the worst and the best option. Most users don't think about implementation details and can get serious performance issues.

The fftw3 implementation does not have this problem:

BM_FFTW3_FLOAT/1018        11.7 us         11.7 us        59338
BM_FFTW3_FLOAT/2246        43.4 us         43.4 us        16293
BM_FFTW3_FLOAT/11200       24.2 us         24.2 us        29191
BM_FFTW3_FLOAT/11202        230 us          230 us         2918
BM_FFTW3_FLOAT/11182        257 us          257 us         2734
BM_FFTW3_FLOAT/11146        260 us          260 us         2727
BM_FFTW3_FLOAT/16384       36.7 us         36.7 us        19199

As I understand it, you use factorization and calculate the FFT for each factor, and then combine the results. It's probably worth adding a separate stage for large prime factors (e.g. czt ) to smooth out this problem.

In my implementation, I use the czt algorithm. This gives the worst result on good multipliers, but it has a fixed difficulty.
Here is an example of my benchmark:

BM_FFT_DSPLIB/1018         46.0 us         46.0 us        15287
BM_FFT_DSPLIB/2246          230 us          230 us         3031
BM_FFT_DSPLIB/11200        1100 us         1100 us          634
BM_FFT_DSPLIB/11202        1138 us         1138 us          636
BM_FFT_DSPLIB/11182        1132 us         1132 us          628
BM_FFT_DSPLIB/11146        1107 us         1107 us          618
BM_FFT_DSPLIB/16384         222 us          222 us         3236

PS: All bencmarks for complex type

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant