-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
imlib/jpege: Minor DCT speedup. #2157
base: master
Are you sure you want to change the base?
Conversation
189718d
to
2a3eb33
Compare
With no SIMD changes on the RT1062... just changing
Baseline is:
The image size output is the same, though. On the H7 Plus, baseline:
And with a
The image size changes which is odd given I'm using colorbar to test. The slow down seems to be less for the H7 than the RT1062. |
Okay, I just checked if there were any overflows beyond int16_t range by doing a comparison for an overflow after every operation and there was not. So, it's not a math issue. Maybe it's the location of that array. |
Okay... figured out something interesting: Moving the DU array to .data on the H7 Plus:
And then as
So, it seems array access is faster in .data than on the stack. However, for whatever reason, doing 16-bit accesses is messed up. Reading/writing to ints is just faster. I guess at the cache level maybe it can't perform 16-bit/8-bit updates instantly and has to do things 32-bits at a time. |
Okay, after removing 16-bit writes and ensuring DU is the first item in the stack and making sure it's aligned... there is no positive performance change. Packing the 16-bit writes into 32-bits, though, appears to reduce performance more. So... yeah, I'm out of ideas on why the code generates JPEGs slower on the RT1062 while at the same time also knowing that the JPEG DCT cycle time was infact reduced by the optimization (measured via the processor cycle counter). |
I noticed an issue. When you set the resolution to something less than 320x240 you noticed issues with the image. I think I can fix this and then the performance will be good. |
2a3eb33
to
ce38de0
Compare
I fixed the issue, and it's still a slowdown on the RT1062. Real head-scratcher. |
This PR provides a minor 5% speedup on the software jpeg code. I had hoped it would provide more speed given the vertical DCT has 50% less instructions over two column passes.
This means any more software optimization of the software jpeg encoder has to improve the bit packing logic.
...
Tests are performed on the color bar image and at 90% quality to generate the most work.
On the STM32H7 with hardware jpeg disabled:
VGA, color, @ 90% quality = 84ms (before 87 ms) - 3% improvement
VGA, grayscale, @ 90% quality = 26ms (before 27 ms) - 3% improvement
SVGA, color, @ 90% quality = 130ms (before 136 ms) - 4% improvement
SVGA, grayscale, @ 90% quality = 40ms (before 42 ms) - 5% improvement
XGA, color, @ 90% quality = 216ms (before 225 ms) - 4% improvement
XGA, grayscale, @ 90% quality = 70ms (before 73 ms) - 4% improvement
HD, color, @ 90% quality = 249ms (before 259 ms) - 4% improvement
HD, grayscale, @ 90% quality = 76ms (before 80 ms) - 5% improvement
...
I found weird behavior on the RT1062. While the code provides a minor speedup on the STM32H7 it appears to cause a slow down across the board and is worse than baseline.
VGA, color, @ 90% quality = 74ms (before 70 ms) - 5% slow down
VGA, grayscale, @ 90% quality = 23ms (before 22 ms) - 5% slow down
SVGA, color, @ 90% quality = 116ms (before 110 ms) - 5% slow down
SVGA, grayscale, @ 90% quality = 36ms (before 35 ms) - 3% slow down
XGA, color, @ 90% quality = 194ms (before 184 ms) - 5% slow down
XGA, grayscale, @ 90% quality = 59ms (before 56 ms) - 5% slow down
HD, color, @ 90% quality = 217ms (before 205 ms) - 5% slow down
HD, grayscale, @ 90% quality = 67ms (before 64 ms) - 5% slow down
After applying #2142 the numbers become (thinking maybe there's something weird happening because of the interrupt overhead):
VGA, color, @ 90% quality = 70ms (before 66 ms) - 5% slow down
VGA, grayscale, @ 90% quality = 22ms (before 20 ms) - 10% slow down
SVGA, color, @ 90% quality = 109ms (before 102 ms) - 7% slow down
SVGA, grayscale, @ 90% quality = 34ms (before 32 ms) - 5% slow down
XGA, color, @ 90% quality = 184ms (before 173 ms) - 5% slow down
XGA, grayscale, @ 90% quality = 56ms (before 52 ms) - 8% slow down
HD, color, @ 90% quality = 207ms (before 195 ms) - 6% slow down
HD, grayscale, @ 90% quality = 65ms (before 61 ms) - 7% slow down
I'm not sure what's causing the slow down in performance on the RT1062. Something interesting I noticed though, changing
int DU[64];
toint16_t DU[64];
causes a massive slow down. On both the STM32 and RT1062 before doing any SIMD work. Maybe the issue is that there's an overflow in the DU array causing the quantization to break.Just for sanity sake, I added a cycle counter to the DCT loop on the RT1062. Without the new code the horizontal and vertical loops take about ~606 clock cycles on average. After the new code they take 578 clock cycles. So, the code does indeed speeup to the loop time by 5%. So... there must be some issue with using an
int16_t DU[64]
array.Will continue to debug this.