imlib/jpege: Minor DCT speedup. #2157

kwagyeman · 2024-02-18T22:43:59Z

This PR provides a minor 5% speedup on the software jpeg code. I had hoped it would provide more speed given the vertical DCT has 50% less instructions over two column passes.

This means any more software optimization of the software jpeg encoder has to improve the bit packing logic.

...

Tests are performed on the color bar image and at 90% quality to generate the most work.

On the STM32H7 with hardware jpeg disabled:

VGA, color, @ 90% quality = 84ms (before 87 ms) - 3% improvement
VGA, grayscale, @ 90% quality = 26ms (before 27 ms) - 3% improvement

SVGA, color, @ 90% quality = 130ms (before 136 ms) - 4% improvement
SVGA, grayscale, @ 90% quality = 40ms (before 42 ms) - 5% improvement

XGA, color, @ 90% quality = 216ms (before 225 ms) - 4% improvement
XGA, grayscale, @ 90% quality = 70ms (before 73 ms) - 4% improvement

HD, color, @ 90% quality = 249ms (before 259 ms) - 4% improvement
HD, grayscale, @ 90% quality = 76ms (before 80 ms) - 5% improvement

...

I found weird behavior on the RT1062. While the code provides a minor speedup on the STM32H7 it appears to cause a slow down across the board and is worse than baseline.

VGA, color, @ 90% quality = 74ms (before 70 ms) - 5% slow down
VGA, grayscale, @ 90% quality = 23ms (before 22 ms) - 5% slow down

SVGA, color, @ 90% quality = 116ms (before 110 ms) - 5% slow down
SVGA, grayscale, @ 90% quality = 36ms (before 35 ms) - 3% slow down

XGA, color, @ 90% quality = 194ms (before 184 ms) - 5% slow down
XGA, grayscale, @ 90% quality = 59ms (before 56 ms) - 5% slow down

HD, color, @ 90% quality = 217ms (before 205 ms) - 5% slow down
HD, grayscale, @ 90% quality = 67ms (before 64 ms) - 5% slow down

After applying #2142 the numbers become (thinking maybe there's something weird happening because of the interrupt overhead):

VGA, color, @ 90% quality = 70ms (before 66 ms) - 5% slow down
VGA, grayscale, @ 90% quality = 22ms (before 20 ms) - 10% slow down

SVGA, color, @ 90% quality = 109ms (before 102 ms) - 7% slow down
SVGA, grayscale, @ 90% quality = 34ms (before 32 ms) - 5% slow down

XGA, color, @ 90% quality = 184ms (before 173 ms) - 5% slow down
XGA, grayscale, @ 90% quality = 56ms (before 52 ms) - 8% slow down

HD, color, @ 90% quality = 207ms (before 195 ms) - 6% slow down
HD, grayscale, @ 90% quality = 65ms (before 61 ms) - 7% slow down

I'm not sure what's causing the slow down in performance on the RT1062. Something interesting I noticed though, changing int DU[64]; to int16_t DU[64]; causes a massive slow down. On both the STM32 and RT1062 before doing any SIMD work. Maybe the issue is that there's an overflow in the DU array causing the quantization to break.

Just for sanity sake, I added a cycle counter to the DCT loop on the RT1062. Without the new code the horizontal and vertical loops take about ~606 clock cycles on average. After the new code they take 578 clock cycles. So, the code does indeed speeup to the loop time by 5%. So... there must be some issue with using an int16_t DU[64] array.

Will continue to debug this.

kwagyeman · 2024-02-18T23:10:50Z

With no SIMD changes on the RT1062... just changing int DU[64] to int16_t DU[64]:

time: 230 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 229 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 230 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 229 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 228 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 229 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 229 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 229 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 228 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}

Baseline is:

{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 209 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 208 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 209 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 208 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 209 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 208 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 208 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 208 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 207 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 207 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 206 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}

The image size output is the same, though.

On the H7 Plus, baseline:

time: 259 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 259 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 260 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 260 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 259 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 260 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 259 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 259 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}

And with a int16_t DU[64]:

time: 268 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 267 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 267 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 267 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 268 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 267 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 268 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}
time: 268 ms
{"w":1280, "h":720, "type":"jpeg", "size":47425}

The image size changes which is odd given I'm using colorbar to test. The slow down seems to be less for the H7 than the RT1062.

kwagyeman · 2024-02-18T23:26:52Z

Okay, I just checked if there were any overflows beyond int16_t range by doing a comparison for an overflow after every operation and there was not. So, it's not a math issue.

Maybe it's the location of that array.

kwagyeman · 2024-02-18T23:38:31Z

Okay... figured out something interesting:

Moving the DU array to .data on the H7 Plus:

time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 242 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 241 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}

And then as int16_t DU[64]:

time: 262 ms
{"w":1280, "h":720, "type":"jpeg", "size":47748}
time: 265 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 265 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 265 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}
time: 265 ms
{"w":1280, "h":720, "type":"jpeg", "size":48123}

So, it seems array access is faster in .data than on the stack. However, for whatever reason, doing 16-bit accesses is messed up. Reading/writing to ints is just faster. I guess at the cache level maybe it can't perform 16-bit/8-bit updates instantly and has to do things 32-bits at a time.

kwagyeman · 2024-02-19T02:09:39Z

Okay, after removing 16-bit writes and ensuring DU is the first item in the stack and making sure it's aligned... there is no positive performance change. Packing the 16-bit writes into 32-bits, though, appears to reduce performance more.

So... yeah, I'm out of ideas on why the code generates JPEGs slower on the RT1062 while at the same time also knowing that the JPEG DCT cycle time was infact reduced by the optimization (measured via the processor cycle counter).

kwagyeman · 2024-02-20T01:29:30Z

I noticed an issue. When you set the resolution to something less than 320x240 you noticed issues with the image. I think I can fix this and then the performance will be good.

kwagyeman · 2024-02-21T01:49:58Z

I fixed the issue, and it's still a slowdown on the RT1062.

Real head-scratcher.

kwagyeman force-pushed the kwabena/minor_jpeg_du_speedup branch from 189718d to 2a3eb33 Compare February 18, 2024 22:47

kwagyeman changed the title ~~imlib/jpege: 5% software speedup compression speedup.~~ imlib/jpege: Minor DCT speedup. Feb 18, 2024

imlib/jpege: Minor DCT speedup.

ce38de0

kwagyeman force-pushed the kwabena/minor_jpeg_du_speedup branch from 2a3eb33 to ce38de0 Compare February 21, 2024 01:40

kwagyeman added the enhancement label Apr 14, 2024

iabdalkader added the simd label Jul 16, 2024

iabdalkader added this to the v4.5.7 milestone Jul 19, 2024

kwagyeman added the experimental Work in progress. label Aug 2, 2024

iabdalkader removed this from the v4.5.9 milestone Aug 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imlib/jpege: Minor DCT speedup. #2157

imlib/jpege: Minor DCT speedup. #2157

kwagyeman commented Feb 18, 2024 •

edited

Loading

kwagyeman commented Feb 18, 2024

kwagyeman commented Feb 18, 2024 •

edited

Loading

kwagyeman commented Feb 18, 2024

kwagyeman commented Feb 19, 2024 •

edited

Loading

kwagyeman commented Feb 20, 2024

kwagyeman commented Feb 21, 2024

imlib/jpege: Minor DCT speedup. #2157

Are you sure you want to change the base?

imlib/jpege: Minor DCT speedup. #2157

Conversation

kwagyeman commented Feb 18, 2024 • edited Loading

kwagyeman commented Feb 18, 2024

kwagyeman commented Feb 18, 2024 • edited Loading

kwagyeman commented Feb 18, 2024

kwagyeman commented Feb 19, 2024 • edited Loading

kwagyeman commented Feb 20, 2024

kwagyeman commented Feb 21, 2024

kwagyeman commented Feb 18, 2024 •

edited

Loading

kwagyeman commented Feb 18, 2024 •

edited

Loading

kwagyeman commented Feb 19, 2024 •

edited

Loading