[Usage recommendation request] Decompressing large EXRs in real-time. #1755

GeorgeTattersallFn · 2024-05-15T11:37:12Z

Hi,
We're doing a virtual production based research project at Foundry. As part of that, we're investigating OpenEXR based solutions that meet the following criteria:

High resolution (8k, 16k, and beyond), HDR, RGBA image sequences (deep data and other channels not necassary).
File sizes as small as possible.
Real-time (24Hz+) playback.

It's fine to presume we have the PCIe bandwidth to perform transfers at rate, as well as good processing power on both the CPU and GPU front.

We're aware that this is very much "having one's cake, and eating it", however lots of OpenEXR-based options seem close, and we're wondering if the EXR experts can see a trick we've missed.

We're fine with lossy compression, which opens the doors to B44[A] and DWA[A/B]. B44 satisfies 1. & 3., but produces relatively large files when compared to DWA. DWA satisfies 1. & 2., but we would need better performance on the decompression.

With 16 threads, for 8k (8192x4096) scanline DWA I'm seeing around 60ms to read a frame, and for 16k I'm seeing around 230ms. Are there any missed tricks on either the encoding or decoding side which could be used to speed up this process? We've thought about a GPU implementation of DWA decoding, but from what we can tell, a combination of Huffman, RLE, deflate and zip are used for DWA's entropy encoding, none of which are particularly GPU friendly, and all together they sound very GPU unfriendly.

Alternatively, are we missing a trick with regards to any of the other compression methods which could help us meet our criteria?

Many thanks,
George

The text was updated successfully, but these errors were encountered:

meshula · 2024-05-16T03:26:28Z

As a baseline, do you have metrics for uncompressed scanline and tiled reads?

GeorgeTattersallFn · 2024-05-16T14:28:18Z

Sure, I should have included some more metrics initially - here's what you asked for:

Type	2k	4k	8k	16k
Uncompressed Scanlines	11.32ms	38.18ms	143.57ms	554.91ms
Uncompressed Tiled	12.11ms	43.66ms	156.71ms	560.20ms
DWAB Scanlines	13.33ms	32.94ms	67.87ms	248.48ms
DWAB Tiled	9.79ms	23.24ms	67.52ms	235.73ms

Those results were taken with calls to RgbaInputFile::readPixels and TiledRgbaInputFile::readTiles, surrounded by QueryPerformanceCounter. I saw a lot of your suggestions in #1717 for a similar-ish (inverse) problem and don't think our method of loading from storage should be an issue, we're able to saturate PCIe in our actual app (not the test app I used for these results), so I'm mainly wondering about decompress performance.

Any thoughts appreciated.

meshula · 2024-05-18T21:01:48Z

This is fantastic data, thank you. Are you aware of the staging/cpp-rewrite branch? The Core has been rewritten in C with concurrency and general performance in mind; it will be merged to main once it matures. It would be great to have the same metrics for the branch, although I hate to ask you to take on more work.

GeorgeTattersallFn · 2024-05-20T11:44:42Z

I wasn't aware of that branch - thanks for pointing it out. I've given it a test run and here's the equivalent data:

Type	2k	4k	8k	16k
Uncompressed Scanlines	12.31ms	33.26ms	96.85ms	275.59ms
Uncompressed Tiled	7.44ms	19.96ms	77.50ms	253.73ms
DWAB Scanlines	21.48ms	42.18ms	107.85ms	380.17ms
DWAB Tiled	14.86ms	31.81ms	99.20ms	391.36ms

The speedups for uncompressed data are quite incredible! Unfortunately, as things currently stand, it seems like DWA compressions have a slight decompression performance hit on my system. I've double checked everything and tried to include a brief look at where my CPU is spending time on staging/cpp_core_rewrite:

16k raw scanline:
|--------------------------------------------------| total CPU time 100%
|--------------------------------------|             unpack_16bit_4chan_interleave_rev 75.11%
|--------|                                           default_read_func 15.61%
|-----|                                              other ~9.28%

16k raw tile:
|--------------------------------------------------| total CPU time 100%
|----------------------------------------|           unpack_16bit_4chan_interleave_rev 79.97%
|----------|                                         default_read_func 19.19%
|-|                                                  other ~0.84%

16k DWAB scanline:
|--------------------------------------------------| total CPU time 100%
|-------------------------------------|              DwaCompressor_uncompress 73.6%
|----------|                                         unpack_16bit_4chan_interleave_rev 19.21%
|--|                                                 DwaCompressor_destroy 4.68%
|-|                                                  other ~2%


16k DWAB tiles:
|--------------------------------------------------| total CPU time 100%
|----------------------------|                       DwaCompressor_uncompress 56.55%
|----------------|                                   unpack_16bit_4chan_interleave_rev 31.85%
|-----|                                              DwaCompressor_destroy 10.82%
|-|                                                  other ~1%

A slightly finer grained, but still brief, look at DwaCompressor_uncompress on staging/cpp_core_rewrite:

DwaCompressor_uncompress (scanline & tiled are similar)
|--------------------------------------------------| total CPU time 100%
|----------|                                         fromHalfZigZag_scalar ~20%
|----------|                                         DwaCompressor_initializeBuffers ~20% (scanline 22%, tiled 18%)
|---------|                                          convertFloatToHalf64_scalar ~17%
|------|                                             internal_huf_decompress ~11%
|----------------|                                   other ~32%

I notice that, as you mentioned, the cpp_core_rewrite DWA codepaths are quite different to the main ones, so it might not be much use to draw a comparison, but here's brief attempt:

DwaCompressor::uncompress (scanline & tiled are similar)
|--------------------------------------------------| total CPU time 100%
|---|                                                Imf3_3::<anon namespace>::fromHalfZigZag_scalar ~6%
|--------------|                                     Imf3_3::<anon namespace>::convertFloatToHalf64_scalar ~27%
|------|                                             Imf_3_3::hufUncompress ~12%
|----------------------------|                       other ~55%
(Unable to find equivalent to DwaCompressor_initializeBuffers, Imf3_3::DwaCompressor::initializewBuffers seems to take up ~0.07% of DwaCompressor::uncompress, so I imagine this functionality happens elsewhere)

How do these timing line up with what you'd expect from the branch? I don't believe my CPU is making use of a lot of optimisations in internal_dwa_simd.h, which is a shame.

Aside from that, if you have any more suggestions to try, they'd be appreciated. Though, if this is the best we'll get on CPU, that's fine - it feels like real-time is quite far away, and I realise I'm asking for long shots.

GeorgeTattersallFn changed the title ~~[Usage recommendation request] Decompressing huge EXRs in real-time.~~ [Usage recommendation request] Decompressing large EXRs in real-time. May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage recommendation request] Decompressing large EXRs in real-time. #1755

[Usage recommendation request] Decompressing large EXRs in real-time. #1755

GeorgeTattersallFn commented May 15, 2024

meshula commented May 16, 2024

GeorgeTattersallFn commented May 16, 2024

meshula commented May 18, 2024

GeorgeTattersallFn commented May 20, 2024

[Usage recommendation request] Decompressing large EXRs in real-time. #1755

[Usage recommendation request] Decompressing large EXRs in real-time. #1755

Comments

GeorgeTattersallFn commented May 15, 2024

meshula commented May 16, 2024

GeorgeTattersallFn commented May 16, 2024

meshula commented May 18, 2024

GeorgeTattersallFn commented May 20, 2024