Double DrJIT Loop #1120

Angom8 · 2024-03-28T13:25:45Z

Angom8
Mar 28, 2024

Hello,

I modified the main loop from render_sample to handle non-uniform sampling. It works fine, even with passes, at the cost of a very expensive (and anormal) cost in memory with passes.

The adaptive sampling value is retrieved from a Tensor through dr::gather<UInt32>(m_sampling_map.array(), index); and scattered through the following JIT loop, with coords the previous "pos" variable from Mitsuba and pos the accumulation of coords (max size = #3FFFFFF). An offset map is computed on first read and allows to scatter values without overlaps.

dr::Loop<Bool> sampling_loop("Adaptive sampling", local_sampling,adaptive_sampling,pos);
while (sampling_loop(local_sampling < adaptive_sampling)) {
      dr::scatter(pos, coords, offset + local_sampling);
      local_sampling += 1;

For passes, I filter the values with UInt32 adaptive_sampling = dr::select(dr::eq(overflows,progress),sampling_map,0);. overflows (also computed on first read) ranging from 1 to N passes and progress being the current pass, with an overall non-JIT loop for(ScalarUInt32 pass_progress = 1;pass_progress<=n_passes;pass_progress+=1).

Is there a way to improve this structure to avoid cache flushes ?

Answered by njroussel

Apr 8, 2024

Hi @Angom8

I think I understood what you're doing.

The high memory usage is somewhat expected, I believe. With this approach you effectively need to write pos to global memory, so that requires a storage of N_rays * 3 * 4 bytes. In general, you want to avoid storing anything that scales with your number of rays (that's one of the goals of megakernels). You should be able to write this without any scatter to generate pos.

I'm confused about the passses here. Are you not updating the positions between them? If so, why are they ran separately? I don't think this matters in any case, but I think I'm misunderstanding your explanation.

Fundamentally, I don't think there is any reason why you sh…

View full answer

njroussel · 2024-04-08T11:23:10Z

njroussel
Apr 8, 2024
Collaborator

Hi @Angom8

I think I understood what you're doing.

The high memory usage is somewhat expected, I believe. With this approach you effectively need to write pos to global memory, so that requires a storage of N_rays * 3 * 4 bytes. In general, you want to avoid storing anything that scales with your number of rays (that's one of the goals of megakernels). You should be able to write this without any scatter to generate pos.

I'm confused about the passses here. Are you not updating the positions between them? If so, why are they ran separately? I don't think this matters in any case, but I think I'm misunderstanding your explanation.

Fundamentally, I don't think there is any reason why you shouldn't be able to achieve perfect cache reuse here. Every render step has the same set of computations. They only differ in what their initial rays are.

1 reply

Angom8 Apr 12, 2024
Author

Hello,
Thank you for your answer.
I quickly realized (with a facepalm) that storing a Vector2u of a size potentially reaching 2^30 was the main cause of my VRAM consumption. I am looking into another solution to unroll the sampling without storage.

The positions are indeed reset between passes as the sampling map is segmented each time the total amount of sample exceeds 2^30.
The updated code (but still not fixed) looks something like this

for(ScalarUInt32 pass_progress = 1;pass_progress<=n_passes;pass_progress+=1){
                Vector2u pos = dr::zeros<Vector2u>(max_spp); // the "big" storage
                UInt32 local_sampling = dr::opaque<UInt32>(0);
                UInt32 adaptive_sampling = dr::opaque<UInt32>(0);
                Mask selective = dr::eq(overflows,dr::opaque<UInt32>(pass_progress));
                dr::masked(adaptive_sampling,selective) = sampling_map;

                dr::Loop<Bool> sampling_loop("Adaptive sampling", local_sampling,adaptive_sampling);
                while (sampling_loop(local_sampling < adaptive_sampling)) {
                    dr::scatter(pos, coords, offset + local_sampling);
                    local_sampling+=1;
                }

                render_sample(scene, sensor, sampler, block, aovs.get(), coords,diff_scale_factor);  
                sampler->advance(); // Will trigger a kernel launch of size 1
                sampler->schedule_state();
                dr::eval(block->tensor());
}
film->put_block(block);

The sampling map contains the SPP
The offset map is computed as the sum of the previous map to ease the scattering
The overflow map reflects the previous sum into multiple passes if it reaches 2^30

I am currently trying to scatter properly the local sampling values but without the variable (it is effectively more tedious). In this current configuration, the cache is reuse is OK with proper timings on small renders.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double DrJIT Loop #1120

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Double DrJIT Loop #1120

Angom8 Mar 28, 2024

Replies: 1 comment · 1 reply

njroussel Apr 8, 2024 Collaborator

Angom8 Apr 12, 2024 Author

Angom8
Mar 28, 2024

Replies: 1 comment 1 reply

njroussel
Apr 8, 2024
Collaborator

Angom8 Apr 12, 2024
Author