Custom BSDF: Accessing the rows of a TensorXf in the megakernel mode #1156

sapo17 · 2024-04-30T17:54:07Z

sapo17
Apr 30, 2024

Summary

Once again, thanks for the amazing work and sorry for another silly question.

I am trying to implement a custom BRDF where I try to look up some rows from a mi.TensorXf upon some computation using the sample2 value.

I would appreciate it if you could give me some feedback.

System configuration

System information:

OS: Windows-10
CPU: Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
GPU: NVIDIA GeForce RTX 4080
Python: 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)]
NVidia driver: 551.86
CUDA: 12.1.66
LLVM: 18.1.1

Dr.Jit: 0.4.4
Mitsuba: 3.5.0
Is custom build? False
Compiled with: MSVC 19.38.33133.0
Variants:
scalar_rgb
scalar_spectral
cuda_ad_rgb
llvm_ad_rgb

Installed Mitsuba with:

   ... conda venv.yml file
  - pip
  - pip:
    - mitsuba

Installed PyTorch with:

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Description

My original code is more involved and uses dr.binary_search and makes some computation using sample2.x value to find the relevant indices of the mi.TensorXf that I am trying to access.

For simplicity, here is a silly reproducer that should roughly give you an idea of what I am trying to do.

Unless I do not put the following flags

dr.set_flag(dr.JitFlag.VCallRecord, False)
dr.set_flag(dr.JitFlag.LoopRecord, False)

the following lines of code do not work on my custom bsdf's sample() or eval_pdf() method:

    # ... my custom BSDF
    def sample(
        self: mi.BSDF,
        ctx: mi.BSDFContext,
        si: mi.SurfaceInteraction3f,
        sample1: float,
        sample2: mi.Point2f,
        active: bool = True,
    ):

        cos_theta_i = mi.Frame3f.cos_theta(si.wi)

        # fill up the BSDFSample
        bs = mi.BSDFSample3f()

        active &= cos_theta_i > 0.0
        """ NOTE: The original code is like this:
            if (unlikely(dr::none_or<false>(active) ||
                     !ctx.is_enabled(BSDFFlags::DiffuseReflection)))
            return { bs, 0.f };

            # Somehow cannot combined `active` with the below expression
        """
        if not ctx.is_enabled(mi.BSDFFlags.DiffuseReflection):
            return (bs, 0.0)
        
        ### here is the reproducer
        a, b = 10, 9
        tmp_idx = dr.clamp(mi.UInt(sample2.x * a), 0, a - 1)
        tmp_data = dr.zeros(mi.TensorXf, (a, b))
        value = tmp_data[tmp_idx]
        ###

        bs.wo = mi.warp.square_to_cosine_hemisphere(sample2)
        bs.pdf = mi.warp.square_to_cosine_hemisphere_pdf(bs.wo)
        bs.eta = 1.0
        bs.sampled_type = mi.UInt32(+mi.BSDFFlags.DiffuseReflection)
        bs.sampled_component = 0

        value = self.m_albedo.eval(si, active)

        return (
            bs,
            dr.select(active & (bs.pdf > 0.0), mi.depolarizer(value), 0.0),
        )

I also tried the following to disable the loop record locally like this but this did not work:

### reproducer
loop_record = dr.flag(dr.JitFlag.LoopRecord)
vcall_record = dr.flag(dr.JitFlag.VCallRecord)
dr.set_flag(dr.JitFlag.LoopRecord, False)
dr.set_flag(dr.JitFlag.VCallRecord, False)

a, b = 10, 9
tmp_idx = dr.clamp(mi.UInt(sample2.x * 10), a, b)
tmp_data = dr.zeros(mi.TensorXf, (10, 9))
value = tmp_data[tmp_idx]

dr.set_flag(dr.JitFlag.LoopRecord, loop_record)
dr.set_flag(dr.JitFlag.VCallRecord, vcall_record)
###

I also tried by flattening the mi.TensorXf and trying to access it using dr.gather(), but that unfortunately that also did not work.

For efficiency reasons, I am trying to run things on the megakernel/recorded mode. I've seen different discussions somehow related to this (#1004, #866). But, I am still wondering if it would be possible to access the mi.TensorXf after having done some computation on sample2.x value to obtain the index to access the necessary rows of the tensor.

Is there a way to do this without disabling the loop record? I would really appreciate your feedback!

PS Please let me know if you need further information

Answered by merlinND

May 2, 2024

Thanks for the additional detail @sapo17.

I think that the key thing to understand is that in symbolic mode, e.g. inside of a symbolic (recorded) loop of the path tracer, the wavefront size (== width of the variables == number of threads the kernel will be launched with) is fixed.
Outside of a symbolic loop, DrJit would automatically introduce a kernel boundary for you, and launch the different kernels with their required widths. But in the body of a recorded loop, that is not possible.

In your example code, you have:

dr.width(tmp_idx) == dr.width(sample2.x): the wavefront size (= number of rays), which is fine
And then you attempt to create arrays of size wavefront_size * n (?) in get_c…

View full answer

merlinND · 2024-05-02T13:00:33Z

merlinND
May 2, 2024
Collaborator

Hello @sapo17,

From the reproducer I am not sure I understand the role of tmp_data.
Is it a fully separate buffer, whose size and contents have nothing to do with the current rays being traced? If so, it should definitely be possible to create & fill it beforehand (e.g. in the constructor of your BSDF) and gather from it in the sample / eval method in megakernel mode, just like we read from e.g. textures.

Note however that you will have to gather values "one by one" (from the point of view of each thread), you cannot get a dynamically-sized row of the TensorXf inside of a megakernel (tmp_data[tmp_idx]).

Please also remember to always share full error messages when reporting issues or asking questions, it helps narrow down the problem.

3 replies

sapo17 May 2, 2024
Author

Hello @merlinND,

Thank you for your quick response, I really appreciate it!

Is it a fully separate buffer, whose size and contents have nothing to do with the current rays being traced?

Yep. In the original code, we initialize these tensors in the constructor. Our goal is to use some of the rows during sample() or eval_pdf(). However, we choose the rows dynamically: row = sample.x * self.a.

I extended the reproducer below. Please note however, it is just a reproducer and accessing to tmp_data in this example is meaningless. To give you a broad idea: we have some frozen distributions (i.e., mi.TensorXf's) that we would like to use for sampling and pdf evaluation. Unfortunately, I can't go into much detail, but I hope it makes more sense.

I put everything (scene, custom BRDF, jupyter notebook etc.) in this google drive, I hope it helps.

Juptyer Notebook:

# %%
import matplotlib.pyplot as plt
import mitsuba as mi
import drjit as dr
import torch

# dr.set_flag(dr.JitFlag.VCallRecord, False)
# dr.set_flag(dr.JitFlag.LoopRecord, False)

# %%
spp=1

# %%
mi.set_variant("cuda_ad_rgb")

# %%
a, b = 10, 9
tmp_data = dr.arange(mi.Float, 0, a * b)
tmp_data = mi.TensorXf(tmp_data, shape=(a, b))

# %%
tmp_data.torch()

# %%
def get_coefficients(
    tensor: mi.TensorXf,
    row: mi.UInt,
) -> mi.TensorXf:
    """
    Args:
        - tensor (mi.TensorXf): Expecting a [a, b] shaped tensor. 
        - indices (mi.UInt): Indices to access corresponding coefficients.

    Returns:
        - mi.TensorXf: A [n, b] shaped tensor, where n is the length of indices.
    """

    # get meta information and flatten
    _, b = tensor.shape
    tensor = dr.ravel(tensor)
    n = len(row)

    start_indices = row * b
    range_indices = dr.arange(mi.UInt, b)
    indices = dr.tile(range_indices, n) + dr.repeat(start_indices, b)
    result = dr.gather(mi.Float, tensor, indices)
    result = mi.TensorXf(result, shape=(n, b))

    return result

# %%
n = 2
sample2 = mi.Point2f(torch.rand((n, 2)).cuda())
tmp_idx = dr.clamp(mi.UInt(sample2.x * a), 0, a - 1)

# %%
tmp_idx

# %%
tmp_data[tmp_idx].torch()

# %%
get_coefficients(tmp_data, tmp_idx).torch()

In the example above, everything works as expected. We generate some arbitrary samples (sample2) and use them to access corresponding rows in the tmp_data. Both gathering (i.e., using get_coefficients()) and direct indexing (i.e., tmp_data[tmp_idx]) gives equivalent results. The full notebook can be found here.

Now, the custom BSDF is defined like this:

import mitsuba as mi
import drjit as dr
import torch


class CustomDiffuseBSDF(mi.BSDF):
    def __init__(self: mi.BSDF, props: mi.Properties) -> None:
        mi.BSDF.__init__(self, props)

        self.m_albedo = props.get("reflectance", [0.5])

        reflection_flags = (
            mi.BSDFFlags.DiffuseReflection | mi.BSDFFlags.FrontSide
        )
        self.m_components = [reflection_flags]
        self.m_flags = reflection_flags

        ### just some data for reproducability
        self.a, self.b = 10, 9
        self.some_data = dr.arange(mi.Float, 0, self.a * self.b)
        self.some_data = mi.TensorXf(self.some_data, shape=(self.a, self.b))
        ###

    def sample(
        self: mi.BSDF,
        ctx: mi.BSDFContext,
        si: mi.SurfaceInteraction3f,
        sample1: float,
        sample2: mi.Point2f,
        active: bool = True,
    ):

        cos_theta_i = mi.Frame3f.cos_theta(si.wi)

        # fill up the BSDFSample
        bs = mi.BSDFSample3f()

        active &= cos_theta_i > 0.0
        """ NOTE: The original code is like this:
            if (unlikely(dr::none_or<false>(active) ||
                     !ctx.is_enabled(BSDFFlags::DiffuseReflection)))
            return { bs, 0.f };

            # Somehow cannot combined `active` with the below expression
        """
        if not ctx.is_enabled(mi.BSDFFlags.DiffuseReflection):
            return (bs, 0.0)
        
        ###
        tmp_idx = dr.clamp(mi.UInt(sample2.x * self.a), 0, self.a - 1)
        # value = self.some_data[tmp_idx]
        value = self.get_coefficients(self.some_data, tmp_idx)

        # in the original code we use these obtained rows (`value`)
        # for something else which we can ignore for this example
        ###

        bs.wo = mi.warp.square_to_cosine_hemisphere(sample2)
        bs.pdf = mi.warp.square_to_cosine_hemisphere_pdf(bs.wo)
        bs.eta = 1.0
        bs.sampled_type = mi.UInt32(+mi.BSDFFlags.DiffuseReflection)
        bs.sampled_component = 0

        value = self.m_albedo.eval(si, active)

        return (
            bs,
            dr.select(active & (bs.pdf > 0.0), mi.depolarizer(value), 0.0),
        )

    def eval(
        self: mi.BSDF,
        ctx: mi.BSDFContext,
        si: mi.SurfaceInteraction3f,
        wo: mi.Vector3f,
        active: bool = True,
    ) -> mi.Color3f:
        if not ctx.is_enabled(mi.BSDFFlags.DiffuseReflection):
            return 0.0

        cos_theta_i = mi.Frame3f.cos_theta(si.wi)
        cos_theta_o = mi.Frame3f.cos_theta(wo)

        active &= (cos_theta_i > 0.0) & (cos_theta_o > 0.0)

        value = self.m_albedo.eval(si, active) * dr.inv_pi * cos_theta_o

        return mi.depolarizer(value) & active

    def pdf(
        self: mi.BSDF,
        ctx: mi.BSDFContext,
        si: mi.SurfaceInteraction3f,
        wo: mi.Vector3f,
        active: bool = True,
    ) -> float:
        if not ctx.is_enabled(mi.BSDFFlags.DiffuseReflection):
            return 0.0

        cos_theta_i = mi.Frame3f.cos_theta(si.wi)
        cos_theta_o = mi.Frame3f.cos_theta(wo)

        pdf = mi.warp.square_to_cosine_hemisphere_pdf(wo)

        return dr.select(
            dr.and_(cos_theta_i > 0.0, cos_theta_o > 0.0), pdf, 0.0
        )

    def eval_pdf(
        self: mi.BSDF,
        ctx: mi.BSDFContext,
        si: mi.SurfaceInteraction3f,
        wo: mi.Vector3f,
        active: bool = True,
    ):
        if not ctx.is_enabled(mi.BSDFFlags.DiffuseReflection):
            return 0.0, 0.0

        cos_theta_i = mi.Frame3f.cos_theta(si.wi)
        cos_theta_o = mi.Frame3f.cos_theta(wo)

        active &= (cos_theta_i > 0.0) & (cos_theta_o > 0.0)

        value = self.m_albedo.eval(si, active) * dr.inv_pi * cos_theta_o
        pdf = mi.warp.square_to_cosine_hemisphere_pdf(wo)

        return (mi.depolarizer(value) & active, dr.select(active, pdf, 0.0))

    def traverse(self: mi.Object, callback) -> None:
        callback.put_parameter(
            "reflectance", self.m_albedo, mi.ParamFlags.Differentiable
        )

    def parameters_changed(self: mi.Object, keys=...) -> None:
        print("There is nothing to do here")

    def to_string(self):
        return (
            "CustomDiffuseBSDF[\n" "    reflectance=%s,\n" "]" % (self.m_albedo)
        )

    def square_to_uniform_disk(self, sample: torch.Tensor):
        radius = torch.sqrt(sample[:, 0])
        theta = 2 * torch.pi * sample[:, 1]
        x = radius * torch.cos(theta)
        y = radius * torch.sin(theta)
        return torch.stack((x, y), dim=1)

    def square_to_cosine_hemisphere(self, sample: torch.Tensor):
        p = self.square_to_uniform_disk(sample)
        z = torch.clamp(1.0 - torch.norm(p, p=2) ** 2, min=0.0)
        z = z.view(1, 1)
        return mi.Vector3f([p[:, 0], p[:, 1], z[:, 0]])
    
    def get_coefficients(
        self,
        tensor: mi.TensorXf,
        row: mi.UInt,
    ) -> mi.TensorXf:
        """
        Args:
            - tensor (mi.TensorXf): Expecting a [a, b] shaped tensor. 
            - indices (mi.UInt): Indices to access corresponding coefficients.

        Returns:
            - mi.TensorXf: A [n, b] shaped tensor, where n is the length of indices.
        """

        # get meta information and flatten
        _, b = tensor.shape
        tensor = dr.ravel(tensor)
        n = len(row)

        start_indices = row * b
        range_indices = dr.arange(mi.UInt, b)
        mi.Log(mi.LogLevel.Warn, f"Here1!")
        indices = dr.tile(range_indices, n) + dr.repeat(start_indices, b)
        mi.Log(mi.LogLevel.Warn, f"Here2!")
        result = dr.gather(mi.Float, tensor, indices)
        result = mi.TensorXf(result, shape=(n, b))

        return result

Now, if I try to render a scene that uses this custom BSDF without disabling the megakernel, unfortunately, the kernel crashes. I managed to find the line where the kernel gives up:

mi.Log(mi.LogLevel.Warn, f"Here1!")
indices = dr.tile(range_indices, n) + dr.repeat(start_indices, b) # this line causes the crash
mi.Log(mi.LogLevel.Warn, f"Here2!")

The only error message that I get is:

2024-05-02 17:24:22 INFO main [xml.cpp:1380] Loading XML file "..\scenes\matpreview\scene_custom_diffuse.xml" with variant "cuda_ad_rgb"..
2024-05-02 17:24:22 INFO main [Scene] Building scene in OptiX ..
2024-05-02 17:24:22 INFO main [Scene] OptiX ready. (took 74ms)
2024-05-02 17:24:22 INFO main [xml.cpp:1398] Done loading XML file "..\scenes\matpreview\scene_custom_diffuse.xml" (took 122ms).
2024-05-02 17:24:22 INFO main [SamplingIntegrator] Starting render job (683x512, 1 sample)
2024-05-02 17:24:22 WARN main [custom_diffuse_brdf.py:176] get_coefficients(): Here1!
The Kernel crashed while executing code in the current cell or a previous cell. 
Please review the code in the cell(s) to identify a possible cause of the failure. 
Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. 
View Jupyter [log](command:jupyter.viewOutput) for further details.

My naive guess is that we can't use sample.x (or, start_indices as it relies on sample.x ) as it is not evaluated yet, but I might be off.

My overall question is: Is it somehow possible/legal to do this operation? If, yes, how can we do it?

Once again thank you very much!

merlinND May 2, 2024
Collaborator

Thanks for the additional detail @sapo17.

I think that the key thing to understand is that in symbolic mode, e.g. inside of a symbolic (recorded) loop of the path tracer, the wavefront size (== width of the variables == number of threads the kernel will be launched with) is fixed.
Outside of a symbolic loop, DrJit would automatically introduce a kernel boundary for you, and launch the different kernels with their required widths. But in the body of a recorded loop, that is not possible.

In your example code, you have:

dr.width(tmp_idx) == dr.width(sample2.x): the wavefront size (= number of rays), which is fine
And then you attempt to create arrays of size wavefront_size * n (?) in get_coefficients()

So the key constraint to keep in mind is that all variables must be of the same width. The fact that there is a mi.TensorXf type doesn't change that. In fact, mi.TensorXf is just a thin wrapper around a plain flat array (tmp_data.array).

However, it is possible to:

Gather and scatter from / to arrays that have different dimensions (as long as the number of values you are reading / writing matches the wavefront size)
Do a Python loop to repeatedly gather the n values you need, especially since n seems to be fixed here (not data-dependent).

So I would expect the corrected code to look something like:

coeffs = [
    dr.gather(mi.Float, tensor.array, indices + k, active)
    for k in range(n)
]

or possibly you will be able to use a notation like tensor[indices, k] directly, but you should double-check in isolation that this notation works as you expect first.

Answer selected by sapo17

sapo17 May 3, 2024
Author

Thank you very much for your detailed answer. I will now think about it and depending on my progress either close the discussion or come back with another question :-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom BSDF: Accessing the rows of a TensorXf in the megakernel mode #1156

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Custom BSDF: Accessing the rows of a TensorXf in the megakernel mode #1156

sapo17 Apr 30, 2024

Summary

System configuration

Description

Replies: 1 comment · 3 replies

merlinND May 2, 2024 Collaborator

sapo17 May 2, 2024 Author

merlinND May 2, 2024 Collaborator

sapo17 May 3, 2024 Author

sapo17
Apr 30, 2024

Replies: 1 comment 3 replies

merlinND
May 2, 2024
Collaborator

sapo17 May 2, 2024
Author

merlinND May 2, 2024
Collaborator

sapo17 May 3, 2024
Author