Chong Zeng · Yue Dong · Pieter Peers · Youkang Kong · Hongzhi Wu · Xin Tong
Project Page
|
arXiv
|
Model
|
Demo
DiLightNet is a novel method for exerting fine-grained lighting control during text-driven diffusion-based image generation. It involves a three stage method for controlling the lighting during image generation: provisional image generation, foreground synthesis and background inpainting. In this repo, we open-source the ControlNet model used in the second stage of DiLightNet, which is a neural network that takes in a provisional image, a mask and radiance hints as input, and geenrates a foreground image under the target lighting. For the provisional image generation stage and background inpainting stage, you can use any off-the-shelf models (e.g. Stable Diffusion, Depth ControlNet, ...) or services (e.g. DALL·E 3, MidJourney, ...).
- Environment Setup
- Usage
- Training
- Community Contributions
- Citation
We use blender python binding bpy
for radiance hint rendering. bpy
requires a minimal version of python 3.10, and for the bpy version (3.6, LTS) we used, the only supported python version is 3.10. Thus, we recommend using conda
to create a new environment with python 3.10 as well as CUDA and PyTorch dependencies.
conda create --name dilightnet python=3.10 pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia
conda activate dilightnet
git clone https://github.com/iamNCJ/DiLightNet
cd DiLightNet
pip install -r requirements.txt
from diffusers.utils import get_class_from_dynamic_module
NeuralTextureControlNetModel = get_class_from_dynamic_module(
"dilightnet/model_helpers",
"neuraltexture_controlnet.py",
"NeuralTextureControlNetModel"
)
neuraltexture_controlnet = NeuralTextureControlNetModel.from_pretrained("DiLightNet/DiLightNet")
The base model of DiLightNet is stabilityai/stable-diffusion-2-1
, you can easily set up an inference pipeline with our DiLightNet controlnet model.
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1", controlnet=neuraltexture_controlnet,
)
cond_image = torch.randn((1, 16, 512, 512))
image = pipe("some text prompt", image=cond_image).images[0]
Please check the simple example for using with real condition images.
The input tensor to the controlnet model should be a torch.Tensor
of shape (BS, 16, H, W)
, range (0, 1)
, where H
and W
are the height and width of the image, respectively. The 16 channels are in the order of:
- Provisional Image:
torch.Tensor
of shape(1, 3, H, W)
, range(0, 1)
- Mask:
torch.Tensor
of shape(1, 1, H, W)
, range(0, 1)
- Radiance Hints:
torch.Tensor
of shape(1, 12, H, W)
, in the order of diffuse, specular (r=0.05, r=0.13, r=0.34), range(0, 1)
Provisional Image | Mask | Radiance Hints - Diffuse |
---|---|---|
Radiance Hints - Specular (r=0.05) | Radiance Hints - Specular (r=0.13) | Radiance Hints - Specular (r=0.34) |
Output Image |
---|
Note: We have switched to DUSt3R for mono metric depth estimation and camera intrinsics estimation. The results in the paper are produced with ZoeDepth and a fixed camera intrinsics (fov=55.0). Hence the released version should be better but can produce results slightly different from the paper.
python3 infer_img.py --prov_img examples/provisional_img/futuristic_soldier.png --prompt "futuristic soldier with advanced armor weaponry and helmet" --env_map examples/env_map/grace.exr --out_vid ./output/soldier_grace.mp4
Please check the test script for more examples.
--prov_img str Path to the provisional image (default: None)
--prompt str Prompt for the generated images (default: )
--num_imgs_per_prompt int
Number of images to generate per prompt (default: 4)
--out_vid [str] Path to the output video, defaults to the input image path (default: None)
--seed int Seed for the generation (default: 3407)
--steps int Number of steps for the diffusion process (default: 20)
--cfg float CFG for the diffusion process (default: 3.0)
--fov [float] Field of view for the mesh reconstruction, none for auto estimation from the image (default: None)
--mask_path [str] Path to the mask for the image (default: None)
--use_sam bool, --nouse_sam bool
Use SAM for background removal (default: True)
--mask_threshold float
Mask threshold for foreground object extraction (default: 25.0)
--pl_pos_r float Rotation radius of the point light (default: 5.0)
--pl_pos_h float Height of the point light (default: 3.0)
--power float Power of the point light (default: 1200.0)
--inpaint bool, --noinpaint bool
Inpaint the background of generated point light images (default: False)
--env_map [str] Environment map for the rendering, defaults to None (white point light) (default: None)
--frames int Number of frames for lighting controlled video (default: 120)
--use_gpu_for_rendering bool, --nouse_gpu_for_rendering bool
Use GPU for radiance hints rendering (default: True)
--cache_radiance_hints bool, --nocache_radiance_hints bool
Cache the radiance hints for the video (default: True)
--radiance_hints_path [str]
pre-rendered radiance hint path (default: None)
- Foreground mask: By default we use U2Net to generate an inital mask and use SAM to further refine it. But if it doesn't work well on your image, you can provide a mask image with
--mask_path
. The mask image can be RGBA or grayscale, which we directly use the last channel as the mask. - Background inpainting: For environment map lightings, the script will automatically inpaint the background with the background color of the environment map. For point light lightings, you can use
--inpaint
to inpaint the background of the generated images using the stable diffusion inpainting model, but we suggest a manual intervention to get better inpainting results. - Randomness: Due to the ambiguity in the provisional image (e.g. shape, original lighting, fine-grained material properties, etc.), generated results can and should have diversity. Hence, you can try different seeds and prompts to get the wanted results, just as any diffusion model. The script will generate
num_imgs_per_prompt
(default=4) images for each prompt.
If you have the mesh of an object, you can use the mesh to render finer radiance hints, and use a depth-conditioned diffusion model to generate the provisional image. We provide a simple script to render radiance hints and groud-truth depth map with a mesh:
python3 mesh_to_hints.py --mesh_path examples/mesh/chair.glb --output_dir tmp/chair --cam_azi 150 --env_map examples/env_map/garage.exr --cam_elev 20
Generating a provisional image with the depth conditioned model is also easy with off-the-shelf models like SDXL Depth ControlNet:
import cv2
import torch
from diffusers import ControlNetModel, StableDiffusionXLControlNetPipeline
controlnet = ControlNetModel.from_pretrained(
"diffusers/controlnet-depth-sdxl-1.0",
torch_dtype=torch.float16,
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
controlnet=controlnet,
torch_dtype=torch.float16,
).to("cuda")
# convert metric depth to inverse relative depth
depth = imageio.v3.imread('tmp/chair/depth0001.exr')
depth = cv2.resize(depth, (1024, 1024), interpolation=cv2.INTER_NEAREST)[None]
inv_depth = 1.0 / depth
max_depth = inv_depth.max()
min_depth = inv_depth.min()
inv_depth = (inv_depth - min_depth) / (max_depth - min_depth)
# you may want to tune your prompt and other parameters to get the desired results
images = pipe(
"a wooden chair", image=inv_depth, num_inference_steps=30, controlnet_conditioning_scale=0.5,
).images[0]
Then you can use the rendered radiance hints to generate the final video:
python3 infer_img.py --prov_img examples/depth_cond/prov_imgs/chair_0.png --radiance_hints_path ./tmp/chair/ --prompt "a wooden chair" --out_vid output/chair_0.mp4
Note that the provided render scripts is a simplified version, and you can develop a more sophisticated pipeline to render the radiance hints with more complex camera / lighting conditions. For more examples, please refer to the test script.
python3 gradio_app.py
An online version of this demo can be found on our huggingface space.
Due to the massive size of the training data and the numerous small files it comprises, we can only release the lists of the 3D models and the scripts we used to generate the training data.
All the 3D models in our training data are selected from Objaverse-LVIS. It contains 3 categories of 3D models with a total of 25K models:
- Filtered 3D models with PBR (13K): these models have PBR materials and are directly used to generate the training data.
- Randomly selected 3D models with albedo only (2K * 2): these models only have a diffuse texture, we randomly pick one from each LVIS category and apply 2 homogeneous specular component (one more metallic and one less metallic) to them for specular material augmentation.
- Most liked 3D models (1K * 8): these models are selected based on the number of likes provided by the Objaverse. For each object, we apply 4 randomly selected SVBRDF from the INRIA-Highres SVBRDF dataset and 4 randomly sampled homogeneous material (two more metallic and two less metallic).
Furthermore, we provide the object ids of our test set. These objects are manually selected in Objaverse (without LVIS tags) and all of them contains high-quality PBR material.
We provide a simplified example rendering script without infrastructure-specific dependencies or optimizations. For each 3D model, the script will render images under the following lighting conditions:
- white point lighting
- multi white point lighting
- area lighting
- white environmental lighting (keeping only intensity)
- environmental lighting
Note: To avoid lighting and material ambiguity, only 1-4 will serve as the provisional images during training.
The generated file structure should look like below, with mesh normalization info, c2w for each view, lighting condition in each case:
.
├── normalize.json
├── view_0
│ ├── cam.json
│ ├── rgb_for_depth.png
│ ├── env_0
│ │ ├── env.json
│ │ ├── gt_diffuse.png
│ │ ├── gt_ggx0.34.png
│ │ ├── gt_ggx0.13.png
│ │ ├── gt_ggx0.05.png
│ │ └── gt.png
│ ├── env_1
│ │ └── ...
│ ├── white_env_0
│ │ └── ...
│ ├── white_env_1
│ │ └── ...
│ ├── white_pl_0
│ │ └── ...
│ ├── white_pl_1
│ │ └── ...
│ ├── multi_pl_0
│ │ └── ...
│ ├── multi_pl_1
│ │ └── ...
│ ├── area_0
│ │ └── ...
│ └── area_1
│ └── ...
├── view_1
│ └── ...
└── view_2
└── ...
The number of views and each lighting condition for each object can be set in the script.
This would form a jsonl file, wich each line being a json object with the following fields:
- image: the path to the target image
- hint: [path1, path2, ...]
- ref: [path1, path2, ...], all possible provisional images (exclude itself)
- text: prompt, generated with BLIP2 using brightened
rgb_for_depth.png
{
"image": "/absolute/path/to/your/file/view_0/white_pl_0/gt.png",
"hint": [
"/absolute/path/to/your/file/view_0/white_pl_0/gt_diffuse.png",
"/absolute/path/to/your/file/view_0/white_pl_0/gt_ggx0.05.png"
"/absolute/path/to/your/file/view_0/white_pl_0/gt_ggx0.13.png"
"/absolute/path/to/your/file/view_0/white_pl_0/gt_ggx0.34.png"
],
"ref": [
"/absolute/path/to/your/file/view_0/white_pl_1/gt.png",
"/absolute/path/to/your/file/view_0/env_0/gt.png",
"/absolute/path/to/your/file/view_0/env_1/gt.png",
"..."
],
"text": "some text description generated by BLIP2"
}
We used fp16
and 8-bit adam
to speed up the training process. bf16
sometimes cause numerical instability, leading to training collapse in our experiments. We use wandb
to log the training process and logging to tensorboard
is not tested. Additional dependencies for training are listed in train/requirements.txt.
To launch the training process, you can use the following command:
accelerate launch --mixed_precision="fp16" --multi_gpu train_controlnet.py
--pretrained_model_name_or_path="stabilityai/stable-diffusion-2-1"
--output_dir=/path/to/output_dir
--exp_id="exp_id"
--dataset_name="/path/to/your/jsonl/data.jsonl"
--aug_dataset_name="/path/to/your/jsonl/data.jsonl"
--test_dataset_name="/path/to/your/jsonl/data.jsonl"
--resolution=512
--shading_hint_channels=12
--learning_rate=1e-5
--train_batch_size=8
--add_mask
--mask_weight=0.2
--dataloader_num_workers=24
--report_to=wandb
--checkpointing_steps=500
--validation_steps=5000
--max_train_steps=300000
--proportion_empty_prompts=0.5
--proportion_channel_aug=0.2
--proportion_pred_normal=0.1
--gradient_checkpointing
--gradient_accumulation_steps=1
--use_8bit_adam
--mixed_precision fp16
--set_grads_to_none
--resume_from_checkpoint=latest
We welcome all kinds of contributions to this project, including but not limited to:
- interesting usage examples
- improved image generation pipeline, e.g. better sampling strategies
- integrating DiLightNet into AUTOMATIC1111 or ComfyUI
- adaptation to community models
- community versions of the controlnet model, e.g. dedicated support for portrait images or anime characters
- ...
Please feel free to open an issue or submit a pull request!
If you find our work useful in your research, please consider citing:
@inproceedings {zeng2024dilightnet,
title = {DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation},
author = {Chong Zeng and Yue Dong and Pieter Peers and Youkang Kong and Hongzhi Wu and Xin Tong},
booktitle = {ACM SIGGRAPH 2024 Conference Papers},
year = {2024}
}