-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
are you aware of any academic torrents for training set? #18
Comments
Hello, have you seen that repo? https://github.com/xiaobai1217/Awesome-Video-Datasets |
Interesting idea with pointwise conv3d. I'm try to play with it too |
my post on reddit is blowin up. I guess we're going to find out if Claude3 is a fraud or not pretty soon. https://www.reddit.com/r/StableDiffusion/comments/1bh970h/claude_3_thinks_4_lines_of_code_changes_will/ |
I make it work only with this setup class InflatedConv3d(nn.Module):
def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', device=None, dtype=None):
super().__init__()
self.kernel_size = (kernel_size, kernel_size, kernel_size)
self.stride = (1, stride, stride)
self.padding = (padding, padding, padding)
self.depthwise_conv = nn.Conv3d(in_channels, in_channels, self.kernel_size, self.stride, self.padding, groups=in_channels)
self.pointwise_conv = nn.Conv3d(in_channels, out_channels, kernel_size=1)
def forward(self, x):
x = self.depthwise_conv(x)
x = self.pointwise_conv(x)
return x Currently i got noise because there no params for newly added conv layers. Now i need to retrain this modules in unet. This means that we can't use sd1.5 checkpoins because all thouse conv layers lays in unet, not in motion_module. Motion module is transformer arch. |
I got this results, 25% more efficient on infer |
In the part 2 - groupnorm - it may help because it does away we tensor resizing - and does a short cut to use a view instead. ‘’’python
‘’’ |
Nope, it is too costly to train conv layers on single gpu. I trained for 50000 steps and even not too close to any recognizable picture. Conv layers in sd are feature extractors and we completly replace them. It is task of size to train complete sd... |
Would it be possible to initialize them with the same weights as the layers they're supposed to replace? Then it'd be more of a finetuning task. |
Initialize, i dont think so. Not directly. May be there is some kind correlation between conv2d weights and two conv3d but it looks like task in asymetric cryptography (private and public key relationship). I think it could be distilled from sd but it will need second gpu that i dont have. Even if i try to distill it will require enormous efforts for single dev. I got only 25% performance improvement. It is good but not 100x and it is expected because main computations in sd in attn layers that have O(n**2) Less parameters in pointwise conv could loss to ordinary conv2 in feature extraction so model overall could perform worse. I'd like to have opportunity to have two transformers blocks in animatediff but it too costly by vram and I must to decide what to choose, video length or transformer block count. Anyway, we all will infer our videos on sora like models that is very prommising. |
I asked Claude 3 - (maybe question not so clear and it’s hallucinating) I give it another shot later. this is wrong- Yes, it is possible to initialize the depthwise separable convolution layers with the same weights as the layers they are supposed to replace. This way, you can start with a model that has the same initial behavior as the original model and then fine-tune it to adapt to the depthwise separable convolution architecture. DepthwiseSeparableInflatedConv3d class that initializes the weights from the original InflatedConv3d layer: class DepthwiseSeparableInflatedConv3d(nn.Module):
def __init__(self, inflated_conv3d_layer):
super().__init__()
in_channels = inflated_conv3d_layer.in_channels
out_channels = inflated_conv3d_layer.out_channels
kernel_size = inflated_conv3d_layer.kernel_size
stride = inflated_conv3d_layer.stride
padding = inflated_conv3d_layer.padding
self.depthwise_conv = InflatedConv3d(in_channels, in_channels, kernel_size, stride, padding, groups=in_channels)
self.pointwise_conv = InflatedConv3d(in_channels, out_channels, kernel_size=1)
# Initialize the depthwise convolution weights from the original InflatedConv3d layer
depthwise_conv_weights = inflated_conv3d_layer.weight.clone()
depthwise_conv_weights = depthwise_conv_weights.view(in_channels, 1, *kernel_size)
self.depthwise_conv.weight = nn.Parameter(depthwise_conv_weights)
if inflated_conv3d_layer.bias is not None:
self.depthwise_conv.bias = nn.Parameter(inflated_conv3d_layer.bias.clone())
# Initialize the pointwise convolution weights from the original InflatedConv3d layer
pointwise_conv_weights = torch.ones(out_channels, in_channels, 1, 1) / in_channels
self.pointwise_conv.weight = nn.Parameter(pointwise_conv_weights)
def forward(self, x):
x = self.depthwise_conv(x)
x = self.pointwise_conv(x)
return x In this updated implementation: optimized_layers = []
for layer in original_layers:
optimized_layer = DepthwiseSeparableInflatedConv3d(layer)
optimized_layers.append(optimized_layer)
optimized_model = nn.Sequential(*optimized_layers) By initializing the depthwise separable convolution layers with the weights from the original InflatedConv3d layers, you can start with a model that behaves similarly to the original model. You can then fine-tune this model to adapt to the depthwise separable convolution architecture and potentially achieve better performance or efficiency. UPDATE class DepthwiseSeparableInflatedConv3d(nn.Module):
def __init__(self, inflated_conv3d_layer):
super().__init__()
in_channels = inflated_conv3d_layer.in_channels
out_channels = inflated_conv3d_layer.out_channels
kernel_size = inflated_conv3d_layer.kernel_size
stride = inflated_conv3d_layer.stride
padding = inflated_conv3d_layer.padding
self.depthwise_conv = nn.Conv2d(in_channels, in_channels, kernel_size, stride, padding, groups=in_channels)
self.pointwise_conv = nn.Conv2d(in_channels, out_channels, kernel_size=1)
# Initialize the depthwise convolution weights from the original InflatedConv3d layer
depthwise_conv_weights = inflated_conv3d_layer.weight.clone()
depthwise_conv_weights = depthwise_conv_weights.view(in_channels, 1, *kernel_size)
self.depthwise_conv.weight = nn.Parameter(depthwise_conv_weights)
if inflated_conv3d_layer.bias is not None:
self.depthwise_conv.bias = nn.Parameter(inflated_conv3d_layer.bias.clone())
# Initialize the pointwise convolution weights from the original InflatedConv3d layer
pointwise_conv_weights = torch.ones(out_channels, in_channels, 1, 1) / in_channels
self.pointwise_conv.weight = nn.Parameter(pointwise_conv_weights)
def forward(self, x):
video_length = x.shape[2]
x = rearrange(x, "b c f h w -> (b f) c h w")
x = self.depthwise_conv(x)
x = self.pointwise_conv(x)
x = rearrange(x, "(b f) c h w -> b c f h w", f=video_length)
return x In this updated implementation: optimized_layers = []
for layer in original_layers:
optimized_layer = DepthwiseSeparableInflatedConv3d(layer)
optimized_layers.append(optimized_layer)
optimized_model = nn.Sequential(*optimized_layers) By initializing the depthwise and pointwise convolution layers with the weights from the original InflatedConv3d layers, the optimized model will start with the same weights as the original model. You can then fine-tune the optimized model to adapt to the depthwise separable convolution architecture. update 3. I ask if this is bad for performance Initializing the depthwise separable convolution layers with the weights from the original InflatedConv3d layers can have both advantages and disadvantages in terms of performance. Let's discuss them: UPDATE 4 - currently testing to see disk saving... import torch
import torch.nn as nn
from einops import rearrange
# Define the DepthwiseSeparableInflatedConv3d class
class DepthwiseSeparableInflatedConv3d(nn.Module):
def __init__(self, inflated_conv3d_layer):
super().__init__()
in_channels = inflated_conv3d_layer.in_channels
out_channels = inflated_conv3d_layer.out_channels
kernel_size = inflated_conv3d_layer.kernel_size
stride = inflated_conv3d_layer.stride
padding = inflated_conv3d_layer.padding
self.depthwise_conv = nn.Conv2d(in_channels, in_channels, kernel_size, stride, padding, groups=in_channels)
self.pointwise_conv = nn.Conv2d(in_channels, out_channels, kernel_size=1)
# Initialize the depthwise convolution weights from the original InflatedConv3d layer
depthwise_conv_weights = inflated_conv3d_layer.weight.clone()
depthwise_conv_weights = depthwise_conv_weights.view(in_channels, 1, *kernel_size)
self.depthwise_conv.weight = nn.Parameter(depthwise_conv_weights)
if inflated_conv3d_layer.bias is not None:
self.depthwise_conv.bias = nn.Parameter(inflated_conv3d_layer.bias.clone())
# Initialize the pointwise convolution weights from the original InflatedConv3d layer
pointwise_conv_weights = torch.ones(out_channels, in_channels, 1, 1) / in_channels
self.pointwise_conv.weight = nn.Parameter(pointwise_conv_weights)
def forward(self, x):
video_length = x.shape[2]
x = rearrange(x, "b c f h w -> (b f) c h w")
x = self.depthwise_conv(x)
x = self.pointwise_conv(x)
x = rearrange(x, "(b f) c h w -> b c f h w", f=video_length)
return x
# Load the pretrained AnimateDiff model
original_model = UNet3DConditionModel.from_pretrained("path/to/animatediff/checkpoint")
# Create an optimized model with DepthwiseSeparableInflatedConv3d layers
optimized_model = UNet3DConditionModel(
sample_size=original_model.sample_size,
in_channels=original_model.in_channels,
out_channels=original_model.out_channels,
center_input_sample=original_model.center_input_sample,
flip_sin_to_cos=original_model.flip_sin_to_cos,
freq_shift=original_model.freq_shift,
down_block_types=original_model.down_block_types,
up_block_types=original_model.up_block_types,
block_out_channels=original_model.block_out_channels,
layers_per_block=original_model.layers_per_block,
downsample_padding=original_model.downsample_padding,
mid_block_scale_factor=original_model.mid_block_scale_factor,
act_fn=original_model.act_fn,
norm_num_groups=original_model.norm_num_groups,
norm_eps=original_model.norm_eps,
cross_attention_dim=original_model.cross_attention_dim,
attention_head_dim=original_model.attention_head_dim,
)
# Replace InflatedConv3d layers with DepthwiseSeparableInflatedConv3d layers
for name, module in optimized_model.named_modules():
if isinstance(module, InflatedConv3d):
inflated_conv3d_layer = getattr(original_model, name)
depthwise_separable_conv3d_layer = DepthwiseSeparableInflatedConv3d(inflated_conv3d_layer)
setattr(optimized_model, name, depthwise_separable_conv3d_layer)
# Save the checkpoint of the optimized model
torch.save(optimized_model.state_dict(), "path/to/optimized/checkpoint.pth") UPDATE I got some other fires I need to attend to then will switch back. UPDATE 6 |
did you successfully train on the 10M web videos?
I'm looking at exploring this suggestion for 4 lies of code change.
guoyww/AnimateDiff#308
The text was updated successfully, but these errors were encountered: