Skip to content
This repository has been archived by the owner on Aug 30, 2018. It is now read-only.

ONNX Tutorial: filter.dim32(i + 2) == kernel_[i] #12

Open
Pavel-Akapian opened this issue Sep 12, 2017 · 12 comments
Open

ONNX Tutorial: filter.dim32(i + 2) == kernel_[i] #12

Pavel-Akapian opened this issue Sep 12, 2017 · 12 comments

Comments

@Pavel-Akapian
Copy link

Hello!
We're trying to replicate PyTorch ONNX Super-Resolution Tutorial . Conversion seems to work OK. But when deploying model to iOS an error occurs (on predictor->run):

[MC] Reading from public effective user settings. 
libc++abi.dylib: terminating with uncaught exception of type caffe2::EnforceNotMet: [enforce fail at conv_op_impl.h:37] filter.dim32(i + 2) == kernel_[i].  Error from operator:  
input: "9" input: "1" output: "11" name: "" type: "Conv" arg { name: "kernels" ints: 5 ints: 5 } arg { name: "strides" ints: 1 ints: 1 } arg { name: "pads" ints: 2 ints: 2 ints: 2 ints: 2 } arg { name: "dilations" ints: 1 ints: 1 } arg { name: "group" i: 1 }

We can run original caffe2 models on device. When we compared manually written caffe2 models and the model made by conversion tool, we noticed conversion tool adds (maybe it could help to fix this issue?):

device_option {
  device_type: 0
  cuda_gpu_id: 0
}

Also this problem replicates on more simple examples.

@prigoyal
Copy link

@Pavel-Akapian thank you for trying out the tutorial. To be able to help further, I need some information on how you are running the model.

can you describe what super-resolution model version are you using? The tutorial highlights 1) a small model that is also available in pytorch examples and 2) the SRResNet model

what image processing did you use if any at all? what is the image input dimension to the model?

The error seems to indicate that the input data is not what the model is expecting. Have you been able to successfully run the tutorial part until the mobile execution using pdb? That would be the first step to get right.

@Pavel-Akapian
Copy link
Author

Pavel-Akapian commented Sep 13, 2017

@prigoyal thank you for quick reply.
I experiment with the model below:

class SuperResolutionNet(nn.Module):
    def __init__(self, upscale_factor, inplace=False):
        super(SuperResolutionNet, self).__init__()

        self.relu = nn.ReLU(inplace=inplace)
        self.conv1 = nn.Conv2d(1, 64, (5, 5), (1, 1), (2, 2))
        self.conv2 = nn.Conv2d(64, 64, (3, 3), (1, 1), (1, 1))
        self.conv3 = nn.Conv2d(64, 32, (3, 3), (1, 1), (1, 1))
        self.conv4 = nn.Conv2d(32, upscale_factor ** 2, (3, 3), (1, 1), (1, 1))
        self.pixel_shuffle = nn.PixelShuffle(upscale_factor)

        self._initialize_weights()

    def forward(self, x):
        x = self.relu(self.conv1(x))
        x = self.relu(self.conv2(x))
        x = self.relu(self.conv3(x))
        x = self.pixel_shuffle(self.conv4(x))
        return x

    def _initialize_weights(self):
        init.orthogonal(self.conv1.weight, init.calculate_gain('relu'))
        init.orthogonal(self.conv2.weight, init.calculate_gain('relu'))
        init.orthogonal(self.conv3.weight, init.calculate_gain('relu'))
        init.orthogonal(self.conv4.weight)

# Create the super-resolution model by using the above model definition.
torch_model = SuperResolutionNet(upscale_factor=3)

I don't get well what do you mean by two versions (I don't notice SSResNet).
I can run the tutorial from very start to the lines

# Save the image, we will compare this with the output image from mobile device
final_img.save("./_static/img/cat_superres.jpg")

with no errors and 'cat_superres.jpg' is successfully created on server machine.
Also i tried 'dummy-model' with only one layer (conv) with default initialization (no training) and get the same error.
Shape is 1x3x224x224. No pre-processing.

const int predHeight = 224;
const int predWidth = 224;
const int crops = 1;
const int channels = 3;

input.Resize(std::vector<int>({crops, channels, predHeight, predWidth}));
I'm going to upload pb files later. I expect it can help a lot.

@Pavel-Akapian
Copy link
Author

Pavel-Akapian commented Sep 13, 2017

@prigoyal Me and my collegues compared pb files and found how to resolve this problem.
The end of predict_net is:

external_input: "1"
external_input: "2"
external_input: "3"
external_input: "4"
external_input: "5"
external_input: "6"
external_input: "7"
external_input: "8"
external_input: "9"
external_output: "27"
external_output: "_onnx_dummy1"
external_output: "_onnx_dummy2"

We need to move the last external_input:"9" before external_input:"1"

external_input: "9"
external_input: "1"
external_input: "2"
external_input: "3"
external_input: "4"
external_input: "5"
external_input: "6"
external_input: "7"
external_input: "8"
external_output: "27"
external_output: "_onnx_dummy1"
external_output: "_onnx_dummy2"

In fact image is loaded into the first external_input.

@ezyang
Copy link
Collaborator

ezyang commented Sep 13, 2017

OK, the fact that the PyTorch exporter places actual inputs at the end of the inputs list (rather than the beginning) is a known wart. onnx-caffe2 is able to handle this if you don't use the protobuf manually but we plan on fixing this. EDIT: This doesn't seem to be the actual problem here.

@prigoyal
Copy link

@Pavel-Akapian the issue is rather very simple. There is no need for modifying the pb here. This version of super-resolution model requires and input image of dim 1x1x224x224 and the reason for that is mentioned in the tutorial. The error you were getting is also indicating that the filter dim is not right. Can you please try it out by passing the correct input without modifying the pb manually?

@Pavel-Akapian
Copy link
Author

@prigoyal this also happens with 1x1x224x224.

@prigoyal
Copy link

prigoyal commented Sep 13, 2017

@Pavel-Akapian it's actually slightly weird that you were able to execute the nets until

# Save the image, we will compare this with the output image from mobile device
final_img.save("./_static/img/cat_superres.jpg")

as you mentioned and that didn't require any tampering with pb manually but executing on iOS needs that. Can you create a simple repro of the error so we can look into it further? Tampering with pb is not the right solution and should be figured out correctly. I am not able to repro this issue with tutorial yet. Also, were you able to rather deploy on android device following some adb instructions in tutorial?

@Pavel-Akapian
Copy link
Author

Pavel-Akapian commented Sep 13, 2017

Here's the pb's generated by exact execution of tutorial that can run on server and not on iOS. ('.txt' ending is fake so github can upload it)
init_net.pb.txt
predict_net.pb.txt
@prigoyal Unfortunately, we aren't developing android device app at the current moment, so it would be difficult to compare it.

@bwasti
Copy link

bwasti commented Sep 13, 2017

Hey @Pavel-Akapian how are you running the network?

There are a couple ways to do it but I am guessing you are using the predictor API? This requires the external_input[0] (the first one) to be the input data. As you correctly determined, this was not what was created by the pytorch exporter.

You can try running the network instead by workspace.RunNet(predict_net) and populating the blob "9" to verify this

@jerryzh168
Copy link
Collaborator

@bwasti we should update predictor to be more flexible, similar to this: https://github.com/onnx/onnx-caffe2/blob/master/onnx_caffe2/backend.py#L318-L321

@bwasti
Copy link

bwasti commented Sep 14, 2017

@jerryzh168 what do you mean by more flexible?

One thing that might be nice is a TensorMap that takes string->Tensor for inputs so we can use blob names instead of the quite annoying ordering of external_input

This is the culprit code btw: https://github.com/caffe2/caffe2/blob/master/caffe2/core/predictor.cc#L48

@jerryzh168
Copy link
Collaborator

jerryzh168 commented Sep 14, 2017

@bwasti Yeah, by flexible I mean the predictor shouldn't depend on the ordering of external_input. We can figure out what are the missing blobs by calling workspace.HasBlob. An extension of getting a string->Tensor map as input is nice to have too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants