Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing xsmm runner dynamic load #146

Merged
merged 2 commits into from
Jul 23, 2024
Merged

Fixing xsmm runner dynamic load #146

merged 2 commits into from
Jul 23, 2024

Conversation

slyalin
Copy link
Owner

@slyalin slyalin commented Jul 22, 2024

TODO: Still doesn't work in Python.

@rengolin
Copy link
Collaborator

rengolin commented Jul 22, 2024

Not sure what's the problem with Python. This is what fixed for us in tpp-mlir, but that was a c++ application, not a shared object. I have copied the libtpp_xsmm_runner_utils.so libraries to the install directory and I still get the same problem:

Created MLIR op: extension::MLIROp MLIROp_2179 (opset1::Parameter a[0]:f32[?,?], opset1::Constant self.linear.weight[0]:f32[128,1024]) -> (f32[?,128])
JIT session error: Symbols not found: [ xsmm_unary_invoke, xsmm_unary_dispatch, xsmm_brgemm_invoke, xsmm_brgemm_dispatch ]
JIT invocation failed

Note, library path is set correctly by running . ./install/setupvars.sh.

@rengolin
Copy link
Collaborator

Looking more at this, Python can load the library:

openat(AT_FDCWD, "/home/rengolin/devel/intel/openvino/build/install/runtime/lib/intel64/libtpp_xsmm_runner_utils.so.19.0git", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832

But later on the JIT fails to find the symbols.

JIT session error: Symbols not found: [ xsmm_unary_invoke, xsmm_unary_dispatch, xsmm_brgemm_invoke, xsmm_brgemm_dispatch ]
JIT invocation failed
Program aborted due to an unhandled Error:
Failed to materialize symbols: { (main, { entry, _mlir_entry }) }

The way we fixed this in C++ was to pre-load the library onto the tpp-run module, while mlir-cpu-runner has the --shared-libs option, which is then shared during execution.

Unfortunately, looking at Orc (LLVM's JIT compiler), the error messages are triggered by helper classes, emitted by some other loader. I imagine openvino binary is the one that needs to load that library and tell the JIT where it it.

Comment on lines +41 to +42
#FIXME: Provide platform-independent way of doing that:
install(FILES ${TPP_MLIR_DIR}/lib/libtpp_xsmm_runner_utils.so ${TPP_MLIR_DIR}/lib/libtpp_xsmm_runner_utils.so.19.0git DESTINATION ${OV_CPACK_RUNTIMEDIR})
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rengolin, please suggest a proper alternative.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually not a bad idea, tbh. An alternative is to change the setupvars.sh to add the TPP build directory to the LD_LIBRARY_PATH.

Unless TPP can be installed as a proper library (on system path), there's not much else we can do.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to have self-containing openvino package as it is now to model a final product without any extra dependencies. This is how binary size of the package will be calculated. This is one of the important product-level metrics.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially intended to provide a proper cmake statement without all these .so etc. things. Do we have a normal way to include TPP-MLIR with find_package similar to what we have in LLVM/MLIR?

@slyalin slyalin marked this pull request as ready for review July 23, 2024 10:46
@slyalin
Copy link
Owner Author

slyalin commented Jul 23, 2024

Now it works for Linux and C++ only. Needed libraries are installed in the target ov directory.

@rengolin
Copy link
Collaborator

Now it works for Linux and C++ only. Needed libraries are installed in the target ov directory.

How can I test this in C++?

@slyalin
Copy link
Owner Author

slyalin commented Jul 23, 2024

Now it works for Linux and C++ only. Needed libraries are installed in the target ov directory.

How can I test this in C++?

To test it in C++ you need to have two programs: one to emit OpenVINO IR with desired model (Python part that uses PyTorch), and second to run that IR in C++ application. It is not very convenient but you cannot convert PyTorch model in C++ app, Python is a requirement in this case. And now C++ is a requirement for xsmm runner part, so we need two programs. I would like to see a PR that shows how a library could be registered for JIT in MLIR/LLVM world and makes it functional for both Python and C++.

The first program:

import torch
import torch.nn as nn
import openvino as ov

# Define a synthetic model
class LinearModel(nn.Module):
    def __init__(self, input_size, output_size):
        super(LinearModel, self).__init__()
        self.linear = nn.Linear(input_size, output_size)
    def forward(self, a):
        # some random element-wise stuff first just to see how it can be combined with MatMul
        b = a*a + 2.0
        x = ((a+a) * (a-b)) / a
        out = self.linear(x)
        return out

# Create an instance of the model
input_size = 1024
output_size = 128
model = LinearModel(input_size, output_size)
# Generate random weights
model.linear.weight.data.normal_(0, 0.01)
model.linear.bias.data.fill_(0.01)

input_data = torch.tensor(range(1, input_size*output_size+1)).to(torch.float32).view(output_size, input_size)

with torch.no_grad():
    reference = model(input_data)
    print('Reference:\n', reference)

ov_model = ov.convert_model(model, example_input=input_data)
ov.save_model(ov_model, "simple_model.matmul.1024x128.xml")

The second program:

#include <openvino/openvino.hpp>

int main () {
  ov::Core core;
  auto compiled_model = core.compile_model("simple_model.matmul.1024x128.xml");
  auto infer_request = compiled_model.create_infer_request();

  auto input_tensor_1 = infer_request.get_input_tensor(0);
  size_t size1 = 128;
  size_t size2 = 1024;
  input_tensor_1.set_shape({size1, size2});
  auto data_1 = input_tensor_1.data<float>();
  for(size_t i = 0; i < size1*size2; ++i)
    data_1[i] = i+1;

  infer_request.infer();

  auto output_tensor = infer_request.get_output_tensor(0);
  auto output_data = output_tensor.data<float>();
  for(size_t i = 0; i < output_tensor.get_size(); ++i) {
      std::cout << "[" << i << "]: " << output_data[i] << "\n";
  }
}

You can build it with

g++ example.cpp -I/where/openvino/installed/runtime/include -lopenvino -L/where/openvino/installed/runtime/lib/intel64

Apply setupvars.sh before that and run.

@slyalin slyalin merged commit a7f652e into mlir Jul 23, 2024
14 of 30 checks passed
@rengolin
Copy link
Collaborator

To test it in C++ you need to have two programs: one to emit OpenVINO IR with desired model (Python part that uses PyTorch), and second to run that IR in C++ application. It is not very convenient but you cannot convert PyTorch model in C++ app, Python is a requirement in this case. And now C++ is a requirement for xsmm runner part, so we need two programs. I would like to see a PR that shows how a library could be registered for JIT in MLIR/LLVM world and makes it functional for both Python and C++.

Ok, I think we can go with that for now. The important process is:

  • It must come from Pytorch, and not a "made-up" graph. Importing through Python and converting to XML is fine.
  • It must pass through tpp-mlir and emit calls to XSMM. The default pipeline is doing that.
  • It must be able to load libxsmm and wrappers at runtime. The C++ program can do that.

Now we need a set of benchmarks:

  1. Roofline: Static shape, matmul (no transpose), bias Add (no broadcast), ReLU. This should achieve similar performance as libxsmm-dnn.
  2. Baseline: Matmul transposed, biass Add with broadcast, ReLU. This should be the slowest of the bunch, but still >50% peak.
  3. Fixups for (2) abive: Correctly tile a transposed matmul, use linalg.generic for broadcast element-wise.

Aiming for these performance targets:

  • Roofline: ~90% peak AMX for BF16
  • Baseline: >50% of the Roofline
  • Fixups: >80% of the Roofline

Later on (or in parallel) we can work the Python issues, but these are not critical to demonstrate impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants