Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC-GPU integration #169

Merged
merged 21 commits into from
Oct 15, 2024
Merged

GC-GPU integration #169

merged 21 commits into from
Oct 15, 2024

Conversation

dchigarev
Copy link

@dchigarev dchigarev commented Sep 6, 2024

This PR adds an integration with graph-compiler's GPU pipeline. The integration means passing GPU buffers (usm ptr/cl_mem), cl queue and cl events as is to GC for execution.

A set of sanity tests was also added to test the integration.

How to build and run tests
  1. Build LLVM with IMEX patches:

    git clone https://github.com/intel/graph-compiler.git
    ./scripts/compile.sh --dev --llvm --imex
    export LLVM_INST_PATH=$(pwd)/externals/llvm-project/build
    
  2. Build OV from this branch:

    git clone https://github.com/dchigarev/openvino.git
    cd openvino & git checkout gc-gpu
    mkdir build & cd build
    cmake .. -G Ninja \
    	-DLLVM_DIR=$LLVM_INST_PATH/lib/cmake/llvm \
    	-DMLIR_DIR=$LLVM_INST_PATH/lib/cmake/mlir \
    	-DENABLE_GRAPH_COMPILER=ON \
    	-DENABLE_INTEL_GPU=ON \ # <-- enables GPU capabilities of graph compiler
    	-DENABLE_TESTS=ON
    
  3. Run sanity tests:

    OV_MLIR_MODE=GC_GPU ./bin/intel64/Release/ov_gpu_func_tests --gtest_filter=MLIRExecution.*
    
  4. Run benchmark_app:

    OV_MLIR_MODE=GC_GPU ./bin/intel64/Debug/benchmark_app -m ./src/plugins/intel_gpu/tests/functional/mlir_op/models/matmul_64_128_f16.xml -d GPU -use_device_mem -ip f16 -infer_precision f16 -niter 100 -hint none -nstreams 1 -nthreads 1
    

What was changed and how it works

1. Common MLIREvaluate class was split into two

There are now two classes: MLIREvaluate (generic evaluation) and MLIREvaluateGcGPU. They both implement the interface of MLIREvaluateBase and an actual instance is created based on mlir_mode parameter in MLIREvaluateBase::create().

This was done because these two evaluation classes operate with different objects in order to lower and invoke recieved MLIR module. Generic MLIREvaluate operates on mlir::EvaluationEngine and mlir::Module, while MLIREvaluateGcGPU operates with gc-specific runtime objects (mlir::gc::OclModuleBuilder, mlir::gc::OclModule, etc...).

2. Context/device information is now forwarded to MLIREvaluateBase::create()

We need context + device information for the gc-gpu-runtime in order to build a module. That's why we now extract ocl_context and cl_device_id from RemoteContextImpl in TransformationsPipeline and forward it all the way to MLIREvaluateBase::create() using ov::EvaluationContext map.

3. Separation between MLIREvaluate::invoke and MLIREvaluate::invoke_packed

A new invocation method was added to MLIREvaluateBase interface (::invoke()). In comparison with ::invoke_packed() that accepts memref arguments in the MemrefDescriptor format, ::invoke() takes tensor vectors as is.

GC-GPU runtime expects arguments to be in a non-packed format (pointers only) if all memrefs in the compiled mlir module have static shapes. Otherwise it expects "packed" format (MemrefDescriptors).

A query method was added to determine which method of MLIREvaluate to call.

(@AndreyPavlenko may provide more info on why we need this separation)

4. Actual OCL implementations of cldnn::stream/buffer/event are now exposed to intel_gpu/src/plugin/ops/mlir_op.cpp

Base classes of stream/buffer/event do not have a method to get a handle of an actual underlying object (cl_queue/cl_mem/cl_event). In order to obtain these handles and pass them to gc-gpu runtime, the instances of these abstract objects are being dynamic-casted to their presumable implementations (ocl::gpu_buffer / ocl_stream / ocl_base_event). In order to do that we have to expose declaration of these ocl-specific implementations to ops/mlir_op.cpp by modifying its include directories. Are we okay with this?

^--- this was replaced with the one below

4. cldnn::stream/buffer/event/device are now able to return an underlying ocl handle

In order to get an actual cl object and pass it to the graph compiler's GPU runtime, the void* get_handle() method was added to cldnn::stream/buffer/device/event interfaces. The method is supposed to return a pointer to an opencl handle from C-api (cl_mem, cl_command_queue, cl_device_id, ...) since gc-gpu runtime takes these instead of c++ wrappers.

5. cldnn::stream::create_base_event(...) can now take a pointer to cl_event

(in order to propagate cl_event returned from gc-gpu runtime to cldnn::event that is returned from MLIROp::evaluate())

dchigarev and others added 3 commits September 6, 2024 09:25
Co-authored-by: Andrey Pavlenko <[email protected]>
Signed-off-by: dchigarev <[email protected]>
Comment on lines +248 to +262
shape(module_input_shape.begin(), module_input_shape.end()) {
if (shape.size() != tensor.get_shape().size()) {
// validate that the shape difference is due to trailing '1's
for (size_t i = 0; i < shape.size(); ++i) {
if (shape[i] != tensor.get_shape()[i]) {
OPENVINO_THROW("Mismatch in shape sizes");
}
}
for (size_t i = shape.size(); i < tensor.get_shape().size(); ++i) {
if (tensor.get_shape()[i] != 1) {
OPENVINO_THROW("Mismatch in shape sizes");
}
}
}
strides.resize(shape.size());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed due to the fact that GPU memory formats could hold at least 4-dimensions, causing trailing extra dims (<64x128x1x1> instead of <64x128>). This code compares input tensors' dimensions with the input dimensions of a MLIR module and trims extra dims.

src/plugins/intel_gpu/src/runtime/ocl/ocl_ext.hpp Outdated Show resolved Hide resolved
@@ -38,21 +51,71 @@ void CreateMLIRSubgraphOp(ProgramBuilder& p, const std::shared_ptr<ov::op::mlir:

Copy link
Author

@dchigarev dchigarev Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need this synchronization anymore since we pass the same queue to GC and submit our kernels to it.

@@ -11,7 +11,8 @@ namespace ov {

namespace pass {

void TRANSFORMATIONS_API transformMLIR(std::shared_ptr<ov::Model> model);
void TRANSFORMATIONS_API transformMLIR(std::shared_ptr<ov::Model> model,
std::shared_ptr<ov::EvaluationContext> loweringContext);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loweringContext stores ocl_context for mlir_op::gpu

src/plugins/intel_gpu/CMakeLists.txt Outdated Show resolved Hide resolved
OpenCL(bool out_of_order_queue = true)
{
OpenCL(bool out_of_order_queue = true) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was fixed by openvino's linter

@@ -23,8 +23,7 @@ struct OpenCL {
bool _supports_usm;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this class from tests/unit_tests/utils to tests/common/utils in order to reuse it in sanity tests for the GPU integration

@@ -95,6 +93,20 @@ struct OpenCL {
_queue = cl::CommandQueue(_context, _device, props);
}

OpenCL(cl_context context, bool out_of_order_queue = true)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's more convenient to construct this object from cl_context in sanity tests for GPU integration, since we can simply request the context from compiled model and construct this class

cmake/graph-compiler.cmake Outdated Show resolved Hide resolved
@dchigarev dchigarev marked this pull request as ready for review October 1, 2024 12:21
@dchigarev
Copy link
Author

@vladimir-paramuzov @slyalin @kurapov-peter @AndreyPavlenko

I think this PR is now in a state where it can be reviewed

@slyalin
Copy link
Owner

slyalin commented Oct 7, 2024

Should we merge #167 before merging this PR? Are both PRs ready to be merged? If they don't have obvious breaking changes, it is more convenient to continue development in the main mlir branch. I have a merged version of mlir branch and master branch from main openvino repository. So to avoid you fighting with merge conflicts on your side I would recommend to merge mentioned two PRs and then I redo the merge with master openvino branch on my side. @kurapov-peter, @AndreyPavlenko, @dchigarev?

@kurapov-peter
Copy link
Collaborator

#167 isn't ready. It still contains experimental code that needs to be cleaned up and points to a fork. @niuxiaog, could you please prepare it for the merge?

Signed-off-by: dchigarev <[email protected]>
@dchigarev
Copy link
Author

Are both PRs ready to be merged? If they don't have obvious breaking changes, it is more convenient to continue development in the main mlir branch.

I think this PR is already in a state where it can be merged. There's one more question though regarding gpu-runtime headers exposure that I would like to discuss.

The questions is, whether it's okay that we include GPU runtime headers to the openvino_intel_gpu_plugin target to access the definitions of OCL-specific engine/buffer/stream implementations and extract the actual OCL handles from them? We also may need these headers in transformation_pipeline.cpp in order to extract a device id from the context. If this headers exposure is not okay, what other alternatives we have in order to extract OCL handles? @vladimir-paramuzov

Signed-off-by: dchigarev <[email protected]>
Copy link
Collaborator

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Would TPP need anything from evaluation context btw?

Signed-off-by: dchigarev <[email protected]>
@slyalin
Copy link
Owner

slyalin commented Oct 14, 2024

@vladimir-paramuzov, please approve explicitly and we will merge.

@dchigarev
Copy link
Author

@slyalin I believe we've got all approves we needed

@slyalin slyalin merged commit 6d28d16 into slyalin:mlir Oct 15, 2024
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants