GC-GPU integration #169

dchigarev · 2024-09-06T12:27:31Z

This PR adds an integration with graph-compiler's GPU pipeline. The integration means passing GPU buffers (usm ptr/cl_mem), cl queue and cl events as is to GC for execution.

A set of sanity tests was also added to test the integration.

How to build and run tests

Build LLVM with IMEX patches:

git clone https://github.com/intel/graph-compiler.git
./scripts/compile.sh --dev --llvm --imex
export LLVM_INST_PATH=$(pwd)/externals/llvm-project/build

Build OV from this branch:

git clone https://github.com/dchigarev/openvino.git
cd openvino & git checkout gc-gpu
mkdir build & cd build
cmake .. -G Ninja \
	-DLLVM_DIR=$LLVM_INST_PATH/lib/cmake/llvm \
	-DMLIR_DIR=$LLVM_INST_PATH/lib/cmake/mlir \
	-DENABLE_GRAPH_COMPILER=ON \
	-DENABLE_INTEL_GPU=ON \ # <-- enables GPU capabilities of graph compiler
	-DENABLE_TESTS=ON

Run sanity tests:

OV_MLIR_MODE=GC_GPU ./bin/intel64/Release/ov_gpu_func_tests --gtest_filter=MLIRExecution.*

Run benchmark_app:

OV_MLIR_MODE=GC_GPU ./bin/intel64/Debug/benchmark_app -m ./src/plugins/intel_gpu/tests/functional/mlir_op/models/matmul_64_128_f16.xml -d GPU -use_device_mem -ip f16 -infer_precision f16 -niter 100 -hint none -nstreams 1 -nthreads 1

What was changed and how it works

1. Common `MLIREvaluate` class was split into two

There are now two classes: MLIREvaluate (generic evaluation) and MLIREvaluateGcGPU. They both implement the interface of MLIREvaluateBase and an actual instance is created based on mlir_mode parameter in MLIREvaluateBase::create().

This was done because these two evaluation classes operate with different objects in order to lower and invoke recieved MLIR module. Generic MLIREvaluate operates on mlir::EvaluationEngine and mlir::Module, while MLIREvaluateGcGPU operates with gc-specific runtime objects (mlir::gc::OclModuleBuilder, mlir::gc::OclModule, etc...).

2. Context/device information is now forwarded to `MLIREvaluateBase::create()`

We need context + device information for the gc-gpu-runtime in order to build a module. That's why we now extract ocl_context and cl_device_id from RemoteContextImpl in TransformationsPipeline and forward it all the way to MLIREvaluateBase::create() using ov::EvaluationContext map.

3. Separation between `MLIREvaluate::invoke` and `MLIREvaluate::invoke_packed`

A new invocation method was added to MLIREvaluateBase interface (::invoke()). In comparison with ::invoke_packed() that accepts memref arguments in the MemrefDescriptor format, ::invoke() takes tensor vectors as is.

GC-GPU runtime expects arguments to be in a non-packed format (pointers only) if all memrefs in the compiled mlir module have static shapes. Otherwise it expects "packed" format (MemrefDescriptors).

A query method was added to determine which method of MLIREvaluate to call.

(@AndreyPavlenko may provide more info on why we need this separation)

4. Actual OCL implementations of `cldnn::stream/buffer/event` are now exposed to `intel_gpu/src/plugin/ops/mlir_op.cpp`

Base classes of stream/buffer/event do not have a method to get a handle of an actual underlying object (cl_queue/cl_mem/cl_event). In order to obtain these handles and pass them to gc-gpu runtime, the instances of these abstract objects are being dynamic-casted to their presumable implementations (ocl::gpu_buffer / ocl_stream / ocl_base_event). In order to do that we have to expose declaration of these ocl-specific implementations to ops/mlir_op.cpp by modifying its include directories. Are we okay with this?

^--- this was replaced with the one below

4. `cldnn::stream/buffer/event/device` are now able to return an underlying ocl handle

In order to get an actual cl object and pass it to the graph compiler's GPU runtime, the void* get_handle() method was added to cldnn::stream/buffer/device/event interfaces. The method is supposed to return a pointer to an opencl handle from C-api (cl_mem, cl_command_queue, cl_device_id, ...) since gc-gpu runtime takes these instead of c++ wrappers.

5. `cldnn::stream::create_base_event(...)` can now take a pointer to `cl_event`

(in order to propagate cl_event returned from gc-gpu runtime to cldnn::event that is returned from MLIROp::evaluate())

Signed-off-by: dchigarev <[email protected]>

Co-authored-by: Andrey Pavlenko <[email protected]> Signed-off-by: dchigarev <[email protected]>

src/plugins/intel_gpu/src/plugin/ops/mlir_op.cpp

dchigarev · 2024-09-06T12:52:19Z

src/common/transformations/src/transformations/mlir/mlir_op.cpp

+          shape(module_input_shape.begin(), module_input_shape.end()) {
+        if (shape.size() != tensor.get_shape().size()) {
+            // validate that the shape difference is due to trailing '1's
+            for (size_t i = 0; i < shape.size(); ++i) {
+                if (shape[i] != tensor.get_shape()[i]) {
+                    OPENVINO_THROW("Mismatch in shape sizes");
+                }
+            }
+            for (size_t i = shape.size(); i < tensor.get_shape().size(); ++i) {
+                if (tensor.get_shape()[i] != 1) {
+                    OPENVINO_THROW("Mismatch in shape sizes");
+                }
+            }
+        }
+        strides.resize(shape.size());


This is needed due to the fact that GPU memory formats could hold at least 4-dimensions, causing trailing extra dims (<64x128x1x1> instead of <64x128>). This code compares input tensors' dimensions with the input dimensions of a MLIR module and trims extra dims.

src/plugins/intel_gpu/src/runtime/ocl/ocl_ext.hpp

dchigarev · 2024-09-06T13:59:01Z

src/plugins/intel_gpu/src/plugin/ops/mlir_op.cpp

@@ -38,21 +51,71 @@ void CreateMLIRSubgraphOp(ProgramBuilder& p, const std::shared_ptr<ov::op::mlir:



We probably don't need this synchronization anymore since we pass the same queue to GC and submit our kernels to it.

Signed-off-by: dchigarev <[email protected]>

dchigarev · 2024-09-30T09:13:16Z

src/common/transformations/include/transformations/mlir/convert.hpp

@@ -11,7 +11,8 @@ namespace ov {

 namespace pass {

-void TRANSFORMATIONS_API transformMLIR(std::shared_ptr<ov::Model> model);
+void TRANSFORMATIONS_API transformMLIR(std::shared_ptr<ov::Model> model,
+                                       std::shared_ptr<ov::EvaluationContext> loweringContext);


loweringContext stores ocl_context for mlir_op::gpu

src/plugins/intel_gpu/CMakeLists.txt

Signed-off-by: dchigarev <[email protected]>

dchigarev · 2024-10-01T10:58:48Z

src/plugins/intel_gpu/tests/common/opencl_helper_instance.hpp

-    OpenCL(bool out_of_order_queue = true)
-    {
+    OpenCL(bool out_of_order_queue = true) {


this was fixed by openvino's linter

dchigarev · 2024-10-01T11:01:22Z

src/plugins/intel_gpu/tests/common/opencl_helper_instance.hpp

@@ -23,8 +23,7 @@ struct OpenCL {
    bool _supports_usm;


moved this class from tests/unit_tests/utils to tests/common/utils in order to reuse it in sanity tests for the GPU integration

dchigarev · 2024-10-01T11:03:48Z

src/plugins/intel_gpu/tests/common/opencl_helper_instance.hpp

@@ -95,6 +93,20 @@ struct OpenCL {
        _queue = cl::CommandQueue(_context, _device, props);
    }

+    OpenCL(cl_context context, bool out_of_order_queue = true)


it's more convenient to construct this object from cl_context in sanity tests for GPU integration, since we can simply request the context from compiled model and construct this class

cmake/graph-compiler.cmake

dchigarev · 2024-10-01T12:26:49Z

@vladimir-paramuzov @slyalin @kurapov-peter @AndreyPavlenko

I think this PR is now in a state where it can be reviewed

Signed-off-by: dchigarev <[email protected]>

src/common/transformations/src/transformations/mlir/convert.cpp

src/common/transformations/src/transformations/mlir/mlir_op.cpp

cmake/graph-compiler.cmake

src/common/transformations/src/transformations/mlir/mlir_op.cpp

src/common/transformations/src/transformations/mlir/mlir_op.hpp

slyalin · 2024-10-07T08:08:01Z

Should we merge #167 before merging this PR? Are both PRs ready to be merged? If they don't have obvious breaking changes, it is more convenient to continue development in the main mlir branch. I have a merged version of mlir branch and master branch from main openvino repository. So to avoid you fighting with merge conflicts on your side I would recommend to merge mentioned two PRs and then I redo the merge with master openvino branch on my side. @kurapov-peter, @AndreyPavlenko, @dchigarev?

kurapov-peter · 2024-10-07T10:18:12Z

#167 isn't ready. It still contains experimental code that needs to be cleaned up and points to a fork. @niuxiaog, could you please prepare it for the merge?

Signed-off-by: dchigarev <[email protected]>

src/common/transformations/src/transformations/mlir/mlir_op.cpp

Signed-off-by: dchigarev <[email protected]>

dchigarev · 2024-10-08T14:23:03Z

Are both PRs ready to be merged? If they don't have obvious breaking changes, it is more convenient to continue development in the main mlir branch.

I think this PR is already in a state where it can be merged. There's one more question though regarding gpu-runtime headers exposure that I would like to discuss.

The questions is, whether it's okay that we include GPU runtime headers to the openvino_intel_gpu_plugin target to access the definitions of OCL-specific engine/buffer/stream implementations and extract the actual OCL handles from them? We also may need these headers in transformation_pipeline.cpp in order to extract a device id from the context. If this headers exposure is not okay, what other alternatives we have in order to extract OCL handles? @vladimir-paramuzov

src/inference/include/openvino/runtime/intel_gpu/remote_properties.hpp

src/plugins/intel_gpu/CMakeLists.txt

Signed-off-by: dchigarev <[email protected]>

kurapov-peter

Looks good to me. Would TPP need anything from evaluation context btw?

cmake/graph-compiler.cmake

src/common/transformations/src/transformations/mlir/mlir_op.cpp

Signed-off-by: dchigarev <[email protected]>

slyalin · 2024-10-14T13:48:12Z

@vladimir-paramuzov, please approve explicitly and we will merge.

src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp

Signed-off-by: dchigarev <[email protected]>

dchigarev · 2024-10-15T10:55:40Z

@slyalin I believe we've got all approves we needed

dchigarev and others added 3 commits September 6, 2024 09:25

Initial gc-gpu integration

04a2dc0

Signed-off-by: dchigarev <[email protected]>

disable gpu mode bu default

a0f683b

Signed-off-by: dchigarev <[email protected]>

Draft of forwarding cl queue

4b2b653

Co-authored-by: Andrey Pavlenko <[email protected]> Signed-off-by: dchigarev <[email protected]>

github-actions bot added category: build category: transformations category: GPU labels Sep 6, 2024

dchigarev commented Sep 6, 2024

View reviewed changes

src/plugins/intel_gpu/src/plugin/ops/mlir_op.cpp Outdated Show resolved Hide resolved

dchigarev commented Sep 6, 2024

View reviewed changes

src/plugins/intel_gpu/src/plugin/ops/mlir_op.cpp Outdated Show resolved Hide resolved

dchigarev commented Sep 6, 2024

View reviewed changes

dchigarev mentioned this pull request Sep 6, 2024

OV GPU integration intel/graph-compiler#207

Closed

dchigarev added 5 commits September 11, 2024 12:26

Add tests with cl buffers

7319299

Signed-off-by: dchigarev <[email protected]>

add f16 tests to support dpas

118615f

Signed-off-by: dchigarev <[email protected]>

allign with new gc

ee502bb

Signed-off-by: dchigarev <[email protected]>

Align integration with new GC runtime

a01a222

Signed-off-by: dchigarev <[email protected]>

Forward device information to mlir_op at model::compile() time

d41b867

Signed-off-by: dchigarev <[email protected]>

github-actions bot added category: Core category: CPP API category: CPU category: inference labels Sep 30, 2024

dchigarev commented Sep 30, 2024

View reviewed changes

dchigarev mentioned this pull request Sep 30, 2024

Implemented GPU OpenCL runtime intel/graph-compiler#343

Merged

do not 'wait()' before 'mlir::gpu_op'

de413d4

Signed-off-by: dchigarev <[email protected]>

dchigarev commented Oct 1, 2024

View reviewed changes

AndreyPavlenko reviewed Oct 1, 2024

View reviewed changes

cmake/graph-compiler.cmake Outdated Show resolved Hide resolved

dchigarev marked this pull request as ready for review October 1, 2024 12:21

fix naming and put few 'vector::reserve()'

e86c699

Signed-off-by: dchigarev <[email protected]>

AndreyPavlenko reviewed Oct 1, 2024

View reviewed changes

cmake/graph-compiler.cmake Outdated Show resolved Hide resolved

AndreyPavlenko reviewed Oct 1, 2024

View reviewed changes

src/common/transformations/src/transformations/mlir/mlir_op.cpp Outdated Show resolved Hide resolved

AndreyPavlenko reviewed Oct 1, 2024

View reviewed changes

src/common/transformations/src/transformations/mlir/mlir_op.hpp Show resolved Hide resolved

dchigarev added 2 commits October 8, 2024 09:57

return cl_event to OV properly

abe86dd

Signed-off-by: dchigarev <[email protected]>

pass tensor vectors as is to MLIREvaluate::invoke()'

42cbdc0

Signed-off-by: dchigarev <[email protected]>

AndreyPavlenko reviewed Oct 8, 2024

View reviewed changes

src/common/transformations/src/transformations/mlir/mlir_op.cpp Outdated Show resolved Hide resolved

address review comments

717f7ca

Signed-off-by: dchigarev <[email protected]>

vladimir-paramuzov reviewed Oct 9, 2024

View reviewed changes

src/inference/include/openvino/runtime/intel_gpu/remote_properties.hpp Outdated Show resolved Hide resolved

src/plugins/intel_gpu/CMakeLists.txt Outdated Show resolved Hide resolved

dchigarev added 3 commits October 9, 2024 09:41

move mlir-related properties to dev_api

8c890d6

Signed-off-by: dchigarev <[email protected]>

return handles from ocl impls

1e05af2

Signed-off-by: dchigarev <[email protected]>

fix graph-compiler.cmake

901f2e3

Signed-off-by: dchigarev <[email protected]>

github-actions bot removed the category: CPP API label Oct 9, 2024

dchigarev added 2 commits October 10, 2024 08:33

fix cmake

a0981ac

Signed-off-by: dchigarev <[email protected]>

create event from ocl handle

7b244ca

Signed-off-by: dchigarev <[email protected]>

kurapov-peter approved these changes Oct 11, 2024

View reviewed changes

dchigarev requested review from AndreyPavlenko and vladimir-paramuzov October 11, 2024 14:57

AndreyPavlenko reviewed Oct 11, 2024

View reviewed changes

cmake/graph-compiler.cmake Show resolved Hide resolved

AndreyPavlenko reviewed Oct 11, 2024

View reviewed changes

cmake/graph-compiler.cmake Outdated Show resolved Hide resolved

AndreyPavlenko reviewed Oct 11, 2024

View reviewed changes

src/common/transformations/src/transformations/mlir/mlir_op.cpp Outdated Show resolved Hide resolved

AndreyPavlenko approved these changes Oct 11, 2024

View reviewed changes

apply review suggestions

5c2284f

Signed-off-by: dchigarev <[email protected]>

vladimir-paramuzov reviewed Oct 15, 2024

View reviewed changes

src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp Outdated Show resolved Hide resolved

assume there's one device per cl_context

f0ecd16

Signed-off-by: dchigarev <[email protected]>

vladimir-paramuzov approved these changes Oct 15, 2024

View reviewed changes

slyalin merged commit 6d28d16 into slyalin:mlir Oct 15, 2024
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC-GPU integration #169

GC-GPU integration #169

dchigarev commented Sep 6, 2024 •

edited

Loading

dchigarev Sep 6, 2024

dchigarev Sep 6, 2024 •

edited

Loading

dchigarev Sep 30, 2024

dchigarev Oct 1, 2024

dchigarev Oct 1, 2024

dchigarev Oct 1, 2024

dchigarev commented Oct 1, 2024

slyalin commented Oct 7, 2024

kurapov-peter commented Oct 7, 2024

dchigarev commented Oct 8, 2024

kurapov-peter left a comment

slyalin commented Oct 14, 2024

dchigarev commented Oct 15, 2024

		@@ -38,21 +51,71 @@ void CreateMLIRSubgraphOp(ProgramBuilder& p, const std::shared_ptr<ov::op::mlir:

GC-GPU integration #169

GC-GPU integration #169

Conversation

dchigarev commented Sep 6, 2024 • edited Loading

What was changed and how it works

1. Common MLIREvaluate class was split into two

2. Context/device information is now forwarded to MLIREvaluateBase::create()

3. Separation between MLIREvaluate::invoke and MLIREvaluate::invoke_packed

4. Actual OCL implementations of cldnn::stream/buffer/event are now exposed to intel_gpu/src/plugin/ops/mlir_op.cpp

4. cldnn::stream/buffer/event/device are now able to return an underlying ocl handle

5. cldnn::stream::create_base_event(...) can now take a pointer to cl_event

dchigarev Sep 6, 2024

Choose a reason for hiding this comment

dchigarev Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

dchigarev Sep 30, 2024

Choose a reason for hiding this comment

dchigarev Oct 1, 2024

Choose a reason for hiding this comment

dchigarev Oct 1, 2024

Choose a reason for hiding this comment

dchigarev Oct 1, 2024

Choose a reason for hiding this comment

dchigarev commented Oct 1, 2024

slyalin commented Oct 7, 2024

kurapov-peter commented Oct 7, 2024

dchigarev commented Oct 8, 2024

kurapov-peter left a comment

Choose a reason for hiding this comment

slyalin commented Oct 14, 2024

dchigarev commented Oct 15, 2024

dchigarev commented Sep 6, 2024 •

edited

Loading

1. Common `MLIREvaluate` class was split into two

2. Context/device information is now forwarded to `MLIREvaluateBase::create()`

3. Separation between `MLIREvaluate::invoke` and `MLIREvaluate::invoke_packed`

4. Actual OCL implementations of `cldnn::stream/buffer/event` are now exposed to `intel_gpu/src/plugin/ops/mlir_op.cpp`

4. `cldnn::stream/buffer/event/device` are now able to return an underlying ocl handle

5. `cldnn::stream::create_base_event(...)` can now take a pointer to `cl_event`

dchigarev Sep 6, 2024 •

edited

Loading