GEMM tuning #53

daniandtheweb · 2024-06-04T19:25:10Z

daniandtheweb
Jun 4, 2024

The GPUs that are not officially supported by ROCm have generic, not optimized gemm kernels.

There's a tool, installed alongside rocBLAS, called rocblas-gemm-tune, that can help generating optimized kernels for a specific card, however it's poorly documented and requires quite a bit of knowledge to use.

If anyone can figure out a simple guide to tune the GPUs using that tool we can try to reach higher performance on some of the not officially supported cards like the Navi 5000.

I think this is the right place to discuss about it as there doesn't seem to be any other conversation about gemm tuning around.

The folder that contains all the optimized kernels in the source code of rocBLAS is this: rocBLAS/library/src/blas3/Tensile/Logic/asm_full.

Here's some documentation: https://github.com/ROCm/Tensile/wiki.

lamikr · 2024-06-06T12:40:45Z

lamikr
Jun 6, 2024
Maintainer

Some documentation which also contain example for generaing the logic file from example configuration is available on src_projects/Tensile/tuning_docs. PDF can be generated from there by running build.sh

0 replies

lamikr · 2024-06-22T00:00:48Z

lamikr
Jun 22, 2024
Maintainer

PDF generated version of this document is available on docs/tutorial/kernel_tuning/tensile_tuning.pdf

0 replies

daniandtheweb · 2024-06-22T01:27:34Z

daniandtheweb
Jun 22, 2024
Author

I've already tried these days to try and follow the steps described in the document, unfortunately the initial benchmark fails due to CMake being unable to find all the required dependencies (even after manually setting the CMAKE_PREFIX_PATH it still fails to find rocm_smi which causes the whole benchmark to fail and crash).

0 replies

lamikr · 2024-07-17T06:43:29Z

lamikr
Jul 17, 2024
Maintainer

This step works for me from chapter 2 basic example:

source /opt/rocm_sdk_612/bin/env_rocm.sh
cd /opt/rocm_sdk_612/docs/tutorial/kernel_tuning
cp examples/example_vega10_tuning.yaml .
~/rocm/sdk/rocm_sdk_builder_612/src_projects/Tensile/Tensile/bin/Tensile example_vega10_tuning.yaml . > tuning.out 2>&1

--> results are in directories:

0_Build
1_BenchmarkProblems
2_BenchmarkData

2 replies

jeroen-mostert Jul 17, 2024

On my machine, this errors out with a cmake error when Tensile tries to create the build environment, saying it can't find FindAMDDeviceLibs.cmake. I can get the cmake itself to work by creating a symlink from /lib64/librocm_smi64.so to /lib/librocm_smi64.so and invoking ROCM_ROOT=${ROCM_PATH} cmake Tensile/Source, but merely adding ROCM_ROOT=${ROCM_PATH} does not allow Tensile itself to run.

If it works on your configuration, it's interesting to know why. :P

We would also still need a "working" tuning configuration that's at least on par with what RocBLAS itself incorporates, though producing something even better is of course no sin...

lamikr Jul 18, 2024
Maintainer

Just to verify that you do the
source /opt/rocm_sdk_612/bin/env_rocm.sh before calling the tensile?

Another issue is that Tensile seems to have problem if the rocm build environment is made also for other gpu's what is the target gpu specified in the Tensile yaml config file. In my case it builded the gfx1036 libs even when I had specified gfx1035 in my yaml file. And then it failed because it could not load and execute the gfx1036 library in my laptop. (As I do not have that)

lamikr · 2024-07-17T17:34:38Z

lamikr
Jul 17, 2024
Maintainer

Just noticed that the TensileTuning stars for me but actually fails to finish. So it does 3 steps from 5 and fails on CSV file generation. So I have these output directories created:

0_Build/
1_BenchmarkProblems
2_BenchmarkData

Inside of these I have files like
1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/Data/00_Final.yaml
./1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/library/Kernels.so-000-gfx1035.hsaco

But these 2 directories are not generated

3_LibraryLogic
4_LibraryClient

Build logfile shows in the end error because for some reason I do not have the csv file generated. I have not found out yet from the Tensile code where that is tried to create.

+ ERR=0
+ /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/0_Build/client/tensile_client --config-file /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/build/../source/ClientParameters.ini
loading config file /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/build/../source/ClientParameters.ini
Loading /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/library/Kernels.so-000-gfx1036.hsaco
Loading /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/library/TensileLibrary_gfx1036.co
terminate called after throwing an instance of 'std::runtime_error'
  what():  Error 209(hipErrorNoBinaryForGpu) /home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/Tensile/Tensile/Source/client/main.cpp:347: 
retError
no kernel image is available for execution on the device

/opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/build/run.sh: line 5: 198393 Aborted                 (core dumped) /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/0_Build/client/tensile_client --config-file /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/build/../source/ClientParameters.ini
Tensile::WARNING: ClientWriter Benchmark Process exited with code 134
Tensile::WARNING: BenchmarkProblems: Benchmark Process exited with code 134
################################################################################
# Cijk_Ailk_Bljk_SB_00
# 00_Final: End - 195.419s
################################################################################

clientExit=1 (ERROR) for ['/opt/rocm_sdk_612/docs/tutorial/kernel_tuning/example_gfx1035_tuning.yaml']
Traceback (most recent call last):
  File "/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/Tensile/Tensile/bin/Tensile", line 39, in <module>
    Tensile.main()
  File "/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/Tensile/Tensile/Tensile.py", line 314, in main
    Tensile(sys.argv[1:])
  File "/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/Tensile/Tensile/Tensile.py", line 297, in Tensile
    executeStepsInConfig(config)
  File "/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/Tensile/Tensile/Tensile.py", line 57, in executeStepsInConfig
    BenchmarkProblems.main(config["BenchmarkProblems"], config["UseCache"])
  File "/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/Tensile/Tensile/BenchmarkProblems.py", line 408, in main
    shutil.copy(resultsFileName, newResultsFileName)
  File "/opt/rocm_sdk_612/lib/python3.11/shutil.py", line 431, in copy
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/opt/rocm_sdk_612/lib/python3.11/shutil.py", line 256, in copyfile
    with open(src, 'rb') as fsrc:
         ^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/Data/00_Final.csv'

Attached is full log file in case it helps you on first steps.
tuning_log.txt

2 replies

jeroen-mostert Jul 17, 2024

Link is broken.

lamikr Jul 17, 2024
Maintainer

I reuploaded

lamikr · 2024-07-17T17:42:23Z

lamikr
Jul 17, 2024
Maintainer

I believe that original files used for creating the kernel tuning for gfx1030 are in directory

rocm_sdk_builder/src_projects/Tensile/Tensile/Configs/navi21

And tuning config files like

rocm_sdk_builder_612/src_projects/Tensile/Tensile/Configs/rocblas_sgemm_asm_full.yaml

indicates that same files can be used for multiple gpu's by running the tuning each of them individually just by changing the ScheduleName, DeviceName and ArchitectureName.

And within one architecture there can be multipe chip-ids specified. (vega10 for example)

0 replies

lamikr · 2024-07-17T19:34:52Z

lamikr
Jul 17, 2024
Maintainer

Getting one step further with very ugly hackish workaround...

So my latest build rocm sdk build in this machine is made for gfx1030, gfx1035 and gfx1036 as a target GPUs.
In my example_gfx1035_tuning.yaml I specify correctly the gfx1035 as a target:

LibraryLogic:
   ScheduleName: "rembrandt"
   DeviceNames: ["Device 1681"]
   ArchitectureName: "gfx1035"

When I know run the Tensile with command:

/home/lamikr/own/rocm/src/sdk/rocm_sdk_builder_612/src_projects/Tensile/Tensile/bin/Tensile example_gfx1035_tuning.yaml . > tuning2.out 2>&1
Instead of creating the

/opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/library/TensileLibrary_gfx1035.co

it creates the

/opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/library/TensileLibrary_gfx1036.co

And referes also to those on 1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/ClientParameters.ini that it creates

code-object=/opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/library/Kernels.so-000-gfx1036.hsaco
code-object=/opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/library/TensileLibrary_gfx1036.co

I also noticed following lines in output:

# Restoring default globalParameters
# Detected local GPU with ISA: gfx1030
# Detected local GPU with ISA: gfx1035
# Detected local GPU with ISA: gfx1036

If I know do ugly hack and force it to detect only the gfx1035 in Tensile/Common.py I get further:

diff --git a/Tensile/Common.py b/Tensile/Common.py
index 3920717e..f61be3f7 100644
--- a/Tensile/Common.py
+++ b/Tensile/Common.py
@@ -2145,9 +2145,11 @@ def detectGlobalCurrentISA():
       for line in process.stdout.decode().split("\n"):
         arch = gfxArch(line.strip())
         if arch is not None:
-          if arch in globalParameters["SupportedISA"]:
-            print1("# Detected local GPU with ISA: " + gfxName(arch))
-            globalParameters["CurrentISA"] = arch
+          print("arch" + str(arch))
+          if str(arch) == "(10, 3, 5)":
+            if arch in globalParameters["SupportedISA"]:
+              print1("# Detected local GPU with ISA: " + gfxName(arch))
+              globalParameters["CurrentISA"] = arch
     if (process.returncode):
       printWarning("%s exited with code %u" % (globalParameters["ROCmAgentEnumeratorPath"], process.returncode))
     return process.returncode

Now the execution gets further as it generates correct gfx1035 files and puts them also the ini-file.
But it still failed to another error:

+ /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/0_Build/client/tensile_client --config-file /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/build/../source/ClientParameters.ini
loading config file /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/build/../source/ClientParameters.ini
Loading /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/library/Kernels.so-000-gfx1035.hsaco
Loading /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/library/TensileLibrary_gfx1035.co
Log level: Debug
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi
/opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/build/run.sh: line 5: 296369 Aborted                 (core dumped) /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/0_Build/client/tensile_client --config-file /opt/rocm_sdk_612/docs/tutorial/kernel_tuning/1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/build/../source/ClientParameters.ini
Tensile::WARNING: ClientWriter Benchmark Process exited with code 134
Tensile::WARNING: BenchmarkProblems: Benchmark Process exited with code 134

At one time I was somehow able to get the 3_ and 4_ dir somehow generated but now it failed. Hopefully you get Tensile to this far...

I think I will do one new build with only the gfx1035 as a target to investigate this further.

0 replies

lamikr · 2024-07-18T02:46:45Z

lamikr
Jul 18, 2024
Maintainer

It took hours, but I found the reason for stoi problem and now I have Tensile crunching optimization for first Logic file in background for gfx1035... (138...from 435 done...)

We need to add gfx1035, 1036 and 1103 in Tensile/Common.py to HardwareMonitor=false group. Otherwise listener in ResultFileReporter.cpp receives some string values that can not be converted with stoi to int.

Debugging and tracing that is painfully as the build of these files is a soup where is involved

Tensiles python code
tensile_client that build by this python code to 0_Build/client/tensile_client
makefiles and shell scripts that are generated by this tensile_client
python code that calls these generated makefiles and shell scripts

I will push some example improvement to docs/tutorial how I got this running.

0 replies

lamikr · 2024-07-18T08:02:01Z

lamikr
Jul 18, 2024
Maintainer

I managed to generate 1 kernel tuning file now so I pushed to Tensile and docs/tutorial/kernel_tuning all changes I know that should be needed you to do the same also on gfx1036. (If you can solve the Tensile library search problem)
I did not test that generated logic file yet,

I pushed also the binfo file for llama_cpp but did not add that to the binfo_list file. I think I got it working with cmake without need for patching by setting proper env variables. Are you able to verify it works for you also?

0 replies

jeroen-mostert · 2024-07-18T11:13:54Z

jeroen-mostert
Jul 18, 2024

I got Tensile to run; export CMAKE_PREFIX_PATH=${ROCM_PATH}/lib64/cmake was all that was needed (obvious in hindsight, I got turned around when invoking it manually). The ISA and hardware monitor problems can be avoided without code changes by passing them as options, though in typical Tensile style this needs some fiddling. Editing the files is also a massive pain so let's not do that manually.

cp ~rocm_sdk_builder/src_projects/Tensile/Tensile/Configs/navi21/*.yaml .
sed 's:navi21:raphael: ; s:Device 73a2:Device 164e: ; s:gfx1030:gfx1036:' -i *.yaml
find *.yaml|xargs -I '{}' ~rocm_sdk_builder/src_projects/Tensile/Tensile/bin/Tensile --global-parameters "CurrentISA=(10,3,6)" HardwareMonitor=False Device=2 -- "{}" .

We can package this up in a little Python script for end users, maybe (using rocminfo to find the right device -- I don't think we should let Tensile try to generate things for all devices at once, even if it technically can) but let's tackle that later.

And now I play the waiting game... well, not so fast. Testing with a smaller file first, it appears the results in 3_LibraryLogic are overwritten rather than merged, like I thought they would be. So before I spend hours on results that are overwritten, let's look at what's actually necessary.

I'll also test gfx1030 first because it's a lot faster. This already gives me an issue where it barfs out "could not find a solution" for rocblas_hgemm_gb_nn_asm_full, which is not promising. I will prioritize finding out why the same tuning doesn't give the same results first (#114), then verify that tuning for gfx1030 gives the same results as AMD's vendor-supplied files, then if all that checks out, ~~waste~~ spend the hours running tuning for gfx1036 specifically to see if it makes a meaningful difference. Of course other people may do the same if they get to it first. :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GEMM tuning #53

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

GEMM tuning #53

daniandtheweb Jun 4, 2024

Replies: 10 comments · 4 replies

lamikr Jun 6, 2024 Maintainer

lamikr Jun 22, 2024 Maintainer

daniandtheweb Jun 22, 2024 Author

lamikr Jul 17, 2024 Maintainer

jeroen-mostert Jul 17, 2024

lamikr Jul 18, 2024 Maintainer

lamikr Jul 17, 2024 Maintainer

jeroen-mostert Jul 17, 2024

lamikr Jul 17, 2024 Maintainer

lamikr Jul 17, 2024 Maintainer

lamikr Jul 17, 2024 Maintainer

lamikr Jul 18, 2024 Maintainer

lamikr Jul 18, 2024 Maintainer

jeroen-mostert Jul 18, 2024

daniandtheweb
Jun 4, 2024

Replies: 10 comments 4 replies

lamikr
Jun 6, 2024
Maintainer

lamikr
Jun 22, 2024
Maintainer

daniandtheweb
Jun 22, 2024
Author

lamikr
Jul 17, 2024
Maintainer

lamikr Jul 18, 2024
Maintainer

lamikr
Jul 17, 2024
Maintainer

lamikr Jul 17, 2024
Maintainer

lamikr
Jul 17, 2024
Maintainer

lamikr
Jul 17, 2024
Maintainer

lamikr
Jul 18, 2024
Maintainer

lamikr
Jul 18, 2024
Maintainer

jeroen-mostert
Jul 18, 2024