GEMM tuning #53
Replies: 10 comments 4 replies
-
Some documentation which also contain example for generaing the logic file from example configuration is available on src_projects/Tensile/tuning_docs. PDF can be generated from there by running build.sh |
Beta Was this translation helpful? Give feedback.
-
PDF generated version of this document is available on docs/tutorial/kernel_tuning/tensile_tuning.pdf |
Beta Was this translation helpful? Give feedback.
-
I've already tried these days to try and follow the steps described in the document, unfortunately the initial benchmark fails due to CMake being unable to find all the required dependencies (even after manually setting the CMAKE_PREFIX_PATH it still fails to find rocm_smi which causes the whole benchmark to fail and crash). |
Beta Was this translation helpful? Give feedback.
-
This step works for me from chapter 2 basic example:
--> results are in directories:
|
Beta Was this translation helpful? Give feedback.
-
Just noticed that the TensileTuning stars for me but actually fails to finish. So it does 3 steps from 5 and fails on CSV file generation. So I have these output directories created:
Inside of these I have files like But these 2 directories are not generated
Build logfile shows in the end error because for some reason I do not have the csv file generated. I have not found out yet from the Tensile code where that is tried to create.
Attached is full log file in case it helps you on first steps. |
Beta Was this translation helpful? Give feedback.
-
I believe that original files used for creating the kernel tuning for gfx1030 are in directory
And tuning config files like
indicates that same files can be used for multiple gpu's by running the tuning each of them individually just by changing the ScheduleName, DeviceName and ArchitectureName. And within one architecture there can be multipe chip-ids specified. (vega10 for example) |
Beta Was this translation helpful? Give feedback.
-
Getting one step further with very ugly hackish workaround... So my latest build rocm sdk build in this machine is made for gfx1030, gfx1035 and gfx1036 as a target GPUs.
When I know run the Tensile with command:
it creates the
And referes also to those on 1_BenchmarkProblems/Cijk_Ailk_Bljk_SB_00/00_Final/source/ClientParameters.ini that it creates
I also noticed following lines in output:
If I know do ugly hack and force it to detect only the gfx1035 in Tensile/Common.py I get further:
Now the execution gets further as it generates correct gfx1035 files and puts them also the ini-file.
At one time I was somehow able to get the 3_ and 4_ dir somehow generated but now it failed. Hopefully you get Tensile to this far... I think I will do one new build with only the gfx1035 as a target to investigate this further. |
Beta Was this translation helpful? Give feedback.
-
It took hours, but I found the reason for stoi problem and now I have Tensile crunching optimization for first Logic file in background for gfx1035... (138...from 435 done...) We need to add gfx1035, 1036 and 1103 in Tensile/Common.py to HardwareMonitor=false group. Otherwise listener in ResultFileReporter.cpp receives some string values that can not be converted with stoi to int. Debugging and tracing that is painfully as the build of these files is a soup where is involved
I will push some example improvement to docs/tutorial how I got this running. |
Beta Was this translation helpful? Give feedback.
-
I managed to generate 1 kernel tuning file now so I pushed to Tensile and docs/tutorial/kernel_tuning all changes I know that should be needed you to do the same also on gfx1036. (If you can solve the Tensile library search problem) I pushed also the binfo file for llama_cpp but did not add that to the binfo_list file. I think I got it working with cmake without need for patching by setting proper env variables. Are you able to verify it works for you also? |
Beta Was this translation helpful? Give feedback.
-
I got Tensile to run;
We can package this up in a little Python script for end users, maybe (using And now I play the waiting game... well, not so fast. Testing with a smaller file first, it appears the results in I'll also test gfx1030 first because it's a lot faster. This already gives me an issue where it barfs out "could not find a solution" for |
Beta Was this translation helpful? Give feedback.
-
The GPUs that are not officially supported by ROCm have generic, not optimized gemm kernels.
There's a tool, installed alongside rocBLAS, called rocblas-gemm-tune, that can help generating optimized kernels for a specific card, however it's poorly documented and requires quite a bit of knowledge to use.
If anyone can figure out a simple guide to tune the GPUs using that tool we can try to reach higher performance on some of the not officially supported cards like the Navi 5000.
I think this is the right place to discuss about it as there doesn't seem to be any other conversation about gemm tuning around.
The folder that contains all the optimized kernels in the source code of rocBLAS is this:
rocBLAS/library/src/blas3/Tensile/Logic/asm_full
.Here's some documentation: https://github.com/ROCm/Tensile/wiki.
Beta Was this translation helpful? Give feedback.
All reactions