GitHub - nikiyadav/Loop-interchange-pass-in-MLIR: Develop a loop interchange pass in MLIR driven by an analytical cost model that optimizes for locality (spatial, temporal - both self and group) and parallelism for multicores (so as to minimize the frequence of synchronization).

nikiyadav / Loop-interchange-pass-in-MLIR Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Develop a loop interchange pass in MLIR driven by an analytical cost model that optimizes for locality (spatial, temporal - both self and group) and parallelism for multicores (so as to minimize the frequence of synchronization).

1 star 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
0001-loop-interchange-pass.patch		0001-loop-interchange-pass.patch
README.txt		README.txt
assignment.txt		assignment.txt
interchange.mlir		interchange.mlir

Repository files navigation

Developed a loop interchange pass in MLIR driven by an analytical cost model that
optimizes for locality (spatial, temporal - both self and group) and parallelism
for multicores (so as to minimize the frequence of synchronization).

References:
[1] Improving Data Locality with Loop Transformations, https://scholarworks.umass.edu/cgi/viewcontent.cgi?article=1139&context=cs_faculty_pubs
[2] Improving locality and parallelism in nested loops, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.79.9865&rep=rep1&type=pdf
[3] Optimizations for Parallelism, Locality and More, https://www.csa.iisc.ac.in/~udayb/slides/uday-polyhedral-opt.pdf
[4] Chapter 11, "Optimizing for Parallelism and Locality", Aho,Lam,Sethi,Ullman, "Compilers Principles, Techniques, & Tools", Second edition.

I have taken help from above references to come up with this algorithm.
Algorithm:
1. For each loop l, compute groups of array references.
Two references ref1 and ref2 belong to same group with respect to loop l if:
a. they refer to the same array and has exactly same access function.
b. or, they refer to the same array and differ only in lth dimension by atmost cacheLineSize.
c. or, they refer to the same array and differ by at most cacheLineSize in the last dimension.
a and b corresponds to group temporal reuse, c corresponds to group spatial reuse.
After this step, we have a set of reference groups for each loop l.
2. Find valid loop permutations.
3. For each loop l, compute number of memory accesses made when l is the innermost loop.
For innermost Loop, number of memory accesses = {
1 , when reference is loop invariant;
(tripCount/cacheLineSize), when reference has spatial reuse for loop l;
(tripCount), otherwise
}
a. For each reference group, choose a reference R.
i. if R has spatial reuse on loop l, add (tripCount/cacheLineSize) to number of memory accesses.
ii. else, if R has temporal reuse on loop l, add 1 to number of memory accesses.
iii. else, (add tripCount) to number of memory accesses.
b. Multiple the result of number of memory accesses by the tripCount of all the remaining loops.
4. Choose the loop with least number of memory accesses as the innermost loop, say it is L.
5. Find the valid loop permutation which has loop L as the innermost loop.
6. Find the loops which are parallel, does not carry any loop dependence.
7. For each loop permutation found in step 5, calculate the cost of synchronization.
Cost of synchronization is calculated for each parallel loop.
For a loop, synchronization cost = product of tripCounts of all loops which are at outer positions to this loop.
8. Choose the permutation with the least synchronization cost as the best permutation.

Assumptions and intution behind the algorithm:
1. Chosen only the best innermost loop because in general when innermost loop has large number of iterations and accesses
large amount of data, only reuse within innermost loop can be exploited. Or only innermost loop is included in the localized vector space.
2. Reuse in multiple directions cannot be exploited by loop interchange.
3. Cost Model includes number of memory accesses (loopCost) and synchronization cost. loopCost is calculated for every
possible innermost loop. Best loop is chosen as the innermost loop. loopCost calculation needs only cacheLineSize information.
4. LoopCost is least for that innermost loop which converts most reuses into locality.
5. Synchronization cost is least for the loop permutation which has the most coarse-grained parallelism.

Test cases not passed:

How to build and run:
1. git clone https://github.com/llvm/llvm-project
2. git checkout 4ba7ae85da314e3f14a5bf26bf80dc29410b3e71
3. git apply --check patch.file
3. git apply patch (see correct command)
4. Build and run (https://mlir.llvm.org/getting_started/)
mkdir llvm-project/build
cd llvm-project/build
cmake -G Ninja ../llvm \
-DLLVM_ENABLE_PROJECTS=mlir \
-DLLVM_BUILD_EXAMPLES=ON \
-DLLVM_TARGETS_TO_BUILD="X86;NVPTX;AMDGPU" \
-DCMAKE_BUILD_TYPE=Release \
-DLLVM_ENABLE_ASSERTIONS=ON \
# -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DLLVM_ENABLE_LLD=ON

cmake --build . --target check-mlir

bin/mlir-opt interchange.mlir -affine-interchange-loop
bin/mlir-opt ../../interchange4.mlir --affine-loop-interchange | bin/FileCheck ../../interchange4.mlir

How to create patch:

- git add .
- git commit -m "msg"
- git format-patch HEAD~