forked from tobiasgrosser/polyhedral.info
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPublications.bib
3268 lines (3086 loc) · 170 KB
/
Publications.bib
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
@Comment{{
We try to keep the formatting in this bibtex file consistent. Please
try to follow the following style guide.
- Order: The entries of this file are ordered by year of appearance and
then by the bibtex tags (newest entries at the top).
- Keys: Use the style firstauthor.lastname + year + optional-tag.
E.g. [Feautrier1992multi]
- '{}': Use a single pair of braces and embrace individual words/letters
that should always remain uppercase.
- Abbreviations: Do not abbreviate conferences and journal names.
- Abstracts: Include abstracts, if available.
- ACM style: For all remaining style issues, we to follow the style used by ACM
(see e.g., Baskaran2009)
!! The style rules are necessarily incomplete, if you would like to improve
the style of this file, feel free to provide a patch that both extends
the style guide and fixes the existing entries.
}}
@inproceedings{Alias2021dpn,
title={Data-aware process networks},
author={Alias, Christophe and Plesco, Alexandru},
booktitle={Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction},
pages={1--11},
year={2021},
keywords = {High-Level Synthesis, FPGA, Automatic Parallelization, Polyhedral Model}
}
@inproceedings{Baghdadi:2019:TPC:3314872.3314896,
author = {Baghdadi, Riyadh and Ray, Jessica and Romdhane, Malek Ben and Del Sozzo, Emanuele and Akkas, Abdurrahman and Zhang, Yunming and Suriana, Patricia and Kamil, Shoaib and Amarasinghe, Saman},
title = {Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code},
booktitle = {Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization},
series = {CGO 2019},
year = {2019},
isbn = {978-1-7281-1436-1},
location = {Washington, DC, USA},
pages = {193--205},
numpages = {13},
url = {https://arxiv.org/pdf/1804.10694.pdf},
acmid = {3314896},
publisher = {IEEE Press},
address = {Piscataway, NJ, USA},
keywords = {Code Generation, Code Optimization, Deep Learning, Distributed Systems, GPU, Polyhedral Model, Tensors}
}
@INPROCEEDINGS{7429301,
author={R. {Baghdadi} and U. {Beaugnon} and A. {Cohen} and T. {Grosser} and M. {Kruse} and C. {Reddy} and S. {Verdoolaege} and A. {Betts} and A. F. {Donaldson} and J. {Ketema} and J. {Absar} and S. v. {Haastregt} and A. {Kravets} and A. {Lokhmotov} and R. {David} and E. {Hajiyev}},
booktitle={2015 International Conference on Parallel Architecture and Compilation (PACT)},
title={PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming},
year={2015},
volume={},
number={},
pages={138-149},
keywords={application program interfaces;graphics processing units;parallel architectures;parallel programming;program compilers;specification languages;platform-neutral compute intermediate language;accelerator programming;GPUs;low-level APIs;CUDA;automatic parallelization;domain specific languages;performance portability;GNU C99;portable implementation language;DSL compilers;PENCIL-to-OpenCL backend;polyhedral compiler;data-dependent control flow;nonaffine array accesses;image processing kernels;Rodinia suites;SHOC suites;DSL embedding scenarios;linear algebra;signal processing radar applications;SpearDE;AMD Radeon HD 5670 GPU platform;R9 285 GPU platform;NVIDIA GTX 470 GPU platform;ARM Mali-T604 GPU platform;DSL;Optimization;Kernel;Image processing;Graphics processing units;Benchmark testing;Arrays;automatic optimization;intermediate language;polyhedral model;domain specific languages;OpenCL},
doi={10.1109/PACT.2015.17},
ISSN={1089-795X},
month={Oct},
url = {https://ieeexplore.ieee.org/document/7429301}
}
@article{Baghdadi:2013:ILT:2400682.2400711,
author = {Baghdadi, Riyadh and Cohen, Albert and Verdoolaege, Sven and Trifunovi\'{c}, Konrad},
title = {Improved Loop Tiling Based on the Removal of Spurious False Dependences},
journal = {ACM Trans. Archit. Code Optim.},
issue_date = {January 2013},
volume = {9},
number = {4},
month = jan,
year = {2013},
issn = {1544-3566},
pages = {52:1--52:26},
articleno = {52},
numpages = {26},
url = {http://doi.acm.org/10.1145/2400682.2400711},
doi = {10.1145/2400682.2400711},
acmid = {2400711},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Tiling, compiler, expansion, false dependences, memory-based dependences}
}
@techreport{baghdadi:hal-01154812,
TITLE = {{PENCIL Language Specification}},
AUTHOR = {Baghdadi, Riyadh and Cohen, Albert and Grosser, Tobias and Verdoolaege, Sven and Lokhmotov, Anton and Absar, Javed and Van Haastregt, Sven and Kravets, Alexey and Donaldson, Alastair},
URL = {https://hal.inria.fr/hal-01154812},
TYPE = {Research Report},
NUMBER = {RR-8706},
PAGES = {37},
INSTITUTION = {{INRIA}},
YEAR = {2015},
MONTH = May,
KEYWORDS = {PENCIL ; DSL ; Accelerator ; Domain Specific Language ; Intermediate Language ; OpenCL},
PDF = {https://hal.inria.fr/hal-01154812/file/RR-8706.pdf},
HAL_ID = {hal-01154812},
HAL_VERSION = {v3}
}
@article{Sukumaran-Rajam2015NonLinearLoops,
author = {Sukumaran-Rajam, Aravind and Clauss, Philippe},
title = {The Polyhedral Model of Nonlinear Loops},
journal = {ACM Transactions on Architecture and Code Optimization},
issue_date = {January 2016},
volume = {12},
number = {4},
month = dec,
year = {2015},
issn = {1544-3566},
pages = {48:1--48:27},
articleno = {48},
numpages = {27},
url = {http://doi.acm.org/10.1145/2838734},
doi = {10.1145/2838734},
acmid = {2838734},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Speculative and dynamic loop parallelization, nonlinear memory references, polyhedral model},
abstract = {
Runtime code optimization and speculative execution are becoming increasingly
prominent to leverage performance in the current multi- and many-core era.
However, a wider and more efficient use of such techniques is mainly hampered
by the prohibitive time overhead induced by centralized data race detection,
dynamic code behavior modeling, and code generation. Most of the existing
Thread Level Speculation (TLS) systems rely on naively slicing the target loops
into chunks and trying to execute the chunks in parallel with the help of a
centralized performance-penalizing verification module that takes care of data
races. Due to the lack of a data dependence model, these speculative systems
are not capable of doing advanced transformations, and, more importantly, the
chances of rollback are high. The polyhedral model is a well-known mathematical
model to analyze and optimize loop nests. The current state-of-art tools limit
the application of the polyhedral model to static control codes. Thus, none of
these tools can generally handle codes with while loops, indirect memory
accesses, or pointers. Apollo (Automatic POLyhedral Loop Optimizer) is a
framework that goes one step beyond and applies the polyhedral model
dynamically by using TLS. Apollo can predict, at runtime, whether the codes are
behaving linearly or not, and it applies polyhedral transformations on-the-fly.
This article presents a novel system that enables Apollo to handle codes whose
memory accesses and loop bounds are not necessarily linear. More generally,
this approach expands the applicability of the polyhedral model at runtime to a
wider class of codes. Plugging together both linear and nonlinear accesses to
the dependence prediction model enables the application of polyhedral loop
optimizing transformations even for nonlinear code kernels while also allowing
a low-cost speculation verification.
}
}
@inproceedings{Mullapudi2015asplos,
author = {Mullapudi, Ravi Teja and
Vasista, Vinay and
Bondhugula, Uday.},
title = {PolyMage: Automatic Optimization for Image Processing Pipelines},
booktitle = {International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)},
year = {2015},
url = {http://mcl.csa.iisc.ernet.in/polymage.html},
abstract = {
This paper presents the design and implementation of PolyMage, a
domain-specific language and compiler for image processing pipelines.
An image processing pipeline can be viewed as a graph of interconnected
stages which process images successively. Each stage typically performs
one of point-wise, stencil, reduction or data-dependent operations on
image pixels. Individual stages in a pipeline typically exhibit abundant
data parallelism that can be exploited with relative ease. However, the
stages also require high memory bandwidth preventing effective
utilization of parallelism available on modern architectures. For
applications that demand high performance, the traditional options are
to use optimized libraries like OpenCV or to optimize manually. While
using libraries precludes optimization across library routines, manual
optimization accounting for both parallelism and locality is very
tedious.
The focus of our system, PolyMage, is on automatically generating
high-performance implementations of image processing pipelines expressed
in a high-level declarative language. Our optimization approach
primarily relies on the transformation and code generation capabilities
of the polyhedral compiler framework. To the best of our knowledge, this
is the first model-driven compiler for image processing pipelines that
performs complex fusion, tiling, and storage optimization automatically.
Experimental results on a modern multicore system show that the
performance achieved by our automatic approach is up to 1.81$\times$
better than that achieved through manual tuning in Halide, a
state-of-the-art language and compiler for image processing pipelines.
For a camera raw image processing pipeline, our performance is
comparable to that of a hand-tuned implementation.}
}
@inproceedings{Acharya2015ppopp,
author = {Acharya, Aravind and
Bondhugula, Uday.},
title = {Pluto+: Near-Complete Modeling of Affine Transformations for Parallelism and Locality},
booktitle = {ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)},
year = {2015},
url = {http://mcl.csa.iisc.ernet.in/downloads/publications/acharya15ppopp.pdf},
abstract = {
Affine transformations have proven to be very powerful for loop restructuring
due to their ability to model a very wide range of transformations. A
single multi-dimensional affine function can represent a long and complex
sequence of simpler transformations. Existing affine transformation
frameworks like the Pluto algorithm, that include a cost function for
modern multicore architectures where coarse-grained parallelism and
locality are crucial, consider only a sub-space of transformations to avoid
a combinatorial explosion in finding the transformations. The ensuing
practical tradeoffs lead to the exclusion of certain useful
transformations, in particular, transformation compositions involving loop
reversals and loop skewing by negative factors. In this paper, we propose
an approach to address this limitation by modeling a much larger space of
affine transformations in conjunction with the Pluto algorithm's cost
function. We perform an experimental evaluation of both, the effect on
compilation time, and performance of generated codes. The evaluation shows
that our new framework, Pluto+, provides no degradation in performance in
any of the Polybench benchmarks. For Lattice Boltzmann Method (LBM) codes
with periodic boundary conditions, it provides a mean speedup of 1.33$\times$ over
Pluto. We also show that Pluto+ does not increase compile times
significantly. Experimental results on Polybench show that Pluto+
increases overall polyhedral source-to-source optimization time only by
15%. In cases where it improves execution time significantly, it increased
polyhedral optimization time only by 2.04$\times$.}
}
@inproceedings{Stock2014,
author = {Stock, Kevin and
Kong, Martin and
Grosser, Tobias and
Pouchet, Louis-No{\"e}l and
Rastello, Fabrice and
Ramanujam, J. and
Sadayappan, P.},
title = { A Framework for Enhancing Data Reuse via Associative Reordering},
booktitle = {Conference on Programming Language Design and Implementation (PLDI)},
year = {2014},
}
@inproceedings{Tavarageri2014,
author = {Tavarageri, Sanket and
Krishnamoorthy, Sriram and
Sadayappan, P.},
title = {Compiler-Assisted Detection of Transient Memory Errors},
booktitle = {Conference on Programming Language Design and Implementation (PLDI)},
year = {2014},
}
@inproceedings{Juega2014cgo,
author = {Juega, Carlos and P\'{e}rez, Jos\'{e} Ignacio G\'{o}mez and Tenllado, Christian and Catthoor, Francky Catthoor},
title = {Adaptive Mapping and Parameter Selection Scheme to Improve Automatic
Code Generation for GPUs},
booktitle = {{International Symposium on Code Generation and Optimization (CGO)}},
year = 2014,
address = {Orlando, FL, United States},
}
@inproceedings{Grosser2014cgo,
author = {Grosser, Tobias and Cohen, Albert and Holewinski, Justin and Sadayappan, P. and Verdoolaege, Sven},
title = {{Hybrid Hexagonal/Classical Tiling for GPUs}},
booktitle = {{International Symposium on Code Generation and Optimization (CGO)}},
year = 2014,
address = {Orlando, FL, United States},
url = {http://hal.inria.fr/hal-00911177}
}
@inproceedings{Mehta2014ppopp,
author = {Mehta, Sanyam and Lin, Pei-Hung and Yew, Pen-Chung},
title = {Revisiting Loop Fusion in the Polyhedral Framework},
booktitle = {Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP)},
series = {PPoPP '14},
year = {2014},
url = {http://www-users.cs.umn.edu/~sanyam/publications/p233-mehta.pdf},
abstract = {Loop fusion is an important compiler optimization for improving memory hierarchy performance through enabling data reuse. Traditional compilers have approached loop fusion in a manner decoupled from other high-level loop optimizations, missing several interesting solutions. Recently, the polyhedral compiler framework with its ability to compose complex transformations, has proved to be promising in performing loop optimizations for small programs. However, our experiments with large programs using state-of-the-art polyhedral compiler frameworks reveal suboptimal fusion partitions in the transformed code. We trace the reason for this to be lack of an effective cost model to choose a good fusion partitioning among the possible choices, which increase exponentially with the number of program statements. In this paper, we propose a fusion algorithm to choose good fusion partitions with two objective functions - achieving good data reuse and preserving parallelism inherent in the source code. These objectives, although targeted by previous work in traditional compilers, pose new challenges within the polyhedral compiler framework and have thus not been addressed. In our algorithm, we propose several heuristics that work effectively within the polyhedral compiler framework and allow us to achieve the proposed objectives. Experimental results show that our fusion algorithm achieves performance comparable to the existing polyhedral compilers for small kernel programs, and significantly outperforms them for large benchmark programs such as those in the SPEC benchmark suite.}
}
@Article{Jimborean2014speculative,
author="Jimborean, Alexandra
and Clauss, Philippe
and Dollinger, Jean-Fran{\c{c}}ois
and Loechner, Vincent
and Martinez Caama{\~{n}}o, Juan Manuel",
title="Dynamic and Speculative Polyhedral Parallelization Using Compiler-Generated Skeletons",
journal="International Journal of Parallel Programming",
year="2014",
volume="42",
number="4",
pages="529--545",
abstract="We propose a framework based on an original generation and use of algorithmic skeletons, and dedicated to speculative parallelization of scientific nested loop kernels, able to apply at run-time polyhedral transformations to the target code in order to exhibit parallelism and data locality. Parallel code generation is achieved almost at no cost by using binary algorithmic skeletons that are generated at compile-time, and that embed the original code and operations devoted to instantiate a polyhedral parallelizing transformation and to verify the speculations on dependences. The skeletons are patched at run-time to generate the executable code. The run-time process includes a transformation selection guided by online profiling phases on short samples, using an instrumented version of the code. During this phase, the accessed memory addresses are used to compute on-the-fly dependence distance vectors, and are also interpolated to build a predictor of the forthcoming accesses. Interpolating functions and distance vectors are then employed for dependence analysis to select a parallelizing transformation that, if the prediction is correct, does not induce any rollback during execution. In order to ensure that the rollback time overhead stays low, the code is executed in successive slices of the outermost original loop of the nest. Each slice can be either a parallel version which instantiates a skeleton, a sequential original version, or an instrumented version. Moreover, such slicing of the execution provides the opportunity of transforming differently the code to adapt to the observed execution phases, by patching differently one of the pre-built skeletons. The framework has been implemented with extensions of the LLVM compiler and an x86-64 runtime system. Significant speed-ups are shown on a set of benchmarks that could not have been handled efficiently by a compiler.",
issn="1573-7640",
doi="10.1007/s10766-013-0259-4",
url="http://dx.doi.org/10.1007/s10766-013-0259-4"
}
@inproceedings{Jimborean2014cgo,
author = {Jimborean, Alexandra and Koukos, Konstantinos and Spiliopoulos, Vasileios and
Black-Schaffer, David and Kaxiras, Stefanos},
title = {{Fix the code. Don't tweak the hardware: A new compiler approach to Voltage-Frequency scaling}},
booktitle = {{International Symposium on Code Generation and Optimization (CGO)}},
year = 2014,
address = {Orlando, FL, United States},
}
@inproceedings{Venkat2014cgo,
author = {Venkat, Anand and Shantharam, Manu and Hall, Mary and Strout, Michelle},
title = {{Non-affine Extensions to Polyhedral Code Generation}},
booktitle = {{International Symposium on Code Generation and Optimization (CGO)}},
year = 2014,
address = {Orlando, FL, United States},
}
@InProceedings{Grosser2014Relation,
author = {Tobias Grosser and Sven Verdoolaege and Albert Cohen and P. Sadayappan},
title = {The relation between diamond tiling and hexagonal tiling},
booktitle = {1st International Workshop on High-Performance Stencil Computations (HiStencils 2014)},
address = {Vienna, Austria},
month = jan,
year = {2014},
url = {http://www.exastencils.org/histencils/histencils2014.pdf#page=75},
abstract = {
Iterative stencil computations are important in scientific computing and more
and more also in the embedded and mobile domain. Recent publications have shown
that tiling schemes that ensure concurrent start provide efficient ways to
execute these kernels. Diamond tiling and hybrid-hexagonal tiling are two
successful tiling schemes that enable concurrent start. Both have different
advantages: diamond tiling is integrated in a general purpose optimization
framework and uses a cost function to choose among tiling hyperplanes, whereas
the more flexible tile sizes of hybrid-hexagonal tiling have proven to be
effective for the generation of GPU code.
We show that these two approaches are even more interesting when combined. We
revisit the formalization of diamond and hexagonal tiling, present the effects
of tile size and wavefront choices on tile-level parallelism, and formulate
constraints for optimal diamond tile shapes. We then extend the diamond tiling
formulation into a hexagonal tiling one, combining the benefits of both. The
paper closes with an outlook of hexagonal tiling in higher dimensional spaces,
an important generalization suitable for massively parallel architectures.
}
}
@inproceedings{Darte2014impact,
author = {Darte, Alain and Isoard, Alexandre},
title = {Parametric Tiling with Inter-Tile Data Reuse},
booktitle = {Proceedings of the
4th International Workshop on Polyhedral Compilation Techniques},
editor = {Rajopadhye, Sanjay and Verdoolaege, Sven},
year = 2014,
month = Jan,
address = {Vienna, Austria},
url = {http://impact.gforge.inria.fr/impact2014/papers/impact2014-darte.pdf},
abstract = {
Loop tiling is a loop transformation widely used to improve spatial and
temporal data locality, increase computation granularity, and enable blocking
algorithms, which are particularly useful when offloading kernels on
platforms with small memories. When hardware caches are not available, data
transfers must be software-managed: they can be reduced by exploiting data
reuse between tiles and, this way, avoid some useless external communications.
An important parameter of loop tiling is the sizes of the tiles, which impact
the size of the necessary local memory. However, for most analyzes that
involve several tiles, which is the case for intertile data reuse, the tile
sizes induce non-linear constraints, unless they are numerical constants.
This complicates or prevents a parametric analysis. In this paper, we show
that, actually, parametric tiling with inter-tile data reuse is nevertheless
possible, i.e., it is possible to determine, at compile-time and in a
parametric fashion, the copy-in and copy-out data sets for all tiles, with
inter-tile reuse, as well as the sizes of the induced local memories, without
the need to analyze the code for each tile size.
}
}
@inproceedings{Guo2014impact,
author = {Guo, Jing and Bernecky, Robert and
Thiyagalingam, Jeyarajan and Scholz, Sven-Bodo},
title = {Polyhedral Methods for Improving Parallel Update-in-Place},
booktitle = {Proceedings of the
4th International Workshop on Polyhedral Compilation Techniques},
editor = {Rajopadhye, Sanjay and Verdoolaege, Sven},
year = 2014,
month = Jan,
address = {Vienna, Austria},
url = {http://impact.gforge.inria.fr/impact2014/papers/impact2014-guo.pdf},
abstract = {
We demonstrate an optimization, denoted as polyhedral reuse analysis (PRA),
that uses polyhedral methods to improve the analysis of in-place update for
single-assignment arrays. The PRA optimization attempts to determine when
parallel array operations that jointly deffine new arrays from existing ones can
reuse the memory of the existing arrays, rather than creating new ones.
Polyhedral representations and related dependency inference methods facilitate
that analysis.
In the context of SaC , we demonstrate the impact of this
optimisation using two non-trivial benchmarks evaluated on conventional shared
memory machines and on GPUs, obtaining performance improvements of 2-8 times
for LU Decomposition and of 2-10 times for Needleman-Wunsch, over the same
computations with PRA disabled.
}
}
@inproceedings{Iooss2014impact,
author = {Iooss, Guillaume and Rajopadhye, Sanjay and
Alias, Christophe and Zou, Yun},
title = {CART: Constant Aspect Ratio Tiling},
booktitle = {Proceedings of the
4th International Workshop on Polyhedral Compilation Techniques},
editor = {Rajopadhye, Sanjay and Verdoolaege, Sven},
year = 2014,
month = Jan,
address = {Vienna, Austria},
url = {http://impact.gforge.inria.fr/impact2014/papers/impact2014-iooss.pdf},
abstract = {
Parametric tiling is a well-known transformation which is widely used to improve
locality, parallelism and granularity. However, parametric tiling is also a
non-linear transformation and this prevents polyhedral analysis or further
polyhedral transformation after parametric tiling. It is therefore generally
applied during the code generation phase.
In this paper, we present a method
to remain polyhedral, in a special case of parametric tiling, where all the
dimensions are tiled and all the tile sizes are constant multiples of a single
tile size parameter. We call this Constant Aspect Ratio Tiling . We show how to
mathematically transform a polyhedron and an affine function into their tiled
counterpart, which are the two main operations needed in such transformation.
}
}
@inproceedings{Li2014impact,
author = {Li, Peng and Pouchet, Louis-No{\"e}l and Cong, Jason},
title = {Throughput Optimization for High-Level Synthesis
Using Resource Constraints},
booktitle = {Proceedings of the
4th International Workshop on Polyhedral Compilation Techniques},
editor = {Rajopadhye, Sanjay and Verdoolaege, Sven},
year = 2014,
month = Jan,
address = {Vienna, Austria}
}
@inproceedings{Mullapudi2014impact,
author = {Mullapudi, Ravi Teja and Bondhugula, Uday},
title = {Tiling for Dynamic Scheduling},
booktitle = {Proceedings of the
4th International Workshop on Polyhedral Compilation Techniques},
editor = {Rajopadhye, Sanjay and Verdoolaege, Sven},
year = 2014,
month = Jan,
address = {Vienna, Austria},
url = {http://impact.gforge.inria.fr/impact2014/papers/impact2014-mullapudi.pdf},
abstract = {
Tiling is a key transformation used for coarsening the granularity of
parallelism and improving locality. It is known that current state-of-the-art
compiler approaches for tiling affine loop nests make use of sufficient, i.e.,
conservative conditions for the validity of tiling. These conservative
conditions, which are used for static scheduling, miss tiling schemes for which
the tile schedule is not easy to describe statically. However, the partial
order of the tiles can be expressed using dependence relations which can be
used for dynamic scheduling at runtime. Another set of opportunities are missed
due to the classic reason that finding valid tiling hyperplanes is often harder
than checking whether a given tiling is valid.
Though the conservative
conditions for validity of tiling have worked in practice on a large number of
codes, we show that they fail to find the desired tiling in several cases –
some of these have dependence patterns similar to real world problems and
applications. We then look at ways to improve current techniques to address
this issue. To quantify the potential of the improved techniques, we manually
tile two dynamic programming algorithms – the Floyd-Warshall algorithm, and
Zuker’s RNA secondary structure prediction and report their performance on a
shared memory multicore. Our 3-d tiled dynamically scheduled implementation of
Zuker’s algorithm outperforms an optimized multi-core implementation GTfold by
a factor of 2.38. Such a 3-d tiling was possible only by reasoning with more
precise validity conditions
}
}
@inproceedings{Simbuerger2014impact,
author = {Simb{\"u}rger, Andreas and Gr{\"o}{\ss}liger, Armin},
title = {On the Variety of Static Control Parts in Real-World Programs:
from Affine via Multi-dimensional to Polynomial and Just-in-Time},
booktitle = {Proceedings of the
4th International Workshop on Polyhedral Compilation Techniques},
editor = {Rajopadhye, Sanjay and Verdoolaege, Sven},
year = 2014,
month = Jan,
address = {Vienna, Austria},
url = {http://impact.gforge.inria.fr/impact2014/papers/impact2014-simbuerger.pdf},
abstract = {
The polyhedron model has been used successfully for automatic parallelization
of code regions with loop nests satisfying certain restrictions, so-called
static control parts. A popular implementation of this model is Polly (an
extension of the LLVM compiler), which is able to identify static control parts
in the intermediate representation of the compiler. We look at static control
parts found in 50 real-world programs from different domains. We study whether
these programs are amenable to polyhedral optimization by Polly at compile time
or at run time. We report the number of static control parts with uniform or
authorne dependences found and study extensions of the current implementation
in Polly . We consider extensions which handle multi-dimensional arrays with
parametric sizes and arrays represented by "pointer-to-pointer" constructs. In
addition, we extend the modeling capabilities of Polly to a model using
semi-algebraic sets and real algebra instead of polyhedra and linear algebra.
We do not only consider the number and size of the code regions found but
measure the share of the run time the studied programs spend in the identified
regions for each of the classes of static control parts under study.
}
}
@inproceedings{Verdoolaege2014impact,
author = {Verdoolaege, Sven and Guelton, Serge and
Grosser, Tobias and Cohen, Albert},
title = {Schedule Trees},
booktitle = {Proceedings of the
4th International Workshop on Polyhedral Compilation Techniques},
editor = {Rajopadhye, Sanjay and Verdoolaege, Sven},
year = 2014,
month = Jan,
address = {Vienna, Austria},
url = {http://impact.gforge.inria.fr/impact2014/papers/impact2014-verdoolaege.pdf},
abstract = {
Schedules in the polyhedral model, both those that represent the original
execution order and those produced by scheduling algorithms, naturally have the
form of a tree. Generic schedule representations proposed in the literature
encode this tree structure such that it is only implicitly available.
Following the internal representation of isl , we propose to represent
schedules as explicit trees and further extend the concept by introducing
different kinds of nodes. We compare our schedule trees to other
representations in detail and illustrate how they have been successfully used
to simplify the implementation of a non-trivial polyhedral compiler.
}
}
@inproceedings{Wang2014impact,
author = {Wang, Wei and Cavazos, John and Porterfield, Allan},
title = {Energy Auto-tuning using the Polyhedral Approach},
booktitle = {Proceedings of the
4th International Workshop on Polyhedral Compilation Techniques},
editor = {Rajopadhye, Sanjay and Verdoolaege, Sven},
year = 2014,
month = Jan,
address = {Vienna, Austria},
url = {http://impact.gforge.inria.fr/impact2014/papers/impact2014-wang.pdf},
abstract = {
As the HPC community moves into the exascale computing era, application energy
has become a big concern. Tuning for energy will be essential in the effort to
overcome the limited power envelope. How is tuning for lower energy related to
tuning for faster execution? Understanding that relationship can guide both
performance and energy tuning for exascale. In this paper, a strong
correlation is presented between the two that allows tuning for execution to be
used as a proxy for energy tuning. We also show that polyhedral compilers can
ectively tune a realistic application for both time and energy.
For a large
number of variants of the Polybench programs and LULESH energy consumption is
strongly correlated with total execution time. Optimizations can increase the
power and energy required between variants, but the variant with minimum
execution time also has the lowest energy usage. The polyhedral framework was
also used to optimize a 2D cardiac wave propagation simulation application.
Various loop optimizations including fusion, tiling, vectorization, and
auto-parallelization, achieved a 20% speedup over the baseline OpenMP
implementation, with an equivalent reduction in energy on an Intel Sandy Bridge
system. On an Intel Xeon Phi system, improvements as high as 21% in execution
time and 19% reduction in energy are obtained
}
}
@inproceedings{Yuki2014impact,
author = {Yuki, Tomofumi},
title = {Understanding {PolyBench/C} 3.2 Kernels},
booktitle = {Proceedings of the
4th International Workshop on Polyhedral Compilation Techniques},
editor = {Rajopadhye, Sanjay and Verdoolaege, Sven},
year = 2014,
month = Jan,
address = {Vienna, Austria},
url = {http://impact.gforge.inria.fr/impact2014/papers/impact2014-yuki.pdf},
abstract = {
In this position paper, we argue the need for more rigorous specification of
kernels in the PolyBench/C benchmark suite. Currently, the benchmarks are
mostly specified by their implementation as C code, with a one sentence
description of what the code is supposed to do. While this is suchcient in the
context of automated loop transformation, the lack of precise specification may
have let some questionable behaviors as benchmark kernels remain unnoticed.
As
an extreme example, two kernels in PolyBench/C 3.2 exhibit parametric speed up
with respect to the problem size when its questionable properties are used.
Abusing such properties can provide arbitrary speedup, which can be some factor
of millions, potentially threatening the credibility of any experimental
evaluation using PolyBench.
}
}
@article{ShirakOil,
title={Oil and Water can mix! Experiences with integrating Polyhedral and AST-based Transformations},
author={Shirako, Jun and Sarkar, Vivek},
booktitle = {17th Workshop Compilers for Parallel Computing (CPC)},
year = 2013
}
@inproceedings{Konstantinidis2013parametric,
title={Parametric GPU code generation for affine loop programs},
author={Konstantinidis, Athanasios and Kelly, Paul HJ and Ramanujam, J and Sadayappan, P},
booktitle={International Workshop on Languages and Compilers for Parallel Computing},
pages={136--151},
year={2013},
organization={Springer},
url={https://parasol.tamu.edu/lcpc2013/papers/lcpc2013_submission_21.pdf},
abstract={
Partitioning a parallel computation into finitely sized chunks for effective
mapping onto a parallel machine is a critical concern for source-to-source
compilation. In the context of OpenCL and CUDA, this translates to the definition
of a uniform hyper-rectangular partitioning of the parallel execution space
where each partition is subject to a fine-grained distribution of resources that has
a direct yet hard to estimate impact on performance. This paper develops the first
compilation scheme for generating parametrically tiled codes for affine loop programs
on GPUs which facilitates run-time exploration of partitioning parameters
as a fast and portable way of finding the ones that yield maximum performance.
Our approach is based on a parametric tiling scheme for producing wavefronts
of parallel rectangular partitions of parametric size and a novel runtime system
that manages wavefront execution and local memory usage dynamically through
an inspector-executor mechanism. Our experimental evaluation demonstrates the
effectiveness of our approach for wavefront as well as rectangularly-parallel partitionings.
}
}
@inproceedings{Kong2013polyhedral,
title={When polyhedral transformations meet SIMD code generation},
author={Kong, Martin and Veras, Richard and Stock, Kevin and Franchetti, Franz and Pouchet, Louis-No{\"e}l and Sadayappan, P},
booktitle={Proceedings of the 34th ACM SIGPLAN conference on Programming language design and implementation},
pages={127--138},
year={2013},
organization={ACM},
url={http://users.ece.cmu.edu/~franzf/papers/pldi13.pdf},
abstract={
Data locality and parallelism are critical optimization objectives for
performance on modern multi-core machines. Both coarse-grain parallelism (e.g.,
multi-core) and fine-grain parallelism (e.g., vector SIMD) must be effectively
exploited, but despite decades of progress at both ends, current compiler
optimization schemes that attempt to address data locality and both kinds of
parallelism often fail at one of the three objectives.
We address this problem by proposing a 3-step framework, which aims for
integrated data locality, multi-core parallelism and SIMD execution of
programs. We define the concept of vectorizable codelets, with properties
tailored to achieve effective SIMD code generation for the codelets. We
leverage the power of a modern high-level transformation framework to
restructure a program to expose good ISA-independent vectorizable codelets,
exploiting multi-dimensional data reuse. Then, we generate ISA-specific
customized code for the codelets, using a collection of lower-level
SIMD-focused optimizations.
We demonstrate our approach on a collection of numerical kernels that we
automatically tile, parallelize and vectorize, exhibiting significant
performance improvements over existing compilers.
}
}
@inproceedings{Grosser2013split,
title={Split tiling for {GPU}s: automatic parallelization using trapezoidal tiles},
author={Grosser, Tobias and Cohen, Albert and Kelly, Paul HJ and Ramanujam, J and Sadayappan, P and Verdoolaege, Sven},
booktitle={GPGPU-6},
pages={24--31},
year={2013},
organization={ACM},
pdf = {http://hal.inria.fr/hal-00786812/PDF/paper.pdf},
abstract = {
Tiling is a key technique to enhance data reuse. For computations structured as
one sequential outer "time" loop enclosing a set of parallel inner loops,
tiling only the parallel inner loops may not enable enough data reuse in the
cache. Tiling the inner loops along with the outer time loop enhances data
locality but may require other transformations like loop skewing that inhibit
inter-tile parallelism.
One approach to tiling that enhances data locality without inhibiting
inter-tile parallelism is split tiling, where tiles are subdivided into a
sequence of trapezoidal computation steps. In this paper, we develop an
approach to generate split tiled code for GPUs in the PPCG polyhedral code
generator. We propose a generic algorithm to calculate index-set splitting that
enables us to perform tiling for locality and synchronization avoidance, while
simultaneously maintaining parallelism, without the need for skewing or
redundant computations. Our algorithm performs split tiling for an arbitrary
number of dimensions and without the need to construct any large integer linear
program. The method and its implementation are evaluated on standard stencil
kernels and compared with a state-of-the-art polyhedral compiler and with a
domain-specific stencil compiler, both targeting CUDA GPUs.
}
}
@article{Mehta2013taco,
author = {Mehta, Sanyam and Beeraka, Gautham and Yew, Pen-Chung},
title = {Tile Size Selection Revisited},
journal = {ACM Transactions on Architecture and Code Optimization (TACO)},
issue_date = {December 2013},
volume = {10},
number = {4},
month = dec,
year = {2013},
url = {http://www-users.cs.umn.edu/~sanyam/publications/a35-mehta.pdf},
abstract = {Loop tiling is a widely used loop transformation to enhance data locality and allow data reuse. In the tiled code, however, tiles of different sizes can lead to significant variation in performance. Thus, selection of an optimal tile size is critical to performance of tiled codes.
In the past, tile size selection has been attempted using both static analytical and dynamic empirical (auto-tuning) models. Past work using static models assumed a direct-mapped cache for the purpose of analysis and thus proved to be less robust. On the other hand, the auto-tuning models involve an exhaustive search in a large space of tiled codes. In this article, we propose a new analytical model for tile size selection that leverages the high set associativity in modern caches to minimize conflict misses. Our tile size selection model targets data reuse in multiple levels of cache. In addition, it considers the interaction of tiling with the SIMD unit in modern processors in estimating the optimal tile size. We find that these factors, not considered in previous models, are critical in developing a robust model for tile size selection. We implement our tile size selection model in a polyhedral compiler and test it on 12 benchmark kernels using two different problem sizes. Our model outperforms the previous analytical models that are based on reusing data in a single level of cache and achieves an average performance improvement of 9.7\% and 20.4\%, respectively, over the best square (cubic) tiles for the two problem sizes. In addition, the tile size chosen by our tile size selection algorithm is similar to the best performing size obtained through an extensive search, validating the analytical model underlying the algorithm.}
}
@ARTICLE{Gonzalez2013tpds,
author = "A. Gonzalez-Escribano and Y. Torres and J. Fresno and D. Llanos",
title = "An extensible system for multilevel automatic data partition and mapping",
journal = "IEEE Transactions on Parallel and Distributed Systems ",
year = "2013",
month = "March",
doi="10.1109/TPDS.2013.83",
abstract = "
Automatic data distribution is a key feature to obtain efficient
implementations from abstract and portable parallel codes. We present a highly
efficient and extensible runtime library that integrates techniques for
automatic data partition and mapping. It uses a novel approach to define an
abstract interface and a plug-in system to encapsulate different types of
regular and irregular techniques, helping to generate codes which are
independent of the exact mapping functions selected. Currently, it supports
hierarchical tiling of arrays with dense and stride domains, that allows the
implementation of both data and task parallelism using a SPMD model. It
automatically computes appropriate domain partitions for a selected virtual
topology, mapping them to available processors with static or dynamic
load-balancing techniques. Our library also allows the construction of reusable
communication patterns that efficiently exploit MPI communication capabilities.
The use of our library greatly reduces the complexity of data distribution and
communication, hiding the details of the underlying architecture. The library
can be used as an abstract layer for building generic tiling operations as
well. Our experimental results show that the use of this library allows to
achieve similar performance as carefully-implemented manual versions for
several, well-known parallel kernels and benchmarks in distributed and
multicore systems, and substantially reduces programming effort.
",
url = "http://www.computer.org/csdl/trans/td/preprint/06482561-abs.html"
}
@ARTICLE{Fresno2013js,
author = "J. Fresno and A. Gonzalez-Escribano and D. Llanos",
title = "Extending a Hierarchical Tiling Arrays Library to Support Sparse Data Partitioning",
journal = "The Journal of Supercomputing",
year = "2013",
volume = "64",
number = "1",
pages = "59--68",
month = "April",
doi = "10.1007/s11227-012-0757-y",
abstract = "
Layout methods for dense and sparse data are often seen as two separate
problems with their own particular techniques. However, they are based on the
same basic concepts. This paper studies how to integrate automatic data-layout
and partition techniques for both dense and sparse data structures. In
particular, we show how to include support for sparse matrices or graphs in
Hitmap, a library for hierarchical tiling and automatic mapping of arrays. The
paper shows that it is possible to offer a unique interface to work with both
dense and sparse data structures. Thus, the programmer can use a single and
homogeneous programming style, reducing the development effort and simplifying
the use of sparse data structures in parallel computations. Our experimental
evaluation shows that this integration of techniques can be effectively done
without compromising performance.
",
url = "http://link.springer.com/article/10.1007%2Fs11227-012-0757-y"
}
@INPROCEEDINGS{Torres2013pdpta,
author = "Y. Torres and A. Gonzalez-Escribano and D. Llanos",
title = "Automatic Run-time Mapping of Polyhedral Computations to Heterogeneous Devices with Memory-size Restrictions",
booktitle = "PDPTA'13 - The 2013 International Conference on Parallel and Distributed Processing Techniques and Applications",
year = "2013",
volume = "2",
month = "July",
publisher = "CSREA Press",
isbn = "1-60132-256-9, 1-60132-257-7 (1-60132-258-5)",
abstract = "
Tools that aim to automatically map parallel computations to heterogeneous and
hierarchical systems try to divide the whole computation in parts with
computational loads adjusted to the capabilities of the target devices. Some
parts are executed in node cores, while others are executed in accelerator
devices. Each part requires one or more data-structure pieces that should be
allocated in the device memory during the computation.
In this paper we present a model that allows such automatic mapping tools to
transparently assign computations to heterogeneous devices with different
memory size restrictions. The model requires the programmer to specify the
access patterns of the computation threads in a simple abstract form. This
information is used at run-time to determine the second-level partition of the
computation assigned to a device, ensuring that the data pieces required by
each sub-part fit in the target device memory, and that the number of kernels
launched is minimal. We present experimental results with a prototype
implementation of the model that works for regular polyhedral expressions. We
show how it works for different example applications and access patterns,
transparently executing big computations in devices with different memory size
restrictions.
",
url = {http://www.infor.uva.es/~diego/docs/torres13.pdf}
}
@techreport{Grosser2013Promises,
hal_id = {hal-00848691},
url = {http://hal.inria.fr/hal-00848691},
title = {{The Promises of Hybrid Hexagonal/Classical Tiling for GPU}},
author = {Grosser, Tobias and Verdoolaege, Sven and Cohen, Albert and Sadayappan, P.},
abstract = {
Time-tiling is necessary for efficient execution of iterative
stencil computations. But the usual hyper-rectangular tiles cannot
be used because of positive/negative dependence distances along the
stencil's spatial dimensions. Several prior efforts have addressed
this issue. However, known techniques trade enhanced data reuse
for other causes of inefficiency, such as unbalanced parallelism,
redundant computations, or increased control flow overhead incompatible
with efficient GPU execution. We explore a new path to maximize the
effectivness of time-tiling on iterative stencil computations. Our
approach is particularly well suited for GPUs. It does not require
any redundant computations, it favors coalesced global-memory access
and data reuse in shared-memory/cache, avoids thread divergence, and
extracts a high degree of parallelism. We introduce hybrid hexagonal
tiling, combining hexagonal tile shapes along the time (sequential)
dimension and one spatial dimension, with classical tiling for other
spatial dimensions. An hexagonal tile shape simultaneously enable
parallel tile execution and reuse along the time dimension. Experimental
results demonstrate significant performance improvements over existing
stencil compilers.
},
affiliation = {PARKAS - INRIA Paris-Rocquencourt, Department of Computer
Science and Engineering - CSE},
type = {Rapport de recherche},
institution = {INRIA},
number = {RR-8339},
year = {2013},
month = Jul,
pdf = {http://hal.inria.fr/hal-00848691/PDF/RR-8339.pdf},
}
@article{Verdoolaege2013PPCG,
title = {Polyhedral parallel code generation for {CUDA}},
author = {Verdoolaege, Sven and Juega, Juan Carlos and Cohen, Albert and
G\'{o}mez, Jos{\'e} Ignacio and Tenllado, Christian and
Catthoor, Francky},
journal = {ACM Transactions on Architecture and Code Optimization},
issue_date = {January 2013},
volume = {9},
number = {4},
month = jan,
year = {2013},
issn = {1544-3566},
pages = {54:1--54:23},
doi = {10.1145/2400682.2400713},
acmid = {2400713},
publisher = {ACM},
address = {New York, NY, USA},
}
@proceedings{impact2013,
title = "{P}roceedings of the 3rd {I}nternational {W}orkshop on {P}olyhedral {C}ompilation {T}echniques",
editor = "Gr{\"o}{\ss}linger, Armin and Pouchet, Louis-No{\"e}l",
year = 2013,
month = Jan,
address = "Berlin, Germany",
url = "http://nbn-resolving.de/urn:nbn:de:bvb:739-opus-26930",
note = "http://impact.gforge.inria.fr/impact2013/"
}
@inproceedings{feld.2013.impact,
author = "Feld, Dustin and Soddemann, Thomas and J{\"u}nger, Michael and Mallach, Sven",
title = "{F}acilitate {SIMD}-{C}ode-{G}eneration in the {P}olyhedral {M}odel by {H}ardware-aware {A}utomatic {C}ode-{T}ransformation",
pages = "45--54",
booktitle = "{P}roceedings of the 3rd {I}nternational {W}orkshop on {P}olyhedral {C}ompilation {T}echniques",
editor = "Gr{\"o}{\ss}linger, Armin and Pouchet, Louis-No{\"e}l",
year = 2013,
month = Jan,
address = "Berlin, Germany",
url = "http://impact.gforge.inria.fr/impact2013/papers/impact2013_facilitate_simd_code_generation.pdf",
abstract = {
Although Single Instruction Multiple Data (SIMD) units are available in general
purpose processors already since the 1990s, state-of-the-art compilers are
often still not capable to fully exploit them, i.e., they may miss to achieve
the best possible performance.
We present a new hardware-aware and adaptive
loop tiling approach that is based on polyhedral transformations and explicitly
dedicated to improve on auto-vectorization. It is an extension to the tiling
algorithm implemented within the PluTo framework [4, 5]. In its default
setting, PluTo uses static tile sizes and is already capable to enable the use
of SIMD units but not primarily targeted to optimize it. We experimented with
differnt tile sizes and found a strong relationship between their choice, cache
size parameters and performance. Based on this, we designed an adaptive
procedure that speciffically tiles vectorizable loops with dynamically
calculated sizes. The blocking is automatically fitted to the amount of data
read in loop iterations, the available SIMD units and the cache sizes. The
adaptive parts are built upon straightforward calculations that are
experimentally verified and evaluated. Our results show significant
improvements in the number of instructions vectorized, cache miss rates and,
finally, running times.
}
}
@inproceedings{yuki.2013.impact,
author = "Yuki, Tomofumi and Rajopadhye, Sanjay",
title = "{M}emory {A}llocations for {T}iled {U}niform {D}ependence {P}rograms",
pages = "13--22",
booktitle = "{P}roceedings of the 3rd {I}nternational {W}orkshop on {P}olyhedral {C}ompilation {T}echniques",
editor = "Gr{\"o}{\ss}linger, Armin and Pouchet, Louis-No{\"e}l",
year = 2013,
month = Jan,
address = "Berlin, Germany",
url = "http://impact.gforge.inria.fr/impact2013/papers/impact2013_memory_allocations_for_tiled_uniform_dependence_programs.pdf",
abstract = {
In this paper, we develop a series of extensions to schedule-independent
storage mapping using Quasi-Universal Occupancy Vectors (QUOVs) targeting tiled
execution of polyhedral programs. By quasi-universality, we mean that we
restrict the \universe" of the schedule to those that correspond to tiling.
This provides the following benefits: (i) the shortest QUOVs may be shorter
than the fully universal ones, (ii) the shortest QUOVs can be found without any
search, and (iii) multi-statement programs can be handled. The resulting
storage mapping is valid for tiled execution by any tile size.
}
}
@inproceedings{fassi.2013.impact,
author = "Fassi, Im{\`e}n and Clauss, Philippe and Kuhn, Matthieu and Slama, Yosr",
title = "{M}ultifor for {M}ulticore",
pages = "37--44",
booktitle = "{P}roceedings of the 3rd {I}nternational {W}orkshop on {P}olyhedral {C}ompilation {T}echniques",
editor = "Gr{\"o}{\ss}linger, Armin and Pouchet, Louis-No{\"e}l",
year = 2013,
month = Jan,
address = "Berlin, Germany",
url = "http://impact.gforge.inria.fr/impact2013/papers/impact2013_multifor_for_multicore.pdf",
abstract = {
We propose a new programming control structure called "multifor", allowing to
take advantage of parallelization models that were not naturally attainable
with the polytope model before. In a multifor-loop, several loops whose bodies
are run simultaneously can be defined. Respective iteration domains are mapped
onto each other according to a run frequency - the grain - and a relative
position - the offset -. Execution models like dataflow, stencil computations or
MapReduce can be represented onto one referential iteration domain, while still
exhibiting traditional polyhedral code analysis and transformation
opportunities. Moreover, this construct provides ways to naturally exploit
hybrid parallelization models, thus significantly improving parallelization
opportunities in the multicore era. Traditional polyhedral software tools are
used to generate the corresponding code. Additionally, a promising perspective
related to nonlinear mapping of iteration spaces is also presented, yielding to
run a loop nest inside any other one by solving the problem of inverting
"ranking Ehrhart polynomials".
}
}
@inproceedings{verdoolaege.2013.impact,
author = "Verdoolaege, Sven and Nikolov, Hristo and Stefanov, Todor",
title = "{O}n {D}emand {P}arametric {A}rray {D}ataflow {A}nalysis",
pages = "23--36",
booktitle = "{P}roceedings of the 3rd {I}nternational {W}orkshop on {P}olyhedral {C}ompilation {T}echniques",
editor = "Gr{\"o}{\ss}linger, Armin and Pouchet, Louis-No{\"e}l",
year = 2013,
month = Jan,
address = "Berlin, Germany",
url = "http://impact.gforge.inria.fr/impact2013/papers/impact2013_on_demand_parametric_array_dataflow_analysis.pdf",
abstract = {
We present a novel approach for exact array data ow analysis in the presence of
constructs that are not static authorne. The approach is similar to that of
fuzzy array data ow analysis in that it also introduces parameters that
represent information that is only available at run-time, but the parameters
have a diㄦent meaning and are analyzed before they are introduced. The
approach was motivated by our work on process networks, but should be generally
useful since fewer parameters are introduced on larger inputs. We include some
preliminary experimental results.
}
}
@inproceedings{wonnacott.2013.impact,
author = {Wonnacott, David G. and Mills Strout, Michelle},
title = "{O}n the {S}calability of {L}oop {T}iling {T}echniques",
pages = "3--11",
booktitle = "{P}roceedings of the 3rd {I}nternational {W}orkshop on {P}olyhedral {C}ompilation {T}echniques",
editor = "Gr{\"o}{\ss}linger, Armin and Pouchet, Louis-No{\"e}l",
year = 2013,
month = Jan,
address = "Berlin, Germany",
url = "http://impact.gforge.inria.fr/impact2013/papers/impact2013_on_the_scalability_of_loop_tiling_techniques.pdf",
abstract = {
The Polyhedral model has proven to be a valuable tool for improving memory
locality and exploiting parallelism for optimizing dense array codes. This
model is expressive enough to describe transformations of imperfectly nested
loops, and to capture a variety of program transformations, including many
approaches to loop tiling. Tools such as the highly successful PLuTo automatic
parallelizer have provided empirical confirmation of the success of
polyhedral-based optimization, through experiments in which a number of
benchmarks have been executed on machines with small- to medium-scale
parallelism.
In anticipation of ever higher degrees of parallelism, we have
explored the impact of various loop tiling strategies on the asymptotic degree
of available parallelism. In our analysis, we consider “weak scaling” as
described by Gustafson, i.e., in which the data set size grows linearly with
the number of processors available. Some, but not all, of the approaches to
tiling provide weak scaling. In particular, the tiling currently performed by
PLuTo does not scale in this sense.
In this article, we review approaches to
loop tiling in the published literature, focusing on both scalability and
implementation status. We find that fully scalable tilings are not available in
general-purpose tools, and call upon the polyhedral compilation community to
focus on questions of asymptotic scalability. Finally, we identify ongoing work
that may resolve this issue.
}
}
@inproceedings{doerfert.2013.impact,
author = "Doerfert, Johannes and Hammacher, Clemens and Streit, Kevin and Hack, Sebastian",
title = "{SP}olly: {S}peculative {O}ptimizations in the {P}olyhedral {M}odel",
pages = "55--60",
booktitle = "{P}roceedings of the 3rd {I}nternational {W}orkshop on {P}olyhedral {C}ompilation {T}echniques",
editor = "Gr{\"o}{\ss}linger, Armin and Pouchet, Louis-No{\"e}l",
year = 2013,
month = Jan,
address = "Berlin, Germany",
url = "http://impact.gforge.inria.fr/impact2013/papers/impact2013_spolly.pdf",
abstract = {
The polyhedral model is only applicable to code regions that form static