This repository has been archived by the owner on Apr 10, 2019. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2
/
talk.orig
955 lines (702 loc) · 26.7 KB
/
talk.orig
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
% vim: ts=4 sw=4 expandtab ft=tex
SECTION 1 Automatic Parallelism for Mercury
SUBSECTION Introduction
SLIDE Motivation --- Multicore computing
Computing has traditionally seen a logarithmic increase in CPU clock
speeds.
However, due to physical limitations this trend no-longer continues.
Manufacturers now ship multicore processors to continue to deliver
better-performing processors without increasing clock speeds.
Programmers who want to take advantage of the extra cores on these
processors must write parallel programs.
SLIDE Motivation --- Threaded programming
Threads are the most common method of parallel programming.
When using threads,
programmers use critical sections to protect shared resources from
concurrent access.
Critical sections are normally protected by locks, but it is easy to make
errors when using locks.
\begin{itemize}
\item Forgetting to use locks can put the program into an inconsistent state,
corrupt memory and crash the program.
\item Using multiple locks in different orders in different places
can lead to deadlocks.
\item Critical sections are not composable, nesting critical sections
may acquire locks in different orders in different places.
\item Misplacing lock operations can lead to critical sections
that are too wide (causing poor performance)
or too narrow (causing data corruption and crashes).
\end{itemize}
SLIDE Automatic parallelism
A good compiler performs many optimisations on behalf of the
programmer.
Programmers rarely think about:
\begin{itemize}
\item register allocation,
\item inlining,
\item simplification such as constant propagation \& strength
reduction.
\end{itemize}
We believe that parallelisation is just another optimisation,
and it would be best if the compiler handled it for us;
so that, like any other optimisation, we wouldn't need to think of it.
SLIDE About Mercury
\begin{itemize}
\item Mercury is a \textbf{pure} logic/functional language
designed to support the creation of large, reliable, efficient programs.
\item It has a syntax similar to Prolog's,
however the operational semantics are very different.
\item It is strongly typed using a Hindley Milner type system.
\item It also has mode and determinism systems.
% making many bugs impossible.
\end{itemize}
\begin{verbatim}
:- pred map(pred(T, U), list(T), list(U)).
:- mode map(pred(in, out) is det, in, out) is det.
map(_, [], []).
map(P, [X | Xs], [Y, Ys]) :-
P(X, Y),
map(P, Xs, Ys).
\end{verbatim}
SLIDE Effects in Mercury
In Mercury, all effects are explicit,
which helps programmers as well as the compiler.
\begin{verbatim}
main(IO0, IO) :-
write_string("Hello ", IO0, IO1),
write_string("world!\n", IO1, IO).
\end{verbatim}
The I/O state represents
the state of the world outside of this process.
Mercury ensures that only one version
is alive at any given time.
This program has three versions of that state:
\begin{description}
\item[\code{IO0}]
represents the state before the program is run
\item[\code{IO1}]
represents the state after printing \verb|Hello |
\item[\code{IO}]
represents the state after printing \verb|world!\n|.
\end{description}
SLIDE Data dependencies
\begin{verbatim}
qsort([], []).
qsort([Pivot | Tail], Sorted) :-
partition(Pivot, Tail, Bigs0, Smalls0), %1
qsort(Bigs0, Bigs), %2
qsort(Smalls0, Smalls), %3
Sorted = Smalls ++ [Pivot | Bigs]. %4
\end{verbatim}
\parbox{0.5\textwidth}{
\epsfbox{dep_graph.eps}
}~\parbox{0.5\textwidth}{
\begin{itemize}
\item Steps 2 and 3 are independent.
\item \red{This is easy to prove because
there are never any \emph{side} effects.}
\item They may be executed in parallel.
\end{itemize}
}
SLIDE Explicit Parallelism
Mercury allows explicit, deterministic parallelism
via the parallel conjunction operator \red{\&}.
\vspace{1em}
\begin{tabular}{l}
\code{qsort([], []).} \\
\code{qsort([Pivot $|$ Tail], Sorted) :-} \\
\code{~~~~partition(Pivot, Tail, Bigs0, Smalls0), } \\
\code{~~~~(} \\
\code{~~~~~~~~qsort(Bigs0, Bigs) } \\
\code{~~~~\red{\&}} \\
\code{~~~~~~~~qsort(Smalls0, Smalls)} \\
\code{~~~~),} \\
\code{~~~~Sorted = Smalls ++ [Pivot $|$ Bigs].}
\end{tabular}
SLIDE Why make this automatic?
We might expect parallelism to yield a speedup in the quicksort example,
but it does not.
The above parallelisation creates $N$ parallel tasks for a list of length $N$.
Most of these tasks are trivial and the overheads of managing them
slow the program down.
Programmers rarely understand the performance of their programs,
even when they think they do.
SUBSECTION Runtime system changes
SLIDE Runtime system changes
Before we can automatically parallelise programs effectively we need to
be able to manually parallelise them effectively.
This meant making several improvements to the runtime system.
The RTS has several objects used in parallel Mercury programs.
\begin{description}
\item[Engines] represent abstract CPUs, the RTS will create as many
engines as there are processors in the system, and control each one
from a POSIX Thread.
\item[Contexts] represent a computation in progress,
they contain the stacks for that computation,
and a copy of the engine's registers when the context is suspended.
\item[Sparks] are a very small structure representing a computation
that has not yet been started,
and therefore has no allocated stack space.
\end{description}
SLIDE Work stealing
Peter Wang introduced sparks and a partial work stealing implementation.
Work stealing reduces contention on a global queue of work by allowing
each context to maintain its own work stack.
Contexts can:
\begin{itemize}
\item Push a spark onto their own stack.
\item Pop a spark off their own stack.
\item Steal a spark from the cold end of another's stack.
\end{itemize}
All of these operations are lock free,
the first two operations are wait free and do not use any atomic
operations.
The stealing operation uses an atomic compare-and-swap that may
busy-wait.
Credit: 80\% Peter Wang, 20\% myself, excluding the queue data structure.
SLIDE Dependent Parallelism
Mercury can handle dependencies between parallel conjuncts.
\emph{Shared variables} are produced in one conjunction and consumed in
another.
\begin{tabular}{l}
\code{map\_foldl(\_, \_, [], Acc, Acc).} \\
\code{map\_foldl(M, F, [X $|$ Xs], Acc0, Acc) :-} \\
\code{~~~~(} \\
\code{~~~~~~~~M(X, Y),} \\
\code{~~~~~~~~F(Y, Acc0, \red{Acc1})} \\
\code{~~~~) \&} \\
\code{~~~~map\_foldl(M, F, Xs, \red{Acc1}).} \\
\end{tabular}
\code{\red{Acc1}} will be replaced with a \emph{future},
If the second conjunct attempts to read from the future before the first
conjunct writes the future,
its context will be blocked and resumed once the first conjunct has placed a
value into the future.
SLIDE Right-recursive parallel code
Mode correctness requires that all producers of variables occur before
consumers in conjunctions.
Programmers are encouraged to make their code tail-recursive.
This means that the recursive call is placed lasted in a conjunction so that it
can become a tail call.
A parallel conjunction $G_1~\&~G_2~\&~\ldots~\&~G_N$ will be executed by spawning
off $G_2~\&~\ldots~\&~G_N$, then executing $G_1$ immediately.
In the common case that the forked-off task is not taken up by another engine
then,
a dependency between the tasks does not require a context switch.
However, if the forked-off task was taken by another engine,
the original context must be suspended until that task completes.
When the last conjunct is a tail call, it often takes far longer to execute than the other conjuncts.
Causing the original context to be blocked for a long time.
SLIDE Decomposing a parallel conjunction
Pseudo compiler output:
\code{case\_label:} \\
\code{~~~~SyncTerm st;} \\
\code{~~~~\red{init\_sync\_term}(\&st);} \\
\code{~~~~\red{spawn\_off}(spawn\_off\_label, \&st);} \\
\code{~~~~M(X, Y);} \\
\code{~~~~F(Y, Acc0, Acc1);} \\
\code{~~~~\red{join\_and\_continue}(resume\_label, \&st);} \\
\code{spawn\_off\_label:} \\
\code{~~~~map\_foldl(M, F, Xs, Acc1, Acc);} \\
\code{~~~~\red{join\_and\_continue}(resume\_label, \&st);} \\
\code{resume\_label:} \\
\code{~~~~return;} \\
SLIDE Execution of right-recursive parallel code
\vspace{-2mm}
Blocking the original context can create a pathological worst-case behaviour:
the same behaviour will occur at each level of recursion.
This will cause it to use a number of contexts \textbf{linear} in the depth of
the recursion.
\vspace{-2mm}
\begin{center}
\epsfbox{linear_context_usage.eps}
\end{center}
\vspace{-4mm}
If each context contains 4MB of stack space,
a loop only of 256 iterations will consume 1GB!
SLIDE Workaround --- Reorder conjuncts
The compiler understands recursion including mutual recursion.
Therefore,
it can move conjuncts with recursive calls to the left of those without.
This avoids the problem above.
\begin{verbatim}
map(P, [X | Xs], [Y, Ys]) :-
map(P, Xs, Ys) &
P(X, Y).
\end{verbatim}
However, where dependencies exist conjuncts cannot be moved without violating
the program's mode-correctness.
Therefore, this solution only works in the rare cases that
parallel conjunctions are independent.
SLIDE Workaround --- The max-contexts limit
To reduce the impact of this,
Peter added a maximum limit on the number of contexts that could be in memory
at the same time.
When the limit is reached no context can be created to handle
the spawned off computation.
This limit trades memory usage for sequential execution.
SLIDE Management of work queues
Work queues where originally owned by contexts,
if there where 100's of contexts active then there where 100's of work queues
to steal work from, most of which are usually empty.
This makes work stealing slower.
Furthermore, as contexts are created and destroyed the number of work queues
changes.
This required a global lock to manage the set of work queues,
Work queues are now owned by engines, which means that there are a small
fixed number of queues.
Greatly simplifying the work stealing code.
More importantly it is faster in the pathological right-recursive case.
SUBSECTION Automatic parallelism
SLIDE Our approach
\begin{itemize}
\item Profile the program to find the expensive parts.
\item Analyse the program to determine what parts \emph{can} be run in
parallel.
\item Select only the parts that can be parallelised \emph{profitably}.
This may involve trial and error when done by hand.
% Originally the above item was:
%\item Determine which parts it is \emph{profitable} run in parallel.
% This may involve trial and error when done by hand.
\item Continue introducing parallel evaluation until the all processors are
fully utilised or there is no profitable parallelism left.
\end{itemize}
\begin{center}
\epsfbox{prof_fb_talk.eps}
\end{center}
SLIDE Deep profiler call graphs
The deep profiler records profiling data not just for a call,
but for a call-chain.
The chain is an alternating list of call and call-site objects.
\vspace{-2mm}
\begin{center}
\epsfbox{call_tree.eps}
\end{center}
\vspace{-2mm}
The two calls to $f$ are recorded separately as the first one
goes via $g$ and the second does not.
This even works with higher order calls such as the one in map.
SLIDE Deep profiler call graphs
Because recursive calls are included in this chain,
multiple calls to the same procedure from a single procedure will have their
profiling data recorded separately.
\vspace{-2mm}
\begin{center}
\epsfbox{call_tree2.eps}
\end{center}
\vspace{-2mm}
If $g$ contains two call sites that both call $f$,
then these also record separate profiling data for $f$ and its descendants.
SLIDE Deep profiler call graphs
This is very powerful and helps to collect a lot of very useful data
about a programs performance.
However, it does not separate recursions or mutual recursions.
The call tree:
\vspace{-2mm}
\begin{center}
\epsfbox{call_tree3.eps}
\end{center}
\vspace{-2mm}
Is actually recorded as:
\vspace{-2mm}
\begin{center}
\epsfbox{call_tree4.eps}
\end{center}
\vspace{-2mm}
This creates a tree of strongly connected components (SCCs).
SLIDE Recursion types
\begin{tabular}{lrr}
\textbf{Recursion type} &
\textbf{\% of procedures} &
\textbf{\% of runtime}\\
\hline
Not recursive & 78.04\% & 57.53\% \\
Single recursion & 12.91\% & 28.96\% \\
Mutual recursion (3 procs) & 2.52\% & 4.78\% \\
Mutual recursion (2 procs) & 1.09\% & 2.20\% \\
Divide \& Conquer & 0.51\% & 3.27\% \\
Mutual recursion (4 procs) & 0.29\% & 1.39\% \\
Mutual recursion (5 procs) & 0.28\% & 1.10\% \\
\hline
Recursion with levels 0, 2, 3, 4 & 0.15\% & 1.04\% \\
\hline
Unclassified/unknown & 3.09\% & 0.26\% \\
\end{tabular}
Taken from an analysis on the mercury compiler.
Unclassified/unknown recursions include builtin code and code with
determinisms the analysis cannot handle.
Most recursions with multiple levels such as 0, 2, 3, 4 are in the
\code{tree234} module of the standard library.
SLIDE Cost of recursive calls
\vspace{-2mm}
For a singlely-recursive procedure we know:
\begin{tabular}{ll}
no. of outside calls & no. of inside (recursive) calls \\
cost of base case & cost of recursive case (minus recursive call)
\\
cost of shared code
\end{tabular}
We can calculate:
\vspace{-4mm}
\begin{eqnarray*}
MaxDepth & = & \frac{NumInsideCalls}{NumOutsideCalls} \\
AvgDepth & = & \frac{MaxDepth}{2} \\
RecCost(Depth) & = & Shared + Base + Depth ( Shared + RecBranch )
\end{eqnarray*}
\vspace{-4mm}
Depending on the situation, we can calculate the cost of a recursive
call at any depth.
Different equations can be derived for each recursion type.
SLIDE Finding parallelisation candidates
The deep profiler's call graph is a tree of strongly connected components
(SCCs).
Each SCC is a group of mutually recursive calls.
The automatic parallelism analysis follows the following algorithm:
\begin{itemize}
\item Recurse depth-first down the call graph from \code{main/2}.
\item Analyse each procedure of each SCC, identify conjunctions that have
two or more goals whose cost is greater than a configurable
threshold.
\item Stop recursing into children if either:
\begin{itemize}
\item the child's cost is below another configurable threshold; or
\item there is no free processor to exploit any parallelism the child
may have.
\end{itemize}
\end{itemize}
SUBSECTION Overlap analysis
SLIDE Overlap
Dependencies between variables can greatly effect the amount of
parallelism that can be gained.
\begin{verbatim}
map_foldl(_, _, [], Acc, Acc).
map_foldl(M, F, [X | Xs], Acc0, Acc) :-
M(X, Y),
F(Y, Acc0, Acc1),
map_foldl(M, F, Xs, Acc1, Acc).
\end{verbatim}
\code{F} needs \code{Y} from \code{M},
and the recursive call needs \code{Acc1} from \code{F}.
There are 3 ways to parallelise the conjunction in \code{map\_foldl}.
SLIDE Parallelising \code{map\_foldl}
\code{Y} is produced at the very end of \code{M}
and consumed at the very start of \code{F},
so the execution of these two calls cannot overlap.
\code{Acc1} is produced at the end of \code{F},
but it is \emph{not} consumed at the start of the recursive call,
so some overlap \emph{is} possible.
\begin{verbatim}
map_foldl(_, _, [], Acc, Acc).
map_foldl(M, F, [X | Xs], Acc0, Acc) :-
(
M(X, Y),
F(Y, Acc0, Acc1)
) &
map_foldl(M, F, Xs, Acc1, Acc).
\end{verbatim}
SLIDE \code{map\_foldl} overlap
The recursive call needs \code{Acc1} only when it calls \code{F}.
The calls to \code{M} can be executed in parallel.
\vspace{1em}
\epsfbox{mapfoldl-overlap.eps}
% SLIDE \code{map\_foldl} overlap
%
% The more expensive \code{M} is relative to \code{F},
% the bigger the overall speedup.
%
% \vspace{1em}
% \epsfbox{mapfoldl-overlap2.eps}
SLIDE Simple overlap example
\vspace{-1em}
We conceptually split each task split into sections,
each section ended by the production or consumption of a shared variable.
$pR$ and $qR$ denote the time that the tasks take to compute any non-shared
variables needed after the conjunction.
\vspace{-1mm}
\begin{center}
\epsfbox{overlap4.eps}
\end{center}
SLIDE Pessimal overlap example
If variables are produced or consumed at different times within goals,
then the overlap can vary greatly.
\vspace{-1mm}
\begin{center}
\epsfbox{overlap6.eps}
\end{center}
SLIDE Overlap with more than one dependency
We calculate the execution time of \code{q} by iterating over its sections.
In this case, that means iterating over the variables it consumes
in the order that it consumes them.
\begin{center}
\epsfbox{overlap2-swap.eps}
\end{center}
SLIDE Overlap algorithm --- Loop over variables
\vspace{-2mm}
\small
\begin{verbatim}
find_conjunct_par_time(Goal, SeqTime,
inout CurParTime, inout ProdTimeMap):
ProdConsList := get_sorted_var_uses(Goal)
CurSeqTime := 0
forall (Var_j, Time_j) in ProdConsList:
Duration_j := Time_j - CurSeqTime
CurSeqTime := CurSeqTime + Duration_j
if Goal produces Var_j:
CurParTime := CurParTime + Duration_j + ...
ProdTimeMap[Var_j] := CurParTime
else Goal must consume Var_j:
ParWantTime := CurParTime + Duration_j + ...
CurParTime := max(ParWantTime, ProdTimeMap[Var_j]) + ...
DurationRest := SeqTime - CurSeqTime
CurParTime := CurParTime + DurationRest
\end{verbatim}
\normalsize
The \verb|...|s represent estimates of overheads.
SLIDE Overlap of more than two tasks
A task that consumes a variable can occur only on the \emph{right}
of the task that generates its value.
Therefore, we process conjuncts from left to right.
\vspace{1em}
\epsfbox{overlap3.eps}
SLIDE Overlap algorithm --- Loop over conjuncts
\begin{verbatim}
find_par_time(Conjs, SeqTimes) returns TotalParTime:
N := length(Conjs)
ProdTimeMap := empty
TotalParTime := 0
for i in 1 to N:
CurParTime := 0 + ...
find_conjunct_par_time(Conjs[i], SeqTimes[i],
CurParTime, ProdTimeMap)
TotalParTime := max(TotalParTime, CurParTime)
TotalParTime := TotalParTime + ...
\end{verbatim}
The \verb|...|s represent estimates of overheads.
SLIDE Results
\begin{center}
{\small
\begin{tabular}{|l|r|r|r|r|r|r|}
\hline
& \multicolumn{1}{|c|}{\textbf{mandelbrot}}
& \multicolumn{1}{|c|}{\textbf{raytracer}}
& \multicolumn{1}{|c|}{\textbf{matrixmult}}
\\
\hline
\textbf{~~~~~~~~seq} & 19.6~~~(0.95) & 17.8~~~(1.26) & 7.7~~~(1.43)\\
\hline
\textbf{thread-safe seq} & 18.6~~~(1.00) & 22.4~~~(1.00) & 11.0~~~(1.00)\\
\hline
\multicolumn{4}{|c|}{\textbf{thread-safe p1}} \\
\hline
\textbf{Indep only} & 18.6~~~(1.00) & 22.3~~~(1.00) & 11.0~~~(1.00)\\
\textbf{Naive} & 18.6~~~(1.00) & 22.4~~~(1.00) & 11.0~~~(1.00)\\
\textbf{Overlap} & 18.5~~~(1.01) & 22.4~~~(1.00) & 11.0~~~(1.00)\\
\hline
\multicolumn{4}{|c|}{\textbf{thread-safe p2}} \\
\hline
\textbf{Indep only} & 18.6~~~(1.00) & 22.4~~~(1.00) & 5.5~~~(1.99) \\
\textbf{Naive} & 9.4~~~(1.98) & 13.0~~~(1.72) & 11.0~~~(1.00) \\
\textbf{Overlap} & 9.5~~~(1.96) & 12.8~~~(1.75) & 5.5~~~(1.99) \\
\hline
\multicolumn{4}{|c|}{\textbf{thread-safe p4}} \\
\hline
\textbf{Indep only} & 18.7~~~(0.99) & 22.6~~~(0.99) & 2.9~~~(3.85)\\
\textbf{Naive} & 4.8~~~(3.88) & 7.9~~~(2.84) & 11.0~~~(1.00)\\
\textbf{Overlap} & 4.8~~~(3.88) & 7.8~~~(2.87) & 2.9~~~(3.85)\\
\hline
\end{tabular}
}
\end{center}
SUBSECTION Loop control
SLIDE Execution of right-recursive parallel code
Recall the problem with parallel right recursion.
\vspace{-2mm}
\begin{center}
\epsfbox{linear_context_usage.eps}
\end{center}
\vspace{-4mm}
If each context contains 4MB of stack space,
a loop only of 256 iterations will consume 1GB!
SLIDE Loop control structure
Our solution of this problem
associates a \emph{loop control structure} with each loop.
This structure contains a fixed number of slots,
each of which has a pointer to a single context.
Once a context is allocated to a slot,
the context is not released until the loop has finished.
Instead, it is reused for later iterations.
% The structure also has bookkeeping fields,
% such as a mutex to manage concurrent access
% and per-slot booleans indicating whether a slot is currently in-use.
We replace the original looping procedure
with code that creates the loop control structure,
before calling a renamed and transformed version of its old self.
\code{map\_foldl(M, F, Xs, Acc0, Acc) :-} \\
\code{~~~~\red{create\_loop\_control}(LC),} \\
\code{~~~~map\_foldl\red{\_lc}(LC, M, F, Xs, Acc0, Acc).} \\
SLIDE Loop control transformation
\code{map\_foldl\red{\_lc}(LC, M, F, [X $|$ Xs], Acc0, Acc) :-} \\
\code{~~~~LCS = \red{lc\_wait\_for\_free\_slot}(LC),} \\
\code{~~~~\red{lc\_spawn\_off}(LC, LCS, spawn\_off\_label),} \\
\code{~~~~map\_foldl\red{\_lc}(LC, M, F, Xs, Acc1, Acc). \% Tail call} \\
\code{spawn\_off\_label:} \\
\code{~~~~M(X, Y);} \\
\code{~~~~F(Y, Acc0, Acc1);} \\
\code{~~~~\red{lc\_free\_slot}(LC, LCS);} \\
\vspace{4mm}
\code{map\_foldl\red{\_lc}(LC, \_, \_, [], Acc, Acc).} \\
\code{~~~~\red{lc\_finish}(LC).} \\
\vspace{\baselineskip}
Only as many iterations of the loop can be active as there are slots in
the loop control structure.
SLIDE Execution of loop controlled code
The first time each slot is used, we create a context for that slot.
After the initial rampup period,
the loop always uses the configured number of contexts, never more.
After the loop terminates, we free the contexts.
\vspace{-2mm}
\begin{center}
\epsfbox{lc_context_usage.eps}
\end{center}
SLIDE Memory usage results: contexts and megabytes
\vspace{-1\baselineskip}
\begin{center}
\scalebox{0.93}{
\input{../mem_table_talk}}
\end{center}
SLIDE Time results: seconds and speedups
\begin{center}
\scalebox{0.97}{
\input{../times_table_talk}}
\end{center}
SUBSECTION Conclusion
SLIDE Conclusion
\vspace{-0.5em}
\begin{itemize}
\item We're able to find and exploit profitable parallelism in small
programs.
\item The analysis explores only the parts of the call graph that
might be profitably parallelised.
\item Our novel overlap analysis allows us to estimate how dependencies
affect parallel execution.
\item Our loop control transformation eliminates excessive memory usage
and maintain tail recursion
\item Several modifications to the runtime system have improved
efficiently.
\item We have adapted the ThreadScope parallel profile visualisation
system for use with Mercury (not shown).
\end{itemize}
\Large
\begin{center}
Thank you
\end{center}
\normalsize
SUBSECTION Spare slides
SLIDE Old engine wakeup code
When a new context is created or an existing one becomes runnable and an engine
is sleeping,
the engine is woken up.
If the context must be executed on a particular engine
(because foreign code needs to use the C-stack owned by that engine)
then all the engines are woken up so that the correct one can execute the
context.
Engines sleep using a POSIX condition variable associated with the runqueue lock.
Sleeping engines wakeup periodically to attempt to steal sparks.
SLIDE New engine wakeup code
In the new system each engine has a semaphore,
Engines wait on the semaphore when they are idle.
When a context becomes runnable and an engine is sleeping the runtime system
wakes the engine by posting to the semaphore.
This prevents races that could occur in the previous system.
Individual engines can be targeted specifically,
so if a context has only one valid engine,
then only that engine will be woken.
We can also pass contexts directly to engines,
avoiding the runqueue and synchronisation in many cases.
When a spark is created a sleeping engine is woken and told which spark queue contains
a spark it can execute.
The hew code is much more responsive.
SUBSECTION Choosing how to parallelise
SLIDE Choosing how to parallelise
\begin{verbatim}
g1, g2, g3
g1 & (g2, g3)
(g1, g2) & g3
g1 & g2 & g3
\end{verbatim}
Each of these is a parallel conjunction of sequential conjunctions,
with some of the conjunctions having only one conjunct.
If there is a \code{g4}, you can
(a) execute it in parallel with all the other parallel conjuncts, or
(b) execute it in sequence with the goals in the last sequential
conjunction.
There are thus $2^{N-1}$ ways
to parallelise a conjunction of $N$ goals.
If you allow goals to be reordered,
the search space would become larger still.
SLIDE Even simple code can have many conjuncts.
\begin{verbatim}
X = (-B + sqrt(pow(B, 2) - 4*A*C)) / 2 * A
\end{verbatim}
Flattening the above expression gives 12 small goals,
each executing one primitive operation:
\begin{verbatim}
V1 = 0 V5 = 4 V9 = sqrt(V8)
V2 = V1 - B V6 = V5 * A V10 = V2 + V9
V3 = 2 V7 = V6 * C V11 = V3 * A
V4 = pow(B, V3) V8 = V4 - V7 X = V9 / V11
\end{verbatim}
Primitive goals are not worth spawning off.
Nonetheless, they can appear between goals
that should be parallelised against one another,
greatly increasing the value of $N$.
SLIDE Reducing the search space.
Currently we do two things to reduce the size of the search space
from $2^{N - 1}$:
\begin{itemize}
\item Remove whole subtrees of the search tree that are worse than
the current best solution (a variant of ``branch and bound'').
\item During search we always follow the most promising-looking
branch before backtracking to the alternative branch.
\item If the search is still taking to long,
then switch to a greedy search that is approximately linear.
\end{itemize}
This allows us to fully explore the search space when it is small,
while saving time by exploring only part of the search space when it is large.
SUBSECTION Push-into-goal transformation
SLIDE Expensive goals in different conjunctions
The call to \code{typecheck} and the call to \code{typecheck\_preds} are
expensive enough to be worth parallelising.
But the if-then-else that contains the call to \code{typecheck}
has a typical cost 1/10th of the cost of \code{typecheck}.
It is not worth parallelising the if-then-else against \code{typecheck\_preds}.
\small
\begin{verbatim}
typecheck_preds([], [], ...).
typecheck_preds([Pred0 | Preds0], [Pred | Preds], ...) :-
( if should_typecheck(Pred0) then
10 typecheck(Pred0, Pred, ...)
else
90 Pred = Pred0
),
100 typecheck_preds(Preds0, Preds, ...).
\end{verbatim}
\normalsize
SLIDE Push later goals into earlier compound goals
We can push the call to \code{typecheck\_preds} into the if-then-else
and parallelise only the then-part:
\small
\begin{verbatim}
typecheck_preds([], [], ...).
typecheck_preds([Pred0 | Preds0], [Pred | Preds], ...) :-
( if should_typecheck(Pred0) then
typecheck(Pred0, Pred, ...) &
typecheck_preds(Preds0, Preds, ...)
else
Pred = Pred0,
typecheck_preds(Preds0, Preds, ...)
).
\end{verbatim}
\normalsize
Our analysis can perform this transformation
as part of deciding whether this parallelisation is worthwhile.