forked from OpenMP/Examples
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathExamples_SIMD.tex
126 lines (85 loc) · 5.34 KB
/
Examples_SIMD.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
%\pagebreak
\section{\code{simd} and \code{declare} \code{simd} Constructs}
\label{sec:SIMD}
The following example illustrates the basic use of the \code{simd} construct
to assure the compiler that the loop can be vectorized.
\cexample{SIMD}{1}
\ffreeexample{SIMD}{1}
When a function can be inlined within a loop the compiler has an opportunity to
vectorize the loop. By guaranteeing SIMD behavior of a function's operations,
characterizing the arguments of the function and privatizing temporary
variables of the loop, the compiler can often create faster, vector code for
the loop. In the examples below the \code{declare} \code{simd} construct is
used on the \plc{add1} and \plc{add2} functions to enable creation of their
corresponding SIMD function versions for execution within the associated SIMD
loop. The functions characterize two different approaches of accessing data
within the function: by a single variable and as an element in a data array,
respectively. The \plc{add3} C function uses dereferencing.
The \code{declare} \code{simd} constructs also illustrate the use of
\code{uniform} and \code{linear} clauses. The \code{uniform(fact)} clause
indicates that the variable \plc{fact} is invariant across the SIMD lanes. In
the \plc{add2} function \plc{a} and \plc{b} are included in the \code{unform}
list because the C pointer and the Fortran array references are constant. The
\plc{i} index used in the \plc{add2} function is included in a \code{linear}
clause with a constant-linear-step of 1, to guarantee a unity increment of the
associated loop. In the \code{declare} \code{simd} construct for the \plc{add3}
C function the \code{linear(a,b:1)} clause instructs the compiler to generate
unit-stride loads across the SIMD lanes; otherwise, costly \emph{gather}
instructions would be generated for the unknown sequence of access of the
pointer dereferences.
In the \code{simd} constructs for the loops the \code{private(tmp)} clause is
necessary to assure that the each vector operation has its own \plc{tmp}
variable.
\cexample{SIMD}{2}
\ffreeexample{SIMD}{2}
A thread that encounters a SIMD construct executes a vectorized code of the
iterations. Similar to the concerns of a worksharing loop a loop vectorized
with a SIMD construct must assure that temporary and reduction variables are
privatized and declared as reductions with clauses. The example below
illustrates the use of \code{private} and \code{reduction} clauses in a SIMD
construct.
\cexample{SIMD}{3}
\ffreeexample{SIMD}{3}
A \code{safelen(N)} clause in a \code{simd} construct assures the compiler that
there are no loop-carried dependencies for vectors of size \plc{N} or below. If
the \code{safelen} clause is not specified, then the default safelen value is
the number of loop iterations.
The \code{safelen(16)} clause in the example below guarantees that the vector
code is safe for vectors up to and including size 16. In the loop, \plc{m} can
be 16 or greater, for correct code execution. If the value of \plc{m} is less
than 16, the behavior is undefined.
\cexample{SIMD}{4}
\ffreeexample{SIMD}{4}
The following SIMD construct instructs the compiler to collapse the \plc{i} and
\plc{j} loops into a single SIMD loop in which SIMD chunks are executed by
threads of the team. Within the workshared loop chunks of a thread, the SIMD
chunks are executed in the lanes of the vector units.
\cexample{SIMD}{5}
\ffreeexample{SIMD}{5}
%%% section
\section{\code{inbranch} and \code{notinbranch} Clauses}
\label{sec:SIMD_branch}
The following examples illustrate the use of the \code{declare} \code{simd}
construct with the \code{inbranch} and \code{notinbranch} clauses. The
\code{notinbranch} clause informs the compiler that the function \plc{foo} is
never called conditionally in the SIMD loop of the function \plc{myaddint}. On
the other hand, the \code{inbranch} clause for the function goo indicates that
the function is always called conditionally in the SIMD loop inside
the function \plc{myaddfloat}.
\cexample{SIMD}{6}
\ffreeexample{SIMD}{6}
In the code below, the function \plc{fib()} is called in the main program and
also recursively called in the function \plc{fib()} within an \code{if}
condition. The compiler creates a masked vector version and a non-masked vector
version for the function \plc{fib()} while retaining the original scalar
version of the \plc{fib()} function.
\cexample{SIMD}{7}
\ffreeexample{SIMD}{7}
%%% section
\section{Loop-Carried Lexical Forward Dependence}
\label{sec:SIMD_forward_dep}
The following example tests the restriction on an SIMD loop with the loop-carried lexical forward-dependence. This dependence must be preserved for the correct execution of SIMD loops.
A loop can be vectorized even though the iterations are not completely independent when it has loop-carried dependences that are forward lexical dependences, indicated in the code below by the read of \plc{A[j+1]} and the write to \plc{A[j]} in C/C++ code (or \plc{A(j+1)} and \plc{A(j)} in Fortran). That is, the read of \plc{A[j+1]} (or \plc{A(j+1)} in Fortran) before the write to \plc{A[j]} (or \plc{A(j)} in Fortran) ordering must be preserved for each iteration in \plc{j} for valid SIMD code generation.
This test assures that the compiler preserves the loop carried lexical forward-dependence for generating a correct SIMD code.
\cexample{SIMD}{8}
\ffreeexample{SIMD}{8}