Paper Content:
Page 1:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based
Pointers
ANDREA LEPORI, ETH Zurich, Switzerland
ALEXANDRU CALOTOIU, ETH Zurich, Switzerland
TORSTEN HOEFLER, ETH Zurich, Switzerland
Pointers are an integral part of C and other programming languages. They enable substantial flexibility from
the programmer’s standpoint, allowing the user fine, unmediated control over data access patterns. However,
accesses done through pointers are often hard to track, and challenging to understand for optimizers, compilers,
and sometimes, even for the developers themselves because of the direct memory access they provide. We
alleviate this problem by exposing additional information to analyzers and compilers. By separating the
concept of a pointer into a data container and an offset, we can optimize C programs beyond what other
state-of-the-art approaches are capable of, in some cases even enabling auto-parallelization. Using this process,
we are able to successfully analyze and optimize code from OpenSSL, the Mantevo benchmark suite, and
the Lempel–Ziv–Oberhumer compression algorithm. We provide the only automatic approach able to find
all parallelization opportunities in the HPCCG benchmark from the Mantevo suite the developers identified
and even outperform the reference implementation by up to 18%, as well as speed up the PBKDF2 algorithm
implementation from OpenSSL by up to 11x.
CCS Concepts: •Software and its engineering →Source code generation ;•Computing methodologies
→Parallel computing methodologies .
Additional Key Words and Phrases: pointer analysis, automatic parallelization
1 INTRODUCTION
C is a widely used programming language, one that is heavily relied upon in many codebases.
In 2022 on GitHub C and C++ accounted together for 4.6M repositories [ 28], this being a 23.5%
growth compared to the previous year [ 29]. According to the TIOBE index [ 52], C is the second
most popular programming language.
The C language was developed in the 1970s at about the same time as the first microprocessor [ 20,
49]. Parallel computing only became widespread decades later, driven by the end of Dennard
scaling [ 12,45]. Hardware efforts to continue growth following Moore’s Law past the power wall
lead to a renewed focus on parallel computing as a way to continue improving performance [ 11,27].
C was not designed with parallel programming in mind as its inception was much earlier than
the explosion of parallel computing. Libraries provide support for parallel paradigms - such as
OpenMP [ 22] or MPI [ 44] - but add a difficult-to-manage layer of complexity. The flexibility and
low-level nature of C code make debugging challenging, even more so when the code is executed
in parallel [15, 21]. This makes parallelizing code a long and complex process.
New Paper, Not an Extension of a Conference Paper
Authors’ addresses: Andrea Lepori, ETH Zurich, Switzerland, andrea.lepori@inf.ethz.ch; Alexandru Calotoiu, ETH Zurich,
Switzerland, acalotoiu@inf.ethz.ch; Torsten Hoefler, ETH Zurich, Switzerland, torsten.hoefler@inf.ethz.ch.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses,
contact the owner/author(s).
©2024 Copyright held by the owner/author(s).
1544-3566/2024/3-ART
https://doi.org/10.1145/3701993
ACM Trans. Arch. Code Optim.arXiv:2503.03359v1 [cs.PL] 5 Mar 2025
Page 2:
PREPRINT2 Lepori et al.
Frameworks such as Polly [ 30] and Pluto [ 17] promise automatic parallelism — these are however
limited to static control parts (SCoPs). The Intel compiler offers the icc parallel mode that can
identify and auto-parallelize several predefined programming patterns. Data-centric methods [ 19]
try to push the boundary even further by leveraging complex dataflow analysis techniques to better
understand data movement and dependencies. On modern systems, data movement is the most
expensive operation in most programs, concerning both time and energy consumption [ 35], and
tends to be the biggest bottleneck in computations [ 47]. Many frameworks and compilers such as
HPVM [ 39], Halide [ 48], Jax [ 26], and DaCe [ 14] leverage dataflow analysis extensively in their
pipeline.
The major obstacle in the static analysis of data accesses is unfortunately the use of pointers.
Though they are an integral part of the C language, they often become a barrier on the road to
performance. This can be seen in the two examples in Figure 1. On the right side of the figure
we can see that the inner loop only has independent data accesses, and modern compilers are
able to detect that. But no compiler we tested was able to detect that the outer loop has no data
dependency. Every access done with p[k] is different because of the pointer movement in the outer
loop. Current analysis methods do not consider this type of access as it would require complex
and expensive pointer tracking, and even then it would require a correlation between the pointer
movement and loop iterations.
Accessing the same pointer at a different in-
dex does not guarantee that different data is ac-
cessed — pointers can move during execution.Accessing the same pointer at the same location
with access to different data. Creation of false
positives in data dependency analysis.
p[i+1] = 0; // array-style access
p++; // pointer moves
p[i] = 42; // array-style access to same memoryfor (int i=0; i<cplen *n;i+=cplen ) {
for (int k=0; k<cplen ;k++) {
p[k] = p[k] ^ data [k];// no loop dependence
}
p+=cplen ;// outer iteration on different data
}
Snippet of the PBKDF2 implementation from OpenSSL
Fig. 1. Two example codes where we can have false positives and negatives with pointer accesses
Our solution aims to address this challenge by tracking data containers (as opposed to pointers)
across the entire program. We then split pointers into the data container and an integer index to it.
This effectively makes data accesses explicit while preserving execution semantics, enabling better
analyzability of the code. A state-of-the-art data-centric compiler can then attempt to parallelize and
improve the performance of the code. The implementation of our methods is explained in Section 2.
Afterwards in Section 3 we discuss the limitations and motivation behind them. In Section 4 we
will discuss possible solutions to mitigate the limitations showing some specific transformations
handling special cases.
ACM Trans. Arch. Code Optim.
Page 3:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 3
Contributions
•We introduce a novel method to statically split pointers into a data container and an index to
it, creating explicit data accesses to aid further analysis.
•We generate a parallel version of the Mantevo HPCCG benchmark automatically finding all
parallelization opportunities the benchmark developers envisioned manually. Our automatic
workflow outperforms the manually tuned version by up to 6%.
•We automatically parallelize the previously serial PBKDF2 algorithm implemented in OpenSSL
and obtain up to an 11x speedup.
2 POINTER DISAGGREGATION
We propose a static method to improve the analyzability of pointers that are used as iterators over
data containers. We first provide a high level overview explaining our method, we follow with
some notes toward a proof and the description of the technical implementation. Then we compare
our method with previous work. We also provide some examples and results to show the value of
our transformation.
2.1 High level overview
Original code with pointer movements
p=a;
p+=i;
x=p[k];ap
ielements
ap p[k] = x
kelements
container containeroffset offsetTransformed code with adjunct instead of pointer
movements
p=a;
p_adj = 0;
p_adj +=i;
x=p[p_adj +k];ap p[p_adj]
ielements
ap p[p_adj]p[p_adj + k] = x
p_adj elements kelements
Fig. 2. Representation of data access patterns with pointer movement (left) and static accesses with adjunct
(right). Annotated with the accessed memory locations and pointer values. In the case of pointer movements
the pointer is both the container and the offset while with the adjunct the pointer is only the container.
Pointers are used both as handles for data containers and to iterate over them. By splitting these
two semantically different use cases we intend to improve element-sensitive analysis of pointer
accesses. For each pointer, we create an adjunct intvariable that represents the offset from the
start of the data container. Note that the type of the adjunct should be big enough to address every
element plus one in the connected data-container. The goal of the adjunct is to allow iteration over
the data container without modifying the handle used to access said data container. We need the
adjunct type to be plus one larger than the size of the data container as with loops containing *p++
we could have the need to get the address after the last element.
Figure 2 shows a high-level example of how the adjunct is used. We can see from the figure that
the value of xafter both executions is the same. For general code, it is important to note that p[i]
ACM Trans. Arch. Code Optim.
Page 4:
PREPRINT4 Lepori et al.
is semantically equivalent to *(p + i) as per C99 standard [ 34]. In Figure 3 we present a minimal
example to show the value of introducing the adjunct variable. We provide the resulting assembly
to show that when using the adjunct , the compiler is able to correctly identify that 𝑝𝑎𝑑𝑗is only
used as an iterator similar to 𝑖and merges the two, resulting in more compact, efficient code.
C representation
for (int i=0; i<n;i++) {
p[0] = i+ 1;
p++;
}ASM representation
.L4:
add eax , 1 ; i++
add rdx , 4 ; p++
mov DWORD PTR [rdx-4 ],eax
cmp eax ,r14d ; i < n
jne .L4
for (int i=0; i<n;i++) {
p[p_adj + 0] = i+ 1;
p_adj ++;
}.L4:
mov DWORD PTR [r12-4+rax *4], eax
add rax , 1 ; i++
cmp rdx ,rax ; i < n
jne .L4No pointer increment
⇒one less instructionadjunct transformationloop iterator→eax
pointer iterator→rdx
𝑛value→r14d
loop iterator→rax
pointer iterator↰
container base address →r12
𝑛value→rdx
Fig. 3. Difference in the assembly representation (compiled with gcc12.2 and -O2) between a code using
pointer movement and one with static access after the adjunct is introduced.
The adjunct transformation enhances code analyzability and exposes additional parallelization
opportunities. Such transformed pointers are equivalent to arrays, in that the symbol represents a
data container that is statically known. After the transformation the adjunct variable dictates how
the pointer is accessed. Because it is an intvariable there are more compiler analysis methods
available to track the value. Inside loops a compiler could apply Loop Strength-Reduction (LSR) to
simplify addressing computations. Note that this can only be done after the adjunct transformation
is applied as LSR only acts on integer values and not addresses (pointers). If instead the adjunct
variable is never used - if a pointer is never moved for example - then the compiler will be able
to optimize away the adjunct with constant propagation resulting in no performance losses. A
summary of the transformation is described in Figure 4. To note is that the expression p = q⋄
xis a combination of p = q; q = q⋄x. Because of this fact, we do not detail this case in the
following sections as it follows from the other statements.
The adjunct type is deliberately a signed integer to replicate the behavior of actual pointer
movement. This is because of the non-wrapping behaviour when the adjunct becomes negative. It
is worth noting that accessing a pointer with a negative offset might be considered valid behavior
under certain circumstances. For instance, if a programmer is confident that the memory is initialized
within the same program, such as when an opaque library provides a pointer within an array, and
the programmer is aware of the underlying structure.
Function arguments and calls undergo minimal modifications. When a function is called, it is
treated as a dereference, and we pass the pointer with the adjunct offset applied. The function
argument declarations remain unchanged, meaning that the adjunct information is lost after the
function call. This is generally not an issue, as subroutines often do not require the additional
information. Moreover, this problem is mitigated by the compiler’s automatic inlining or by manually
annotating functions to force inlining, thereby eliminating the function call.
ACM Trans. Arch. Code Optim.
Page 5:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 5
Original code Modified code
int* p; int* p;
int p_adj = 0;
*p p[p_adj]
p[x] p[x + p_adj]
p = p⋄x;
orp⋄= x; p_adj = p_adj ⋄x;
p = q; p = q;
p_adj = q_adj;
p = q⋄x; p = q;
p_adj = q_adj⋄x;Original code Modified code
f(p); f(p + p_adj);
void f(int* p) { void f(int* p) {
int p_adj = 0;
// body // body
int k = *p; int k = p[p_adj];
} }
Fig. 4. Transformation summary to improve pointer analyzability. Both pandqareintpointers ( int* p, q ),
xis an intexpression and⋄is any integer binary operator. Note that the data-type intis only for explanation
purposes and can be substituted with any other type. The adjunct -type intcan be interchanged with any
integer type that fits the size of the data-container.
2.2 Notes toward a proof
We provide a proof sketch that the transformation showed in Figure 4 does not modify the accesses
done with pointers. First for initialization of a pointer let 𝑝be a pointer to some data. Then let 𝑝𝑎𝑑𝑗
be a new unique integer variable not already present in the program and initialized to zero when
the memory of the pointer is first initialized. I.e. we have 𝑝=𝑚𝑎𝑙𝑙𝑜𝑐(...);𝑝𝑎𝑑𝑗=0where𝑝𝑎𝑑𝑗was
not previously present in the variable definitions. This is a new variable definition without any
name conflict hence it will not influence other parts of the program.
Let𝑆be an arbitrary 𝑛length sequence of statements s.t.
𝑆=(𝑝:=𝑝+𝑥𝑖|𝑥𝑖∈Z,0≤𝑖<𝑛)
Then we define the syntax 𝑆;𝑝by using the sequential operator ;with a variable name at the end
as the value of 𝑝after execution of 𝑆. By our definition we have that:
(𝑆;𝑝)=𝑝+𝑛∑︁
𝑖=0𝑥𝑖
Let𝑆′be an𝑛length sequence of statements (note that 𝑥𝑖are the same values as in 𝑆and in the
same order).
𝑆′= 𝑝𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗+𝑥𝑖|𝑥𝑖∈Z,0≤𝑖<𝑛
As before by definition we have that
(𝑆′;𝑝𝑎𝑑𝑗)=𝑝𝑎𝑑𝑗+𝑛∑︁
𝑖=0𝑥𝑖
We can see that
(𝑆;𝑝)=𝑝+𝑛∑︁
𝑖=0𝑥𝑖=𝑝+𝑝𝑎𝑑𝑗+𝑛∑︁
𝑖=0𝑥𝑖−𝑝𝑎𝑑𝑗=𝑝+(𝑆′;𝑝𝑎𝑑𝑗)−𝑝𝑎𝑑𝑗
ACM Trans. Arch. Code Optim.
Page 6:
PREPRINT6 Lepori et al.
By using the sequential operator ;we define a program sequence that defines the pointer 𝑝, integer
𝑝𝑎𝑑𝑗and executes all statements in 𝑆. This represents an arbitrary program from the definition to
the access of a pointer 𝑝with arbitrary pointer movements in between.
(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑆;𝑝)))=(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑝+(𝑆′;𝑝𝑎𝑑𝑗)−𝑝𝑎𝑑𝑗)))
=(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑝+(𝑆′;𝑝𝑎𝑑𝑗))))
We note that(𝑆′;𝑝)=𝑝and(𝑆;𝑝𝑎𝑑𝑗)=𝑝𝑎𝑑𝑗as the statements does not affect the evaluated
variable. Then we have that if we substitute any execution 𝑝=𝑝+𝑥𝑖with𝑝𝑎𝑑𝑗=𝑝𝑎𝑑𝑗+𝑥𝑖i.e.
executing𝑆′instead of𝑆then
(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑆;𝑝)))=(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑝+(𝑆′;𝑝𝑎𝑑𝑗))))
=(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑆′;𝑝)+(𝑆′;𝑝𝑎𝑑𝑗)))
=(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑆′;𝑝+𝑝𝑎𝑑𝑗)))
This means that by substituting 𝑆with𝑆′and accessing every pointer using 𝑝+𝑝𝑎𝑑𝑗as their
address we access the same memory location. Hence the program will be semantically equivalent. 𝑆
and𝑆′only contain statements referring to 𝑝or𝑝𝑎𝑑𝑗for simplicity of the proof. Any other statement
could be interleaved including control flow statements. By getting all possible execution paths and
removing all statements that don’t act directly on 𝑝then we have the same case as 𝑆that can be
transformed in 𝑆′.
The proof sketch for the correctness of the pointer assignment transformation can be found in
Appendix A.
2.3 Technical implementation
We will now discuss the algorithm that governs our approach. The algorithms are applied to the
AST (abstract syntax tree) to produce the modified code including the adjunct transformation. It is
important to note that the whole transformation is source-to-source. The implementation leverages
libclang to parse the source code. We keep a 𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔 map that maps pointers to its
adjunct . As previously mentioned the type of the adjunct variable should be an integer type big
enough to address each element of the data-container. We mostly use intas it is big enough for
most practical purposes but it could be easily interchanged with bigger or smaller integer types.
For every variable declaration, we check if it is a pointer type. In that case, we append an
additional variable declaration for the adjunct variable with type intand initialize it to 0. We keep
a mapping between the pointer variables and their adjunct . To avoid name collisions for the adjunct
variable name a simple counter and checks are used.
for all variable declaration of the form 𝑡𝑦𝑝𝑒𝑛𝑎𝑚𝑒 =𝑖𝑛𝑖𝑡do
if𝑡𝑦𝑝𝑒 is a pointer type then
𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑁𝑎𝑚𝑒←𝑔𝑒𝑡𝑈𝑛𝑖𝑞𝑢𝑒𝑁𝑎𝑚𝑒(𝑛𝑎𝑚𝑒+“_𝑎𝑑𝑗”)
𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔[𝑛𝑎𝑚𝑒]←𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑁𝑎𝑚𝑒
𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑁𝑎𝑚𝑒 init to 0
We analyze every array subscript expression ( a[i] , where ais the array expression and iis the
index expression) and verify if the array expression is in the mapping. In case we have a match
we modify the index expression to sum the additional adjunct variable. Note that we can handle
unary expressions of the type *peasily by transforming them to p[0] with a simple transformation
(which does not change the semantics as per C standard E1[E2] is equivalent to *(E1 + E2) ).
ACM Trans. Arch. Code Optim.
Page 7:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 7
for all array subscript expression of the form 𝑎𝑟𝑟𝑎𝑦[𝑖𝑛𝑑𝑒𝑥]do
if𝑎𝑟𝑟𝑎𝑦∈𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔 then
𝑎𝑑𝑗𝑢𝑛𝑐𝑡←𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔[𝑎𝑟𝑟𝑎𝑦]
𝑖𝑛𝑑𝑒𝑥 =𝑖𝑛𝑑𝑒𝑥+𝑎𝑑𝑗𝑢𝑛𝑐𝑡
Finally, we examine every assignment. Here we have two cases: p = p⋄xorp = q . For both,
the left expression must be in the mapping. In the first case ( p = p⋄x) the right side must be a
binary operation for which the first operator is the same as the left side of the assignment ( p = q
⋄x,pandqmust be the same pointer). We then substitute the pointer pwith its adjunct (p_adj
= p_adj⋄x). In the second case ( p = q ) we check that the right side is also a pointer in the
mapping. Then we append an additional statement that overwrites the adjunct of the left side with
theadjunct of the right ( p_adj = q_adj ).
for all assignment of the form 𝑙ℎ𝑠=𝑟ℎ𝑠do
if𝑙ℎ𝑠∈𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔 then
if𝑟ℎ𝑠is a pointer∧𝑟ℎ𝑠∈𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔 then
𝑙ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡←𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔[𝑙ℎ𝑠]
𝑟ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡←𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔[𝑟ℎ𝑠]
𝑙ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡 =𝑟ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡
else if𝑟ℎ𝑠==𝑝𝑡𝑟⋄𝑒𝑥𝑝𝑟 then
if𝑙ℎ𝑠==𝑝𝑡𝑟then
𝑙ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡←𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔[𝑙ℎ𝑠]
𝑙ℎ𝑠=𝑙ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡
𝑟ℎ𝑠=𝑙ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡⋄𝑒𝑥𝑝𝑟
2.4 Previous approaches
A similar approach was already done specifically for digital signal processing (DSP) applications [ 33].
Their approach could also be extended to general programs and modifies the source code to
remove pointer movement. Our biggest innovation is the use of the adjunct that enables run-time
determined pointer movements. Previous methods only handled static or compile-time decidable
pointer increments.
For example the expression:
if ( exp) {
x=ptr++;
} else {
y=ptr + 2;
}
cannot be handled by previous approaches as it cannot decide at compile time if the pointer must
be incremented by 1 or 2. With our adjunct approach it can be handled as the same branch just by
modifying the adjunct instead of the pointer.
While points-to analysis represents a cutting-edge method widely integrated into modern com-
pilers, its effectiveness in element-sensitive analysis, particularly concerning pointer movements,
remains limited. This constraint poses challenges for loop optimization strategies like vectorization
or parallelization, as points-to analysis struggles to grasp element-sensitive aliasing patterns. As
illustrated in Figure 5, conventional points-to analysis, as implemented in GCC 13, fails to statically
ACM Trans. Arch. Code Optim.
Page 8:
PREPRINT8 Lepori et al.
discern element-sensitive aliasing in code involving pointer movements. However, following our
transformation, GCC successfully identifies the absence of access collisions. It is worth noting that
even after the adjunct transformation, GCC continues to leverage points-to analysis to recognize
the independence between accessed elements. The shown example in Figure 5 includes only a
5-iteration loop to more obviously present the missing vectorization check branch.
Figure 5 illustrates that the transformation is also beneficial for dynamic allocations. In such cases,
the base pointer must be stored somewhere to ensure proper memory deallocation. However, even
with the base pointer known, the compiler often struggles to fully resolve all aliasing information.
int* arr =malloc (sizeof(int) * 10);
// random init array
int* p=arr;
int* q=arr;
if ( argc == 42) {
p+= 5;
} else {
q+= 5;
}
for (int i=0; i<5; i++) {
// GCC adds runtime vectorization checks
p[i] += q[i];
}Unrolled loop
no vectorizationUnrolled
vectorized
loop
main:
push rbp
mov edx, 10
mov rbp, rsp
push rbx
mov rdi, QWORD PTR [rsi+8]
xor esi, esi
and rsp, -32
call strtol
mov edi, 40
mov rbx, rax
call malloc
vmovd xmm0 , ebx
vmovd xmm3 , ebx
vmovq xmm1 , QWORD PTR .LC1 [rip]
vpbroadcastd ymm0 , xmm0
vpaddd ymm0 , ymm0 , YMMWORD PTR .LC0 [rip]
lea rdx, [rax+20]
vmovdqu YMMWORD PTR [rax], ymm0
vpshufd xmm0 , xmm3 , 0xe0
vpaddd xmm0 , xmm0 , xmm1
vmovq QWORD PTR [rax+32], xmm0
cmp ebx, 42
je .L2
main: @24
mov rcx, rdx
mov rdx, rax
mov rax, rcx
.L2:
lea rsi, [rax+4]
mov rcx, rdx
sub rcx, rsi
cmp rcx, 8
jbe .L3
.L2: @33
vmovdqu xmm2 , XMMWORD PTR [rdx]
vpaddd xmm0 , xmm2 , XMMWORD PTR [rax]
vmovdqu XMMWORD PTR [rdx], xmm0
mov ecx, DWORD PTR [rax+16]
add ecx, DWORD PTR [rdx+16]
.L4:
vpsrldq xmm1 , xmm0 , 8
mov DWORD PTR [rdx+16], ecx
vpaddd xmm0 , xmm0 , xmm1
vpsrldq xmm1 , xmm0 , 4
vpaddd xmm0 , xmm0 , xmm1
vmovd eax, xmm0
add eax, ecx
vzeroupper
mov rbx, QWORD PTR [rbp-8]
leave
ret.L3:
mov ecx, DWORD PTR [rax]
add DWORD PTR [rdx], ecx
mov ecx, DWORD PTR [rax+4]
add DWORD PTR [rdx+4], ecx
mov ecx, DWORD PTR [rax+8]
add DWORD PTR [rdx+8], ecx
mov ecx, DWORD PTR [rax+12]
add DWORD PTR [rdx+12], ecx
mov ecx, DWORD PTR [rax+16]
vmovdqu xmm0 , XMMWORD PTR [rdx]
add ecx, DWORD PTR [rdx+16]
jmp .L4int* arr =malloc (sizeof(int) * 10);
// random init array
int* p=arr;
int p_adj = 0;
int* q=arr;
int q_adj = 0;
if ( argc == 42) {
p_adj += 5;
} else {
q_adj += 5;
}
for (int i=0; i<5; i++) {
// GCC statically decided vectorization
p[i+p_adj ] += q[i+q_adj ];
}Unrolled
vectorized
loop
main:
push rbp
mov edx, 10
mov rbp, rsp
push rbx
mov rdi, QWORD PTR [rsi+8]
xor esi, esi
and rsp, -32
call strtol
mov edi, 40
mov rbx, rax
call malloc
vmovd xmm0 , ebx
cmp ebx, 42
vmovd xmm2 , ebx
vpbroadcastd ymm0 , xmm0
mov edx, 24
mov ecx, 4
vmovq xmm1 , QWORD PTR .LC1 [rip]
vpaddd ymm0 , ymm0 , YMMWORD PTR .LC0 [rip]
cmovne rcx, rdx
mov esi, 0
mov edx, 20
cmovne rdx, rsi
vmovdqu YMMWORD PTR [rax], ymm0
vpshufd xmm0 , xmm2 , 0xe0
vpaddd xmm0 , xmm0 , xmm1
lea rsi, [rax+rdx]
lea rdi, [rax+16+rdx]
vmovq QWORD PTR [rax+32], xmm0
vmovdqu xmm3 , XMMWORD PTR [rax-4+rcx]
vpaddd xmm0 , xmm3 , XMMWORD PTR [rsi]
vmovdqu XMMWORD PTR [rsi], xmm0
mov edx, DWORD PTR [rdi]
add edx, DWORD PTR [rax+12+rcx]
mov DWORD PTR [rdi], edx
vmovdqu xmm0 , XMMWORD PTR [rsi]
vpsrldq xmm1 , xmm0 , 8
vpaddd xmm0 , xmm0 , xmm1
vpsrldq xmm1 , xmm0 , 4
vpaddd xmm0 , xmm0 , xmm1
vmovd eax, xmm0
add eax, edx
vzeroupper
mov rbx, QWORD PTR [rbp-8]
leave
ret
Fig. 5. Left side: example code using pointer movements where the points-to analysis inside GCC is unable
to ensure non-aliasing accesses. Right side: code after adjunct transformation where GCC is able statically
decide non-aliasing using points-to analysis and integer analysis. Note the difference in the CFG of the
resulting assembly code.
Sui et al. [ 51] proposed additional analysis techniques for pointer loops, enhancing field-sensitive
and element-sensitive methods. However, their work does not address pointer movements, which
remain a challenge for compilers, even advanced ones. Our transformation effectively eliminates
pointer movements, allowing existing points-to analysis techniques, including advanced element-
sensitive approaches, to be applied with greater efficacy. By incorporating points-to analysis with
our approach, we enhance the depth of insights and enable advanced code optimization. Our
methodology involves modifying the source code, rendering it compatible with various analyzers
and optimizers. While our primary focus lies in data-centric frameworks, which often struggle
with handling pointers, it is essential to note that our approach is not limited to this domain. The
modified source code generated through our method seamlessly integrates with any existing or
future frameworks, offering versatility and long-term applicability.
Scalar Evolution (SCEV) [ 9] and other loop analysis methods aim to scrutinize loops for potential
optimizations and performance enhancements, including vectorization. However, as demonstrated
in numerous examples, current implementations often struggle to identify such opportunities,
particularly concerning pointers. Although there have been proposals for improving analysis
techniques [ 13,32,38,51], finding techniques that are both safe and general enough remains
challenging. Our transformation addresses this challenge by providing additional information to
ACM Trans. Arch. Code Optim.
Page 9:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 9
the compiler, enabling it to leverage existing SCEV techniques for performance improvements.
With better analysis techniques, more patterns can be matched. Importantly, our transformation
does not hinder the adoption of new techniques, as the array access pattern is well-established and
commonly utilized, making it compatible with existing analyzers. While advanced analyzers may
eventually incorporate pointer movements into their analyses, currently, this remains a significant
challenge, and most analyzers do not account for it. Hence, we contend that pointer movement
continues to pose a problem, and our transformation provides a valuable solution in the current
landscape.
2.5 Practical results
PBKDF2. To showcase the transformation in practice we test it on the code snippet from PBKDF2
first presented in Section 1. We expect the inner loop 2to be vectorized by every compiler. What
we want to test is whether compilers are able to discover the independence between the iterations
of the outer loop 1and the effects that the adjunct transformation has on it.
// snippet of the PBKDF2 implementation in OpenSSL
// int cplen, n
for (int i=0; i<cplen *n;i+=cplen ) {
for (int k=0; k<cplen ;k++) {
// no loop carried dependence
p[k] = p[k] ^ data [k];
}
// each outer loop iteration works on different
data ↩→
p+=cplen ;
}// adjunct transformed version
// int cplen, n
for (int i=0; i<cplen *n;i+=cplen ) { 1
for (int k=0; k<cplen ;k++) { 2
// data adjunct removed for readability
p[k+p_adj ] = p[k+p_adj ] ^ data [k];
}
// no pointer movement, the adjunct variable
is used instead ↩→
p_adj +=cplen ;
}
The code was run both with normal pointer movements (“no adjunct ”) and with the transforma-
tion (“with adjunct ” variables). It was compiled with different compilers: GCC 12.2.2, Clang 15.0.6,
and Polly (same Clang version). We also used the DaCe 0.14.1 data-centric framework to analyze
and optimize the code with the resulting file being compiled with GCC. The code was instrumented
with PAPI [ 23] to measure the execution time and the number of instructions. For each test, 10 runs
were executed and the median is reported, with the 95% confidence interval being reported. For the
number of instructions, the error range is negligible ( ±1000 instructions) and not reported. Runs
were done with n = 100 andcplen = 100·106. As DaCe and Polly produce OpenMP code we
report different values for different thread counts. We compare results with Polly as it is another
compiler that offers automatic parallelization. Note that after the adjunct transformation the code
only has array accesses which is a pattern supported by polyhedral compilers.
Figure 6 shows that the adjunct transformation gets a minimal improvement in single-threaded
performance. While with multiple threads the data-centric compiler DaCe was able to identify
the parallelization opportunity on loop 1. Polly however isn’t able to identify the parallel loop
either with or without the adjunct transformation. This is the case because the array-style access
ofpis non-affine as it uses the p_adj variable which is non-affine w.r.t. loop iterators. Which is a
requirement for the loop to be a SCoP (static control part). Polly only acts on SCoPs of programs.
Note that as DaCe uses a data-centric IR which utilizes static and scope defined data containers
(for arrays), pointer movements are not supported directly. The adjunct transformation is a way to
support pointer movements in the data-centric IR without the need to change its inner workings.
The number of instructions neither increase nor decrease with the adjunct . As Polly is a collection
of transformations on the LLVM-IR that is compiled using Clang the number of instructions is
identical (Polly wasn’t able to find any SCoPs hence no transformations were applied). Surprisingly
ACM Trans. Arch. Code Optim.
Page 10:
PREPRINT10 Lepori et al.
GCC DaCe Clang Polly2345
CompilerTime [s]Serial execution comparison
with and without adjunct
AMD Ryzen 5 2600X (6 cores)
With adjunct
Noadjunct / original
2 4 6 8 122345
DaCe (with adjunct )Polly Polly with adjunct
ThreadsParallel execution comparison
between DaCe and Polly
AMD Ryzen 5 2600X (6 cores)
Fig. 6. Serial and parallel runtimes for a simplified code snippet extracted from the PBKDF2 implementation
of OpenSSL. Note that DaCe can only analyze the code after using the adjunct transformation.
DaCe has the same number of instructions as GCC. Even if it uses GCC to compile to machine
instructions this is unexpected as DaCe applies multiple transformations to the code. This can
be explained as even after the transformations the computation is mostly the same only with
additional annotations for parallel execution.
Instructions [ 106]Noadjunct With adjunct
GCC 4 380 4 380
Clang 2 974 2 974
Polly 2 974 2 974
DaCe not supported 4 380
Table 1. Number of instructions generated by different compilers of the code snippet taken from PBKDF2.
This proves the value of the transformation, but also that the transformation benefits compilers
with data-centric IR the most: they can leverage the improved analyzability of pointer movements.
Data centric-paradigms otherwise represent pointers as indirection that is challenging - or even
impossible - to handle. All compilers see a minimal improvement but they are less able to capitalize
on the improved analyzability.
LZO (compression algorithms). Another instance where pointer movements within loops are
prevalent is in compression algorithms. As an illustration, we applied our transformation to the
compression function of the Lempel–Ziv–Oberhumer (LZO) algorithm [ 46]. This algorithm is
commonly utilized within the Linux Kernel for file-systems and memory, supporting live com-
pression/decompression operations. Notably, it finds applications in BTRFS [ 1], SquashFS [ 5],
initramfs [2], zram [7], and zswap [8].
Following a similar approach to previous benchmark studies of the LZO algorithm [ 36], we
executed multiple iterations of the algorithm, ensuring each run involved significant computation
without increasing memory usage. The compression was performed on a 1 MB file with 2000
iterations and a block size of 256KB. As before, we report the median with a 95% confidence interval
using the same compiler versions. We compared the original implementation compiled with GCC
ACM Trans. Arch. Code Optim.
Page 11:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 11
against DaCe with the adjunct transformation. It is noteworthy that the DaCe code was executed
with a single thread, as no significant parallelization opportunities were identified.
Our results indicate a slight improvement, thanks to the additional vectorization opportuni-
ties provided by the transformation. However, achieving more substantial improvements, along
with potential parallelization opportunities, proves challenging due to loop carried dependencies
which would require an algorithm redesign. Nevertheless, in compression algorithms, the adjunct
transformation can facilitate modest yet discernible enhancements in runtime performance.
DaCe+adjunct GCC2.833.23.4
CompilerTime [s]i7-8550U (16GB)
Fig. 7. Serial runtime (single thread) of LZO compression algorithm. Comparison of original implementation
compiled with GCC 12.2.2 and DaCe transformed version (including adjunct transformation)
3 LIMITATIONS
The method we propose can provide compilers with better insights into pointer behavior, but only
in certain scenarios. In this section, we go into detail as to the limitations of our approach. The
main limitation is about the static decidability of which data container a pointer is pointing to.
But we argue that in those cases an algorithm redesign would be needed for any performance
improvement.
The necessity of static decidability arises from the utilization of DaCe as the compiler/framework
for optimizations. Although DaCe achieves significant performance improvements, particularly in
automatic parallelization, it relies on the ability to statically determine data containers. However,
theadjunct transformation itself encounters limitations primarily associated with higher-order
pointers, as elaborated in Subsection 3.2.
Another important aspect to highlight is the utility of our adjunct transformation. Simply expos-
ing additional information to the compiler does not guarantee a runtime performance improvement.
In earlier examples, we demonstrated how the compiler generates fewer instructions or eliminates
branches entirely. Additionally, using DaCe it created further parallelization opportunities. We
discovered that these improvements are predominantly seen in codes where pointers are used
as iterators. While this pattern is common in many codebases as we will discuss in Table 2, no
immediate improvements will be offered to codes that do not exhibit this pattern.
3.1 Static decidability
A first assumption is that a pointer has to always point to a statically decidable data container in
any given section of the code. This is required to be able to deterministically identify data accesses.
Conditional reassignment introduces uncertainty as to which data container a pointer is pointing
to. In such cases, our analysis backs off and does not generate an adjunct. However, unconditional
or static pointer re-assignments are supported as they aren’t introducing uncertainty as to which
data container they point to. We can see two code snippets that show this difference in Figure 8.
ACM Trans. Arch. Code Optim.
Page 12:
PREPRINT12 Lepori et al.
With this assumption, we achieve flawless aliasing detection, as the compiler possesses static
knowledge of the data container associated with each pointer. Detecting accesses to the same data
simply involves verifying if the data container and indices are identical. Leveraging the adjunct
transformation streamlines index comparison to a straightforward integer check, often allowing
for static analysis if the abstract model is sufficiently precise.
It would be possible to expand the transformation to also support conditional pointers reas-
signment by creating adjunct versions for each container and replicating the branch decision for
each subsequent access. However, this would create unsustainable code and memory replication
requirements.
Decidable access - supported
int* p;
p=a;
// now p always points to a
p[i] = 5; // accessing a with an offset of i
p=b;
// now p always points to b
p[i] = 2; // accessing b with an offset of iUndecidable access - not supported
int* p;
if (...) { // unknown branch
p=a;
} else {
p=b;
}
// p could point to either a or b
p[i] = 5; // undecidable data container
Fig. 8. Undecidable vs decidable pointer accesses
3.2 Pointing to pointers
Double pointers - or higher order pointers - are supported by applying the transformation recur-
sively, as can be seen in Figure 9. Note that the data container to which pis pointing to is statically
known. As with simple pointers, we can trace the accesses 2and 3to container a(1) and get the
correct adjunct in this case. But the decidability issue remains, as we can see in Figure 10 (left side)
where the data container that is accessed is determined by the higher-order pointer location.
int* a=new int[10];
int** p= &a;
int x= (* p)[i]// -> a[i]
a++;
int y= (* p)[i]// -> a[i + 1]int* a=new int[10];
int a_adj = 0;
int** p= &a;1
int p_adj = 0;
int x= (p[p_adj ])[i+a_adj ]2
a_adj ++;
int y= (p[p_adj ])[i+a_adj ]3
Fig. 9. Example of adjunct transformation with double pointers
The general case could be handled with arrays of adjunct , as shown in Figure 10 (right side).
However, trying to leverage this approach on general codes would substantially increase the
complexity of the code and consume a significant amount of memory. Note that this approach
wasn’t implemented because of the unsustainability of the transformation. Therefore, if additional
assumptions can be made about the shape of the individual data containers pointed at, more
targeted, efficient approaches can be used as explained in Section 4.
ACM Trans. Arch. Code Optim.
Page 13:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 13
int* a=new int[10];
int* b=new int[10];
int** p=new int*[2];
p[0] = a;
p[1] = b;
if (...) {
p++;
}
int x= (* p)[i]// -> either a[i] or b[i]adjunct for new pointers
Array of adjunct for new pointer array
Copy current adjunct position for
future accesses through pint* a=new int[10];
int a_adj = 0;
int* b=new int[10];
int b_adj = 0;
int** p=new int*[2];
int* p_adj_array =new int[2];
int p_adj = 0;
p[0] = a;
p_adj_array [0] = a_adj ;
p[1] = b;
p_adj_array [1] = b_adj ;
if (...) {
p_adj ++;
}
int x= (p[p_adj ])[i+p_adj_array [p_adj ]];Offset of ppointer
Offset of array element Requested indexBased on which element
we are accessing with p
Fig. 10. Example of adjunct transformation with arrays of pointers
Although a multidimensional array shares the same memory representation as higher-order
pointers, the analysis differs. Our aim is to eliminate pointer movements and replace them with
more predictable array accesses. With multidimensional arrays, this pattern is inherent, and our
transformation results in code similar to what is typically observed, as pointers are generally not
relocated from multidimensional arrays. Prior proposals have suggested transformations to flatten
multidimensional arrays into a single dimension [ 31]. This approach was also included in GCC
between version 4.3 and 4.8. The method is orthogonal to ours and can be applied before or after
our transformation as it will not introduce additional pointer movements.
3.3 Evaluating applicability
In our investigation, we have identified the primary constraint of the adjunct transformation: higher-
order pointers. The requirement for statically decidable data containers is only a prerequisite of the
DaCe framework and not of the adjunct transformation itself. To gauge the broader viability of the
transformation, we conducted an assessment of its utilization in extensive codebases, specifically
focusing on the handling of higher-order pointers.
It is important to clarify our criteria for identifying non-applicable instances. We categorize
higher-order pointer usage as constraining only when it potentially involves pointer iteration.
Notably, we exempt instances where double pointers are employed to delegate memory allocation
responsibilities to another function (refer to Figure 11). Moreover, we exclude considerations of the
argv parameter in main functions due to its typically diminutive size and minimal involvement
in computationally intensive tasks. Our analysis considers application source code, omitting unit
tests and examples. Furthermore, our scope is confined to C and C++ files. In the same way, we
count single pointers as possible candidates for the adjunct transformation.
The outcomes of our investigation are presented in Table 2. Our examination encompassed
OpenSSL [ 4], notable for its cryptographic algorithm implementations, TurboBench [ 6], which
hosts a plethora of compression algorithm implementations, and the Linux kernel [3].
ACM Trans. Arch. Code Optim.
Page 14:
PREPRINT14 Lepori et al.
void init_pointer (char** p, int n) {
*p=malloc (n);
memset (*p, 0, n);
}
int main (int argc , char** argv ) {
char* buf;
init_pointer (&buf,atoi (argv [1]));
return 0;
}
Fig. 11. Example usage of double pointers to delegate memory allocation responsibility. This usage is fully
supported by the adjunct transformation.
Code-base LoC #applicable #non-applicable
OpenSSL 704,308 26,925 1,362
TurboBench 854,548 32,680 756
Linux kernel 24,940,479 1,656,065 4,972
Table 2. Number of lines of code (LoC), number of applicable instances of the adjunct transformation
(#applicable) and number of non applicable instances (#non-applicable) in different code-bases (only C and
C++).
Higher order pointers are indeed common in these codebases, as expected. However, their
frequency is generally several orders of magnitude lower than the total code volume and notably
even lower than single order pointer instances. It’s essential to note that the provided statistics
encompass pointers not exclusively used as iterators. This assessment offers a preliminary insight
into the method’s broader applicability, highlighting the prevalence of single pointers compared to
their higher order counterparts in programs.
4 FROM GENERIC TO SPECIFIC - REFINEMENTS FOR PRACTICAL C CODES
We will show how our approach towards improving the analyzability of pointers used as iterators
benefits real codes. However, we discovered that for such codes further refinements are needed
such as additional transformations and support for pointers to external calls. All transformation in
the following subsections were implemented and are done automatically by the compiler.
4.1 Improving analyzability by restructuring data
As discussed in Section 3, pointing to pointers is a more difficult to analyze scenario, as conditional
pointer movements can lead to static undecidability regarding which data container is accessed.
However, if additional information is known, the data structures themselves can be changed to
make analysis easier. As this kind of transformation targets a general structure they do not only
target a specific program but the entire set of programs that use that structure.
An example of a structure transformation that we implemented is inside the Mantevo HPCCG
benchmark. It uses sparse matrix storage, specifically the List of Lists (LIL) format. The matrix is
stored using two lists for each row containing the column indices and the non-zero values. The
row lists are stored as continuous data, and pointers to them are used to access the correct location
ACM Trans. Arch. Code Optim.
Page 15:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 15
for (int i=0; i<nrow ;i++) {
double sum = 0.0; // accumulate sum for each row
int cur_nnz =A->nnz_in_row [i];// number of non-zeros per row
for (int j=0; j<cur_nnz ;j++) {
sum +=A->ptr_to_vals [i][j]3// matrix element
*x[A->ptr_to_inds [i][j]]; // vector element
}
y[i] = sum;
}
Fig. 12. Matrix-vector multiplication with a sparse matrix in LIL format taken from the HPCCG benchmark
based on the current row. In Figure 13, we show the initialization code for a data structure from
the HPCCG benchmark (modified for simplicity).
The matrix is accessed using those two lists, as it can be seen in Figure 12 for a matrix-vector
multiplication from the same benchmark. With the knowledge that there exists a statically known
maximum number of non-zeroes per row, we can transform the list of lists into a matrix, as
when the data is accessed (like in 3) we see that the access resembles a 2D array. We transform
the initialization of the sparse matrix to utilize a 2D-like structure instead of a contiguous list.
The transformation is shown in Figure 13. Thanks to this change we now have data stored in a
more regular pattern (2D array). This enables compilers to apply standard transformations and
optimizations on 2D containers.
double * curvalptr =new double[ max_nnz *nrow ];
int * curindptr =new int[ max_nnz *nrow ];
for (...) { // for every row
A->ptr_to_vals [currow ] = curvalptr ;1
A->ptr_to_inds [currow ] = curindptr ;2
for (...) { // for every value
*curvalptr = 27.0;
*curindptr =curcol ;
curvalptr ++;
curindptr ++;
}
}// do not allocate curvalptr or curindptr
for (...) { // for every row
A->ptr_to_vals [currow ] = new double[ max_nnz ];
int vals_adj = 0;
A->ptr_to_inds [currow ] = new int[ max_nnz ];
int inds_adj = 0;
for (...) { // for every value
A->ptr_to_vals [currow ][vals_adj ] = 27.0;
A->ptr_to_inds [currow ][inds_adj ] = curcol ;
vals_adj ++;
inds_adj ++;
}
}
Fig. 13. Initialization of a sparse matrix in LIL format extracted from the Mantevo HPCCG benchmark. The
original code is on the left, while the transformed code is on the right.
The transformation is done first by finding two pointer assignments working on the same struct
inside the same scope ( 1and 2). We use a map to relate the right side of the assignment with the
left (curvalptr toA->ptr_to_vals[currow] ). We then create an adjunct and transform the code
in a similar way to the general pointer transformation. Using the previously created map we find all
pointer accesses that must be transformed and substitute them with array access using the adjunct .
In the same way, we find pointer arithmetic operations and substitute the pointer with the adjunct .
The benefit of using such specific pattern-based transformations is that they can be used in
conjunction with our general algorithm, and therefore be integrated into the same workflow. This
ACM Trans. Arch. Code Optim.
Page 16:
PREPRINT16 Lepori et al.
creates the possibility to efficiently handle a wide range of other access patterns without the need
for separate workflows or manual code modification.
It’s important to note that for multidimensional arrays, allocating a continuous buffer and
then splitting it by storing pointers introduces uncertainty for static analysis, as explained in
Section 3.2. Such patterns are unsupported by our current method. However, our LIL transformation
automatically converts this pattern into a 2D array. In other scenarios, implementing additional
transformations may be necessary to align with the limitations of our method.
4.2 Handling external calls with pointer arguments
Some parts of a program, such as external library calls are outside the scope of static compiler
analysis. These must therefore be considered black boxes. A common pattern in C when calling
external functions is to use pointers as arguments for inputs and outputs. One such example is the
memcpy function from the C standard library. The function with the following signature:
void* memcpy (void* destination , const void* source , size_t num);
takes two pointers and an integer numand copies numbytes from the source pointer to the
destination .
In the previous section, we discussed the possible modifications and movement of pointers and
their impact on decidability. In the example of memcpy pointers are not modified. In C arguments
are passed by copy and not by reference therefore callee modifications of arguments are not
propagated to the caller. The same argument applies to pointers. What is modified in those cases is
the content of the pointer, hence the elements of the data container that the pointer is pointing to.
A function taking a pointer as an argument can modify its contents. This poses a challenge as data
dependencies can no longer be tracked. In Figure 14, we use memcpy instead of a simple assignment.
char* p=new char[1];
*p= 0;
for (int i=0; i<n;i++) {
char q= *p+i; // use *p from previous
iteration ↩→
memcpy (p, &q, 1); // equal to *p = q
}
With memcpy dependency
q = *p + i
memcpy(p, &q, 1)qp i
for loopint* q=new char[1];
*q= 42;
for (int i=0; i<n;i++) {
char p= 0;
memcpy (&p,q, 1); // equal to p = *q
data [i] = p+i;// data[i] = 42 + i;
}
Without memcpy dependency
q = *p + i
memcpy(p, &q, 1)qp i
for loop
Fig. 14. Left side: memcpy creates a dependency between loop iterations. Right side: pis a loop local variable
andmemcpy is used to assign values to it.
On the left side of the figure the data in pdepends on the value previously written to it by memcpy
(dependency shown by the dashed arrow). This enforces that the loop must be executed in order. If
ACM Trans. Arch. Code Optim.
Page 17:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 17
we remove this dependency (dashed arrow) then loop iterations would be independent. This would
permit the compiler to vectorize the loop producing the incorrect result.
We previously assumed that the second argument of memcpy is only read, but this is an insight
gained by looking at the implementation of memcpy and not black box analysis. Without outside
information, no compiler framework can overcome this issue. In the right side of Figure 14 qis only
read from but the compiler cannot infer this. This creates a spurious data dependency between
iterations and prevents the loop from being parallelized.
While it is good practice to make pointer arguments that are read-only const , not every function
or library does so. Our solution is to use whitelists for often-used library calls - manually annotating
library functions to specify if a pointer is write-only, read-write, or read-only. This whitelist could
become a database of known functions that could even be automatically populated through compiler
analysis, as it only needs to be performed once for any given library implementation. Annotations
only provide information about which data container was read or written. We, therefore, assume
conservatively that the entire data container could be read or written.
4.3 Stateful external calls
Some libraries store an internal state that is opaque to the program calling functions provided by
the library — great examples being MPI or OpenGL. One such library is a part of OpenSSL, namely
HMAC (Hash Message Authentication Code) — it uses an internal state to store hashes and keys
between calls. This data is saved in an internal data structure and a pointer to it is provided to
calling functions. This pointer (called context in this implementation) is never modified by the user
code, but only by library calls.
HMAC_CTX *hctx =HMAC_CTX_new (); 1
for (...) {
HMAC_CTX_copy (hctx ,base_hctx );2
// do operations with hctx
}
HMAC_CTX_free (hctx );
In the example above, without any additional information, it appears that the hctx pointer is ini-
tialized outside the loop ( 1), and that the call to HMAC_CTX_copy within the loop ( 2) could read and
write to hctx . This creates a data dependency between loop iterations. In reality, HMAC_CTX_copy
does not read the data of the first argument ( hctx ) — that value is only written to and effectively
used as a local variable within the scope of the loop. Therefore, by marking the dependency on
(hctx ) inHMAC_CTX_copy as write-only and noticing that it is never used afterwards, the loop can
be parallelized. This approach essentially performs an escape analysis [ 40], identifying that the
variable within the loop does not escape the thread executing each iteration.
5 EVALUATION
To evaluate our transformation we extended the DaCe [ 14,19] pipeline by adding the pointer
disaggregation transformation explained in Section 2 and the specific transformations explained
in Section 4. The approach described in Section 4.2 and Section 4.3 is implemented by adding a
whitelist of additional dependencies in the data-centric IR (intermediate representation) used by
DaCe.
As detailed and demonstrated in Section 2.5, while incremental performance enhancements are
attainable across various compilers, the constrained scope of vectorization inhibits substantial
performance gains. Consequently, we prioritize automatic parallelization compilers like Polly and
DaCe, which hold promise in uncovering novel opportunities facilitated by the transformations
ACM Trans. Arch. Code Optim.
Page 18:
PREPRINT18 Lepori et al.
implemented. Specifically, we direct our attention to DaCe due to its inherent data-centric approach,
which offers superior analysis of data dependencies.
We measured performance on a dual-socket 2x6 core (2x12 threads) Intel Xeon X5670 @ 2.93GHz
with 48GB of RAM. We compare the DaCe data-centric framework (using GCC 12.1.1 as a backend
compiler) against GCC and Polly (using Clang 15.0.6). We report the median of 10 runs with a
confidence interval of 95%. OpenMP is used for parallel execution. Integrating the adjunct trans-
formation into Polly did not reveal any additional Static Control Parts (SCoPs), and consequently,
no further parallelization opportunities were identified. Hence, we provide the runtime results
of Polly without the adjunct transformation. We specifically chose to compare with Polly due
to its widespread availability as part of the LLVM/Clang suite, offering some level of automatic
parallelization capabilities. Our evaluation was conducted on the Mantevo HPCCG benchmark and
the OpenSSL PBKDF2 implementation, with the results presented below.
C transformation Data centric Hardware specificInput C file
C AST
Data flow graph
Parallel C++ code
with OpenMPCPU
code generatorFPGA
code generatorCUDA
code generator
Machine
instructionsC parsingAST transformations
including adjunct
Transformation to
data centric IRData-centric
transformations
C++
compilation
Fig. 15. Workflow for C program optimization including adjunct transformation and leveraging the DaCe
data-centric framework.
5.1 OpenSSL PBKDF2
The Password Based Key Derivation Function 2 (PBKDF2) is a derivation function that derives a
cryptographically secure key from a password. PBKDF2 applies a Hash-based Message Authentica-
tion Code (HMAC) function multiple times to a provided password and salt. Based on the required
length of the key this process is repeated multiple times. A summary of the process is provided in
Figure 16. We can notice that each single 𝐵𝑖blocks can be computed independently of any other.
ACM Trans. Arch. Code Optim.
Page 19:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 19
𝑘𝑒𝑦=𝑃𝐵𝐾𝐷𝐹 2(𝐻𝑀𝐴𝐶,𝑝𝑎𝑠𝑠𝑤𝑜𝑟𝑑,𝑠𝑎𝑙𝑡,𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠,𝑘𝑒𝑦𝐿𝑒𝑛 )
=𝐵0++𝐵1++...++𝐵𝑘𝑒𝑦𝐿𝑒𝑛/ℎ𝑙𝑒𝑛
𝐵𝑖=𝑈𝑖
0⊕𝑈𝑖
1⊕...⊕𝑈𝑖
𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠
𝑈𝑖
0=𝐻𝑀𝐴𝐶(𝑝𝑎𝑠𝑠𝑤𝑜𝑟𝑑,𝑠𝑎𝑙𝑡+𝑖)
𝑈𝑖
1=𝐻𝑀𝐴𝐶(𝑝𝑎𝑠𝑠𝑤𝑜𝑟𝑑,𝑈𝑖
0)
...
𝑈𝑖
𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 =𝐻𝑀𝐴𝐶(𝑃𝑎𝑠𝑠𝑤𝑜𝑟𝑑,𝑈𝑖
𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠−1)
Fig. 16. PBKDF2 scheme. In this representation, ++is the string concatenation, ⊕is the XOR operator and ℎ𝑙𝑒𝑛
is the output length of the HMAC function.
124 8 12 24051015
DaCePolly
Manual OpenMP
FastPBKDF23.7x3.5xTime [s]#blks=4
124 8 12 240102030
DaCePolly
Manual OpenMP
FastPBKDF26.2x
7.2x#blks=8
124 8 12 240204060
DaCePolly
Manual OpenMP
FastPBKDF28.9x
8.9x
ThreadsTime [s]#blks=16
124 12 24020406080
DaCePolly
Manual OpenMP
FastPBKDF29.7x10.7x
Threads#blks=24
PBKDF2 with various block sizes
Fig. 17. Runtime of various parallelized PBKDF2 implementations using different block counts (block count
determines the problem size) and using multiple thread counts. Increasing the thread count beyond the
number of blocks would be ineffective as the work could not be meaningfully divided among threads.
The implementation inside OpenSSL does not include parallel options. We applied our pipeline
to the OpenSSL code to auto-parallelize the code. We compare the runtime against Polly and a
manually parallel version that was written using the OpenSSL implementation as a starting point.
Only the PBKDF2 algorithm was analyzed, all HMAC calls are treated as external calls and use the
ACM Trans. Arch. Code Optim.
Page 20:
PREPRINT20 Lepori et al.
standard OpenSSL implementation. For completeness, we compare it with FastPBKDF2 [ 16], an
implementation of the same algorithm that has been developed for parallelism from the ground up.
We used SHA1 as the HMAC function, 5·106iterations, and an output key size of 480 bytes (or 24
blocks), and summarize our results in Figure 17.
We then varied the problem size, represented by the number of blocks. The results (Figure 17)
show a consistent trend across all experiments. Our approach was able to obtain comparable results
to the manually parallelized version, providing at most a 10.7x improvement with 24 threads. Polly
wasn’t able to identify any parallel opportunities due to the challenges of analyzing pointers. The
FastPBKDF2 implementation is considerably faster than any OpenSSL equivalent, mainly due to
the serial runtime of FastPBKDF2 version being 3 times faster, 27.7s versus 84.5s for OpenSSL. This
reinforces the idea that tools can provide significant performance increases and find opportunities
for parallelism, but nothing surpasses finding a better algorithm altogether.
5.2 Mantevo HPCCG
The Mantevo HPCCG benchmark computes the conjugate gradient on a sparse matrix. The sparse
matrix is stored using the List of Lists (LIL) format. We leverage the specific transformation
explained in Section 4.1 to improve analyzability. The LIL transformation is applied once only to
the input file for DaCe. We compare our pipeline against Polly with the same preprocessed input.
The baseline is obtained using the original benchmark with OpenMP enabled.
124 8 12 2402468
DaCe
Hand-tunedPolly
2.4xSerialTime [s]Size1003
124 8 12 240102030
DaCeHand-tunedPolly
2.7xSerialSize1503
124 8 12 24020406080
DaCe Hand-tunedPolly
2.9xSerial
ThreadsTime [s]Size2003
124 8 12 24050100150
DaCeHand-tunedPolly
2.9xSerial
ThreadsSize2503
HPCCG with different problem sizes
Fig. 18. Runtime of Mantevo HPCCG benchmark with different compilers.
We can see that Polly was not able to find parallel opportunities as pointers are used to access
the sparse matrix. Our approach was able to parallelize all five loops. The performance of our
automatically parallelized code matches the hand-tuned version developers created across all
ACM Trans. Arch. Code Optim.
Page 21:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 21
experiments. At larger problem sizes, our approach even outperforms the reference implementation
by up to 18%.
5.3 Lempel–Ziv–Oberhumer compression algorithm
In Subsection 2.5, we presented results from a benchmark on the Lempel–Ziv–Oberhumer (LZO)
compression algorithm. Although automatic parallelization wasn’t achieved, there was still a modest
enhancement in single-threaded performance. Achieving effective automatic parallelization would
necessitate an algorithm redesign due to the presence of loop-carried dependencies. However, our
method managed to achieve a 4% runtime enhancement without requiring any algorithm rewriting.
Given the analogous pointer movement patterns observed in most compression algorithms, we
anticipate similar, if not superior, outcomes by extending our pipeline to encompass them as well.
6 RELATED WORK
Pointer analysis in C. Pointer analysis has been a topic of research for decades. One of the most
impactful approaches, the "points-to" analysis, was proposed by Andersen et al. [ 10] and developed
and improved over the years [ 18,24,37]. Our approach is orthogonal to this method, and while more
constrained in the patterns it detects, it provides a powerful benefit towards automatic parallelism
detection.
Parallelization of C programs. A large and increasing number of approaches exist that are geared
toward finding parallelism in C codes, with different degrees of automation. Tools like DiscoPoP [ 43]
focus on identifying loops that could be parallelized using OpenMP constructs, while frameworks
like Polly [ 30] and Pluto [ 17] leverage the polyhedral representation of loops to automatically
generate parallel code. Neither of these approaches can currently tackle codes using pointers as
iterators, but we believe it should be possible for our approach to be leveraged in these workflows
to increase their reach, as we have shown in this work with DaCe.
Parallelization of programs with pointers. Utilizing the Distributed Dynamic Pascal (DDP) lan-
guage [ 33] facilitates the development of parallel programs involving pointers. The DDP language
introduces explicit pointer access mechanisms to enable the parallelization of pointer-related oper-
ations. In contrast, our approach seamlessly integrates with existing C programs, obviating the
necessity for transpilation processes.
Intermediate representations. The LLVM IR [ 41] allows for many transformations and code
optimizations. However, tracking pointer dependencies across scopes remains difficult. A new
opportunity has appeared with the recent rise of MLIR [ 42], a framework that allows multiple
IR dialects to co-exist and their different strengths to all contribute to overall code analyzability
and performance. The MemRefType built-in dialect already enables to specify the shape of the
underlying data. This includes the option to save the start of a data container even if the pointer is
moved. MLIR would permit our approach to be implemented as a dialect, bringing it to the wider
LLVM ecosystem.
Source code modification. The alteration of source code within digital signal processing (DSP)
programs facilitates the elimination of pointer references as outlined in the work by Franke et al.
[25]. While the approach bears some resemblance to our proposed method, it exhibits comparatively
reduced flexibility. It exclusively encompasses static code modifications, devoid of the incorporation
of the adjunct concept, which enables the utilization of compile-time determined values exclusively.
The primary aim remains the enhancement of computational efficiency, albeit without resorting to
parallel processing techniques.
Pointers metadata. Incorporating metadata into pointers is not a novel concept, as evidenced
by previous studies [ 50,53]. In other scenarios, where the emphasis is on safety rather than
ACM Trans. Arch. Code Optim.
Page 22:
PREPRINT22 Lepori et al.
performance, statically known mappings to pointers have been established. This facilitates the
addition of runtime bounds checking, analogous to how we introduce runtime pointer movements
through source code modifications.
7 CONCLUSION
In this work, we introduce a static transformation that separates the use of pointers as handles
to data containers from their use as iterators. This improves code analyzability and data-centric
compilers are shown to benefit from this transformation as eliminating indirection exposes ad-
ditional parallelization opportunities. Using our approach on the Mantevo HPCCG benchmark
we were able to match the developer-optimized version and even surpass it by up to 6%. On the
OpenSSL PBKDF2 implementation, we were able to automatically parallelize it and obtain up to an
11x speedup.
ACKNOWLEDGMENTS
This project has received funding from EuroHPC-JU under grant agreement DEEP-SEA No 95560
and by the European Research Council (ERC) under the European Union’s Horizon 2020 program
(grant agreement PSAP, No. 101002047). This work was partially supported by the ETH Future
Computing Laboratory (EFCL), financed by a donation from Huawei Technologies.
A PROOF SKETCH OF POINTER ASSIGNMENT
We want to show that the assignment of pointers gives equivalent results after the application of
theadjunct transformation described in Figure 4. To do so we introduce a new pointer 𝑞with its
𝑞𝑎𝑑𝑗. Let𝑛,𝑚 be arbitrary integers then we define 𝑋,𝑌 and𝑋′,𝑌′as
𝑋=(𝑝:=𝑝+𝑥𝑖|𝑥𝑖∈Z,0≤𝑖<𝑛)
𝑌= 𝑣𝑖:=𝑣𝑖+𝑥𝑖|𝑥𝑖∈Z&𝑣𝑖∈{𝑝,𝑞},𝑛≤𝑖<𝑚
𝑋′= 𝑝𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗+𝑥𝑖|𝑥𝑖∈Z,0≤𝑖<𝑛
𝑌′=
𝑣𝑖
𝑎𝑑𝑗:=𝑣𝑖
𝑎𝑑𝑗+𝑥𝑖|𝑥𝑖∈Z&𝑣𝑖
𝑎𝑑𝑗∈{𝑝𝑎𝑑𝑗,𝑞𝑎𝑑𝑗},𝑛≤𝑖<𝑚
Using ;to concatenate commands we define 𝑆with
𝑆=(𝑋;(𝑞:=𝑝;𝑌))
Where it represents the execution of an arbitrary number of pointer movement statements followed
by a pointer assignment and another arbitrary number of pointer movement statements. In the
same way we define 𝑆′as the adjunct transformation of 𝑆.
𝑆′=(𝑋′;(𝑞:=𝑝;(𝑞𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗;𝑌′)))
Note that the ;operator is right associative. Then we define 𝑌′
𝑞as
𝑌′
𝑞=𝑚∑︁
𝑖=𝑛|𝑣𝑖
𝑎𝑑 𝑗=𝑞𝑎𝑑 𝑗𝑥𝑖
i.e. the difference of value after executing 𝑌′on𝑞𝑎𝑑𝑗. Then(𝑌′;𝑞𝑎𝑑𝑗)=𝑞𝑎𝑑𝑗+𝑌′
𝑞. Note that𝑋and
𝑋′only act on 𝑝and𝑝𝑎𝑑𝑗respectively. This is to have a simpler proof but it can be easily shown
that by defining 𝑋and𝑋′in a similar way as 𝑌and𝑌′the same result will be obtained.
For sake of a shorter syntax and more readable proof we omit the declaration of the pointers and
adjunct integer variables. I.e.
(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑞:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑞𝑎𝑑𝑗:=0;(𝑆;𝑞)))))
ACM Trans. Arch. Code Optim.
Page 23:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 23
as it will be carried without modifications through all the proof steps. Then we can derive
𝑆;𝑞=𝑋;(𝑞:=𝑝;(𝑌;𝑞))
(adjunct transformation) =𝑋;(𝑞:=𝑝;(𝑞+(𝑌′;𝑞𝑎𝑑𝑗)))
(distributive) =𝑋;((𝑞:=𝑝;𝑞)+(𝑞:=𝑝;(𝑌′;𝑞𝑎𝑑𝑗)))
(no effect statement) =𝑋;((𝑞:=𝑝;𝑞)+(𝑌′;𝑞𝑎𝑑𝑗))
(assignment) =𝑋;(𝑝+(𝑌′;𝑞𝑎𝑑𝑗))
(distributive) =(𝑋;𝑝)+(𝑋;(𝑌′;𝑞𝑎𝑑𝑗))
(adjunct transformation) =((𝑋′;𝑝)+(𝑋′;𝑝𝑎𝑑𝑗))+(𝑋;(𝑌′;𝑞𝑎𝑑𝑗))
(no effect statement) =𝑝+(𝑋′;𝑝𝑎𝑑𝑗)+(𝑋;(𝑌′;𝑞𝑎𝑑𝑗))
(no effect statement) =𝑝+(𝑋′;𝑝𝑎𝑑𝑗)+(𝑌′;𝑞𝑎𝑑𝑗)
(sum expantion of adjunct) =𝑝+(𝑋′;𝑝𝑎𝑑𝑗)+𝑞𝑎𝑑𝑗+𝑌′
𝑞
(adjunct value of 0) =𝑝+(𝑋′;𝑝𝑎𝑑𝑗)+𝑌′
𝑞
(no effect statement) =𝑝+(𝑋′;𝑝𝑎𝑑𝑗)+(𝑋′;𝑌′
𝑞)
(distributive) =𝑝+(𝑋′;(𝑝𝑎𝑑𝑗+𝑌′
𝑞))
(no effect statement) =𝑝+(𝑋′;(𝑞:=𝑝;(𝑝𝑎𝑑𝑗+𝑌′
𝑞)))
(assignment) =𝑝+(𝑋′;(𝑞:=𝑝;(𝑞𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗;𝑞𝑎𝑑𝑗+𝑌′
𝑞))))
(𝑌′
𝑞definition) =𝑝+(𝑋′;(𝑞:=𝑝;(𝑞𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗;(𝑌′;𝑞𝑎𝑑𝑗))))
(S’ definition) =𝑝+(𝑆′;𝑞𝑎𝑑𝑗)
(no effect statement) =(𝑋′;𝑝)+(𝑆′;𝑞𝑎𝑑𝑗)
(assignment) =(𝑋′;(𝑞:=𝑝;𝑞))+(𝑆′;𝑞𝑎𝑑𝑗)
(no effect statement) =(𝑋′;(𝑞:=𝑝;(𝑞𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗;𝑞)))+(𝑆′;𝑞𝑎𝑑𝑗)
(no effect statement) =(𝑋′;(𝑞:=𝑝;(𝑞𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗;(𝑌′;𝑞))))+(𝑆′;𝑞𝑎𝑑𝑗)
(S’ definition) =(𝑆′;𝑞)+(𝑆′;𝑞𝑎𝑑𝑗)=𝑞+(𝑆′;𝑞𝑎𝑑𝑗)
Now we have proven the equivalence of the pointer assignment transformation stated in Figure 4.
REFERENCES
[1][n. d.]. BTRFS compression documentation. https://btrfs.readthedocs.io/en/latest/Compression.html. Accessed:
2024-09-05.
[2] [n. d.]. lib, initramfs: Add initramfs LZO compression. https://lore.kernel.org/all/1238593252-3435-1-git-send-email-
andr345@gmail.com/. Accessed: 2024-09-05.
[3] [n. d.]. Linux kernel repository. https://github.com/torvalds/linux. Accessed: 2024-14-05.
[4] [n. d.]. OpenSSL repository. https://github.com/openssl/openssl. Accessed: 2024-14-05.
[5] [n. d.]. Squashfs documentation. https://dr-emann.github.io/squashfs/squashfs.html. Accessed: 2024-09-05.
[6] [n. d.]. TurboBench repository. https://github.com/powturbo/TurboBench. Accessed: 2024-14-05.
[7][n. d.]. zram documentation. https://www.kernel.org/doc/html/next/admin-guide/blockdev/zram.html. Accessed:
2024-09-05.
[8] [n. d.]. zswap documentation. https://www.kernel.org/doc/html/v4.18/vm/zswap.html. Accessed: 2024-09-05.
[9] Javed Absar. 2018. Scalar evolution-demystified. In European LLVM Developers Meeting .
[10] Lars Ole Andersen and Peter Lee. 2005. Program Analysis and Specialization for the C Programming Language.
[11] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A
Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al .2006. The landscape of parallel computing
research: A view from berkeley. (2006).
[12] Gordon Bell. 1994. Scalable, parallel computers: alternatives, issues, and challenges. International Journal of Parallel
Programming 22, 1 (1994), 3–46.
ACM Trans. Arch. Code Optim.
Page 24:
PREPRINT24 Lepori et al.
[13] MA Belyaev, M Akhin, and VM Itsykson. 2014. Improving static analysis by loop unrolling on an arbitrary iteration.
University Scientific Journal 8 (2014), 154–168.
[14] Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, and Torsten Hoefler. 2019. State-
ful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures. In
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19) .
[15] M.D. Beynon, H. Andrade, and J. Saltz. 2002. Low-cost non-intrusive debugging strategies for distributed parallel
programs. In Proceedings. IEEE International Conference on Cluster Computing . 439–442. https://doi.org/10.1109/
CLUSTR.2002.1137778
[16] Joseph Birr-Pixton. 2015. PBKDF2: performance matters. https://jbp.io/2015/08/11/pbkdf2-performance-matters.html.
Accessed: 2023-01-31.
[17] Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan.
2008. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the
Polyhedral Model. In Compiler Construction , Laurie Hendren (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg,
132–146.
[18] Martin Bravenboer and Yannis Smaragdakis. 2009. Strictly Declarative Specification of Sophisticated Points-to
Analyses. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages
and Applications (Orlando, Florida, USA) (OOPSLA ’09) . Association for Computing Machinery, New York, NY, USA,
243–262. https://doi.org/10.1145/1640089.1640108
[19] Alexandru Calotoiu, Tal Ben-Nun, Grzegorz Kwasniewski, Johannes de Fine Licht, Timo Schneider, Philipp Schaad,
and Torsten Hoefler. 2022. Lifting C Semantics for Dataflow Optimization. In Proceedings of the 36th ACM International
Conference on Supercomputing (Virtual Event) (ICS ’22) . Association for Computing Machinery, New York, NY, USA,
Article 17, 13 pages. https://doi.org/10.1145/3524059.3532389
[20] Stephen Cass. 2018. Chip Hall of Fame: Intel 4004 Microprocessor. https://spectrum.ieee.org/chip-hall-of-fame-intel-
4004-microprocessor. Accessed: 2023-01-05.
[21] Jong-Deok Choi, Barton P. Miller, and Robert H. B. Netzer. 1991. Techniques for Debugging Parallel Programs with
Flowback Analysis. ACM Trans. Program. Lang. Syst. 13, 4 (oct 1991), 491–530. https://doi.org/10.1145/115372.115324
[22] L. Dagum and R. Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Computa-
tional Science and Engineering 5, 1 (1998), 46–55. https://doi.org/10.1109/99.660313
[23] Anthony Danalis, Heike Jagode, and Jack Dongarra. 2018. PAPI: Counting outside the Box. (2018-04 2018).
[24] Maryam Emami, Rakesh Ghiya, and Laurie J. Hendren. 1994. Context-Sensitive Interprocedural Points-to Analysis in
the Presence of Function Pointers. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language
Design and Implementation (Orlando, Florida, USA) (PLDI ’94) . Association for Computing Machinery, New York, NY,
USA, 242–256. https://doi.org/10.1145/178243.178264
[25] Björn Franke and Michael O’Boyle. 2001. Compiler Transformation of Pointers to Explicit Array Accesses in DSP
Applications. In Compiler Construction , Reinhard Wilhelm (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 69–85.
[26] Roy Frostig, Matthew Johnson, and Chris Leary. 2018. Compiling machine learning programs via high-level tracing.
https://mlsys.org/Conferences/doc/2018/146.pdf
[27] Fabrizio Gagliardi, Miquel Moreto, Mauro Olivieri, and Mateo Valero. 2019. The international race towards Exascale in
Europe. CCF Transactions on High Performance Computing 1, 1 (2019), 3–13.
[28] GitHub. 2022. GitHub API, repository count. https://api.github.com/search/repositories?q=language:Ccreated:<=2022-
12-31. Accessed: 2023-01-02.
[29] GitHub. 2022. GitHub Top Programming Languages. https://octoverse.github.com/2022/top-programming-languages.
Accessed: 2023-01-02.
[30] TOBIAS GROSSER, ARMIN GROESSLINGER, and CHRISTIAN LENGAUER. 2012. POLLY — PERFORMING POLYHE-
DRAL OPTIMIZATIONS ON A LOW-LEVEL INTERMEDIATE REPRESENTATION. Parallel Processing Letters 22, 04
(2012), 1250010. https://doi.org/10.1142/S0129626412500107 arXiv:https://doi.org/10.1142/S0129626412500107
[31] Tobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, and Sebastian Pop. 2015. Optimistic Delinearization
of Parametrically Sized Arrays. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport
Beach, California, USA) (ICS ’15) . Association for Computing Machinery, New York, NY, USA, 351–360. https:
//doi.org/10.1145/2751205.2751248
[32] Richard Günther. 2006. Profile driven loop transformations. In GCC Developers’ Summit . Citeseer, 49.
[33] R. Gupta. 1992. SPMD execution of programs with dynamic data structures on distributed memory machines. In
Proceedings of the 1992 International Conference on Computer Languages . 232–241. https://doi.org/10.1109/ICCL.1992.
185487
[34] ISO. 1999. ISO C Standard 1999 . Technical Report. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
Subsection 6.5.2 Postfix operators.
ACM Trans. Arch. Code Optim.
Page 25:
PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 25
[35] Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2020. Data Movement Is All You Need: A
Case Study on Optimizing Transformers. https://doi.org/10.48550/ARXIV.2007.00072
[36] Jason Kane and Qing Yang. 2012. Compression Speed Enhancements to LZO for Multi-core Systems. In 2012 IEEE 24th
International Symposium on Computer Architecture and High Performance Computing . 108–115. https://doi.org/10.1109/
SBAC-PAD.2012.29
[37] George Kastrinis and Yannis Smaragdakis. 2013. Hybrid Context-Sensitivity for Points-to Analysis. In Proceedings of the
34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) (PLDI
’13). Association for Computing Machinery, New York, NY, USA, 423–434. https://doi.org/10.1145/2491956.2462191
[38] Yeonsoo Kim, Shinhyung Yang, Shin-Dug Kim, and Bernd Burgstaller. 2019. Code Optimizations for In-network
Processing: The State of Scalar Evolution in LLVM: The State of Scalar Evolution in LLVM. Proceedings of the Korea
Telecommunications Society (2019), 434–437.
[39] Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Komuravelli, Vikram Adve, and Sarita Adve. 2018.
HPVM: Heterogeneous Parallel Virtual Machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming (Vienna, Austria) (PPoPP ’18) . Association for Computing Machinery, New York,
NY, USA, 68–80. https://doi.org/10.1145/3178487.3178493
[40] Thomas Kotzmann and Hanspeter Mössenböck. 2005. Escape analysis in the context of dynamic compilation and
deoptimization. In Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments
(Chicago, IL, USA) (VEE ’05) . Association for Computing Machinery, New York, NY, USA, 111–120. https://doi.org/10.
1145/1064979.1064996
[41] C. Lattner and V. Adve. 2004. LLVM: a compilation framework for lifelong program analysis & transformation. In
International Symposium on Code Generation and Optimization, 2004. CGO 2004. 75–86. https://doi.org/10.1109/CGO.
2004.1281665
[42] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana
Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s
Law. arXiv:2002.11054 [cs.PL]
[43] Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf. 2015. DiscoPoP: A profiling tool to identify parallelization
opportunities. In Tools for High Performance Computing 2014: Proceedings of the 8th International Workshop on Parallel
Tools for High Performance Computing, October 2014, HLRS, Stuttgart, Germany . Springer, 37–54.
[44] Message Passing Interface Forum. 2021. MPI: A Message-Passing Interface Standard Version 4.0 . https://www.mpi-
forum.org/docs/mpi-4.0/mpi40-report.pdf
[45] Gordon E. Moore. 2006. Cramming more components onto integrated circuits, Reprinted from Electronics, volume
38, number 8, April 19, 1965, pp.114 ff. IEEE Solid-State Circuits Society Newsletter 11, 3 (2006), 33–35. https:
//doi.org/10.1109/N-SSC.2006.4785860
[46] Markus F.X.J. Oberhumer. 2017. Lempel–Ziv–Oberhumer. https://www.oberhumer.com/opensource/lzo/. Accessed:
2024-09-05.
[47] Geraldo F. Oliveira, Juan Gómez-Luna, Lois Orosa, Saugata Ghose, Nandita Vijaykumar, Ivan Fernandez, Mohammad
Sadrosadati, and Onur Mutlu. 2021. DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement
Bottlenecks. IEEE Access 9 (2021), 134457–134502. https://doi.org/10.1109/ACCESS.2021.3110993
[48] Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Sylvain Paris, Marc Levoy, Saman Ama-
rasinghe, and Frédo Durand. 2017. Halide: Decoupling Algorithms from Schedules for High-Performance Image
Processing. Commun. ACM 61, 1 (dec 2017), 106–115. https://doi.org/10.1145/3150211
[49] Dennis M. Ritchie. 1996. The Development of the C Programming Language . Association for Computing Machinery,
New York, NY, USA, 671–698. https://doi.org/10.1145/234286.1057834
[50] Matthew S. Simpson and Rajeev K. Barua. 2010. MemSafe: Ensuring the Spatial and Temporal Memory Safety of
C at Runtime. In 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation . 199–208. https:
//doi.org/10.1109/SCAM.2010.15
[51] Yulei Sui, Xiaokang Fan, Hao Zhou, and Jingling Xue. 2018. Loop-Oriented Pointer Analysis for Automatic SIMD
Vectorization. ACM Trans. Embed. Comput. Syst. 17, 2, Article 56 (jan 2018), 31 pages. https://doi.org/10.1145/3168364
[52] TIOBE. 2022. TIOBE index. https://www.tiobe.com/tiobe-index/. Accessed: 2023-01-02.
[53] Jie Zhou, John Criswell, and Michael Hicks. 2023. Fat Pointers for Temporal Memory Safety of C. Proc. ACM Program.
Lang. 7, OOPSLA1, Article 86 (apr 2023), 32 pages. https://doi.org/10.1145/3586038
ACM Trans. Arch. Code Optim.