loader
Generating audio...

arxiv

Paper 2503.03359

Iterating Pointers: Enabling Static Analysis for Loop-based Pointers

Authors: Andrea Lepori, Alexandru Calotoiu, Torsten Hoefler

Published: 2025-03-05

Abstract:

Pointers are an integral part of C and other programming languages. They enable substantial flexibility from the programmer's standpoint, allowing the user fine, unmediated control over data access patterns. However, accesses done through pointers are often hard to track, and challenging to understand for optimizers, compilers, and sometimes, even for the developers themselves because of the direct memory access they provide. We alleviate this problem by exposing additional information to analyzers and compilers. By separating the concept of a pointer into a data container and an offset, we can optimize C programs beyond what other state-of-the-art approaches are capable of, in some cases even enabling auto-parallelization. Using this process, we are able to successfully analyze and optimize code from OpenSSL, the Mantevo benchmark suite, and the Lempel-Ziv-Oberhumer compression algorithm. We provide the only automatic approach able to find all parallelization opportunities in the HPCCG benchmark from the Mantevo suite the developers identified and even outperform the reference implementation by up to 18%, as well as speed up the PBKDF2 algorithm implementation from OpenSSL by up to 11x.

Paper Content:
Page 1: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers ANDREA LEPORI, ETH Zurich, Switzerland ALEXANDRU CALOTOIU, ETH Zurich, Switzerland TORSTEN HOEFLER, ETH Zurich, Switzerland Pointers are an integral part of C and other programming languages. They enable substantial flexibility from the programmer’s standpoint, allowing the user fine, unmediated control over data access patterns. However, accesses done through pointers are often hard to track, and challenging to understand for optimizers, compilers, and sometimes, even for the developers themselves because of the direct memory access they provide. We alleviate this problem by exposing additional information to analyzers and compilers. By separating the concept of a pointer into a data container and an offset, we can optimize C programs beyond what other state-of-the-art approaches are capable of, in some cases even enabling auto-parallelization. Using this process, we are able to successfully analyze and optimize code from OpenSSL, the Mantevo benchmark suite, and the Lempel–Ziv–Oberhumer compression algorithm. We provide the only automatic approach able to find all parallelization opportunities in the HPCCG benchmark from the Mantevo suite the developers identified and even outperform the reference implementation by up to 18%, as well as speed up the PBKDF2 algorithm implementation from OpenSSL by up to 11x. CCS Concepts: •Software and its engineering →Source code generation ;•Computing methodologies →Parallel computing methodologies . Additional Key Words and Phrases: pointer analysis, automatic parallelization 1 INTRODUCTION C is a widely used programming language, one that is heavily relied upon in many codebases. In 2022 on GitHub C and C++ accounted together for 4.6M repositories [ 28], this being a 23.5% growth compared to the previous year [ 29]. According to the TIOBE index [ 52], C is the second most popular programming language. The C language was developed in the 1970s at about the same time as the first microprocessor [ 20, 49]. Parallel computing only became widespread decades later, driven by the end of Dennard scaling [ 12,45]. Hardware efforts to continue growth following Moore’s Law past the power wall lead to a renewed focus on parallel computing as a way to continue improving performance [ 11,27]. C was not designed with parallel programming in mind as its inception was much earlier than the explosion of parallel computing. Libraries provide support for parallel paradigms - such as OpenMP [ 22] or MPI [ 44] - but add a difficult-to-manage layer of complexity. The flexibility and low-level nature of C code make debugging challenging, even more so when the code is executed in parallel [15, 21]. This makes parallelizing code a long and complex process. New Paper, Not an Extension of a Conference Paper Authors’ addresses: Andrea Lepori, ETH Zurich, Switzerland, andrea.lepori@inf.ethz.ch; Alexandru Calotoiu, ETH Zurich, Switzerland, acalotoiu@inf.ethz.ch; Torsten Hoefler, ETH Zurich, Switzerland, torsten.hoefler@inf.ethz.ch. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). ©2024 Copyright held by the owner/author(s). 1544-3566/2024/3-ART https://doi.org/10.1145/3701993 ACM Trans. Arch. Code Optim.arXiv:2503.03359v1 [cs.PL] 5 Mar 2025 Page 2: PREPRINT2 Lepori et al. Frameworks such as Polly [ 30] and Pluto [ 17] promise automatic parallelism — these are however limited to static control parts (SCoPs). The Intel compiler offers the icc parallel mode that can identify and auto-parallelize several predefined programming patterns. Data-centric methods [ 19] try to push the boundary even further by leveraging complex dataflow analysis techniques to better understand data movement and dependencies. On modern systems, data movement is the most expensive operation in most programs, concerning both time and energy consumption [ 35], and tends to be the biggest bottleneck in computations [ 47]. Many frameworks and compilers such as HPVM [ 39], Halide [ 48], Jax [ 26], and DaCe [ 14] leverage dataflow analysis extensively in their pipeline. The major obstacle in the static analysis of data accesses is unfortunately the use of pointers. Though they are an integral part of the C language, they often become a barrier on the road to performance. This can be seen in the two examples in Figure 1. On the right side of the figure we can see that the inner loop only has independent data accesses, and modern compilers are able to detect that. But no compiler we tested was able to detect that the outer loop has no data dependency. Every access done with p[k] is different because of the pointer movement in the outer loop. Current analysis methods do not consider this type of access as it would require complex and expensive pointer tracking, and even then it would require a correlation between the pointer movement and loop iterations. Accessing the same pointer at a different in- dex does not guarantee that different data is ac- cessed — pointers can move during execution.Accessing the same pointer at the same location with access to different data. Creation of false positives in data dependency analysis. p[i+1] = 0; // array-style access p++; // pointer moves p[i] = 42; // array-style access to same memoryfor (int i=0; i<cplen *n;i+=cplen ) { for (int k=0; k<cplen ;k++) { p[k] = p[k] ^ data [k];// no loop dependence } p+=cplen ;// outer iteration on different data } Snippet of the PBKDF2 implementation from OpenSSL Fig. 1. Two example codes where we can have false positives and negatives with pointer accesses Our solution aims to address this challenge by tracking data containers (as opposed to pointers) across the entire program. We then split pointers into the data container and an integer index to it. This effectively makes data accesses explicit while preserving execution semantics, enabling better analyzability of the code. A state-of-the-art data-centric compiler can then attempt to parallelize and improve the performance of the code. The implementation of our methods is explained in Section 2. Afterwards in Section 3 we discuss the limitations and motivation behind them. In Section 4 we will discuss possible solutions to mitigate the limitations showing some specific transformations handling special cases. ACM Trans. Arch. Code Optim. Page 3: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 3 Contributions •We introduce a novel method to statically split pointers into a data container and an index to it, creating explicit data accesses to aid further analysis. •We generate a parallel version of the Mantevo HPCCG benchmark automatically finding all parallelization opportunities the benchmark developers envisioned manually. Our automatic workflow outperforms the manually tuned version by up to 6%. •We automatically parallelize the previously serial PBKDF2 algorithm implemented in OpenSSL and obtain up to an 11x speedup. 2 POINTER DISAGGREGATION We propose a static method to improve the analyzability of pointers that are used as iterators over data containers. We first provide a high level overview explaining our method, we follow with some notes toward a proof and the description of the technical implementation. Then we compare our method with previous work. We also provide some examples and results to show the value of our transformation. 2.1 High level overview Original code with pointer movements p=a; p+=i; x=p[k];ap ielements ap p[k] = x kelements container containeroffset offsetTransformed code with adjunct instead of pointer movements p=a; p_adj = 0; p_adj +=i; x=p[p_adj +k];ap p[p_adj] ielements ap p[p_adj]p[p_adj + k] = x p_adj elements kelements Fig. 2. Representation of data access patterns with pointer movement (left) and static accesses with adjunct (right). Annotated with the accessed memory locations and pointer values. In the case of pointer movements the pointer is both the container and the offset while with the adjunct the pointer is only the container. Pointers are used both as handles for data containers and to iterate over them. By splitting these two semantically different use cases we intend to improve element-sensitive analysis of pointer accesses. For each pointer, we create an adjunct intvariable that represents the offset from the start of the data container. Note that the type of the adjunct should be big enough to address every element plus one in the connected data-container. The goal of the adjunct is to allow iteration over the data container without modifying the handle used to access said data container. We need the adjunct type to be plus one larger than the size of the data container as with loops containing *p++ we could have the need to get the address after the last element. Figure 2 shows a high-level example of how the adjunct is used. We can see from the figure that the value of xafter both executions is the same. For general code, it is important to note that p[i] ACM Trans. Arch. Code Optim. Page 4: PREPRINT4 Lepori et al. is semantically equivalent to *(p + i) as per C99 standard [ 34]. In Figure 3 we present a minimal example to show the value of introducing the adjunct variable. We provide the resulting assembly to show that when using the adjunct , the compiler is able to correctly identify that 𝑝𝑎𝑑𝑗is only used as an iterator similar to 𝑖and merges the two, resulting in more compact, efficient code. C representation for (int i=0; i<n;i++) { p[0] = i+ 1; p++; }ASM representation .L4: add eax , 1 ; i++ add rdx , 4 ; p++ mov DWORD PTR [rdx-4 ],eax cmp eax ,r14d ; i < n jne .L4 for (int i=0; i<n;i++) { p[p_adj + 0] = i+ 1; p_adj ++; }.L4: mov DWORD PTR [r12-4+rax *4], eax add rax , 1 ; i++ cmp rdx ,rax ; i < n jne .L4No pointer increment ⇒one less instructionadjunct transformationloop iterator→eax pointer iterator→rdx 𝑛value→r14d loop iterator→rax pointer iterator↰ container base address →r12 𝑛value→rdx Fig. 3. Difference in the assembly representation (compiled with gcc12.2 and -O2) between a code using pointer movement and one with static access after the adjunct is introduced. The adjunct transformation enhances code analyzability and exposes additional parallelization opportunities. Such transformed pointers are equivalent to arrays, in that the symbol represents a data container that is statically known. After the transformation the adjunct variable dictates how the pointer is accessed. Because it is an intvariable there are more compiler analysis methods available to track the value. Inside loops a compiler could apply Loop Strength-Reduction (LSR) to simplify addressing computations. Note that this can only be done after the adjunct transformation is applied as LSR only acts on integer values and not addresses (pointers). If instead the adjunct variable is never used - if a pointer is never moved for example - then the compiler will be able to optimize away the adjunct with constant propagation resulting in no performance losses. A summary of the transformation is described in Figure 4. To note is that the expression p = q⋄ xis a combination of p = q; q = q⋄x. Because of this fact, we do not detail this case in the following sections as it follows from the other statements. The adjunct type is deliberately a signed integer to replicate the behavior of actual pointer movement. This is because of the non-wrapping behaviour when the adjunct becomes negative. It is worth noting that accessing a pointer with a negative offset might be considered valid behavior under certain circumstances. For instance, if a programmer is confident that the memory is initialized within the same program, such as when an opaque library provides a pointer within an array, and the programmer is aware of the underlying structure. Function arguments and calls undergo minimal modifications. When a function is called, it is treated as a dereference, and we pass the pointer with the adjunct offset applied. The function argument declarations remain unchanged, meaning that the adjunct information is lost after the function call. This is generally not an issue, as subroutines often do not require the additional information. Moreover, this problem is mitigated by the compiler’s automatic inlining or by manually annotating functions to force inlining, thereby eliminating the function call. ACM Trans. Arch. Code Optim. Page 5: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 5 Original code Modified code int* p; int* p; int p_adj = 0; *p p[p_adj] p[x] p[x + p_adj] p = p⋄x; orp⋄= x; p_adj = p_adj ⋄x; p = q; p = q; p_adj = q_adj; p = q⋄x; p = q; p_adj = q_adj⋄x;Original code Modified code f(p); f(p + p_adj); void f(int* p) { void f(int* p) { int p_adj = 0; // body // body int k = *p; int k = p[p_adj]; } } Fig. 4. Transformation summary to improve pointer analyzability. Both pandqareintpointers ( int* p, q ), xis an intexpression and⋄is any integer binary operator. Note that the data-type intis only for explanation purposes and can be substituted with any other type. The adjunct -type intcan be interchanged with any integer type that fits the size of the data-container. 2.2 Notes toward a proof We provide a proof sketch that the transformation showed in Figure 4 does not modify the accesses done with pointers. First for initialization of a pointer let 𝑝be a pointer to some data. Then let 𝑝𝑎𝑑𝑗 be a new unique integer variable not already present in the program and initialized to zero when the memory of the pointer is first initialized. I.e. we have 𝑝=𝑚𝑎𝑙𝑙𝑜𝑐(...);𝑝𝑎𝑑𝑗=0where𝑝𝑎𝑑𝑗was not previously present in the variable definitions. This is a new variable definition without any name conflict hence it will not influence other parts of the program. Let𝑆be an arbitrary 𝑛length sequence of statements s.t. 𝑆=(𝑝:=𝑝+𝑥𝑖|𝑥𝑖∈Z,0≤𝑖<𝑛) Then we define the syntax 𝑆;𝑝by using the sequential operator ;with a variable name at the end as the value of 𝑝after execution of 𝑆. By our definition we have that: (𝑆;𝑝)=𝑝+𝑛∑︁ 𝑖=0𝑥𝑖 Let𝑆′be an𝑛length sequence of statements (note that 𝑥𝑖are the same values as in 𝑆and in the same order). 𝑆′=𝑝𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗+𝑥𝑖|𝑥𝑖∈Z,0≤𝑖<𝑛 As before by definition we have that (𝑆′;𝑝𝑎𝑑𝑗)=𝑝𝑎𝑑𝑗+𝑛∑︁ 𝑖=0𝑥𝑖 We can see that (𝑆;𝑝)=𝑝+𝑛∑︁ 𝑖=0𝑥𝑖=𝑝+𝑝𝑎𝑑𝑗+𝑛∑︁ 𝑖=0𝑥𝑖−𝑝𝑎𝑑𝑗=𝑝+(𝑆′;𝑝𝑎𝑑𝑗)−𝑝𝑎𝑑𝑗 ACM Trans. Arch. Code Optim. Page 6: PREPRINT6 Lepori et al. By using the sequential operator ;we define a program sequence that defines the pointer 𝑝, integer 𝑝𝑎𝑑𝑗and executes all statements in 𝑆. This represents an arbitrary program from the definition to the access of a pointer 𝑝with arbitrary pointer movements in between. (𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑆;𝑝)))=(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑝+(𝑆′;𝑝𝑎𝑑𝑗)−𝑝𝑎𝑑𝑗))) =(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑝+(𝑆′;𝑝𝑎𝑑𝑗)))) We note that(𝑆′;𝑝)=𝑝and(𝑆;𝑝𝑎𝑑𝑗)=𝑝𝑎𝑑𝑗as the statements does not affect the evaluated variable. Then we have that if we substitute any execution 𝑝=𝑝+𝑥𝑖with𝑝𝑎𝑑𝑗=𝑝𝑎𝑑𝑗+𝑥𝑖i.e. executing𝑆′instead of𝑆then (𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑆;𝑝)))=(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑝+(𝑆′;𝑝𝑎𝑑𝑗)))) =(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑆′;𝑝)+(𝑆′;𝑝𝑎𝑑𝑗))) =(𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑆′;𝑝+𝑝𝑎𝑑𝑗))) This means that by substituting 𝑆with𝑆′and accessing every pointer using 𝑝+𝑝𝑎𝑑𝑗as their address we access the same memory location. Hence the program will be semantically equivalent. 𝑆 and𝑆′only contain statements referring to 𝑝or𝑝𝑎𝑑𝑗for simplicity of the proof. Any other statement could be interleaved including control flow statements. By getting all possible execution paths and removing all statements that don’t act directly on 𝑝then we have the same case as 𝑆that can be transformed in 𝑆′. The proof sketch for the correctness of the pointer assignment transformation can be found in Appendix A. 2.3 Technical implementation We will now discuss the algorithm that governs our approach. The algorithms are applied to the AST (abstract syntax tree) to produce the modified code including the adjunct transformation. It is important to note that the whole transformation is source-to-source. The implementation leverages libclang to parse the source code. We keep a 𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔 map that maps pointers to its adjunct . As previously mentioned the type of the adjunct variable should be an integer type big enough to address each element of the data-container. We mostly use intas it is big enough for most practical purposes but it could be easily interchanged with bigger or smaller integer types. For every variable declaration, we check if it is a pointer type. In that case, we append an additional variable declaration for the adjunct variable with type intand initialize it to 0. We keep a mapping between the pointer variables and their adjunct . To avoid name collisions for the adjunct variable name a simple counter and checks are used. for all variable declaration of the form 𝑡𝑦𝑝𝑒𝑛𝑎𝑚𝑒 =𝑖𝑛𝑖𝑡do if𝑡𝑦𝑝𝑒 is a pointer type then 𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑁𝑎𝑚𝑒←𝑔𝑒𝑡𝑈𝑛𝑖𝑞𝑢𝑒𝑁𝑎𝑚𝑒(𝑛𝑎𝑚𝑒+“_𝑎𝑑𝑗”) 𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔[𝑛𝑎𝑚𝑒]←𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑁𝑎𝑚𝑒 𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑁𝑎𝑚𝑒 init to 0 We analyze every array subscript expression ( a[i] , where ais the array expression and iis the index expression) and verify if the array expression is in the mapping. In case we have a match we modify the index expression to sum the additional adjunct variable. Note that we can handle unary expressions of the type *peasily by transforming them to p[0] with a simple transformation (which does not change the semantics as per C standard E1[E2] is equivalent to *(E1 + E2) ). ACM Trans. Arch. Code Optim. Page 7: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 7 for all array subscript expression of the form 𝑎𝑟𝑟𝑎𝑦[𝑖𝑛𝑑𝑒𝑥]do if𝑎𝑟𝑟𝑎𝑦∈𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔 then 𝑎𝑑𝑗𝑢𝑛𝑐𝑡←𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔[𝑎𝑟𝑟𝑎𝑦] 𝑖𝑛𝑑𝑒𝑥 =𝑖𝑛𝑑𝑒𝑥+𝑎𝑑𝑗𝑢𝑛𝑐𝑡 Finally, we examine every assignment. Here we have two cases: p = p⋄xorp = q . For both, the left expression must be in the mapping. In the first case ( p = p⋄x) the right side must be a binary operation for which the first operator is the same as the left side of the assignment ( p = q ⋄x,pandqmust be the same pointer). We then substitute the pointer pwith its adjunct (p_adj = p_adj⋄x). In the second case ( p = q ) we check that the right side is also a pointer in the mapping. Then we append an additional statement that overwrites the adjunct of the left side with theadjunct of the right ( p_adj = q_adj ). for all assignment of the form 𝑙ℎ𝑠=𝑟ℎ𝑠do if𝑙ℎ𝑠∈𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔 then if𝑟ℎ𝑠is a pointer∧𝑟ℎ𝑠∈𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔 then 𝑙ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡←𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔[𝑙ℎ𝑠] 𝑟ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡←𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔[𝑟ℎ𝑠] 𝑙ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡 =𝑟ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡 else if𝑟ℎ𝑠==𝑝𝑡𝑟⋄𝑒𝑥𝑝𝑟 then if𝑙ℎ𝑠==𝑝𝑡𝑟then 𝑙ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡←𝑎𝑑𝑗𝑢𝑛𝑐𝑡𝑀𝑎𝑝𝑝𝑖𝑛𝑔[𝑙ℎ𝑠] 𝑙ℎ𝑠=𝑙ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡 𝑟ℎ𝑠=𝑙ℎ𝑠𝐴𝑑𝑗𝑢𝑛𝑐𝑡⋄𝑒𝑥𝑝𝑟 2.4 Previous approaches A similar approach was already done specifically for digital signal processing (DSP) applications [ 33]. Their approach could also be extended to general programs and modifies the source code to remove pointer movement. Our biggest innovation is the use of the adjunct that enables run-time determined pointer movements. Previous methods only handled static or compile-time decidable pointer increments. For example the expression: if ( exp) { x=ptr++; } else { y=ptr + 2; } cannot be handled by previous approaches as it cannot decide at compile time if the pointer must be incremented by 1 or 2. With our adjunct approach it can be handled as the same branch just by modifying the adjunct instead of the pointer. While points-to analysis represents a cutting-edge method widely integrated into modern com- pilers, its effectiveness in element-sensitive analysis, particularly concerning pointer movements, remains limited. This constraint poses challenges for loop optimization strategies like vectorization or parallelization, as points-to analysis struggles to grasp element-sensitive aliasing patterns. As illustrated in Figure 5, conventional points-to analysis, as implemented in GCC 13, fails to statically ACM Trans. Arch. Code Optim. Page 8: PREPRINT8 Lepori et al. discern element-sensitive aliasing in code involving pointer movements. However, following our transformation, GCC successfully identifies the absence of access collisions. It is worth noting that even after the adjunct transformation, GCC continues to leverage points-to analysis to recognize the independence between accessed elements. The shown example in Figure 5 includes only a 5-iteration loop to more obviously present the missing vectorization check branch. Figure 5 illustrates that the transformation is also beneficial for dynamic allocations. In such cases, the base pointer must be stored somewhere to ensure proper memory deallocation. However, even with the base pointer known, the compiler often struggles to fully resolve all aliasing information. int* arr =malloc (sizeof(int) * 10); // random init array int* p=arr; int* q=arr; if ( argc == 42) { p+= 5; } else { q+= 5; } for (int i=0; i<5; i++) { // GCC adds runtime vectorization checks p[i] += q[i]; }Unrolled loop no vectorizationUnrolled vectorized loop main: push rbp mov edx, 10 mov rbp, rsp push rbx mov rdi, QWORD PTR [rsi+8] xor esi, esi and rsp, -32 call strtol mov edi, 40 mov rbx, rax call malloc vmovd xmm0 , ebx vmovd xmm3 , ebx vmovq xmm1 , QWORD PTR .LC1 [rip] vpbroadcastd ymm0 , xmm0 vpaddd ymm0 , ymm0 , YMMWORD PTR .LC0 [rip] lea rdx, [rax+20] vmovdqu YMMWORD PTR [rax], ymm0 vpshufd xmm0 , xmm3 , 0xe0 vpaddd xmm0 , xmm0 , xmm1 vmovq QWORD PTR [rax+32], xmm0 cmp ebx, 42 je .L2 main: @24 mov rcx, rdx mov rdx, rax mov rax, rcx .L2: lea rsi, [rax+4] mov rcx, rdx sub rcx, rsi cmp rcx, 8 jbe .L3 .L2: @33 vmovdqu xmm2 , XMMWORD PTR [rdx] vpaddd xmm0 , xmm2 , XMMWORD PTR [rax] vmovdqu XMMWORD PTR [rdx], xmm0 mov ecx, DWORD PTR [rax+16] add ecx, DWORD PTR [rdx+16] .L4: vpsrldq xmm1 , xmm0 , 8 mov DWORD PTR [rdx+16], ecx vpaddd xmm0 , xmm0 , xmm1 vpsrldq xmm1 , xmm0 , 4 vpaddd xmm0 , xmm0 , xmm1 vmovd eax, xmm0 add eax, ecx vzeroupper mov rbx, QWORD PTR [rbp-8] leave ret.L3: mov ecx, DWORD PTR [rax] add DWORD PTR [rdx], ecx mov ecx, DWORD PTR [rax+4] add DWORD PTR [rdx+4], ecx mov ecx, DWORD PTR [rax+8] add DWORD PTR [rdx+8], ecx mov ecx, DWORD PTR [rax+12] add DWORD PTR [rdx+12], ecx mov ecx, DWORD PTR [rax+16] vmovdqu xmm0 , XMMWORD PTR [rdx] add ecx, DWORD PTR [rdx+16] jmp .L4int* arr =malloc (sizeof(int) * 10); // random init array int* p=arr; int p_adj = 0; int* q=arr; int q_adj = 0; if ( argc == 42) { p_adj += 5; } else { q_adj += 5; } for (int i=0; i<5; i++) { // GCC statically decided vectorization p[i+p_adj ] += q[i+q_adj ]; }Unrolled vectorized loop main: push rbp mov edx, 10 mov rbp, rsp push rbx mov rdi, QWORD PTR [rsi+8] xor esi, esi and rsp, -32 call strtol mov edi, 40 mov rbx, rax call malloc vmovd xmm0 , ebx cmp ebx, 42 vmovd xmm2 , ebx vpbroadcastd ymm0 , xmm0 mov edx, 24 mov ecx, 4 vmovq xmm1 , QWORD PTR .LC1 [rip] vpaddd ymm0 , ymm0 , YMMWORD PTR .LC0 [rip] cmovne rcx, rdx mov esi, 0 mov edx, 20 cmovne rdx, rsi vmovdqu YMMWORD PTR [rax], ymm0 vpshufd xmm0 , xmm2 , 0xe0 vpaddd xmm0 , xmm0 , xmm1 lea rsi, [rax+rdx] lea rdi, [rax+16+rdx] vmovq QWORD PTR [rax+32], xmm0 vmovdqu xmm3 , XMMWORD PTR [rax-4+rcx] vpaddd xmm0 , xmm3 , XMMWORD PTR [rsi] vmovdqu XMMWORD PTR [rsi], xmm0 mov edx, DWORD PTR [rdi] add edx, DWORD PTR [rax+12+rcx] mov DWORD PTR [rdi], edx vmovdqu xmm0 , XMMWORD PTR [rsi] vpsrldq xmm1 , xmm0 , 8 vpaddd xmm0 , xmm0 , xmm1 vpsrldq xmm1 , xmm0 , 4 vpaddd xmm0 , xmm0 , xmm1 vmovd eax, xmm0 add eax, edx vzeroupper mov rbx, QWORD PTR [rbp-8] leave ret Fig. 5. Left side: example code using pointer movements where the points-to analysis inside GCC is unable to ensure non-aliasing accesses. Right side: code after adjunct transformation where GCC is able statically decide non-aliasing using points-to analysis and integer analysis. Note the difference in the CFG of the resulting assembly code. Sui et al. [ 51] proposed additional analysis techniques for pointer loops, enhancing field-sensitive and element-sensitive methods. However, their work does not address pointer movements, which remain a challenge for compilers, even advanced ones. Our transformation effectively eliminates pointer movements, allowing existing points-to analysis techniques, including advanced element- sensitive approaches, to be applied with greater efficacy. By incorporating points-to analysis with our approach, we enhance the depth of insights and enable advanced code optimization. Our methodology involves modifying the source code, rendering it compatible with various analyzers and optimizers. While our primary focus lies in data-centric frameworks, which often struggle with handling pointers, it is essential to note that our approach is not limited to this domain. The modified source code generated through our method seamlessly integrates with any existing or future frameworks, offering versatility and long-term applicability. Scalar Evolution (SCEV) [ 9] and other loop analysis methods aim to scrutinize loops for potential optimizations and performance enhancements, including vectorization. However, as demonstrated in numerous examples, current implementations often struggle to identify such opportunities, particularly concerning pointers. Although there have been proposals for improving analysis techniques [ 13,32,38,51], finding techniques that are both safe and general enough remains challenging. Our transformation addresses this challenge by providing additional information to ACM Trans. Arch. Code Optim. Page 9: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 9 the compiler, enabling it to leverage existing SCEV techniques for performance improvements. With better analysis techniques, more patterns can be matched. Importantly, our transformation does not hinder the adoption of new techniques, as the array access pattern is well-established and commonly utilized, making it compatible with existing analyzers. While advanced analyzers may eventually incorporate pointer movements into their analyses, currently, this remains a significant challenge, and most analyzers do not account for it. Hence, we contend that pointer movement continues to pose a problem, and our transformation provides a valuable solution in the current landscape. 2.5 Practical results PBKDF2. To showcase the transformation in practice we test it on the code snippet from PBKDF2 first presented in Section 1. We expect the inner loop 2to be vectorized by every compiler. What we want to test is whether compilers are able to discover the independence between the iterations of the outer loop 1and the effects that the adjunct transformation has on it. // snippet of the PBKDF2 implementation in OpenSSL // int cplen, n for (int i=0; i<cplen *n;i+=cplen ) { for (int k=0; k<cplen ;k++) { // no loop carried dependence p[k] = p[k] ^ data [k]; } // each outer loop iteration works on different data ↩→ p+=cplen ; }// adjunct transformed version // int cplen, n for (int i=0; i<cplen *n;i+=cplen ) { 1 for (int k=0; k<cplen ;k++) { 2 // data adjunct removed for readability p[k+p_adj ] = p[k+p_adj ] ^ data [k]; } // no pointer movement, the adjunct variable is used instead ↩→ p_adj +=cplen ; } The code was run both with normal pointer movements (“no adjunct ”) and with the transforma- tion (“with adjunct ” variables). It was compiled with different compilers: GCC 12.2.2, Clang 15.0.6, and Polly (same Clang version). We also used the DaCe 0.14.1 data-centric framework to analyze and optimize the code with the resulting file being compiled with GCC. The code was instrumented with PAPI [ 23] to measure the execution time and the number of instructions. For each test, 10 runs were executed and the median is reported, with the 95% confidence interval being reported. For the number of instructions, the error range is negligible ( ±1000 instructions) and not reported. Runs were done with n = 100 andcplen = 100·106. As DaCe and Polly produce OpenMP code we report different values for different thread counts. We compare results with Polly as it is another compiler that offers automatic parallelization. Note that after the adjunct transformation the code only has array accesses which is a pattern supported by polyhedral compilers. Figure 6 shows that the adjunct transformation gets a minimal improvement in single-threaded performance. While with multiple threads the data-centric compiler DaCe was able to identify the parallelization opportunity on loop 1. Polly however isn’t able to identify the parallel loop either with or without the adjunct transformation. This is the case because the array-style access ofpis non-affine as it uses the p_adj variable which is non-affine w.r.t. loop iterators. Which is a requirement for the loop to be a SCoP (static control part). Polly only acts on SCoPs of programs. Note that as DaCe uses a data-centric IR which utilizes static and scope defined data containers (for arrays), pointer movements are not supported directly. The adjunct transformation is a way to support pointer movements in the data-centric IR without the need to change its inner workings. The number of instructions neither increase nor decrease with the adjunct . As Polly is a collection of transformations on the LLVM-IR that is compiled using Clang the number of instructions is identical (Polly wasn’t able to find any SCoPs hence no transformations were applied). Surprisingly ACM Trans. Arch. Code Optim. Page 10: PREPRINT10 Lepori et al. GCC DaCe Clang Polly2345 CompilerTime [s]Serial execution comparison with and without adjunct AMD Ryzen 5 2600X (6 cores) With adjunct Noadjunct / original 2 4 6 8 122345 DaCe (with adjunct )Polly Polly with adjunct ThreadsParallel execution comparison between DaCe and Polly AMD Ryzen 5 2600X (6 cores) Fig. 6. Serial and parallel runtimes for a simplified code snippet extracted from the PBKDF2 implementation of OpenSSL. Note that DaCe can only analyze the code after using the adjunct transformation. DaCe has the same number of instructions as GCC. Even if it uses GCC to compile to machine instructions this is unexpected as DaCe applies multiple transformations to the code. This can be explained as even after the transformations the computation is mostly the same only with additional annotations for parallel execution. Instructions [ 106]Noadjunct With adjunct GCC 4 380 4 380 Clang 2 974 2 974 Polly 2 974 2 974 DaCe not supported 4 380 Table 1. Number of instructions generated by different compilers of the code snippet taken from PBKDF2. This proves the value of the transformation, but also that the transformation benefits compilers with data-centric IR the most: they can leverage the improved analyzability of pointer movements. Data centric-paradigms otherwise represent pointers as indirection that is challenging - or even impossible - to handle. All compilers see a minimal improvement but they are less able to capitalize on the improved analyzability. LZO (compression algorithms). Another instance where pointer movements within loops are prevalent is in compression algorithms. As an illustration, we applied our transformation to the compression function of the Lempel–Ziv–Oberhumer (LZO) algorithm [ 46]. This algorithm is commonly utilized within the Linux Kernel for file-systems and memory, supporting live com- pression/decompression operations. Notably, it finds applications in BTRFS [ 1], SquashFS [ 5], initramfs [2], zram [7], and zswap [8]. Following a similar approach to previous benchmark studies of the LZO algorithm [ 36], we executed multiple iterations of the algorithm, ensuring each run involved significant computation without increasing memory usage. The compression was performed on a 1 MB file with 2000 iterations and a block size of 256KB. As before, we report the median with a 95% confidence interval using the same compiler versions. We compared the original implementation compiled with GCC ACM Trans. Arch. Code Optim. Page 11: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 11 against DaCe with the adjunct transformation. It is noteworthy that the DaCe code was executed with a single thread, as no significant parallelization opportunities were identified. Our results indicate a slight improvement, thanks to the additional vectorization opportuni- ties provided by the transformation. However, achieving more substantial improvements, along with potential parallelization opportunities, proves challenging due to loop carried dependencies which would require an algorithm redesign. Nevertheless, in compression algorithms, the adjunct transformation can facilitate modest yet discernible enhancements in runtime performance. DaCe+adjunct GCC2.833.23.4 CompilerTime [s]i7-8550U (16GB) Fig. 7. Serial runtime (single thread) of LZO compression algorithm. Comparison of original implementation compiled with GCC 12.2.2 and DaCe transformed version (including adjunct transformation) 3 LIMITATIONS The method we propose can provide compilers with better insights into pointer behavior, but only in certain scenarios. In this section, we go into detail as to the limitations of our approach. The main limitation is about the static decidability of which data container a pointer is pointing to. But we argue that in those cases an algorithm redesign would be needed for any performance improvement. The necessity of static decidability arises from the utilization of DaCe as the compiler/framework for optimizations. Although DaCe achieves significant performance improvements, particularly in automatic parallelization, it relies on the ability to statically determine data containers. However, theadjunct transformation itself encounters limitations primarily associated with higher-order pointers, as elaborated in Subsection 3.2. Another important aspect to highlight is the utility of our adjunct transformation. Simply expos- ing additional information to the compiler does not guarantee a runtime performance improvement. In earlier examples, we demonstrated how the compiler generates fewer instructions or eliminates branches entirely. Additionally, using DaCe it created further parallelization opportunities. We discovered that these improvements are predominantly seen in codes where pointers are used as iterators. While this pattern is common in many codebases as we will discuss in Table 2, no immediate improvements will be offered to codes that do not exhibit this pattern. 3.1 Static decidability A first assumption is that a pointer has to always point to a statically decidable data container in any given section of the code. This is required to be able to deterministically identify data accesses. Conditional reassignment introduces uncertainty as to which data container a pointer is pointing to. In such cases, our analysis backs off and does not generate an adjunct. However, unconditional or static pointer re-assignments are supported as they aren’t introducing uncertainty as to which data container they point to. We can see two code snippets that show this difference in Figure 8. ACM Trans. Arch. Code Optim. Page 12: PREPRINT12 Lepori et al. With this assumption, we achieve flawless aliasing detection, as the compiler possesses static knowledge of the data container associated with each pointer. Detecting accesses to the same data simply involves verifying if the data container and indices are identical. Leveraging the adjunct transformation streamlines index comparison to a straightforward integer check, often allowing for static analysis if the abstract model is sufficiently precise. It would be possible to expand the transformation to also support conditional pointers reas- signment by creating adjunct versions for each container and replicating the branch decision for each subsequent access. However, this would create unsustainable code and memory replication requirements. Decidable access - supported int* p; p=a; // now p always points to a p[i] = 5; // accessing a with an offset of i p=b; // now p always points to b p[i] = 2; // accessing b with an offset of iUndecidable access - not supported int* p; if (...) { // unknown branch p=a; } else { p=b; } // p could point to either a or b p[i] = 5; // undecidable data container Fig. 8. Undecidable vs decidable pointer accesses 3.2 Pointing to pointers Double pointers - or higher order pointers - are supported by applying the transformation recur- sively, as can be seen in Figure 9. Note that the data container to which pis pointing to is statically known. As with simple pointers, we can trace the accesses 2and 3to container a(1) and get the correct adjunct in this case. But the decidability issue remains, as we can see in Figure 10 (left side) where the data container that is accessed is determined by the higher-order pointer location. int* a=new int[10]; int** p= &a; int x= (* p)[i]// -> a[i] a++; int y= (* p)[i]// -> a[i + 1]int* a=new int[10]; int a_adj = 0; int** p= &a;1 int p_adj = 0; int x= (p[p_adj ])[i+a_adj ]2 a_adj ++; int y= (p[p_adj ])[i+a_adj ]3 Fig. 9. Example of adjunct transformation with double pointers The general case could be handled with arrays of adjunct , as shown in Figure 10 (right side). However, trying to leverage this approach on general codes would substantially increase the complexity of the code and consume a significant amount of memory. Note that this approach wasn’t implemented because of the unsustainability of the transformation. Therefore, if additional assumptions can be made about the shape of the individual data containers pointed at, more targeted, efficient approaches can be used as explained in Section 4. ACM Trans. Arch. Code Optim. Page 13: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 13 int* a=new int[10]; int* b=new int[10]; int** p=new int*[2]; p[0] = a; p[1] = b; if (...) { p++; } int x= (* p)[i]// -> either a[i] or b[i]adjunct for new pointers Array of adjunct for new pointer array Copy current adjunct position for future accesses through pint* a=new int[10]; int a_adj = 0; int* b=new int[10]; int b_adj = 0; int** p=new int*[2]; int* p_adj_array =new int[2]; int p_adj = 0; p[0] = a; p_adj_array [0] = a_adj ; p[1] = b; p_adj_array [1] = b_adj ; if (...) { p_adj ++; } int x= (p[p_adj ])[i+p_adj_array [p_adj ]];Offset of ppointer Offset of array element Requested indexBased on which element we are accessing with p Fig. 10. Example of adjunct transformation with arrays of pointers Although a multidimensional array shares the same memory representation as higher-order pointers, the analysis differs. Our aim is to eliminate pointer movements and replace them with more predictable array accesses. With multidimensional arrays, this pattern is inherent, and our transformation results in code similar to what is typically observed, as pointers are generally not relocated from multidimensional arrays. Prior proposals have suggested transformations to flatten multidimensional arrays into a single dimension [ 31]. This approach was also included in GCC between version 4.3 and 4.8. The method is orthogonal to ours and can be applied before or after our transformation as it will not introduce additional pointer movements. 3.3 Evaluating applicability In our investigation, we have identified the primary constraint of the adjunct transformation: higher- order pointers. The requirement for statically decidable data containers is only a prerequisite of the DaCe framework and not of the adjunct transformation itself. To gauge the broader viability of the transformation, we conducted an assessment of its utilization in extensive codebases, specifically focusing on the handling of higher-order pointers. It is important to clarify our criteria for identifying non-applicable instances. We categorize higher-order pointer usage as constraining only when it potentially involves pointer iteration. Notably, we exempt instances where double pointers are employed to delegate memory allocation responsibilities to another function (refer to Figure 11). Moreover, we exclude considerations of the argv parameter in main functions due to its typically diminutive size and minimal involvement in computationally intensive tasks. Our analysis considers application source code, omitting unit tests and examples. Furthermore, our scope is confined to C and C++ files. In the same way, we count single pointers as possible candidates for the adjunct transformation. The outcomes of our investigation are presented in Table 2. Our examination encompassed OpenSSL [ 4], notable for its cryptographic algorithm implementations, TurboBench [ 6], which hosts a plethora of compression algorithm implementations, and the Linux kernel [3]. ACM Trans. Arch. Code Optim. Page 14: PREPRINT14 Lepori et al. void init_pointer (char** p, int n) { *p=malloc (n); memset (*p, 0, n); } int main (int argc , char** argv ) { char* buf; init_pointer (&buf,atoi (argv [1])); return 0; } Fig. 11. Example usage of double pointers to delegate memory allocation responsibility. This usage is fully supported by the adjunct transformation. Code-base LoC #applicable #non-applicable OpenSSL 704,308 26,925 1,362 TurboBench 854,548 32,680 756 Linux kernel 24,940,479 1,656,065 4,972 Table 2. Number of lines of code (LoC), number of applicable instances of the adjunct transformation (#applicable) and number of non applicable instances (#non-applicable) in different code-bases (only C and C++). Higher order pointers are indeed common in these codebases, as expected. However, their frequency is generally several orders of magnitude lower than the total code volume and notably even lower than single order pointer instances. It’s essential to note that the provided statistics encompass pointers not exclusively used as iterators. This assessment offers a preliminary insight into the method’s broader applicability, highlighting the prevalence of single pointers compared to their higher order counterparts in programs. 4 FROM GENERIC TO SPECIFIC - REFINEMENTS FOR PRACTICAL C CODES We will show how our approach towards improving the analyzability of pointers used as iterators benefits real codes. However, we discovered that for such codes further refinements are needed such as additional transformations and support for pointers to external calls. All transformation in the following subsections were implemented and are done automatically by the compiler. 4.1 Improving analyzability by restructuring data As discussed in Section 3, pointing to pointers is a more difficult to analyze scenario, as conditional pointer movements can lead to static undecidability regarding which data container is accessed. However, if additional information is known, the data structures themselves can be changed to make analysis easier. As this kind of transformation targets a general structure they do not only target a specific program but the entire set of programs that use that structure. An example of a structure transformation that we implemented is inside the Mantevo HPCCG benchmark. It uses sparse matrix storage, specifically the List of Lists (LIL) format. The matrix is stored using two lists for each row containing the column indices and the non-zero values. The row lists are stored as continuous data, and pointers to them are used to access the correct location ACM Trans. Arch. Code Optim. Page 15: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 15 for (int i=0; i<nrow ;i++) { double sum = 0.0; // accumulate sum for each row int cur_nnz =A->nnz_in_row [i];// number of non-zeros per row for (int j=0; j<cur_nnz ;j++) { sum +=A->ptr_to_vals [i][j]3// matrix element *x[A->ptr_to_inds [i][j]]; // vector element } y[i] = sum; } Fig. 12. Matrix-vector multiplication with a sparse matrix in LIL format taken from the HPCCG benchmark based on the current row. In Figure 13, we show the initialization code for a data structure from the HPCCG benchmark (modified for simplicity). The matrix is accessed using those two lists, as it can be seen in Figure 12 for a matrix-vector multiplication from the same benchmark. With the knowledge that there exists a statically known maximum number of non-zeroes per row, we can transform the list of lists into a matrix, as when the data is accessed (like in 3) we see that the access resembles a 2D array. We transform the initialization of the sparse matrix to utilize a 2D-like structure instead of a contiguous list. The transformation is shown in Figure 13. Thanks to this change we now have data stored in a more regular pattern (2D array). This enables compilers to apply standard transformations and optimizations on 2D containers. double * curvalptr =new double[ max_nnz *nrow ]; int * curindptr =new int[ max_nnz *nrow ]; for (...) { // for every row A->ptr_to_vals [currow ] = curvalptr ;1 A->ptr_to_inds [currow ] = curindptr ;2 for (...) { // for every value *curvalptr = 27.0; *curindptr =curcol ; curvalptr ++; curindptr ++; } }// do not allocate curvalptr or curindptr for (...) { // for every row A->ptr_to_vals [currow ] = new double[ max_nnz ]; int vals_adj = 0; A->ptr_to_inds [currow ] = new int[ max_nnz ]; int inds_adj = 0; for (...) { // for every value A->ptr_to_vals [currow ][vals_adj ] = 27.0; A->ptr_to_inds [currow ][inds_adj ] = curcol ; vals_adj ++; inds_adj ++; } } Fig. 13. Initialization of a sparse matrix in LIL format extracted from the Mantevo HPCCG benchmark. The original code is on the left, while the transformed code is on the right. The transformation is done first by finding two pointer assignments working on the same struct inside the same scope ( 1and 2). We use a map to relate the right side of the assignment with the left (curvalptr toA->ptr_to_vals[currow] ). We then create an adjunct and transform the code in a similar way to the general pointer transformation. Using the previously created map we find all pointer accesses that must be transformed and substitute them with array access using the adjunct . In the same way, we find pointer arithmetic operations and substitute the pointer with the adjunct . The benefit of using such specific pattern-based transformations is that they can be used in conjunction with our general algorithm, and therefore be integrated into the same workflow. This ACM Trans. Arch. Code Optim. Page 16: PREPRINT16 Lepori et al. creates the possibility to efficiently handle a wide range of other access patterns without the need for separate workflows or manual code modification. It’s important to note that for multidimensional arrays, allocating a continuous buffer and then splitting it by storing pointers introduces uncertainty for static analysis, as explained in Section 3.2. Such patterns are unsupported by our current method. However, our LIL transformation automatically converts this pattern into a 2D array. In other scenarios, implementing additional transformations may be necessary to align with the limitations of our method. 4.2 Handling external calls with pointer arguments Some parts of a program, such as external library calls are outside the scope of static compiler analysis. These must therefore be considered black boxes. A common pattern in C when calling external functions is to use pointers as arguments for inputs and outputs. One such example is the memcpy function from the C standard library. The function with the following signature: void* memcpy (void* destination , const void* source , size_t num); takes two pointers and an integer numand copies numbytes from the source pointer to the destination . In the previous section, we discussed the possible modifications and movement of pointers and their impact on decidability. In the example of memcpy pointers are not modified. In C arguments are passed by copy and not by reference therefore callee modifications of arguments are not propagated to the caller. The same argument applies to pointers. What is modified in those cases is the content of the pointer, hence the elements of the data container that the pointer is pointing to. A function taking a pointer as an argument can modify its contents. This poses a challenge as data dependencies can no longer be tracked. In Figure 14, we use memcpy instead of a simple assignment. char* p=new char[1]; *p= 0; for (int i=0; i<n;i++) { char q= *p+i; // use *p from previous iteration ↩→ memcpy (p, &q, 1); // equal to *p = q } With memcpy dependency q = *p + i memcpy(p, &q, 1)qp i for loopint* q=new char[1]; *q= 42; for (int i=0; i<n;i++) { char p= 0; memcpy (&p,q, 1); // equal to p = *q data [i] = p+i;// data[i] = 42 + i; } Without memcpy dependency q = *p + i memcpy(p, &q, 1)qp i for loop Fig. 14. Left side: memcpy creates a dependency between loop iterations. Right side: pis a loop local variable andmemcpy is used to assign values to it. On the left side of the figure the data in pdepends on the value previously written to it by memcpy (dependency shown by the dashed arrow). This enforces that the loop must be executed in order. If ACM Trans. Arch. Code Optim. Page 17: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 17 we remove this dependency (dashed arrow) then loop iterations would be independent. This would permit the compiler to vectorize the loop producing the incorrect result. We previously assumed that the second argument of memcpy is only read, but this is an insight gained by looking at the implementation of memcpy and not black box analysis. Without outside information, no compiler framework can overcome this issue. In the right side of Figure 14 qis only read from but the compiler cannot infer this. This creates a spurious data dependency between iterations and prevents the loop from being parallelized. While it is good practice to make pointer arguments that are read-only const , not every function or library does so. Our solution is to use whitelists for often-used library calls - manually annotating library functions to specify if a pointer is write-only, read-write, or read-only. This whitelist could become a database of known functions that could even be automatically populated through compiler analysis, as it only needs to be performed once for any given library implementation. Annotations only provide information about which data container was read or written. We, therefore, assume conservatively that the entire data container could be read or written. 4.3 Stateful external calls Some libraries store an internal state that is opaque to the program calling functions provided by the library — great examples being MPI or OpenGL. One such library is a part of OpenSSL, namely HMAC (Hash Message Authentication Code) — it uses an internal state to store hashes and keys between calls. This data is saved in an internal data structure and a pointer to it is provided to calling functions. This pointer (called context in this implementation) is never modified by the user code, but only by library calls. HMAC_CTX *hctx =HMAC_CTX_new (); 1 for (...) { HMAC_CTX_copy (hctx ,base_hctx );2 // do operations with hctx } HMAC_CTX_free (hctx ); In the example above, without any additional information, it appears that the hctx pointer is ini- tialized outside the loop ( 1), and that the call to HMAC_CTX_copy within the loop ( 2) could read and write to hctx . This creates a data dependency between loop iterations. In reality, HMAC_CTX_copy does not read the data of the first argument ( hctx ) — that value is only written to and effectively used as a local variable within the scope of the loop. Therefore, by marking the dependency on (hctx ) inHMAC_CTX_copy as write-only and noticing that it is never used afterwards, the loop can be parallelized. This approach essentially performs an escape analysis [ 40], identifying that the variable within the loop does not escape the thread executing each iteration. 5 EVALUATION To evaluate our transformation we extended the DaCe [ 14,19] pipeline by adding the pointer disaggregation transformation explained in Section 2 and the specific transformations explained in Section 4. The approach described in Section 4.2 and Section 4.3 is implemented by adding a whitelist of additional dependencies in the data-centric IR (intermediate representation) used by DaCe. As detailed and demonstrated in Section 2.5, while incremental performance enhancements are attainable across various compilers, the constrained scope of vectorization inhibits substantial performance gains. Consequently, we prioritize automatic parallelization compilers like Polly and DaCe, which hold promise in uncovering novel opportunities facilitated by the transformations ACM Trans. Arch. Code Optim. Page 18: PREPRINT18 Lepori et al. implemented. Specifically, we direct our attention to DaCe due to its inherent data-centric approach, which offers superior analysis of data dependencies. We measured performance on a dual-socket 2x6 core (2x12 threads) Intel Xeon X5670 @ 2.93GHz with 48GB of RAM. We compare the DaCe data-centric framework (using GCC 12.1.1 as a backend compiler) against GCC and Polly (using Clang 15.0.6). We report the median of 10 runs with a confidence interval of 95%. OpenMP is used for parallel execution. Integrating the adjunct trans- formation into Polly did not reveal any additional Static Control Parts (SCoPs), and consequently, no further parallelization opportunities were identified. Hence, we provide the runtime results of Polly without the adjunct transformation. We specifically chose to compare with Polly due to its widespread availability as part of the LLVM/Clang suite, offering some level of automatic parallelization capabilities. Our evaluation was conducted on the Mantevo HPCCG benchmark and the OpenSSL PBKDF2 implementation, with the results presented below. C transformation Data centric Hardware specificInput C file C AST Data flow graph Parallel C++ code with OpenMPCPU code generatorFPGA code generatorCUDA code generator Machine instructionsC parsingAST transformations including adjunct Transformation to data centric IRData-centric transformations C++ compilation Fig. 15. Workflow for C program optimization including adjunct transformation and leveraging the DaCe data-centric framework. 5.1 OpenSSL PBKDF2 The Password Based Key Derivation Function 2 (PBKDF2) is a derivation function that derives a cryptographically secure key from a password. PBKDF2 applies a Hash-based Message Authentica- tion Code (HMAC) function multiple times to a provided password and salt. Based on the required length of the key this process is repeated multiple times. A summary of the process is provided in Figure 16. We can notice that each single 𝐵𝑖blocks can be computed independently of any other. ACM Trans. Arch. Code Optim. Page 19: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 19 𝑘𝑒𝑦=𝑃𝐵𝐾𝐷𝐹 2(𝐻𝑀𝐴𝐶,𝑝𝑎𝑠𝑠𝑤𝑜𝑟𝑑,𝑠𝑎𝑙𝑡,𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠,𝑘𝑒𝑦𝐿𝑒𝑛 ) =𝐵0++𝐵1++...++𝐵𝑘𝑒𝑦𝐿𝑒𝑛/ℎ𝑙𝑒𝑛 𝐵𝑖=𝑈𝑖 0⊕𝑈𝑖 1⊕...⊕𝑈𝑖 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 𝑈𝑖 0=𝐻𝑀𝐴𝐶(𝑝𝑎𝑠𝑠𝑤𝑜𝑟𝑑,𝑠𝑎𝑙𝑡+𝑖) 𝑈𝑖 1=𝐻𝑀𝐴𝐶(𝑝𝑎𝑠𝑠𝑤𝑜𝑟𝑑,𝑈𝑖 0) ... 𝑈𝑖 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠 =𝐻𝑀𝐴𝐶(𝑃𝑎𝑠𝑠𝑤𝑜𝑟𝑑,𝑈𝑖 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠−1) Fig. 16. PBKDF2 scheme. In this representation, ++is the string concatenation, ⊕is the XOR operator and ℎ𝑙𝑒𝑛 is the output length of the HMAC function. 124 8 12 24051015 DaCePolly Manual OpenMP FastPBKDF23.7x3.5xTime [s]#blks=4 124 8 12 240102030 DaCePolly Manual OpenMP FastPBKDF26.2x 7.2x#blks=8 124 8 12 240204060 DaCePolly Manual OpenMP FastPBKDF28.9x 8.9x ThreadsTime [s]#blks=16 124 12 24020406080 DaCePolly Manual OpenMP FastPBKDF29.7x10.7x Threads#blks=24 PBKDF2 with various block sizes Fig. 17. Runtime of various parallelized PBKDF2 implementations using different block counts (block count determines the problem size) and using multiple thread counts. Increasing the thread count beyond the number of blocks would be ineffective as the work could not be meaningfully divided among threads. The implementation inside OpenSSL does not include parallel options. We applied our pipeline to the OpenSSL code to auto-parallelize the code. We compare the runtime against Polly and a manually parallel version that was written using the OpenSSL implementation as a starting point. Only the PBKDF2 algorithm was analyzed, all HMAC calls are treated as external calls and use the ACM Trans. Arch. Code Optim. Page 20: PREPRINT20 Lepori et al. standard OpenSSL implementation. For completeness, we compare it with FastPBKDF2 [ 16], an implementation of the same algorithm that has been developed for parallelism from the ground up. We used SHA1 as the HMAC function, 5·106iterations, and an output key size of 480 bytes (or 24 blocks), and summarize our results in Figure 17. We then varied the problem size, represented by the number of blocks. The results (Figure 17) show a consistent trend across all experiments. Our approach was able to obtain comparable results to the manually parallelized version, providing at most a 10.7x improvement with 24 threads. Polly wasn’t able to identify any parallel opportunities due to the challenges of analyzing pointers. The FastPBKDF2 implementation is considerably faster than any OpenSSL equivalent, mainly due to the serial runtime of FastPBKDF2 version being 3 times faster, 27.7s versus 84.5s for OpenSSL. This reinforces the idea that tools can provide significant performance increases and find opportunities for parallelism, but nothing surpasses finding a better algorithm altogether. 5.2 Mantevo HPCCG The Mantevo HPCCG benchmark computes the conjugate gradient on a sparse matrix. The sparse matrix is stored using the List of Lists (LIL) format. We leverage the specific transformation explained in Section 4.1 to improve analyzability. The LIL transformation is applied once only to the input file for DaCe. We compare our pipeline against Polly with the same preprocessed input. The baseline is obtained using the original benchmark with OpenMP enabled. 124 8 12 2402468 DaCe Hand-tunedPolly 2.4xSerialTime [s]Size1003 124 8 12 240102030 DaCeHand-tunedPolly 2.7xSerialSize1503 124 8 12 24020406080 DaCe Hand-tunedPolly 2.9xSerial ThreadsTime [s]Size2003 124 8 12 24050100150 DaCeHand-tunedPolly 2.9xSerial ThreadsSize2503 HPCCG with different problem sizes Fig. 18. Runtime of Mantevo HPCCG benchmark with different compilers. We can see that Polly was not able to find parallel opportunities as pointers are used to access the sparse matrix. Our approach was able to parallelize all five loops. The performance of our automatically parallelized code matches the hand-tuned version developers created across all ACM Trans. Arch. Code Optim. Page 21: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 21 experiments. At larger problem sizes, our approach even outperforms the reference implementation by up to 18%. 5.3 Lempel–Ziv–Oberhumer compression algorithm In Subsection 2.5, we presented results from a benchmark on the Lempel–Ziv–Oberhumer (LZO) compression algorithm. Although automatic parallelization wasn’t achieved, there was still a modest enhancement in single-threaded performance. Achieving effective automatic parallelization would necessitate an algorithm redesign due to the presence of loop-carried dependencies. However, our method managed to achieve a 4% runtime enhancement without requiring any algorithm rewriting. Given the analogous pointer movement patterns observed in most compression algorithms, we anticipate similar, if not superior, outcomes by extending our pipeline to encompass them as well. 6 RELATED WORK Pointer analysis in C. Pointer analysis has been a topic of research for decades. One of the most impactful approaches, the "points-to" analysis, was proposed by Andersen et al. [ 10] and developed and improved over the years [ 18,24,37]. Our approach is orthogonal to this method, and while more constrained in the patterns it detects, it provides a powerful benefit towards automatic parallelism detection. Parallelization of C programs. A large and increasing number of approaches exist that are geared toward finding parallelism in C codes, with different degrees of automation. Tools like DiscoPoP [ 43] focus on identifying loops that could be parallelized using OpenMP constructs, while frameworks like Polly [ 30] and Pluto [ 17] leverage the polyhedral representation of loops to automatically generate parallel code. Neither of these approaches can currently tackle codes using pointers as iterators, but we believe it should be possible for our approach to be leveraged in these workflows to increase their reach, as we have shown in this work with DaCe. Parallelization of programs with pointers. Utilizing the Distributed Dynamic Pascal (DDP) lan- guage [ 33] facilitates the development of parallel programs involving pointers. The DDP language introduces explicit pointer access mechanisms to enable the parallelization of pointer-related oper- ations. In contrast, our approach seamlessly integrates with existing C programs, obviating the necessity for transpilation processes. Intermediate representations. The LLVM IR [ 41] allows for many transformations and code optimizations. However, tracking pointer dependencies across scopes remains difficult. A new opportunity has appeared with the recent rise of MLIR [ 42], a framework that allows multiple IR dialects to co-exist and their different strengths to all contribute to overall code analyzability and performance. The MemRefType built-in dialect already enables to specify the shape of the underlying data. This includes the option to save the start of a data container even if the pointer is moved. MLIR would permit our approach to be implemented as a dialect, bringing it to the wider LLVM ecosystem. Source code modification. The alteration of source code within digital signal processing (DSP) programs facilitates the elimination of pointer references as outlined in the work by Franke et al. [25]. While the approach bears some resemblance to our proposed method, it exhibits comparatively reduced flexibility. It exclusively encompasses static code modifications, devoid of the incorporation of the adjunct concept, which enables the utilization of compile-time determined values exclusively. The primary aim remains the enhancement of computational efficiency, albeit without resorting to parallel processing techniques. Pointers metadata. Incorporating metadata into pointers is not a novel concept, as evidenced by previous studies [ 50,53]. In other scenarios, where the emphasis is on safety rather than ACM Trans. Arch. Code Optim. Page 22: PREPRINT22 Lepori et al. performance, statically known mappings to pointers have been established. This facilitates the addition of runtime bounds checking, analogous to how we introduce runtime pointer movements through source code modifications. 7 CONCLUSION In this work, we introduce a static transformation that separates the use of pointers as handles to data containers from their use as iterators. This improves code analyzability and data-centric compilers are shown to benefit from this transformation as eliminating indirection exposes ad- ditional parallelization opportunities. Using our approach on the Mantevo HPCCG benchmark we were able to match the developer-optimized version and even surpass it by up to 6%. On the OpenSSL PBKDF2 implementation, we were able to automatically parallelize it and obtain up to an 11x speedup. ACKNOWLEDGMENTS This project has received funding from EuroHPC-JU under grant agreement DEEP-SEA No 95560 and by the European Research Council (ERC) under the European Union’s Horizon 2020 program (grant agreement PSAP, No. 101002047). This work was partially supported by the ETH Future Computing Laboratory (EFCL), financed by a donation from Huawei Technologies. A PROOF SKETCH OF POINTER ASSIGNMENT We want to show that the assignment of pointers gives equivalent results after the application of theadjunct transformation described in Figure 4. To do so we introduce a new pointer 𝑞with its 𝑞𝑎𝑑𝑗. Let𝑛,𝑚 be arbitrary integers then we define 𝑋,𝑌 and𝑋′,𝑌′as 𝑋=(𝑝:=𝑝+𝑥𝑖|𝑥𝑖∈Z,0≤𝑖<𝑛) 𝑌=𝑣𝑖:=𝑣𝑖+𝑥𝑖|𝑥𝑖∈Z&𝑣𝑖∈{𝑝,𝑞},𝑛≤𝑖<𝑚 𝑋′=𝑝𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗+𝑥𝑖|𝑥𝑖∈Z,0≤𝑖<𝑛 𝑌′= 𝑣𝑖 𝑎𝑑𝑗:=𝑣𝑖 𝑎𝑑𝑗+𝑥𝑖|𝑥𝑖∈Z&𝑣𝑖 𝑎𝑑𝑗∈{𝑝𝑎𝑑𝑗,𝑞𝑎𝑑𝑗},𝑛≤𝑖<𝑚 Using ;to concatenate commands we define 𝑆with 𝑆=(𝑋;(𝑞:=𝑝;𝑌)) Where it represents the execution of an arbitrary number of pointer movement statements followed by a pointer assignment and another arbitrary number of pointer movement statements. In the same way we define 𝑆′as the adjunct transformation of 𝑆. 𝑆′=(𝑋′;(𝑞:=𝑝;(𝑞𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗;𝑌′))) Note that the ;operator is right associative. Then we define 𝑌′ 𝑞as 𝑌′ 𝑞=𝑚∑︁ 𝑖=𝑛|𝑣𝑖 𝑎𝑑 𝑗=𝑞𝑎𝑑 𝑗𝑥𝑖 i.e. the difference of value after executing 𝑌′on𝑞𝑎𝑑𝑗. Then(𝑌′;𝑞𝑎𝑑𝑗)=𝑞𝑎𝑑𝑗+𝑌′ 𝑞. Note that𝑋and 𝑋′only act on 𝑝and𝑝𝑎𝑑𝑗respectively. This is to have a simpler proof but it can be easily shown that by defining 𝑋and𝑋′in a similar way as 𝑌and𝑌′the same result will be obtained. For sake of a shorter syntax and more readable proof we omit the declaration of the pointers and adjunct integer variables. I.e. (𝑝:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑝𝑎𝑑𝑗:=0;(𝑞:=𝑚𝑎𝑙𝑙𝑜𝑐(...);(𝑞𝑎𝑑𝑗:=0;(𝑆;𝑞))))) ACM Trans. Arch. Code Optim. Page 23: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 23 as it will be carried without modifications through all the proof steps. Then we can derive 𝑆;𝑞=𝑋;(𝑞:=𝑝;(𝑌;𝑞)) (adjunct transformation) =𝑋;(𝑞:=𝑝;(𝑞+(𝑌′;𝑞𝑎𝑑𝑗))) (distributive) =𝑋;((𝑞:=𝑝;𝑞)+(𝑞:=𝑝;(𝑌′;𝑞𝑎𝑑𝑗))) (no effect statement) =𝑋;((𝑞:=𝑝;𝑞)+(𝑌′;𝑞𝑎𝑑𝑗)) (assignment) =𝑋;(𝑝+(𝑌′;𝑞𝑎𝑑𝑗)) (distributive) =(𝑋;𝑝)+(𝑋;(𝑌′;𝑞𝑎𝑑𝑗)) (adjunct transformation) =((𝑋′;𝑝)+(𝑋′;𝑝𝑎𝑑𝑗))+(𝑋;(𝑌′;𝑞𝑎𝑑𝑗)) (no effect statement) =𝑝+(𝑋′;𝑝𝑎𝑑𝑗)+(𝑋;(𝑌′;𝑞𝑎𝑑𝑗)) (no effect statement) =𝑝+(𝑋′;𝑝𝑎𝑑𝑗)+(𝑌′;𝑞𝑎𝑑𝑗) (sum expantion of adjunct) =𝑝+(𝑋′;𝑝𝑎𝑑𝑗)+𝑞𝑎𝑑𝑗+𝑌′ 𝑞 (adjunct value of 0) =𝑝+(𝑋′;𝑝𝑎𝑑𝑗)+𝑌′ 𝑞 (no effect statement) =𝑝+(𝑋′;𝑝𝑎𝑑𝑗)+(𝑋′;𝑌′ 𝑞) (distributive) =𝑝+(𝑋′;(𝑝𝑎𝑑𝑗+𝑌′ 𝑞)) (no effect statement) =𝑝+(𝑋′;(𝑞:=𝑝;(𝑝𝑎𝑑𝑗+𝑌′ 𝑞))) (assignment) =𝑝+(𝑋′;(𝑞:=𝑝;(𝑞𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗;𝑞𝑎𝑑𝑗+𝑌′ 𝑞)))) (𝑌′ 𝑞definition) =𝑝+(𝑋′;(𝑞:=𝑝;(𝑞𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗;(𝑌′;𝑞𝑎𝑑𝑗)))) (S’ definition) =𝑝+(𝑆′;𝑞𝑎𝑑𝑗) (no effect statement) =(𝑋′;𝑝)+(𝑆′;𝑞𝑎𝑑𝑗) (assignment) =(𝑋′;(𝑞:=𝑝;𝑞))+(𝑆′;𝑞𝑎𝑑𝑗) (no effect statement) =(𝑋′;(𝑞:=𝑝;(𝑞𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗;𝑞)))+(𝑆′;𝑞𝑎𝑑𝑗) (no effect statement) =(𝑋′;(𝑞:=𝑝;(𝑞𝑎𝑑𝑗:=𝑝𝑎𝑑𝑗;(𝑌′;𝑞))))+(𝑆′;𝑞𝑎𝑑𝑗) (S’ definition) =(𝑆′;𝑞)+(𝑆′;𝑞𝑎𝑑𝑗)=𝑞+(𝑆′;𝑞𝑎𝑑𝑗) Now we have proven the equivalence of the pointer assignment transformation stated in Figure 4. REFERENCES [1][n. d.]. BTRFS compression documentation. https://btrfs.readthedocs.io/en/latest/Compression.html. Accessed: 2024-09-05. [2] [n. d.]. lib, initramfs: Add initramfs LZO compression. https://lore.kernel.org/all/1238593252-3435-1-git-send-email- andr345@gmail.com/. Accessed: 2024-09-05. [3] [n. d.]. Linux kernel repository. https://github.com/torvalds/linux. Accessed: 2024-14-05. [4] [n. d.]. OpenSSL repository. https://github.com/openssl/openssl. Accessed: 2024-14-05. [5] [n. d.]. Squashfs documentation. https://dr-emann.github.io/squashfs/squashfs.html. Accessed: 2024-09-05. [6] [n. d.]. TurboBench repository. https://github.com/powturbo/TurboBench. Accessed: 2024-14-05. [7][n. d.]. zram documentation. https://www.kernel.org/doc/html/next/admin-guide/blockdev/zram.html. Accessed: 2024-09-05. [8] [n. d.]. zswap documentation. https://www.kernel.org/doc/html/v4.18/vm/zswap.html. Accessed: 2024-09-05. [9] Javed Absar. 2018. Scalar evolution-demystified. In European LLVM Developers Meeting . [10] Lars Ole Andersen and Peter Lee. 2005. Program Analysis and Specialization for the C Programming Language. [11] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, et al .2006. The landscape of parallel computing research: A view from berkeley. (2006). [12] Gordon Bell. 1994. Scalable, parallel computers: alternatives, issues, and challenges. International Journal of Parallel Programming 22, 1 (1994), 3–46. ACM Trans. Arch. Code Optim. Page 24: PREPRINT24 Lepori et al. [13] MA Belyaev, M Akhin, and VM Itsykson. 2014. Improving static analysis by loop unrolling on an arbitrary iteration. University Scientific Journal 8 (2014), 154–168. [14] Tal Ben-Nun, Johannes de Fine Licht, Alexandros Nikolaos Ziogas, Timo Schneider, and Torsten Hoefler. 2019. State- ful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19) . [15] M.D. Beynon, H. Andrade, and J. Saltz. 2002. Low-cost non-intrusive debugging strategies for distributed parallel programs. In Proceedings. IEEE International Conference on Cluster Computing . 439–442. https://doi.org/10.1109/ CLUSTR.2002.1137778 [16] Joseph Birr-Pixton. 2015. PBKDF2: performance matters. https://jbp.io/2015/08/11/pbkdf2-performance-matters.html. Accessed: 2023-01-31. [17] Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2008. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model. In Compiler Construction , Laurie Hendren (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 132–146. [18] Martin Bravenboer and Yannis Smaragdakis. 2009. Strictly Declarative Specification of Sophisticated Points-to Analyses. In Proceedings of the 24th ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications (Orlando, Florida, USA) (OOPSLA ’09) . Association for Computing Machinery, New York, NY, USA, 243–262. https://doi.org/10.1145/1640089.1640108 [19] Alexandru Calotoiu, Tal Ben-Nun, Grzegorz Kwasniewski, Johannes de Fine Licht, Timo Schneider, Philipp Schaad, and Torsten Hoefler. 2022. Lifting C Semantics for Dataflow Optimization. In Proceedings of the 36th ACM International Conference on Supercomputing (Virtual Event) (ICS ’22) . Association for Computing Machinery, New York, NY, USA, Article 17, 13 pages. https://doi.org/10.1145/3524059.3532389 [20] Stephen Cass. 2018. Chip Hall of Fame: Intel 4004 Microprocessor. https://spectrum.ieee.org/chip-hall-of-fame-intel- 4004-microprocessor. Accessed: 2023-01-05. [21] Jong-Deok Choi, Barton P. Miller, and Robert H. B. Netzer. 1991. Techniques for Debugging Parallel Programs with Flowback Analysis. ACM Trans. Program. Lang. Syst. 13, 4 (oct 1991), 491–530. https://doi.org/10.1145/115372.115324 [22] L. Dagum and R. Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Computa- tional Science and Engineering 5, 1 (1998), 46–55. https://doi.org/10.1109/99.660313 [23] Anthony Danalis, Heike Jagode, and Jack Dongarra. 2018. PAPI: Counting outside the Box. (2018-04 2018). [24] Maryam Emami, Rakesh Ghiya, and Laurie J. Hendren. 1994. Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation (Orlando, Florida, USA) (PLDI ’94) . Association for Computing Machinery, New York, NY, USA, 242–256. https://doi.org/10.1145/178243.178264 [25] Björn Franke and Michael O’Boyle. 2001. Compiler Transformation of Pointers to Explicit Array Accesses in DSP Applications. In Compiler Construction , Reinhard Wilhelm (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg, 69–85. [26] Roy Frostig, Matthew Johnson, and Chris Leary. 2018. Compiling machine learning programs via high-level tracing. https://mlsys.org/Conferences/doc/2018/146.pdf [27] Fabrizio Gagliardi, Miquel Moreto, Mauro Olivieri, and Mateo Valero. 2019. The international race towards Exascale in Europe. CCF Transactions on High Performance Computing 1, 1 (2019), 3–13. [28] GitHub. 2022. GitHub API, repository count. https://api.github.com/search/repositories?q=language:Ccreated:<=2022- 12-31. Accessed: 2023-01-02. [29] GitHub. 2022. GitHub Top Programming Languages. https://octoverse.github.com/2022/top-programming-languages. Accessed: 2023-01-02. [30] TOBIAS GROSSER, ARMIN GROESSLINGER, and CHRISTIAN LENGAUER. 2012. POLLY — PERFORMING POLYHE- DRAL OPTIMIZATIONS ON A LOW-LEVEL INTERMEDIATE REPRESENTATION. Parallel Processing Letters 22, 04 (2012), 1250010. https://doi.org/10.1142/S0129626412500107 arXiv:https://doi.org/10.1142/S0129626412500107 [31] Tobias Grosser, J. Ramanujam, Louis-Noel Pouchet, P. Sadayappan, and Sebastian Pop. 2015. Optimistic Delinearization of Parametrically Sized Arrays. In Proceedings of the 29th ACM on International Conference on Supercomputing (Newport Beach, California, USA) (ICS ’15) . Association for Computing Machinery, New York, NY, USA, 351–360. https: //doi.org/10.1145/2751205.2751248 [32] Richard Günther. 2006. Profile driven loop transformations. In GCC Developers’ Summit . Citeseer, 49. [33] R. Gupta. 1992. SPMD execution of programs with dynamic data structures on distributed memory machines. In Proceedings of the 1992 International Conference on Computer Languages . 232–241. https://doi.org/10.1109/ICCL.1992. 185487 [34] ISO. 1999. ISO C Standard 1999 . Technical Report. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf Subsection 6.5.2 Postfix operators. ACM Trans. Arch. Code Optim. Page 25: PREPRINTIterating Pointers: Enabling Static Analysis for Loop-based Pointers 25 [35] Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. 2020. Data Movement Is All You Need: A Case Study on Optimizing Transformers. https://doi.org/10.48550/ARXIV.2007.00072 [36] Jason Kane and Qing Yang. 2012. Compression Speed Enhancements to LZO for Multi-core Systems. In 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing . 108–115. https://doi.org/10.1109/ SBAC-PAD.2012.29 [37] George Kastrinis and Yannis Smaragdakis. 2013. Hybrid Context-Sensitivity for Points-to Analysis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (Seattle, Washington, USA) (PLDI ’13). Association for Computing Machinery, New York, NY, USA, 423–434. https://doi.org/10.1145/2491956.2462191 [38] Yeonsoo Kim, Shinhyung Yang, Shin-Dug Kim, and Bernd Burgstaller. 2019. Code Optimizations for In-network Processing: The State of Scalar Evolution in LLVM: The State of Scalar Evolution in LLVM. Proceedings of the Korea Telecommunications Society (2019), 434–437. [39] Maria Kotsifakou, Prakalp Srivastava, Matthew D. Sinclair, Rakesh Komuravelli, Vikram Adve, and Sarita Adve. 2018. HPVM: Heterogeneous Parallel Virtual Machine. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Vienna, Austria) (PPoPP ’18) . Association for Computing Machinery, New York, NY, USA, 68–80. https://doi.org/10.1145/3178487.3178493 [40] Thomas Kotzmann and Hanspeter Mössenböck. 2005. Escape analysis in the context of dynamic compilation and deoptimization. In Proceedings of the 1st ACM/USENIX International Conference on Virtual Execution Environments (Chicago, IL, USA) (VEE ’05) . Association for Computing Machinery, New York, NY, USA, 111–120. https://doi.org/10. 1145/1064979.1064996 [41] C. Lattner and V. Adve. 2004. LLVM: a compilation framework for lifelong program analysis & transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004. 75–86. https://doi.org/10.1109/CGO. 2004.1281665 [42] Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A Compiler Infrastructure for the End of Moore’s Law. arXiv:2002.11054 [cs.PL] [43] Zhen Li, Rohit Atre, Zia Ul-Huda, Ali Jannesari, and Felix Wolf. 2015. DiscoPoP: A profiling tool to identify parallelization opportunities. In Tools for High Performance Computing 2014: Proceedings of the 8th International Workshop on Parallel Tools for High Performance Computing, October 2014, HLRS, Stuttgart, Germany . Springer, 37–54. [44] Message Passing Interface Forum. 2021. MPI: A Message-Passing Interface Standard Version 4.0 . https://www.mpi- forum.org/docs/mpi-4.0/mpi40-report.pdf [45] Gordon E. Moore. 2006. Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff. IEEE Solid-State Circuits Society Newsletter 11, 3 (2006), 33–35. https: //doi.org/10.1109/N-SSC.2006.4785860 [46] Markus F.X.J. Oberhumer. 2017. Lempel–Ziv–Oberhumer. https://www.oberhumer.com/opensource/lzo/. Accessed: 2024-09-05. [47] Geraldo F. Oliveira, Juan Gómez-Luna, Lois Orosa, Saugata Ghose, Nandita Vijaykumar, Ivan Fernandez, Mohammad Sadrosadati, and Onur Mutlu. 2021. DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks. IEEE Access 9 (2021), 134457–134502. https://doi.org/10.1109/ACCESS.2021.3110993 [48] Jonathan Ragan-Kelley, Andrew Adams, Dillon Sharlet, Connelly Barnes, Sylvain Paris, Marc Levoy, Saman Ama- rasinghe, and Frédo Durand. 2017. Halide: Decoupling Algorithms from Schedules for High-Performance Image Processing. Commun. ACM 61, 1 (dec 2017), 106–115. https://doi.org/10.1145/3150211 [49] Dennis M. Ritchie. 1996. The Development of the C Programming Language . Association for Computing Machinery, New York, NY, USA, 671–698. https://doi.org/10.1145/234286.1057834 [50] Matthew S. Simpson and Rajeev K. Barua. 2010. MemSafe: Ensuring the Spatial and Temporal Memory Safety of C at Runtime. In 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation . 199–208. https: //doi.org/10.1109/SCAM.2010.15 [51] Yulei Sui, Xiaokang Fan, Hao Zhou, and Jingling Xue. 2018. Loop-Oriented Pointer Analysis for Automatic SIMD Vectorization. ACM Trans. Embed. Comput. Syst. 17, 2, Article 56 (jan 2018), 31 pages. https://doi.org/10.1145/3168364 [52] TIOBE. 2022. TIOBE index. https://www.tiobe.com/tiobe-index/. Accessed: 2023-01-02. [53] Jie Zhou, John Criswell, and Michael Hicks. 2023. Fat Pointers for Temporal Memory Safety of C. Proc. ACM Program. Lang. 7, OOPSLA1, Article 86 (apr 2023), 32 pages. https://doi.org/10.1145/3586038 ACM Trans. Arch. Code Optim.

---