loader
Generating audio...

arxiv

Paper 2405.18512

Understanding Transformer Reasoning Capabilities via Graph Algorithms

Authors: Clayton Sanford, Bahare Fatemi, Ethan Hall, Anton Tsitsulin, Mehran Kazemi, Jonathan Halcrow, Bryan Perozzi, Vahab Mirrokni

Published: 2024-05-28

Abstract:

Which transformer scaling regimes are able to perfectly solve different classes of algorithmic problems? While tremendous empirical advances have been attained by transformer-based neural networks, a theoretical understanding of their algorithmic reasoning capabilities in realistic parameter regimes is lacking. We investigate this question in terms of the network's depth, width, and number of extra tokens for algorithm execution. Our novel representational hierarchy separates 9 algorithmic reasoning problems into classes solvable by transformers in different realistic parameter scaling regimes. We prove that logarithmic depth is necessary and sufficient for tasks like graph connectivity, while single-layer transformers with small embedding dimensions can solve contextual retrieval tasks. We also support our theoretical analysis with ample empirical evidence using the GraphQA benchmark. These results show that transformers excel at many graph reasoning tasks, even outperforming specialized graph neural networks.

Paper Content:
Page 1: arXiv:2405.18512v1 [cs.LG] 28 May 2024Understanding Transformer Reasoning Capabilities via Graph Algorithms Clayton Sanford1,2∗, Bahare Fatemi1, Ethan Hall3, Anton Tsitsulin1, Mehran Kazemi4, Jonathan Halcrow1, Bryan Perozzi1, and Vahab Mirrokni1 1Google Research,2Columbia University,3Google,4Google DeepMind Abstract Which transformer scaling regimes are able to perfectly sol ve different classes of algorithmic problems? While tremendous empirical advan ces have been at- tained by transformer-based neural networks, a theoretica l understanding of their algorithmic reasoning capabilities in realistic paramete r regimes is lacking. We investigate this question in terms of the network’s depth, w idth, and number of extra tokens for algorithm execution. Our novel representa tional hierarchy sepa- rates 9 algorithmic reasoning problems into classes solvab le by transformers in different realistic parameter scaling regimes. We prove th at logarithmic depth is necessary and sufficient for tasks like graph connectivity, while single-layer trans- formers with small embedding dimensions can solve contextu al retrieval tasks. We also support our theoretical analysis with ample empiric al evidence using the GraphQA benchmark. These results show that transformers ex cel at many graph reasoning tasks, even outperforming specialized graph neu ral networks. 1 Introduction The transformer neural network architecture, which was ini tially introduced for neural machine translation [ 5,75], quickly became the standard neural network architecture across many fields, powering recent breakthroughs in language modeling [ 18,64,11,74,2,73,65], computer vision [ 19, 47], and natural sciences [ 34,51]. Across fields, transformers have superseded other archit ectures, surpassing them in downstream performance while maintaini ng reasonable computational footprint. How can we analyze reasoning capabilities of neural network s? One approach is to study algorithms executable with their internal representations. Neural al gorithmic reasoning [ 28,87,32,35,76] is a field of research dedicated to exploring such capabilities. Algorithmic execution is desirable because models use it to generalize out-of-distribution [ 90,44] and scale to larger problem sizes [ 91]. In this work, we focus on classes of transformers solving alg orithmic reasoning problems on graphs. Why graph problems in particular? Recent research by Besta e t al. [ 9] suggests that graphs are an ideal abstraction of complex reasoning with dependencies, including chain-of-thought [ 80] and its generalizations [ 84,16,39,38]. Furthermore, we investigate graph reasoning in order to e valuate transformer capabilities compared to specialized models t hat explicitly capture the structure of data. Graph neural networks (GNNs) [ 70,42,26,6,14] offer strong baselines for algorithmic reason- ing [77,12] and comprehensive theoretical frameworks for their capab ilities and limitations [ 49,82]. The complexity of graph reasoning tasks varies greatly; som e are easily solved by non-graph archi- tectures and others require sophistication beyond standar d GNNs [ 88]. This makes graph tasks a compelling testbed for both theoretical and empirical eval uations. While transformer-based models are commonly optimized for downstream performance [ 36,30,62], theoretical investigations of transformer capabilities i n realistic parameter regimes have been limited. ∗Correspondence: clayton.h.sanford@gmail.com orbaharef@google.com . Preprint. Under review. Page 2: v1v2 v3v4 v5 v6 v7v8GraphG Task : Arev2andv4connected?v1 v2 v3 v4 v5 v6 v7 v8 v1,v2 v2,v3 v3,v4 v6,v7 v6,v8 v7,v8 v2,v4InputXVertex tokens Edge tokens Task tokensEmbedding Self-Attention MLP Self-Attention MLP Self-Attention . . .Transformer f ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ ∅ YesTargetf(X) Figure 1: The graph encoding scheme employed in our theoreti cal and empirical analysis that presents a graph reasoning task (e.g. connectivity) as a tok enized input to a standard transformer model. We analyze the capabilities of standard transformer models to solve graph reasoning tasks by em- ploying a graph tokenization scheme similar to [ 40]; see Figure 1for a graph encoding example. We generalize the insights of [ 69]—which established that graph connectivity can be compute d in a more depth-efficient manner with transformers than GNNs—t o broader families of graph reason- ing tasks. We also study the representational impacts of blank tokens , which can be thought of as a theoretically-tractable version of either chain-of-tho ught prompting [ 80], scratch space [ 59], or pause/filler tokens [ 27,63]. We conduct an extensive study of graph reasoning problems wi th transformers from both theoretical and empirical perspectives. We summarize our principal con tributions as follows: 1. We introduce a novel representational hierarchy of graph reasoning tasks that formalizes rea- soning capabilities of transformers in several realistic p arameter scaling regimes. This includes two graph reasoning task classes—which we refer to as the parallelizable andsearch tasks and include well-known algorithmic tasks like connectivity andshortest path respectively. 2. We prove that logarithmic-depth transformers are necess ary and sufficient to solve paralleliz- able tasks in a highly parameter-efficient manner; similar c onstructions for search tasks employ much larger networks. We distinguish both of these classes f rom an easier family of problems— retrieval tasks , such as node count andedge existence —by showing that retrieval tasks can be efficiently computed by single-layer transformers, while p arallelizable and search tasks cannot. 3. We empirically validate our representational contrast b y showing that transformers outperform GNNs on tasks that require the analysis of long-range depend encies, including parallelizable tasks like connectivity. Our experiments suggest that GNNs perform better on simpler tasks that require only local analysis of neighboring nodes in low samp le-complexity regimes, benefiting from their well-aligned inductive bias. 2 Related work Fundamental capabilities of transformers. Early representational research [ 86,60,79] estab- lished the universality of transformers by simulating Turi ng machines step-by-step, albeit in an unre- alistic polynomial-depth scaling regime. These universal ity results were extended to bounded-depth transformers with chain-of-thought tokens [ 54,50], but with a persistent gap between positive and negative results. A more fine-grained theoretical approach established the limitations of bounded- size transformers by relating them to threshold circuits, i mplying that L-complete tasks like graph connectivity are unsolvable by constant-depth transforme rs [53,52,55]. However, these circuit 2 Page 3: complexity reductions are not bidirectional and their posi tive results pertaining to classes of regular languages [ 29] reveal little about when tasks like connectivity canbe expressed. The closest analytical framework to ours characterizes tra nsformers as distributed computing pro- tocols by quantifying the informational bandwidth of their self-attention units. This line of work sharply separates transformers from other architectures a nd provides width and depth separa- tions [ 68,69]. The ability of transformers to simulate finite-state auto mata in logarithmic depth provides a further roadmap for how transformers efficiently leverage parallel computation [ 46]. Empirical analysis of transformer reasoning capabilities .Multiple empirical investigations ex- plore the algorithmic capabilities of transformer-based m odels [ 89,45,44]. Graph reasoning prob- lems have been used to evaluate capabilities of transformer -based LLMs [ 22,78,85,72,31] includ- ing works with graph-specific GNN adapters [ 13,61]. In the empirical part of our study, we use the GraphQA dataset [ 22]—which was evaluated initially on LLM prompting appraoche s and later on GNN-augmented prompts [ 60]—to validate our complexity hierarchy and compare transfo rmers with GNNs and prompt-based LLMs on a selection of graph reaso ning tasks. Transformers and graph neural networks. GNNs provide a useful benchmark for model expres- sivity on graph inputs. For instance, the inability of GNNs t o efficiently solve “global structure” tasks like subgraph connectivity made evident by a connecti on to the C ONGEST distributed com- puting model [ 49]. A further connection [ 57] to the Weisfeiler-Lehman (WL) isomorphism heuris- tic [81] captures the inability of feature-less GNNs to distinguis h certain graph instances [ 82]. See Appendix Dfor a further discussion of the representational limitatio ns of GNNs. Graph transformers [ 20,66,56] integrate the transformer architecture with graph-struc tured data. While GNNs can simulate the WL test [ 1], transformers can simulate GNNs and exceed their WL limitations [ 88]; they are strictly more expressive than 2-WL [ 40]. Despite our focus on standard transformers, our empirical results in Section 4include comparisons to a wide range of GNNs. 3 Hardness taxonomy of transformer graph reasoning tasks We provide our core result: a rigorous quantification of the h ardness of graph reasoning tasks for transformer-based models. While graph reasoning tasks, li ke other algorithmic problems, can be categorized into well-known computational and circuit com plexity classes (e.g. TC0,L,NL,NP, NP), the relationship between membership in these classes and the hardness of solving a task with a parameter-efficient neural network is not immediately obv ious. Our hierarchy bridges the gap between these worst-case computational classes and the rep resentational capabilities of bounded- size transformers of different parameter scaling regimes. These regimes include transformers whose depthLscales with the input sequence length N; this contrasts with most theoretical results, which study on the constant-depth regime. Our positive and negative theoretical results employ the tr ansformer model of [ 69], which is pre- sented in Appendix A.1and assumes that the embedding dimension mgrows less rapidly than than the sequence length Nand that the multi-layer perceptrons (MLPs) are arbitrary f unctions. The result is a model of a transformer as a bounded-capacity communication protocol . In this model, ar- bitrary functions of each individual embedding vector can b e computed, but the interactions between these vectors are restricted by the low-rank of the attentio n matrix. The relevance of this model is motivated by the rapid scaling in the context lengths of mode rn transformers in recent years and the high ratio of MLP parameter count to the embedding dimension each operates on. This model also permits the inclusion of blank pause token inputs of [ 27], which provide additional computational power to the transformer model by extending t he computational “tape” without intro- ducing new information about the input. These results divide the tasks that are we empirically inves tigate in Section 4into three difficulty families based on the hardness of solving the task with paral lel computation. 1.Retrieval tasks —including node count, edge count, edge existence, and node degree—are tasks that can intuitively be solved by a single lookup step or glob al aggregation. These are the easiest tasks in our framework, and we show in Section 3.3that these retrieval tasks can be computed by a single-layer transformer with small embedding dimensi on. (In contrast, all other examined tasks cannot be solved in that regime.) 3 Page 4: LDP LDW SearchParallelizable LD D1 Retrieval (a) The complexity hierarchyTask class Example tasks Complexity Retrieval (§ 3.3) Node count D1 L= 1 Edge count D1 m=O(logN) Edge existence D1 Node degree D1 Parallelizable (§ 3.1) Connectivity LD L=O(logN) Cycle check LDP∩LDW m=O(Nǫ) Bipartiteness LDP∩LDW Search (§ 3.2) Shortest path LDW L=O(logN) Diameter LDW m=O(N1/2+ǫ) (b) Example tasks and their complexity. Figure 2: A summary of the theoretical hierarchy of Section 3that visualizes which type of graph reasoning tasks can be solved in which transformer scaling r egime (Depth 1(D1),LogDepth (LD), LogDepthWide (LDW ) andLogDepthPause (LDP )). 2.Parallelizable tasks —including connectivity, connected nodes, and cycle check from the ex- perimental results and a host of other graph reasoning tasks , such as bipartiteness, planarity, and minimum spanning forest—are non-trivial tasks that can be s olved efficiently in a parallel compu- tation setting. Section 3.1establishes that these tasks can be solved by bounded-size t ransformers with logarithmic depth. 3.Search tasks —including shortest path and other tasks like diameter and d irected reachability— comprise a harder family of tasks that are less easily solved by parallel algorithms. In Section 3.2, we prove that these tasks belong in an equivalence class and e xhibit large-model scaling regimes where they can be computed. The representational hardness of these classes is quantifie d by several results that determine whether transformers that obey different parameter scaling rules c an compute them. We define the following scaling regimes for the depth L, embedding dimension m, and number of “pause” tokens N′of a family of transformers as a function of the size of the input g raph,N=O(|V|+|E|). •Depth 1(D1): Single-layer multi-headed transformers with small embe dding dimension m= O(logN)without any pause tokens. •LogDepth (LD): Transformers with depth L=O(logN), embedding dimension m=O(Nǫ)for any fixedǫ>0, and no pause tokens. •LogDepthPause (LDP ): Transformers with the same depth and width constraints as LogDepth , with at most N′= poly(N)blank “pause” tokens appended to the input sequence. •LogDepthWide (LDW ): Transformers with depth L=O(logN), embedding dimension m= O(N1/2+ǫ), and no pause tokens. Our positive and negative results that relate scaling regim es and graph reasoning tasks are displayed in Figure 2. We present high-level result summaries in the following se ctions with proofs in Ap- pendix. The most technically significant results concern LogDepth models and are provided in Appendix B. These bounds are consequences of Theorem 1, an improved analysis of the relationship between transformers and the massively parallel computation (MPC) distributed computing model of [ 37]2. This connection between MPC and transformers is a sharp impr ovement of a similar result by [ 69]. Theorem 1 (Simplified version of Theorem 8).For constant δ,ǫ >0, anyR-round MPC protocol withNmachines with O(Nδ)bits of local memory each can be simulated by a transformer of depth L=O(R)and embedding dimension m=O(Nδ+ǫ). 2MPC is a theoretical model of the MapReduce computational pa radigm that distributed computation among a large number of machines with restricted local memory size and communication bandwidth. We formally introduce the MPC model in Appendix B.1. 4 Page 5: Results pertaining to Depth 1are stated in detail and proved in Appendix C. In addition, we discuss the triangle counting task (and the more general clique coun ting task) in Section 3.4, where we show a distinct result for much smaller depth ( L=O(loglogN)) that includes pause tokens. 3.1 Parallelizable tasks and LogDepth transformers We define the family of parallelizable tasks to consist of graph reasoning tasks that are L-complete and are equivalent to graph connectivity in O(1)-rounds, as proved by [ 58]. This family includes (but is not limited to): graph connectivity, minimum spanni ng forest, cycle detection, st-connectivity, number of connected components, graph bipartiteness, plan arity testing, and one-cycle vs two-cycles testing. While graph connectivity and minimum spanning for est were shown to be computable by logarithmic depth transformers with small polynomial widt h (LogDepth ) by previous work [ 69], this poses broader questions: Are all parallelizable graph reas oning tasks computable by a transformer of logarithmic depth and small embedding dimension? And do s ub-logarithmic-depth transformers exist that solve any parallelizable tasks? We first show that all parallelizable tasks can be solved in tw o logarithmic-depth scaling settings. Theorem 2. For any parallelizable task, there exists transformers in LogDepthPause and LogDepthPause that solve the task. Theorem 2is stated formally as Theorem 18and proved in Appendix B.2.1 . Both components of the theorem are a consequence of a novel relationship between th e MPC model and transformers (Theo- rem1) and the analysis of MPC protocols for graph reasoning tasks by [58]. TheLogDepthPause re- sult is the direct implication of an O(1)-round MPC equivalence (Theorem 9) between all paralleliz- able tasks and an MPC protocol that solves the connectivity t ask (Theorem 11). TheLogDepthWide bound is a consequence of Theorem 15, which shows that all languages in NC2(including all lan- guages in LandNL) can be evaluated by an MPC protocol with O(logN)rounds and with local memoryO(N1/2+ǫ)[58]. We further prove the conditional optimality of logarithmic depth. Theorem 3. Conditional on Conjecture 13, any transformer that solves some parallelizable task with widthmH=O(Nǫ)and pause tokens N′= poly(N)must have depth L= Ω(logN). This result (stated formally as Theorem 19is a combination of the conditional depth lower bound on graph connectivity by [ 69] and theO(1)-round equivalence between all parallel tasks. 3.2 Search tasks Search tasks are similarly defined to be those that are NL-complete and equivalent to shortest path in O(1)-rounds of MPC and include shortest path, strong connectivi ty,st-reachability, radius, diameter, and median. Like before, the O(1)-round MPC equivalence translates to an O(1)-depth equivalence in transformers. We give a similar positive result for LogDepthWide transformers; whether these tasks can be solved by LogDepthPause transformers is unknown. Theorem 4. For any search task, there exists a transformer in LogDepthWide that sovles the task. This theorem, which is restated in Appendix B.2.2 as Theorem 21, is also an immediate consequence of Theorem 15. While the minimum depth of a transformer with small embeddin g dimension that solves a search task is not identified, we prove that the minimum depth needed to solve some all search task is approximately equivalent in Theorem 22. 3.3 Retrieval tasks and Depth 1transformers Graph tasks whose algorithms consist of a single look-up or a ggregation step can be efficiently solved by single-layer transformers. This result assumes t hat the graph G= (V,E)is encoded as some input sequence Xof lengthN=O(|V|+|E|)that expresses each edge and vertex a single token. This encoding scheme is detailed in Appendix C. Theorem 5. For any retrieval task (including node count, edge count, ed ge existence, and node degree) there exists a transformer in Depth 1that solves the task. 5 Page 6: We formalize this statement in Theorem 36and prove it their in Appendix C.1. These rely on proving the existence of a useful input MLP φthat precomputes embeddings with useful structure for all retrieval tasks. In contrast, we show that a collection of parallelizable and search tasks cannot be efficiently solved by transformers in Depth 1. Theorem 6. Any single-layer transformer that solves the graph connect ivity, shortest path, or cycle detection task has width satisfying mH=˜Ω(N). The proof of the formal counterpart of this statement (Theor em38) appears in Appendix C.2and is a consequence of a standard communication complexity argum ent. A more generalized result than Theorem 6was proved by [ 55], which establishes that all problems outside of NC1—which include allL-complete and NL-complete languages, and hence, all parallelizable and sea rch tasks—cannot be solved by constant-depth transformers with polynomial width because they cannot be c omputed byTC0(constant-depth threshold circuits). We nonetheless incl ude this theorem to demonstrate a clean lower bound that applies to very simple input graph ins tances. 3.4 Triangle counting We finally construct depth-efficient transformers for trian gle counting due to the MPC algorithms of [10]. Unlike previous positive results, which applied uniform ly across all graphs instances of bounded size, the complexity of the corresponding transfor mers for triangle counting is a function of the arboricity3of the input graph. When the arboricity grows sub-polynomia lly withN—as is the case for bounded-degree graphs—no pause tokens are nece ssary. Unlike the parallelizable and search classes of problems, strictly sub-logarithmic dept h is attainable with pause tokens, even for worst-case graphs. Theorem 7. There exists a transformer that computes the number of trian gles in any input graph of arboricity αand has embedding dimension m=O(Nǫ), depthL=O(loglogN), and pause tokens N′=/braceleftbiggO(αN1−ǫ)ifα= Ω(Nǫ) 0 otherwise. This result is a special case of Theorem 23, a more general result about clique counting that appears in Appendix B.2.3 . Theoretical conclusions. These results provide a tight characterization of the the re asoning capa- bilities of transformers whose depth, width, and input padd ing conform to different scaling regimes. They strengthen the established connection between transf ormers and massively parallel computa- tion (MPC) [ 69] and generalize the resulting representational bounds to b roader categories of graph tasks. We conclude that the logarithmic-depth regime is apt for for considering tasks in LandNL, which had previous illuminated the limitations of transfor mers with constant depth and a limited number of chain-of-thought tokens [ 54]. While expressivity does not imply learnability, these th e- oretical benchmarks sharply characterize the fundamental limitations of transformers and coincide with experimental results conveyed in the subsequent secti on. 4 Empirical graph reasoning capabilities We further illuminate the reasoning capabilities of transf ormers by conducting an empirical inves- tigation of the abilities of a variety of neural architectur e and training settings to learn graph al- gorithmic tasks. We use the GraphQA benchmark tasks [ 22] for our experiments. We evaluate standard autoregressive transformers—both small models ( at most 60M parameters) trained from scratch and fine-tuned (FT) T5-11B model (with 11B parameter s) [64]. For the fine-tuned models, we explore task-specific fine-tuning—and contrast those res ults with graph neural networks (GNNs) and prompting-based methods on pre-trained LLMs. These experimental results validate key tenets of our theor etical results and demonstrate the utility of transformers’ algorithmic reasoning capabilities. Our pr incipal empirical conclusions are as follows: 3Thearboricity is the minimum number of spanning forests needed to cover the graph, which grows at most linearly with the degree of the graph. 6 Page 7: # of training samples Model 1K 100K GCN [ 42] 83.8 83.8 MPNN [ 26] 94.0 94.4 GIN [ 82] 93.8 94.0 60M transformer 92.9 98.0 11B transformer (FT) 98.4 — (a) Connectivity classification accuracy for trained transformers and GNNs.1021031041050.90.951 Number of Training Samples (log scale)Accuracy Train Test (b) Connectivity classification accuracy of 60M transformers by training set size. Figure 3: Accuracy of a variety of trained transformers and G NNs on the connectivity task. 1.Transformers excel at global reasoning tasks. Transformers outperform GNNs on tasks that require efficiently aggregating information about distant nodes in a graph, such as connectivity and shortest path. 2.GNNs uncover local graph structure with few samples. While transformers are capable of efficiently expressing all graph learning tasks under inves tigating, the structural limitations of GNNs provide them with favorable inductive biases for intri nsically local tasks, such as cycle check and node degree, and permit them to outperform transfo rmers in a low-sample regime. 3.Trained transformers outperform LLM prompting. Transformers trained to explicitly solve graph reasoning tasks consistently attain greater accurac y across tasks than a variety of prompting strategies applied to more recent larger LMs. A comprehensive evaluation of each GraphQA task on every tra ining setting appears in Appendix E, in addition details about transformer training, the GraphQ A benchmark, and alternative GNN and prompting approaches. 4.1 Transformers excel at global reasoning tasks As indicated in Section 3, graph reasoning algorithms can be categorized based on the extent to which they entail aggregating “local” information about no des and their immediate neighbors or modeling “global” connections between nodes separated by a long distances. This section investi- gates the following question about transformers and long-d istance reasoning: When do transformers outperform GNNs on tasks that require g lobal reasoning? We consider two tasks that require reasoning across long dis tances in a graph instance: evaluating connectivity and computing the shortest path between a pair of nodes. Neither of these tasks can be solved by only investigating the neighbors of the source a nd sink node, which therefore implies that some analysis of global graph structure is necessary. # of training samples Model 1K 100K GCN [ 42] 50.2 55.0 MPNN [ 26] 66.8 72.6 GIN [ 82] 54.0 58.6 60M transformer 57.4 97.1 11B transformer (FT) 92.8 — Table 1: Transformers vs GNNs on shortest path: Fine-tuned large transformers outperform other transformers and GNNs, even the alternatives are trained on much larger training sets.Figure 3adisplays the accuracy of a vari- ety of trained transformers and GNNs on the connectivity task contrastsing the performance of all such models when trained on 1,000 and 100,000 graph connectivity instances. In the most restricted sample complexity regime, trained GNNs are consistently more accurate than the small transformer; however, increas- ing the number of training samples yields a far more substantial improvement in the per- formance of the small transformer, which out- performs all GNNs trained on 100,000 samples. Notably, the pre-trained transformer, fine-tuned on just 1000 training instances, nearly solves the connectivity task. This suggests significant enhancements due to the larger model size and the data-rich fi ne-tuning phase. Figure 3bplots the 7 Page 8: training and test error of ten small transformers trained on connectivity datasets of increasing size and reveals a sharp and continual improvement in accuracy. T he fine-tuned T5 transformer has similar accuracy to the most sample-rich small transformer and exceeds that of all GNNs. On the other hand, Table 1demonstrates that the MPNN GNN models outperform small tran sformers when trained to compute the shortest path, even on larger tra ining datasets. However, the fine-tuned T5 model has far higher accuracy than all alternatives, even when trained only 1000 samples. Theoretical interpretation: Because connectivity is the prototypical example of a task i n the parallelizable class and can thus be efficiently implemente d byLogDepth transformers with very small width (Theorem 2), the fact that small transformers succeed in solving nearl y all connectivity instances is illuminating but not surprising. In contrast, message-passing GNNs are unable to solve connectivity in a similarly depth-efficient manner due to fu ndamental capacity limitations.4 Shortest path belongs to the search class of graph reasoning tasks and is NL-complete. Theorem 4 implies that shortest path is computable by LogDepthWide transformers, which are likely to require very large embedding dimension to be learnable by finite samp les. This task can only be computed in a depth- and parameter-efficient manner if a variety of sea rch tasks, including all-pairs shortest- path and diameter, are as well (Theorem 22). The fact that only the pre-trained model has nearly optimal performance on shortest path reinforces the theore tical intuition that solving shortest path requires a very large number of model parameters. 4.2 GNNs uncover local graph structure with few samples While small transformers outperform GNNs on graph reasonin g algorithms that entail analysis of long-range graph structure, their empirical successes are not uniform. Here, we investigate the following question: When do GNNs outperform transformers on graph reasoning tas ks? Node Degree Cycle Check Model 1K 100K 1K 100K GCN [ 42] 9.8 9.4 83.2 83.2 MPNN [ 26] 99.4 99.8 99.0 100.0 GIN [ 82] 36.2 37.8 98.8 83.2 60M transformer 31.6 91.7 97.1 98.0 11B transformer (FT) 68.8 — 98.0 — Table 2: Transformers vs GNNs on cycle check and node degree: GNNs are favorably biased for local structure.Figure 3band Table 1demonstrate that GNNs outperform small trans- formers in the low-sample regime, despite the sufficient expressivity of transformers. This gap in perfor- mance, which is reinforced for the node degree andcycle check tasks in Table 2, suggests that GNNs have a beneficial inductive bias for learn- ing graph reasoning tasks that can be solved by attending exclusively to lo- cal heuristics. Just as the bounded kernel size of convolutional neural netw orks (CNNs) enables the sample- efficient learning of relevant textures and gradients for im age analysis, message-passing GNNs are unable to send messages instantaneously across multiple ed ges, which simplifies a search of the space of “one-hop” graph algorithms and represents a positi ve inductive bias. In contrast, the ability of transformers to send information between any pair of inpu t tokens—and the alternative inductive biases suggested by the input positional encoding—likely i nduces a steeper sample complexity to learn node degree. Theoretical interpretation: While model expressivity is necessary for learnability, it is not suffi- cient. The locality constraints of message-passing GNNs li kely provides a favorable inductive bias for learning tasks like node degree with an exclusively on lo cal structure that makes learning these tasks possible in a sample-efficient manner. While cycle che ck is more representationally difficult for GNNs than transformers in the worst case (see Appendix A.2), the random graphs sampled for the GraphQA benchmark have very small cycles (Figure 7) and do not resemble the large-diameter worst-case instance. 4See Appendix A.2for a discussion of prior work on theoretical relationships between GNNs and both the Weisfeiler-Leman graph isomorphism test [ 82] and the C ONGEST distributed computing model and their implication that GNNs cannot solve connectivity in a parame ter-efficient manner. 8 Page 9: Retrieval tasks Parallelizable Tasks Search Tasks Subgrap h Counting Method Node count Edge count Edge existence Node degree Conn ectivity Cycle check Shortest path Triangle countingPromptingZERO -SHOT [22] 21.7 12.4 44.5 14.0 84.9 76.0 11.5 1.5 ZERO -COT [22] 14.6 9.4 33.5 10.4 73.5 32.3 33.6 12.7 FEW-SHOT [22] 25.3 12.0 36.8 17.4 79.4 37.4 22.7 3.0 COT [22] 27.6 12.8 42.8 29.2 45.2 58.0 38.6 8.1 COT-BAG [22] 26.9 12.5 37.3 28.0 45.2 52.1 40.4 8.1Ours60M transformer-1K 100.0 100.0 67.6 31.5 92.9 97.1 57.4 33.4 60M transformer-100K 100.0 100.0 96.1 91.7 98.0 98.0 97.2 40.5 11B transformer (FT)-1K 100.0 45.0 100.0 68.8 98.4 98.0 92.8 26.0 Table 3: Comparison of GraphQA task accuracies of transform ers explicitly trained for graph rea- soning and LLMs with a variety of prompting strategies. 4.3 Trained transformers outperform LLM prompting Large language models (LLMs) are regularly evaluated by the ir reasoning abilities, and it remains an open research question to determine what kinds of trainin g data best teaches models to solve logical problems. We investigate the extent to which LLMs ca n already perform graph reasoning tasks without being trained explicitly to do so. Do transformers trained explicitly to solve graph reasonin g tasks outperform prompt-tuning approaches on much larger LLMs? In Table 3, we contrast the capabilities of trained transformer model s with several prompt- based approaches to querying LLMs. Task-specific transform ers—including the fine-tuned 11B transformers—consistently dominated the prompt-based ap proaches, despite the vast difference in parameter count and the almost certain presence of graph rea soning in the LLM’s corpus. Theoretical interpretation: While the representational capabilities of LLMs to solve re asoning tasks is much greater than that of small transformers, this p erformance gap suggests that their effec- tive reasoning capacity is much weaker and that it may be impr oved by a richer training corpus that includes synthetic tasks. Finally, we observe that the near-perfect performance of tr ained transformers on the node count, edge count, and edge existence is consistent with the representa tional easy of those tasks, as suggested by the existence of efficient Depth 1transformer implementations. 5 Conclusion This paper provides a comprehensive evaluation of transfor mer models’ graph reasoning capabilities, shedding light on their effectiveness across diverse graph reasoning tasks. By introducing a novel representational hierarchy, the study distinguishes betw een retrieval, parallelizable, and search rea- soning tasks and offers insights into the performance of tra nsformers at varying levels of granularity. The empirical investigation reveals that transformers exh ibit strong performance in graph-based rea- soning problems, often matching or surpassing specialized graph models. Furthermore, the study highlights transformers’ exceptional ability to capture g lobal graph patterns effectively, showcasing their capability in understanding long-range dependencie s, a critical factor in solving tasks involv- ing global graph structures. Overall, this work crystalliz es precise representational trade-offs that reflect the fundamental reasoning capabilities of transfor mers and demonstrates that the tasks used to quantify those capabilities are indeed learnable in a sampl e-efficient and parameter-efficient manner. While the hierarchy introduced by this work effectively sep arates graph algorithmic tasks into dis- tinct equivalence classes with significant implications fo r their computability by transformers, sev- eral questions remain for future research. We focused on gra ph reasoning tasks due to their relevance to the broader context of transformers, GNNs, and parallel a lgorithms. However, the complex- ity classes presented here could potentially be extended to a wider range of algorithmic problems. While our assumption of unbounded-size MLPs provides stron g lower bounds, further research into whether parallelizable tasks can be represented by transfo rmers with bounded-size MLP units would complement this existing work. Broader experimental resul ts that empirically evaluate the scaling laws would more directly assess the relevance of representa tional theoretical results to learnability. 9 Page 10: References [1] Anders Aamand, Justin Y . Chen, Piotr Indyk, Shyam Naraya nan, Ronitt Rubinfeld, Nicholas Schiefer, Sandeep Silwal, and Tal Wagner. Exponentially im proving the complexity of simu- lating the weisfeiler-lehman test with graph neural networ ks, 2022. Cited on page 3. [2] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad , Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shy amal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774 , 2023. Cited on page 1. [3] Alexandr Andoni, Zhao Song, Clifford Stein, Zhengyu Wan g, and Peilin Zhong. Parallel graph connectivity in log diameter rounds. In FOCS , October 2018. Cited on pages 17and18. [4] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dm itry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen , et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403 , 2023. Cited on page 41. [5] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neu ral machine translation by jointly learning to align and translate. In ICLR , 2014. Cited on page 1. [6] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alva ro Sanchez-Gonzalez, Vinicius Zam- baldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo , Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and grap h networks. arXiv preprint arXiv:1806.01261 , 2018. Cited on page 1. [7] Paul Beame, Paraschos Koutris, and Dan Suciu. Communica tion steps for parallel query pro- cessing. JACM , 64(6), oct 2017. Cited on page 19. [8] James Bergstra and Yoshua Bengio. Random search for hype r-parameter optimization. JMLR , 13(2), 2012. Cited on page 41. [9] Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenb erger, Michal Podstawski, Lukas Gi- aninazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadoms ki, Piotr Nyczyk, and Torsten Hoefler. Graph of thoughts: Solving elaborate problems with large language models. AAAI , 38(16):17682–17690, March 2024. Cited on page 1. [10] Amartya Shankha Biswas, Talya Eden, Quanquan C. Liu, Sl obodan Mitrovi ´c, and Ronitt Ru- binfeld. Massively parallel algorithms for small subgraph counting, 2022. Cited on pages 6 and19. [11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Aman da Askell, et al. Language models are few-shot learners. NeurIPS , 2020. Cited on pages 1and41. [12] Quentin Cappart, Didier Chételat, Elias B Khalil, Andr ea Lodi, Christopher Morris, and Petar Veliˇckovi ´c. Combinatorial optimization and reasoning with graph neu ral networks. JMLR , 24 (130):1–61, 2023. Cited on page 1. [13] Ziwei Chai, Tianjie Zhang, Liang Wu, Kaiqiao Han, Xiaoh ai Hu, Xuanwen Huang, and Yang Yang. GraphLLM: Boosting graph reasoning ability of large l anguage model, 2023. Cited on page 3. [14] Ines Chami, Sami Abu-El-Haija, Bryan Perozzi, Christo pher Ré, and Kevin Murphy. Machine learning on graphs: A model and comprehensive taxonomy. JMLR , 23(89):1–64, 2022. Cited on page 1. [15] Sam Coy and Artur Czumaj. Deterministic massively para llel connectivity. In STOC , STOC ’22, June 2022. Cited on page 19. [16] Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. In ICLR , 2023. Cited on page 1. [17] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplifie d data processing on large clusters. InOSDI’04: Sixth Symposium on Operating System Design and Imp lementation , pages 137– 150, San Francisco, CA, 2004. Cited on page 18. 10 Page 11: [18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understandi ng.arXiv preprint arXiv:1810.04805 , 2018. Cited on page 1. [19] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesniko v, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image r ecognition at scale. In ICLR , 2021. Cited on page 1. [20] Vijay Prakash Dwivedi and Xavier Bresson. A generaliza tion of transformer networks to graphs. arXiv preprint arXiv:2012.09699 , 2020. Cited on page 3. [21] Paul Erd ˝os and Alfred Rényi. On random graphs. Publicationes Mathematicae Debrecen , 6: 290–297, 1959. Cited on page 40. [22] Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi. Ta lk like a graph: Encoding graphs for large language models. In ICLR , 2024. Cited on pages 3,6,9,40,41, and 42. [23] Fabian Frei and Koichi Wada. Efficient circuit simulati on in mapreduce. ArXiv , abs/1907.01624, 2019. Cited on page 19. [24] Roy Frostig, Matthew James Johnson, and Chris Leary. Co mpiling machine learning programs via high-level tracing. Systems for Machine Learning , 4(9), 2018. Cited on page 41. [25] Mohsen Ghaffari, Fabian Kuhn, and Jara Uitto. Conditio nal hardness results for massively parallel computation from distributed lower bounds. In FOCS , 2019. Cited on page 19. [26] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Or iol Vinyals, and George E Dahl. Neu- ral message passing for quantum chemistry. In ICML , 2017. Cited on pages 1,7,8,41, and 42. [27] Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krish na Menon, Sanjiv Kumar, and Vaish- navh Nagarajan. Think before you speak: Training language m odels with pause tokens. In ICLR , 2024. Cited on pages 2,3, and 15. [28] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turi ng machines. arXiv preprint arXiv:1410.5401 , 2014. Cited on page 1. [29] Yiding Hao, Dana Angluin, and Robert Frank. Formal lang uage recognition by hard attention transformers: Perspectives from circuit complexity. TACL , 10:800–810, 2022. Cited on page 3. [30] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, E lena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johann es Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 , 2022. Cited on page 1. [31] Bowen Jin, Gang Liu, Chi Han, Meng Jiang, Heng Ji, and Jia wei Han. Large language models on graphs: A comprehensive survey. arXiv preprint arXiv:2312.02783 , 2023. Cited on page 3. [32] Armand Joulin and Tomas Mikolov. Inferring algorithmi c patterns with stack-augmented re- current nets. NIPS , 2015. Cited on page 1. [33] Norman P. Jouppi, Cliff Young, Nishant Patil, David Pat terson, Gaurav Agrawal, Ramin- der Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borcher s, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Da ley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, Wil liam Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel K illebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chri s Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Mair e Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Tho mas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter , Dan Steinberg, Andy Swing, 11 Page 12: Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tu ttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-data center performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News , 2017. Cited on page 41. [34] John Jumper, Richard Evans, Alexander Pritzel, Tim Gre en, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Ž ídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafo ld.Nature , 596(7873):583–589, 2021. Cited on page 1. [35] Lukasz Kaiser and Ilya Sutskever. Neural GPUs learn alg orithms. In ICLR , 2016. Cited on page 1. [36] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown , Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scal ing laws for neural language models. arXiv preprint arXiv:2001.08361 , 2020. Cited on page 1. [37] Howard Karloff, Siddharth Suri, and Sergei Vassilvits kii.A Model of Computation for MapRe- duce , pages 938–948. Cited on pages 4,17, and 18. [38] Nora Kassner, Oyvind Tafjord, Ashish Sabharwal, Kyle R ichardson, Hinrich Schuetze, and Peter Clark. Language models with rationality. In Houda Bou amor, Juan Pino, and Kalika Bali, editors, EMNLP , December 2023. Cited on page 1. [39] Mehran Kazemi, Najoung Kim, Deepti Bhatia, Xin Xu, and D eepak Ramachandran. LAM- BADA: Backward chaining for automated reasoning in natural language. In ACL, 2023. Cited on page 1. [40] Jinwoo Kim, Tien Dat Nguyen, Seonwoo Min, Sungjun Cho, M oontae Lee, Honglak Lee, and Seunghoon Hong. Pure transformers are powerful graph learn ers, 2022. Cited on pages 2,3, and31. [41] Diederik P Kingma and Jimmy Ba. Adam: A method for stocha stic optimization. arXiv preprint arXiv:1412.6980 , 2014. Cited on page 41. [42] Thomas N Kipf and Max Welling. Semi-supervised classifi cation with graph convolutional networks. arXiv preprint arXiv:1609.02907 , 2016. Cited on pages 1,7,8,41, and 42. [43] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. NeurIPS , 35:22199–22213, 2022. Cited on page 41. [44] Nayoung Lee, Kartik Sreenivasan, Jason D Lee, Kangwook Lee, and Dimitris Papailiopoulos. Teaching arithmetic to small transformers. In ICLR , 2024. Cited on pages 1and3. [45] Yuxuan Li and James L. McClelland. Systematic generali zation and emergent structures in transformers trained on structured tasks, 2022. Cited on pa ge3. [46] Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishna murthy, and Cyril Zhang. Trans- formers learn shortcuts to automata, 2022. Cited on page 3. [47] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zha ng, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer us ing shifted windows. In CVPR , 2021. Cited on page 1. [48] Ilya Loshchilov and Frank Hutter. Decoupled weight dec ay regularization. arXiv preprint arXiv:1711.05101 , 2017. Cited on page 41. [49] Andreas Loukas. What graph neural networks cannot lear n: depth vs width. In ICLR , 2020. Cited on pages 1,3, and 39. [50] Eran Malach. Auto-regressive next-token predictors a re universal learners, 2023. Cited on page 2. 12 Page 13: [51] Amil Merchant, Simon Batzner, Samuel S Schoenholz, Mur atahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials disco very. Nature , 624(7990):80–85, 2023. Cited on page 1. [52] William Merrill and Ashish Sabharwal. A logic for expre ssing log-precision transformers, 2022. Cited on page 2. [53] William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. TACL , 11:531–545, 2023. Cited on pages 2and32. [54] William Merrill and Ashish Sabharwal. The expressive p ower of transformers with chain of thought, 2023. Cited on pages 2and6. [55] William Merrill, Ashish Sabharwal, and Noah A. Smith. S aturated transformers are constant- depth threshold circuits, 2021. Cited on pages 2and6. [56] Erxue Min, Runfa Chen, Yatao Bian, Tingyang Xu, Kangfei Zhao, Wenbing Huang, Peilin Zhao, Junzhou Huang, Sophia Ananiadou, and Yu Rong. Transfo rmer for graphs: An overview from architecture perspective, 2022. Cited on page 3. [57] Christopher Morris, Yaron Lipman, Haggai Maron, Basti an Rieck, Nils M Kriege, Martin Grohe, Matthias Fey, and Karsten Borgwardt. Weisfeiler and Leman go machine learning: The story so far. JMLR , 24(1):15865–15923, 2023. Cited on pages 3and39. [58] Danupon Nanongkai and Michele Scquizzato. Equivalenc e classes and conditional hardness in massively parallel computations. Distributed Computing , 35(2):165–183, January 2022. Cited on pages 5,17,18,19, and 20. [59] Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Hen ryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with l anguage models. arXiv preprint arXiv:2112.00114 , 2021. Cited on page 2. [60] Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Att ention is turing complete. J. Mach. Learn. Res. , 22(1), jan 2021. ISSN 1532-4435. Cited on pages 2and3. [61] Bryan Perozzi, Bahare Fatemi, Dustin Zelle, Anton Tsit sulin, Mehran Kazemi, Rami Al-Rfou, and Jonathan Halcrow. Let your graph do the talking: Encodin g structured data for LLMs, 2024. Cited on pages 3,41, and 42. [62] Jackson Petty, Sjoerd van Steenkiste, Ishita Dasgupta , Fei Sha, Dan Garrette, and Tal Linzen. The impact of depth and width on transformer language model g eneralization. NAACL , 2024. Cited on page 1. [63] Jacob Pfau, William Merrill, and Samuel R Bowman. Let’s think dot by dot: Hidden computa- tion in transformer language models. arXiv preprint arXiv:2404.15758 , 2024. Cited on pages 2and15. [64] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Le e, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of tr ansfer learning with a unified text-to-text transformer. JMLR , 21(1):5485–5551, 2020. Cited on pages 1,6, and 41. [65] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitr y Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across mil lions of tokens of context. arXiv preprint arXiv:2403.05530 , 2024. Cited on page 1. [66] Yu Rong, Yatao Bian, Tingyang Xu, Weiyang Xie, Ying Wei, Wenbing Huang, and Junzhou Huang. Self-supervised graph transformer on large-scale m olecular data. NeurIPS , 33, 2020. Cited on page 3. [67] Tim Roughgarden, Sergei Vassilvitskii, and Joshua R. W ang. Shuffles and circuits (on lower bounds for modern parallel computation). J. ACM , 65(6), nov 2018. Cited on page 19. 13 Page 14: [68] Clayton Sanford, Daniel Hsu, and Matus Telgarsky. Repr esentational strengths and limitations of transformers, 2023. Cited on pages 3,31,32,35, and 36. [69] Clayton Sanford, Daniel Hsu, and Matus Telgarsky. Tran sformers, parallel computation, and logarithmic depth, 2024. Cited on pages 2,3,4,5,6,15,18,20,21,22,30,31,32, and 39. [70] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hag enbuchner, and Gabriele Monfar- dini. The graph neural network model. IEEE transactions on neural networks , 20(1):61–80, 2008. Cited on page 1. [71] Noam Shazeer. GLU variants improve transformer. arXiv preprint arXiv:2002.05202 , 2020. Cited on page 41. [72] Kaya Stechly, Matthew Marquez, and Subbarao Kambhampa ti. GPT-4 doesn’t know it’s wrong: An analysis of iterative prompting for reasoning pro blems, 2023. Cited on page 3. [73] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui W u, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 , 2023. Cited on page 1. [74] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shrut i Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023. Cited on page 1. [75] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you ne ed. In NeurIPS , 2017. Cited on pages 1,15, and 41. [76] Petar Veli ˇckovi ´c and Charles Blundell. Neural algorithmic reasoning. Patterns , 2(7), 2021. Cited on page 1. [77] Petar Veli ˇckovi ´c, Adrià Puigdomènech Badia, David Budden, Razvan Pascanu, Andrea Banino, Misha Dashevskiy, Raia Hadsell, and Charles Blundell. The C LRS algorithmic reasoning benchmark. In ICML , 2022. Cited on page 1. [78] Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xi aochuang Han, and Yulia Tsvetkov. Can language models solve graph problems in natur al language? In NeurIPS , 2023. Cited on pages 3and41. [79] Colin Wei, Yining Chen, and Tengyu Ma. Statistically me aningful approximation: a case study on approximating turing machines with transformers, 2021. Cited on page 2. [80] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma , Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reas oning in large language models. NeurIPS , 2022. Cited on pages 1,2,15, and 41. [81] Boris Weisfeiler and Andrei Leman. The reduction of a gr aph to canonical form and the algebra which appears therein. nti, Series , 2(9):12–16, 1968. Cited on pages 3and39. [82] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegel ka. How powerful are graph neural networks?, 2018. Cited on pages 1,3,7,8,39,41, and 42. [83] Andrew Chi-Chih Yao. Some complexity questions relate d to distributive comput- ing(preliminary report). In STOC , page 209–213, New York, NY , USA, 1979. Cited on page 36. [84] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Gr iffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving w ith large language models. NeurIPS , 2024. Cited on page 1. [85] Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yo ngfeng Zhang. Language is all a graph needs, 2023. Cited on page 3. 14 Page 15: [86] Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to- sequence functions?, 2019. Cited on page 2. [87] Wojciech Zaremba and Ilya Sutskever. Learning to execu te. In ICLR , 2015. Cited on page 1. [88] Bohang Zhang, Shengjie Luo, Liwei Wang, and Di He. Rethi nking the expressive power of GNNs via graph biconnectivity. In ICLR , 2023. Cited on pages 1and3. [89] Yi Zhang, Arturs Backurs, Sébastien Bubeck, Ronen Elda n, Suriya Gunasekar, and Tal Wagner. Unveiling transformers with lego: a synthetic reasoning ta sk, 2022. Cited on page 3. [90] Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courvi lle, Behnam Neyshabur, and Hanie Sedghi. Teaching algorithmic reasoning via in-conte xt learning. arXiv preprint arXiv:2211.09066 , 2022. Cited on page 1. [91] Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, O mid Saremi, Josh Susskind, Samy Bengio, and Preetum Nakkiran. What algorithms can transfor mers learn? a study in length generalization. In ICLR , 2024. Cited on page 1. A Theoretical preliminaries For some vector v∈RN, let softmax(v) =1/summationtextN i=1exp(vi)(exp(v1),...,exp(vN))∈RN. Letei∈Rmdenote theith elementary vector, /vector1 = (1,...,1)∈Rmand/vector0 = (0,...,0)∈Rm. Let [n] ={1,...,n}. A.1 Transformer models We use a similar theoretical definition of a multi-layer tran sformer model to the one employed by [69]. We use a bounded communication theoretical model of the standard bidirectional transform er of [75], which assumes that the principal limitation of the transf ormer’s representational power is the low rank of the self-attention matrix. Several theoreti cal assumptions are necessary to model transformers in this regime: • the interleaved element-wise multi-layer perceptron (ML P) units compute arbitrary func- tions; • the embedding dimension mand number of heads per layer Hare much smaller than the sequence length N; and • all model parameters, inputs, and intermediate products c an be represented with O(logN)- bit numbers. These assumptions are justified in greater detail by [ 69]. Furthermore, we also model transformers as having N′additional “pause token” inputs, which are embeddings that contain no information relevant informati on about the input instance, but which provide a longer “work tape” that can be utilized for computa tional purposes [see e.g. 27,63]. While structurally similar to chain-of-thought reasoning [ 80], these pause tokens are not determined by multiple auto-regressive passes over the transformer and d o not encode useful information as input. Such pause tokens exist both in the theoretical results of Se ction 3(as blank tokens at the end of the sequence) and in the empirical results of Section 4(as placeholders between the graph representation and the task query for graphs that can be represented in fewer tokens than the maximum sequence length). Definition 1. LetTransformerN,N′ m,H,Ldenote the family of all bidirectional transformers with embed- ding dimension m, number of heads H, and depthL, which operate on input sequences on length NwithN′blank pause tokens appended to the end5. Any transformer f∈TransformerN,N′ m,H,Lis a 5IfN′= 0, we denote the family as TransformerN m,H,L. 15 Page 16: function of the form f:RN×d→RNfor some input dimension d, which parameterized by query, key, and value matrices Qℓ,h,Kℓ,h,Vℓ,h∈Rm×m,forℓ∈[L],h∈[H], and arbitrary element-wise MLPs with positional embedding sφ0,...,φLandψwith φ0:Rd×N→Rm, φℓ:Rm×N→Rm,forℓ∈ {1,...,L}, ψ:Rm×N→R, where for sequence Yof lengthN+N′, φℓ(Y) = (φℓ(y1,1),...,φℓ(yN+N′,N+N′)). To evaluate f(X)on some input sequence X= (x1,...,xN)∈RN×d, we define intermediate embeddings Y0,...,YL∈R(N+N′)×mas follows: • The initial embedding is computed by applying the first MLP φ0to all elements of the input X, along with N′“blank” inputs: Y0=φ0(X) = (φ0(x1,1),...,φ0(xN,N),φ0(0,N+1),...,φ0(0,N+N′)). • The intermediate embeddings Yℓforℓ∈ {1,...,L}are computed by applying a unit of multi-headed attention (parameterized by Qℓ,h,Kℓ,h,Vℓ,hfor allh∈[H]) to the previous embeddingYℓ−1, followed by an element-wise application of the MLP φℓ. That is, Yℓ=φℓ/parenleftBiggH/summationdisplay h=1softmax/parenleftbig φ(Yℓ−1)QhKT hφ(Yℓ−1)T/parenrightbig φ(Yℓ−1)Vh/parenrightBigg . • The output f(X)∈RNis computed to be the first Noutputs ofψ(YL). That is, f(X) = (ψ(YL 1,1),...,ψ(YL N,N)). While our theoretical results pertain to the bidirectional regime (where no masking is applied to the self-attention layer), all results in Appendix Cand all lower bounds in Appendix Bapply to autoregressive transformers as well. Our empirical result s utilize causal masking. Furthermore, an L-layer bidirectional transformer that operates on sequenc es of length NwithN′ pause tokens can be simulated by and L-layer autoregressive transformer with O(L·(N+N′)) pause tokens. In the regime where N′= poly(N)andL=O(logN), any positive results for bidirectional models likewise apply to autoregressive mod els. A.2 Graph reasoning tasks We exclusively consider graph reasoning tasks that apply to undirected and unweighted graphs G= (V,E)of bounded size, i.e. |V|+|E|=O(N)for some size parameter N. We define the tasks used for experimentation as follows: •Node count : Given graph G, compute |V|. •Edge count : Given graph G, compute |E|. •Edge existence : Given graph Gand vertices u,v∈V, return1{(u,v)∈E}. •Node degree : Given graph Gand vertexu∈V, returndeg(u) =|(u,v)∈E|. •Connectivity : Given graph Gand vertices u,v∈V, return 1 if there exists a path of edges betweenuandvand 0 otherwise. •Cycle check : Given graph G, return 1 if there exists a cycle of any length ≥3and 0 otherwise. •Shortest path : Given graph Gand vertices u,v∈V, return the smallest number of edges that forms a path between uandvif one exists and −1otherwise. 16 Page 17: •Triangle counting : Given graph G, return the number of triangles in the graph: |{(u,v,w) : (u,v),(v,w),(u,w)∈E}|. These tasks are foundational to computer science and are sol ved by famous polynomial-time algo- rithms, such as Dijkstra’s algorithm for shortest path and K ruskal’s for spanning trees. Several of these tasks reap significant algorithmic benefits from paral lel computing models, including graph connectivity, which can be evaluated in a logarithmic numbe r of rounds by employing an “iterative merging” strategy [ 3]. The amenability of some of these tasks to parallel computing is the core principle underlying our the- oretical results on transformer model capacity and the resu lting contrasts with GNNs. For instance, the iterative merging algorithm for connectivity can be sim ulated by a logarithmic-depth transformer, but message-passing GNNs cannot do without a substantial wi dth scaling (see Appendix D). Further- more, this parallel computing algorithm is widely believed to be optimal, and a crystallization of this as Conjecture 13is the bedrock of our transformer optimality conjectures. The multi-layer results of Appendix Bare representation-agnostic; the bounds apply to any fixed encoding of vertices and edges as a transformer input sequen ceX∈RN×d. The input representa- tion for Appendix Cmust satisfy the more specified node/edge encoding scheme , which represents each vertex and edge as a single embedding, followed by a quer y embedding (with optional blank embeddings as needed). This encoding scheme is reflected by F igure 1and closely resembles the graph instance encoding used for the experimental results6. Throughout the paper, we partition these and other graph rea soning tasks into three representational categories: retrieval, parallelizable, and search tasks. The retrieval category contains only four of the tasks used for experiments: •Retrieval tasks include node count, edge count, edge existence, and node deg ree. On the other hand, the other categories reflect two equivalen ce classes introduce by [ 58]: •Parallelizable tasks are defined to be L-complete and equivalent to connectivity in O(1) rounds of MPC. These tasks include connectivity, cycle chec k, planarity testing, minimum cut, bipartiteness testing, minimum spanning forest, conn ected components, one-cycle ver- sus two-cycles (see Conjecture 13), and # connected components. •Search tasks are defined to be NL-complete and equivalent to shortest path in O(1)rounds of MPC. These tasks include shortest path, strong connectiv ity (for directed graphs), all- pairs shortest path, median, diameter, radius, directed cy cle detection, and st-reachability. Triangle counting does not appear in any of these classes and is separately analyzed in Section 3.4. B Multi-layer transformers and parallelizable/search gra ph reasoning tasks In this appendix, we formalize and prove all results in Secti ons3.1and3.2that pertain to the paral- lelizable and search categories of graph reasoning tasks. The primary technical tool used to establish these results i s an improved relationship between the MPC model of distributed communication of [ 37] and our theoretical model of a transformers (Ap- pendix A.1). This result is presented informally as Theorem 1, and formally as Theorem 8. Because graph reasoning tasks are well studied in the MPC model of com putation [e.g. 3,58], this theorem enables the transfer of those positive results to transform er architectures. Similarly, negative results that pertain to the MPC model, including those conditioned o n the well-known one-cycle versus two-cycle conjecture (Conjecture 13), imply negative results for transformers. Theorem 8 (Formal version of Theorem 1; transformers simulate MPC) .For constants 0< δ < δ′<1andγ >0, anR-round deterministic (γ,δ)-MPC protocol with ninputs can be simu- lated by a transformer f∈TransformerN,N′ m,H,Lwith depthL=O(R), single heads H= 1 , embedding dimension m=O(nδ′), context length N=n, and blank chain-of-thought tokens N′= max(0,O(n1+γ−δ)−n). 6The empirical encoding scheme uses two tokens—not one—to en code each edge and has fixed size N. For all graphs whose encodings do not require all Ntokens, the remaining tokens are blank placeholders that precede the query tokens. 17 Page 18: We formally introduce the MPC distributed computing model i n Appendix B.1, along with the one- cycle versus two-cycle conjecture and a collection of posit ive results from [ 58]. In Appendix B.2, we formally present the task-specific graph reasoning resul ts introduced in Sections 3.1and3.2 and prove them as implications of Theorem 8. We finally contextualize and prove Theorem 8in Appendix B.3by modifying the proof strategy of a similar result [ 69] and by showing that MPC protocols can be simulated by a weaker model of distributed c omputation. B.1 MPC preliminaries The massively parallel computation (MPC) model of [ 37] formalizes distributed computing frame- works such as MapReduce [ 17] as theoretical models that are amenable to rigorous analys is. MPC pertains to a regime where an input that consists of n“words” is distributed across a very large num- ber of machines q(e.g.q≈n0.95), each of which contains a bounded local memory s(e.g.s≈n0.1) where computation on individual machines is inexpensive bu t communication between machines is costly. We use the definition of MPC of [ 3], which quantifies the complexity of a protocol by the local memory s=O(nδ), the global memory sq=O(n1+γ), and the number of communication roundsR. Definition 2. For global and local memory constants γ,δ >0and input size n, anR-round(γ,δ)- MPC protocol specifies a distributed computing protocol over q= Θ(n1+γ−δ)machines, each of which contains a local memory of size s=O(nδ). • An input to the protocol, which is encoded as a length- nsequence of p= Θ(logn)-bit words, is distributed across the local memories of the first ⌈n/s⌉machines. • In each of the Rrounds, each machine computes an arbitrary function its loc al memory at the time, which specifies at most swords to be transmitted to other machines. • Messages are simultaneously transmitted (subject to the c onstraint that each machine sends and receives at most swords of information), and each machine’s new local memory i s set equal to the messages received. • After the final round, the output of the protocol is a concate nation of the local memories of the first⌈n/s⌉machines. We say that an MPC protocol computes some function f:Zn 2p→Zn 2pif for anyX∈Zn 2pencoded as input to the protocol, the output of the protocol is f(X). B.1.1 MPC and graph reasoning tasks In this section, we introduce previously known results abou t the abilities of Massively Parallel Com- putation protocols to solve graph reasoning tasks. The nove l results presented in the subsequent section are the primarily the immediate implications of the se results and Theorem 8. MPC round equivalence for parallelizable and search tasks These results largely reflect the contributions of [ 58], which establish a hierarchy of graph reasoning tasks that is reflected in our distinction between parallelizable and search tasks. They show that all parallelizable tasks are equiv- alent to connectivity in O(1)rounds of MPC and that all search tasks are likewise equivale nt to shortest path. (See their Figure 1.) Concretely, they prove the following for all graphs G= (V,E) with|V|+|E|=O(n). Theorem 9 (Theorem 4 of [ 58]; round-equivalence of parallelizable tasks) .If there exists anR-round(γ,δ)-MPC protocol for any parallelizable task—including graph connectivity, st- connectivity, cycle detection, formula evaluation, plana rity testing, graph bipartiteness, connected components, minimum spanning forest, and minimum cut, one- cycle versus two-cycle testing—for someγ >0andδ∈(0,1) , then there exists an R′-round(γ′,δ)-MPC protocol for any other other parallelizable task where R′=R+O(1)andγ′= max(γ,3). Theorem 10 (Theorem 5 of [ 58]; round-equivalence of search tasks) .If there exists an R-round (γ,δ)-MPC protocol for any search task—including st-reachability (for directed graphs), strong connectivitity, directed cycle detection, shortest path, all-pairs shortest path, diameter, radius, and median—for some γ >0andδ∈(0,1) , then there exists an R′-round(γ′,δ)-MPC protocol for any other other search task where R′=R+O(1)andγ′= max(γ,2). The equivalance of Theorem 9has more immediate implications for the round complexity of paral- lelizable tasks than Theorem 10does for search tasks because the round complexity of graph c onnec- 18 Page 19: tivity is well-understood to be O(logn)and thought to be optimal. We first present a deterministic MPC positive result for graph connectivity. Theorem 11 (Theorem 6.2 of [ 15]; log-round connectivity MPC algorithm) .For anyγ >0and δ∈(0,1) , there exists a deterministic O(logn)-round(γ,δ)-MPC protocol that solves graph con- nectivity on any graph G= (V,E)of size|V|+|E|=O(n). Hence, all parallelizable tasks can be solved with a logarit hmic number of MPC rounds with arbi- trarily small polynomial local memory. Corollary 12 (Log-round parallelizable MPC algorithms) .For anyγ >0andδ∈(0,1) and any parallelizable task, there exists a deterministic O(logn)-round(γ′,δ)-MPC protocol that solves the task on any graph G= (V,E)of size|V|+|E|=O(n)forγ′= max(γ,3). The optimality of these logarithmic-round protocols is sug gested by a conjecture about the hardness of distinguishing between two graphs with nvertices and nedges, one of whose edges are arranged in a single cycle of length nand the other in two disjoint cycles of lengthn 2. This is the well-known “one-cycle versus two-cycle conjecture,” which is widely e mployed as a condition for distributed computing hardness [see e.g. 7,67,25]. Conjecture 13 (One-cycle versus two-cycle conjecture, see e.g. [ 25]).For anyγ >0andδ∈(0,1) , any(γ,δ)-MPC protocol that distinguishes a single length- ncycle from two disjoint length-n 2cycles usesR= Ω(logn)rounds. Under this conjecture, Theorem 9immediately implies the optimality of Corollary 12. Corollary 14 (Optimality of log-round parallelizable MPC algorithms) .Conditional on Conjec- ture13, for anyγ >0andδ∈(0,1) , any(γ,δ)-MPC protocol that solves a parallelizable graph task on all graphs G= (V,E)of size|V|+|E|=O(n)usesR= Ω(logn)rounds. The round complexity of search tasks is less understood, and it is unknown if search tasks can be solved byO(logn)-round(γ,δ)-MPC protocols if δ∈(0,1 2]. MPC protocols for problems with bounded circuit size More general MPC constructions are known for problems that solved by bounded-size boolean circ uits, which include both paralleliz- able and search tasks. The well-known NCiclasses of Boolean circuits that take Ninputs and havepoly(n)gates and depth O(login)have been shown to be computable by bounded-round MapReduce-like computational protocols [ 23] and by MPC protocols in particular [ 58]. Theorem 15 (Theorem 1 of [ 58]; log-round circuit MPC algorithms) .For any problem in NCi+1 and anyγ >0andδ∈(1 2,1), there exists a deterministic O(login)-round(γ,δ)-MPC protocol that solves the problem. SinceLandNLare known to belong to NC2, the following corollary is an immediate consequence. Corollary 16 (Log-round parallelizable and search MPC algorithms with h igh local memory) .For any parallelizable or search graph reasoning task and any γ >0andδ∈(1 2,1), there exists a deterministic O(logn)-round(γ,δ)-MPC protocol that solves the task on all graphs G= (V,E)of size|V|+|E|=O(N). Note that these results pertain exclusively to the “large lo cal memory regime,” where each machine has memory at s=ω(√n). Therefore, this does not guarantee the existence of a O(logn)-round MPC solution for any search task or for any parallelizable ta sk withγ <1. MPC protocol for triangle counting Finally, the triangle counting task can be solved in the MPC framework by utilizing a special case of a parallel algorith m for clique counting. These pertain to graphs with bounded arboricityα, a quantity that corresponds to the branching factor of a nod e that is bounded by the degree of the graph; these apply to arbitrar y graphs by noting that α≤ |V|. Theorem 17 (Theorem 1.5 of [ 10]; loglog-round triangle counting MPC algorithm) .For anyk≥3, δ∈(0,1) , andγ >0, there exists an O(loglogn)-round(γ,δ)-MPC protocol that computes the number of k-cliques in any graph G= (V,E)with|V|+|E|=O(n)and arboricity α= O(nγ/(k−2)). 19 Page 20: B.2 Positive and negative graph reasoning results for multi -layer transformers We state formalized versions of the statements of Sections 3.1,3.2and3.4and prove them by invok- ing Theorem 8jointly with the relationships between MPC and graph reason ing of Appendix B.1.1 . B.2.1 Parallelizable task results (Section 3.1) For the duration of this section, the class of parallelizabl e tasks includes all of those that are deemed equivalent to graph connectivity in O(1)rounds of MPC by [ 58], as stated in Theorem 9. We first present a statement that formalizes the existence of logarithmic-depth transformer construc- tions for solving parallelizable tasks. Theorem 18 (Formal version of Theorem 2; log-depth transformers compute parallelizable tasks) . For anyǫ∈(0,1) and any parallelizable task, there exists a transformer f∈TransformerN,N′ m,H,L such thatf(X)computes the task where X∈RNis some encoding of any graph G= (V,E)with |V|+|E|=O(N)andfhas depthL=O(logN)and headsH=O(1)and embedding dimension mand pause tokens N′satisfying either •LogDepthPause :m=O(Nǫ)andN′=O(N4−ǫ′), whereǫ′∈(0,ǫ); or •LogDepthWide :m=O(Nǫ)andN′= 0, ifǫ>1 2. Proof. The first bullet is an immediate implication of Corollary 12and Theorem 8withγ= 3, δ′=ǫ, andδ=ǫ′. The second bullet follows from Corollary 16and Theorem 8withγ=δ=ǫ 2 andδ′=ǫ. We then establish that sub-logarithmic-depth solutions to any parallelizable task are impossible with- out having linear embedding dimension mor super-polynomial number of pause tokens N′, under the assumption that the one-cycle versus two-cycle conject ure holds. Theorem 19 (Formal version of Theorem 3; log-depth optimality for parallelizable tasks) .Condi- tional on Conjecture 13, for anyǫ∈(0,1) andγ >0and any parallelizable task, if there exists a transformer f∈TransformerN,N′ m,H,Lthat solves the task and has width mH=O(Nǫ)and pause tokensN+N′=O(N1+γ), then its depth satisfies L= Ω(logN). Proof. The proof is a consequence of Corollary 14and a result of [ 69] that proves the to simulate transformers with MPC protocols, an inversion of Theorem 8. We restate that result as follows. Theorem 20 (Theorem 3.4 of [ 69]; MPC simulates transformers) .Fix any transformer f∈ TransformerN,N′ m,H,Lwith widthmH=O(Nδ)for someδ∈(0,1) and total sequence length N+N′=O(N1+γ)for someγ≥0. Then, for any δ′∈(δ,1)andγ′= 1 + 2γ+δ′, there exists anO(L(1+γ) δ′−δ)-round(γ′,δ′)-MPC protocol that simulates f. Consider a transformer f∈TransformerN,N′ m,L,Hthat solves the task with width mH=O(Nǫ)and total sequence length N+N′=O(N1+γfor some constant ǫ∈(0,1) andγ >1. Then, there exists anOǫ,γ(L)-round(1 + 2γ+√ǫ,√ǫ)-MPC protocol that solves the parallelizable task. If Conjecture 13is true, then L= Ω(logN). B.2.2 Search task results (Section 3.2) We present the main result of Section 3.2and show an equivalence between the minimum depth transformer needed to solve search tasks in a width-efficien t manner. As before, the class of search- able tasks includes tasks that are equivalent to shortest pa th inO(1)MPC rounds (Theorem 10). Theorem 21 (Formal version of Theorem 4;LogDepthWide computes search tasks) .For any ǫ∈(1 2,1)and any search task, there exists a transformer f∈TransformerN,N′ m,H,Lsuch that f(X)computes the task where X∈RNis some encoding of any graph G= (V,E)with |V|+|E|=O(N), andfhas depthL=O(logN), headsH=O(1), embedding dimension m=O(Nǫ, and no pause tokens ( N′= 0). 20 Page 21: Proof. As before, the proof is an immediate consequence of Corollar y16and Theorem 8. While both search and parallelizable tasks are computable l ogarithmic depth of considerable width, the minimum depth transformer of width O(Nǫ)for smallǫthat computes search tasks is unknown. Despite this deficit, we can still establish the representat ion similarity of all search tasks by showing that their minimum depths vary by at most a constant additive factor. Theorem 22 (Depth-equivalence of search tasks) .Suppose for some ǫ∈(0,1) andγ >0there exists a transformer f∈TransformerN,N′ m,H,Lwith embedding dimension m=O(Nǫ)and total sequence length N+N′=O(N1+γ)that computes the some search task on all graphs G= (V,E) of size|V|+|E|=O(N). Then, for any other search task, there exists some transfor merf′∈ TransformerN,¯N′ ¯m,H,¯Lwith embedding dimension ¯m=O(m), depth¯L=L+O(1), and pause tokens ¯N′=N′+O(N3)that computes that search task. Proof. This statement is an immediate consequence of Theorem 10and Theorem 8. B.2.3 Clique counting task results (Section 3.4) We prove the existing of doubly-logarithmic-depth transfo rmer that solves the triangle counting task giving a more general result that counts the number of k-cliques in any graph of bounded arboricity. Theorem 23 (Generalization of Theorem 7; loglog-depth computes clique counting) .For any fixed ǫ∈(0,1) andk≥3, there exists a transformer f∈TransformerN,N′ m,H,Lwith embedding dimension m=O(nǫ), headsH=O(1), depthL=O(loglogn), and chain-of-thought tokens N′=/braceleftbiggO(αk−2N1−ǫ)ifα= Ω(Nǫ/(k−2)), 0 otherwise. that counts the number of k-cliques in all graphs G= (V,E)of arboricity αand size|V|+|E|= O(N). Theorem 7follows immediately from taking k= 3. Proof. This is the immediate implication of Theorem 8and Theorem 23. B.3 Proof of Theorem 8 Our principal theoretical result establishes that any MPC p rotocol with sublinear local memory can be simulated by a transformer with sublinear embedding dime nsion. Theorem 8 (Formal version of Theorem 1; transformers simulate MPC) .For constants 0< δ < δ′<1andγ >0, anR-round deterministic (γ,δ)-MPC protocol with ninputs can be simu- lated by a transformer f∈TransformerN,N′ m,H,Lwith depthL=O(R), single heads H= 1 , embedding dimension m=O(nδ′), context length N=n, and blank chain-of-thought tokens N′= max(0,O(n1+γ−δ)−n). This offers an improvement on Theorem 3.1 of [ 69], which only guarantees that MPC protocols with local memory s=O(n1/4−ǫ)(or withδ<1 4) can be simulated by transformers with sublinear embedding dimension. This distinction is significant becau se several positive results for the MPC protocol (e.g. the ability to solve all problems in LandNLin a logarithmic number of MPC rounds) requires= Ω(N1/2)local memory, and hence could not be shown to be simulated by t ransformers of sublinear embedding dimension previously7. Including an allowance of N′blank pause tokens permits the simulation of MPC protocol wi th γ≥δ, i.e. where number of machines qgrows super-linearly with n. Ifγ < δ , then the protocol can be simulated without pause tokens (i.e. N′= 0) for sufficiently large input sizes n. 7For this theoretical model of the transformer to be non-vacu ous, the transformer must have embedding dimension m=o(Nd). Without this, any function over X∈RN×dcould be solved by a single-layer transformer that concatenates all inputs into a single vect or in an attention unit and passes that as input to an MLP that solves the problem. 21 Page 22: To prove Theorem 8, we define a restriction of the MPC computational model that d isallows each machine from communicating with more than kother machines in each round of the protocol. Definition 3. For constants γ,δ,ρ >0, anR-round(γ,δ,ρ)-MPC protocol on input sequences of lengthnis a(γ,δ)-MPC protocol (Definition 2) that obeys an additional constraint on the number of outgoing and incoming messages. Namely, for some capacit yk=O(nρ), in each round, every machine can send its local memory to and receive information from at most distinct kmachines. Theorem 8is an immediate consequence of Proposition 24—which establishes that any (γ,δ)-MPC protocol can be simulated by a (γ,δ,ρ)-MPC protocol—and Corollary 25—which shows that any (γ,δ,ρ)-MPC protocol can be simulated by a transformer. Proposition 24 ((γ,δ,ρ)-MPC simulates (γ,δ)-MPC) .For constants γ,δ >0andρ∈(0,δ 2), iffcan be computed by an R-round(γ,δ)-MPC protocol, then there exists a O(R(1+γ)2 ρ2)-round (γ,δ,ρ)-MPC protocol that computes fas well. Corollary 25 (Transformers simulate (γ,δ,ρ)-MPC) .For constants δ∈(0,1) andγ,ρ >0, an R-round deterministic (γ,δ,ρ)-MPC protocol with ninputs can be simulated by a transformer f∈ TransformerN,N′ m,H,Lwith depthL=R+1, headsH= 1, embedding dimension m=O(nδ+4ρlogn), context length N=n, and blank chain-of-thought tokens N′= max(0,O(n1+γ−δ)−n). Proof of Theorem 8.Letρ:= min(δ 2,δ′−δ 4)−ǫfor some small constant ǫ(e.g.ǫ:= min(δ 4,δ′−δ 8)). By Proposition 24, we can simulate the target R-round(γ,δ)-MPC protocol with an R′-round (γ,δ,ρ)-MPC protocol for R′=O/parenleftbiggR(1+γ2) min((δ′−δ)2,δ2)/parenrightbigg . We apply Corollary 25to conclude that this latter protocol can be simulated by a tr ansformer of depthL=R′+1and embedding dimension m=O(nδ+4ρlogn) =O(nδ′). We prove Proposition 24in Appendix B.3.1 by simulating a single round of a standard MPC with multiple rounds of restricted MPC, where messages are sent t o intermediate “neighborhoods”— which contain collections of machines with similar destina tions—of increasing fine granularity. We prove Corollary 25in Appendix B.3.2 by a minor adaptation of the proof of Theorem 3.1 of [ 69]. B.3.1 Proof of Proposition 24 Fix anR-round(γ,δ)-MPC protocol πwith input size n,q= Θ(n1+γ−δ)machines, and s= O(nδ)local memory. To prove Proposition 24, it suffices to show that a single round of MPC communication can be simulated an O((1+γ)2 ρ2)-round(γ,δ,ρ)-MPC protocol. To formalize the communication procedure to simulate, we le tOutbox= (Outbox 1,...,Outbox q) denote an encoding of the outgoing “message packets” from ea ch source machine that obeys π’s local memory constraints. Concretely, Outbox i={(Msg,Src,Dst) :Src=i,Dst∈[q]}s.t./summationdisplay Msg∈Outbox i|Msg| ≤s. LetInbox= (Inbox 1,...,Inbox q)denote the same collection of message packets, organized by their destination machines. Inbox i={(Msg,Src,Dst)∈Outbox Src:Dst=i}s.t./summationdisplay Msg∈Inboxi|Msg| ≤s. It suffices to prove Lemma 26. Lemma 26 ((γ,δ,ρ)-MPC simulates one communication round of (γ,δ)-MPC) .Ifρ <δ 2, there exists anO((1+γ)2 ρ2)-round(γ,δ,ρ)-MPC protocol that takes as input any Outbox and returns the corresponding Inbox . 22 Page 23: Proof. We definer=O(1+γ ρ)intermediate steps and prove that each of those can be simula ted. The intuition is that each an intermediate step routes each p acket(Msg,Src,Dst)to a machine that belongs to the same “neighborhood” as Dst. Each step maps each packet to a neighborhood of smaller radius than the step before, until all packets have b een transmitted to their proper destination location. We define a tree of neighborhoods as follows. Fix some branchi ng factorℓ= Θ(nρ/2)and number of intermediate steps r=/ceilingleftBig logq logℓ/ceilingrightBig =O(1+γ ρ). For any step t= 0,...,rand neighborhood index j= 1,...,bt:=/ceilingleftbigq ℓr−t/ceilingrightbig , we define Nbhdt j=/braceleftbig (j−1)ℓr−t+1,(j−1)ℓr−t+2,...,j·ℓr−t/bracerightbig ∩[q] and observe that it satisfies the size constraint |Nbhdt j| ≤ℓr−t. We let Nbhdt(Dst)denote the unique neighborhood Nbhdt jsatisfying Dst∈Nbhdt j. We say that Nbhdt j<Nbhdt j′ifj < j′(and equivalently, for all i∈Nbhdt jandi′∈Nbhdt j′,i<i′). Let Children (Nbhdt j) =/braceleftBig Nbhdt+1 (j−1)ℓ+1,...,Nbhdt+1 jℓ/bracerightBig , Descendantsτ(Nbhdt j) =/braceleftbig Nbhdτ(i) :i∈Nbhdt j/bracerightbig , Parent(Nbhdt j) =Nbhdt−1 ⌈j/ℓ⌉, Ancestort(Nbhdτ j) =Nbhdt(i)fori∈Nbhdτ j. for someτ≥t. Note that the following are true of neighborhoods. • The initial “step 0” neighborhood contains all machines (i .e.Nbhd0 1= [q]), and the final “stepr” neighborhoods contain a single machine (i.e. Nbhdr j={j}for allj∈[q]). • For anyt<r andj∈[bt],Nbhdt jis the disjoint union of all sets in Children (Nbhdt j)and |Children (Nbhdt j)| ≤ℓ. •/braceleftbig Nbhdt 1,...,Nbhdt bt/bracerightbig comprise a disjoint union of [q]. • For any Dst∈[q], there exist unique sets Nbhd0(Dst),...,Nbhdr(Dst)that satisfy Nbhdr(Dst)⊂ ··· ⊂Nbhd0(Dst) andNbhdt−1(Dst) =Parent(Nbhdt(Dst)). We define a collection of MPC machine states as “intermediate outboxes” Outboxt= (Outboxt 1,...,Outboxt q)for each step tto represent an assignment of message packets to machines in the sametth-level neighborhood as the packet destination. That is, w e require that each Outboxt i satisfy Outboxt i=/braceleftbig (Msg,Src,Dst) :i∈Nbhdt(Dst)/bracerightbig (1) We say that Outboxtarevalid intermediate outboxes forOutbox if: 1.Outboxtsatisfies Equation ( 1); 2. there is a one-to-one mapping between packets in Outbox andOutboxt; and 3. local memory constraints are maintained up to a factor of t wo8 /summationdisplay Msg∈Outboxt i|Msg| ≤2s. 8For the sake of simplicity, we assume that the encoding Msgalso contains all relevant metadata about the packetSrcandDst. That is, we only track the message size |Msg|. 23 Page 24: Step0 1Nbhd qmachines per Nbhd Nbhd0 1 Machine 1 Outbox0 1 Machine 2 Outbox0 2 Machineq Outbox0 qStep1 ℓNbhd s q/ℓ machines per Nbhd Nbhd1 1 Machine 1 Outbox1 1 Machineq/ℓ Outbox1 q/ℓ Nbhd1 ℓ Machineq Outbox1 qStep2 ℓ2Nbhd s q/ℓ2machines per Nbhd Nbhd2 1 Nbhd2 ℓ Nbhd2 ℓ2−ℓ+1 Nbhd2 ℓ2Stepr qNbhd s 1machine per Nbhd Nbhdr 1 Machine1 Outboxr 1 Nbhdr q Machineq Outboxr q Figure 4: The neighborhood routing structure. Note that Outbox satisfies the conditions for Outbox0, and the only sequence of embeddings that satisfies the conditions for OutboxrisInbox . An inductive argument that applies Lemma 27rtimes completes the proof. Lemma 27 ((γ,δ,ρ)-MPC simulates one intermediate step) .For anyt∈ {0,...,r−1}andρ<δ 2, there exists an O(1+γ ρ)-round(γ,δ,ρ)-MPC protocol that takes as input any satisfactory Outboxt and returns some satisfactory Outboxt+1. Proof. The protocol operates in two phases: 1. The pre-computation phase , where relevant message size metadata is computed using O(r−t)rounds of communication. 2. The message-passing phase , where all messages are propagated in r−trounds of commu- nication in order to convert OutboxttoOutboxt+1. Sincer=O(1+γ ρ), the bound on total number of rounds is satisfied. The algorithm maintains the invariant that all communicati on occurs within t-level neighborhoods Nbhdt jwithout any mixing between neighborhoods; this is possible because of the assumed valid- ity ofOutboxt, which implies that all messages whose packets appear in Nbhdt jofOutboxthave ultimate destinations in Nbhdt j. Concretely, if (Msg,Src,Dst)∈Outboxt iandi∈Nbhdt j, then Dst∈Nbhdt j. We first explain the routing strategy for the message-passin g phase by describing an itinerary of machines that each message packet will be routed to and provi ng that this itinerary meets the con- 24 Page 25: ditions needed to be executed by an (r−t)-round(γ,δ,ρ)-MPC protocol. We then describe how the metadata necessary for the itinerary computations can b e obtained during the pre-computation phase. Message-passing phase. We first introduce some packet notation to track all relevant metadata about any message in the tth intermediate step. Fix some particular input Outboxt. • LetP= (Msg,Src,Dst,Srct)denote a packet that contains a message Msg and metadata concerning its “global” source Src and destination Dst (i.e. (Msg,Src,Dst)∈Outbox Src,InboxDst) and its “local” source Srctfrom within Outboxt(i.e.(Msg,Src,Dst)∈Outboxt Srct). • We write P∈Outboxt iifSrct=iandP∈Nbhd ifP∈Outboxt ifor somei∈Nbhd . • LetSrcNbhdt,τ(P) =Nbhdτ(Srct)represent the neighborhood of size ℓr−τcontainsP (i.e.P∈SrcNbhdt,τ(P)) andDstNbhdt(P) =Nbhdt+1(Dst)denote the neighborhood of sizeℓr−t−1that contains the ultimate destination Dst. Because P∈Nbhdt jif and only if Dst∈Nbhdt j, it follows that DstNbhdt(P)⊂SrcNbhdt,t(P). • Let|P|=|Msg|be the total number of bits needed to encode the packet. • We establish a lexicographical ordering that depends first onSrctand next on Dst. That is, we say that P′< P for someP′= (Msg′,Src′,Dst′,Srct′)ifSrct′<Srct, or if Srct′=SrctandDst′<Dst. We develop a packet scoring function zt,τ(P), which in turn induces an itinerary function bt,τ(P)∈ [q]that assigns an intermediate machine to hold packet Pafterr−τcommunication steps. These functions are carefully chosen in order to ensure that the lo cal memories of each individual machine are bounded by O(s)and that the number of machine each sends messages to and rece ives messages from is bounded by O(ℓ2). Critically, we ensure that bt,τ(P)∈SrcNbhdt,τ(P); that is, we initially rearrange packets within solely within the smallest neighb orhoods of Outboxtand gradually prop- agate packets more widely. That way, each packet Pobeys a trajectory of the following form over r−tMPC rounds: bt,r(P) =Srct∈SrcNbhdt,r(P) =⇒bt,r−1(P)∈SrcNbhdt,r−1(P) =⇒... =⇒bt,t+1(P)∈SrcNbhdt,t+1(P) =⇒bt,t(P)∈DstNbhdt(P). To define these functions, we first define partial sums of messa ge sizes in order to later determine an itinerary of machines that Pwill be routed to in between Srctand some destination machine in DstNbhdt(P). The first term, EqualDstSumt,τ(P), sums the size of all “lesser” packets that share a destination neighborhood DstNbhdt(P)and aτth level input neighborhood SrcNbhdt,τ(P): EqualDstSumt,τ(P) =/summationdisplay P′∈SrcNbhdt,τ(P) DstNbhdt(P′)=DstNbhdt(P) P′<P|P′|. The second term, LessDstSumt,τ(P), sums the size of all packets that share an input neighborhoo d but have a “smaller” destination neighborhood than P: LessDstSumt,τ(P) =LessDstSumt(SrcNbhdt,τ(P),DstNbhdt(P)), where LessDstSumt(SrcNbhd,DstNbhd) =/summationdisplay P′∈SrcNbhd DstNbhdt(P′)<DstNbhd|P′|. We now define the packet scoring and itinerary functions for a nyτ∈ {t,...,r}: zt,τ(P) =/braceleftbigg2s·minSrcNbhdt,τ(P)+LessDstSumt,τ(P)+EqualDstSumt,τ(P)τ≥t+1, 2s·minDstNbhdt(P)+2·EqualDstSumt,τ(P) τ=t. 25 Page 26: bt,τ(P) =/floorleftbiggzt,τ(P) 2s/floorrightbigg . We prove a series of claims to establish that the packet scori ng and itinerary functions are properly defined. Claim 28 (Itinerary range) .For anyτ∈ {t,...,r}, the itinerary function satisfies bt,τ(P)∈/braceleftbiggSrcNbhdt,τ(P)τ≥t+1 DstNbhdt(P)τ=t. As an immediate consequence, bt,r(P) =SrctforP= (Msg,Src,Dst,Srct). Proof. We first bound the scoring function zt,τ. Note that LessDstSumt,τ(P)+EqualDstSumt,τ(P)≤/summationdisplay P′∈SrcNbhdt,τ(P)|P′|−|P| ≤ |SrcNbhdt,τ(P)|·2s−|P|. Therefore, for τ >t , zt,τ(P)∈/bracketleftbig 2s·minSrcNbhdt,τ(P),2s·(minSrcNbhdt,τ(P)+|SrcNbhdt,τ(P)|)−|P|/bracketrightbig ,(2) and bt,τ(P)∈[minSrcNbhdt,τ(P),minSrcNbhdt,τ(P)+|SrcNbhdt,τ(P)|), which proves the first case of the claim. The second case follo ws by observing that EqualDstSumt,t(P)≤s·|DstNbhdt(P)|−|P|. Were it not true, there would exist at least one machine i∈DstNbhdt(P)that must receive more an squantity of messages at the end of the entire round of the prot ocolπ(i.e./summationtext Msg∈Inboxi|Msg|>s), which contradicts the MPC assumption. Claim 29 (Gaps between scores) .IfP1/\⌉}atio\slash=P2andzt,τ(P1)≤zt,τ(P2), then zt,τ(P1)+|P1| ≤zt,τ(P2). Proof. First, letτ > t . Consider the case where SrcNbhdt,τ(P1)/\⌉}atio\slash=SrcNbhdt,τ(P2). By Claim 28and our assumption that zt,τ(P1)≤zt,τ(P2), it must be the case that SrcNbhdt,τ(P1)< SrcNbhdt,τ(P2). Hence, we have the following by applying Equation ( 2). zt,τ(P2)−zt,τ(P1) ≥2s·(minSrcNbhdt,τ(P2)−(minSrcNbhdt,τ(P1)+|SrcNbhdt,τ(P1)|−|P1|)) ≥ |P1|. Otherwise, if SrcNbhdt,τ(P1) =SrcNbhdt,τ(P2), then zt,τ(P2)−zt,τ(P1) = (LessDstSumt,τ(P2)+EqualDstSumt,τ(P2))−(LessDstSumt,τ(P1)+EqualDstSumt,τ(P1)) =/summationdisplay P′∈S2|P′|−/summationdisplay P′∈S1|P′|, for some packet subsets S1,S2⊂SrcNbhdt,τ(P1). By further inspecting the respective LessDstSumt,τandEqualDstSumt,τterms, we observe that S1⊂S2andP1∈S2\S1. The claim immediately follows. The argument for the case τ=tis nearly identical. Claim 30 (Local memory bound) .For anyb∈N, /summationdisplay P:bt,τ(P)=b|P| ≤/braceleftbigg2s τ∈ {t,r} 3s τ∈ {t+1,...,r}. 26 Page 27: Proof. The caseτ=ris an immediate consequence of the inductive assumption tha tOutboxt satisfies the desired intermediate properties. For all other cases, let {P1,...,Pn}denote all packets with bt,τ(Pi) =band letzt,τ(P1)≤ ··· ≤ zt,τ(Pn)without loss of generality. We use Claim 29, the assumption that all |Pi| ≤s, and the boundedness of zt,τ(Pi)from Claim 28to conclude the proof. n/summationdisplay i=1|Pi| ≤n−1/summationdisplay i=1(zt,τ(Pi+1)−zt,τ(Pi))+|Pn| ≤zt,τ(Pn)−zt,τ(P1)+s ≤/braceleftbigg2s τ=t, 3s τ >t. Claim 31 (Intra-class distance preservation) .IfP1andP2satisfySrcNbhdt,τ+1(P1) = SrcNbhdt,τ+1(P2)andDstNbhdt(P1) =DstNbhdt(P2), then zt,τ(P1)−zt,τ(P2) =zt,τ+1(P1)−zt,τ+1(P2). Proof. SinceSrcNbhdt,τ+1(P1) =SrcNbhdt,τ+1(P2), it follows that SrcNbhdt,τ(P1) = SrcNbhdt,τ(P2)and therefore, zt,τ(P1)−zt,τ(P2) =EqualDstSumt,τ(P1)−EqualDstSumt,τ(P2) =/summationdisplay P′∈SrcNbhdt,τ(P1) DstNbhdt(P′)=DstNbhdt(P1) P1≤P′<P2|P′| (3) zt,τ+1(P1)−zt,τ+1(P2) =EqualDstSumt,τ+1(P1)−EqualDstSumt,τ+1(P2) =/summationdisplay P′∈SrcNbhdt,τ+1(P1) DstNbhdt(P′)=DstNbhdt(P1) P1≤P′<P2|P′|. (4) BecauseP1,P2∈SrcNbhdt,τ+1(P1), the defined packet ordering implies that any P′∈[P1,P2)must satisfyP′∈SrcNbhdt,τ+1(P1). Therefore, Equations ( 3) and ( 4) are equal and the claim holds. Claim 32 (Distinct recipients bound) .For anyb, |/braceleftbig bt,τ(P) :bt,τ+1(P) =b/bracerightbig | ≤3ℓ. Proof. Within each root neighborhood Nbhdτ j, there exist at most ℓdestination neighborhoods in DstNbhdt(P)forP∈Nbhdτ j. Fix some such DstNbhd and letP1,...,Pndenote all packets with bt,τ+1(P) =band DstNbhdt(Pi) =DstNbhd . Without loss of generality, assume that zt,τ(P1)≤ ··· ≤zt,τ(Pn). Because all such packets belong to the same machine in step r−τ−1(i.e.bt,τ+1(Pi) =bfor all i∈[n]), they belong share the same source neighborhood of size ℓr−τ−1(i.e.SrcNbhdt,τ+1(Pi) = SrcNbhdt,τ+1(P1)). By Claim 31and the definition of bt,τ, bt,τ(Pi)−bt,τ(P1)≤1+1 2s(zt,τ(Pi)−zt,τ(P1)) = 1+1 2s(zt,τ+1(Pi)−zt,τ+1(P1)) ≤2+bt,τ+1(Pi)−bt,τ+1(P1) = 2. Therefore, there are at most three possible values of bt,τ(Pi). The claim follows by considering each of the ℓdestination neighborhoods separately: /vextendsingle/vextendsingle/braceleftbig bt,τ(P) :bt,τ+1(P) =b/bracerightbig/vextendsingle/vextendsingle=/summationdisplay DstNbhd/vextendsingle/vextendsingle/braceleftbig bt,τ(P) :bt,τ+1(P) =b,DstNbhdt(P) =DstNbhd/bracerightbig/vextendsingle/vextendsingle ≤3ℓ. 27 Page 28: Claim 33 (Distinct senders bound) .For anyb, |/braceleftbig bt,τ+1(P) :bt,τ(P) =b/bracerightbig | ≤3ℓ2. Proof. Within each Nbhdτ j, there exist at most ℓ2distinct pairs of destination neighborhoods DstNbhdt(P)and source neighborhoods Nbhdt,τ(P)forP∈Nbhdτ j. As before, we fix some DstNbhd andSrcNbhd and letP1,...,Pnall satisfybt,τ(Pi) =b, DstNbhdt(Pi) =DstNbhd , andSrcNbhdt,τ+1=SrcNbhd . Using the same argument, we show that /vextendsingle/vextendsingle/braceleftbig bt,τ+1(Pi) :i∈[n]/bracerightbig/vextendsingle/vextendsingle≤3. We conclude by considering all such pairs. As a result of Claims 30,32, and 33, we conclude that each packet P= (Msg,Src,Dst,Srct)can be equipped with some itinerary Srct=bt,r(P),bt,r−1(P),...,bt,t+1(P),bt,t(P)∈DstNbhdt(P) that properly translates an instances of OutboxttoOutboxt+1and does so without ever requiring local memory more than 3s=O(Nδ)on any intermediate step or any machine to send or receive messages from more than 3ℓ2=O(Nρ)other machines. This itinerary can be executed using an (r−t)-round(γ,δ,ρ)-MPC protocol. Pre-computation phase. It remains to show that each bt,τcan be computed for each packet. To do so, we prove that there exists an O(r−t)-round(γ,δ,ρ)-MPC protocol that ends with each machine i knowingEqualDstSumt,τ(P)andLessDstSumt,τ(P)for eachP∈Outboxt iandτ∈ {t,...,r}. We design an MPC protocol that uses the tree structure to propag ate information about individual child neighborhoods to their parents and vice versa. We describe r ecursive relationships that elucidate how to compute the two salient quantities. First, we introduce a useful intermediate term. Let EqualDstSumt,τ(i,DstNbhd) =/summationdisplay i′∈Nbhdτ(i) i′<i/summationdisplay P′∈Outboxt i′ DstNbhdt(P′)=DstNbhd|P′|, denote the sum of all packet that are contained by “lesser” ma chines that share source and destination neighborhoods. Note that EqualDstSumt,τ(P)can be computed for any P∈Outboxt ilocally by machineigiven prior knowledge of EqualDstSumt,τ(i,DstNbhdt(P)). Thus, the pre-computation phase need only compute the latter term. We also introduce a term that represents sum of the sizes of al l packets that share source and desti- nation neighborhoods: NbhdSumt(SrcNbhd,DstNbhd) =/summationdisplay P′∈SrcNbhd DstNbhdt(P′)=DstNbhd|P′|. Now, we provide the recurrences for any τ <r (or for any SrcNbhd satisfying |SrcNbhd|>1: EqualDstSumt,τ(i,DstNbhd) =EqualDstSumt,τ+1(i,DstNbhd)+EqualDstSumt,τ(minNbhdτ+1(i),DstNbhd) =EqualDstSumt,τ+1(i,DstNbhd)+/summationdisplay SrcNbhd∈Children (Nbhdτ(i)) SrcNbhd<Nbhdτ+1(i)NbhdSumt(SrcNbhd,DstNbhd), NbhdSumt(SrcNbhd,DstNbhd) =/summationdisplay SrcNbhd′∈Children (SrcNbhd)NbhdSumt(SrcNbhd,DstNbhd), (5) LessDstSumt(SrcNbhd,DstNbhd) =/summationdisplay SrcNbhd′∈Children (SrcNbhd)LessDstSumt(SrcNbhd,DstNbhd). (6) 28 Page 29: Whenτ=r, the terms EqualDstSumt,τ(i,DstNbhd),NbhdSumt(Nbhdr(i),DstNbhd), and LessDstSumt(Nbhdr(i),DstNbhd)can be computed locally within machine i. We follow a tree-like communication pattern to compute all r elevant sums. Each machine i∈[q] computes /braceleftbig EqualDstSumt,τ(i,DstNbhd) :τ≥t,DstNbhd ∈Children (Nbhdt(i))/bracerightbig and/braceleftbig LessDstSumt(Nbhdτ(i),DstNbhd) :τ≥t,DstNbhd ∈Children (Nbhdt(i))/bracerightbig by completing r−tpropagate-up rounds,r−taggregation rounds, and r−tpropagate-down rounds. • The propagate-up rounds compute the neighborhood-wide me ssage-size summations NbhdSumt(Nbhdτ(i),DstNbhd)andLessDstSumt(Nbhdτ(i),DstNbhd)in each machine isatisfyingi= minNbhdτ(i)for eachτandDstNbhd . • The aggregation rounds accumulate NbhdSum terms of the same level into specific EqualDstSum terms. • The propagate-down rounds iteratively compute and share t heEqualDstSum and LessDstSum terms with all relevant machines. Propagate-up rounds: Fix some neighborhood Nbhdt j. Afterr−τpropagate-up rounds, the goal is to compute the terms NbhdSumt(SrcNbhd,DstNbhd)andLessDstSumt(SrcNbhd,DstNbhd) for each relevant destination neighborhood DstNbhd ∈Children (Nbhdt j)and source neighbor- hoodSrcNbhd ∈Descendantsτ(Nbhdt j)within a single machine in each source neighborhood, minSrcNbhd . We argue that this is possible inductively. Before performing any computation (that is, after “round 0” ), each machine iindividually com- putesNbhdSumt(Nbhdr(i),DstNbhd)andLessDstSumt(Nbhdr(i),DstNbhd)by aggregating the messages encoded in its own representation of Outboxt i. We assume that the procedure works as specified for r−τrounds of communication. Fix some SrcNbhdτ−1∈Descendantsτ−1(Nbhdt j). Then, for every SrcNbhdτ∈ChildrenSrcNbhdτ−1, the quantities {NbhdSumt(SrcNbhdτ,DstNbhd) :DstNbhd ∈Children (Nbhdt j)} ∪{LessDstSumt(SrcNbhdτ,DstNbhd) :DstNbhd ∈Children (Nbhdt j)} have already been computed and stored in minSrcNbhdτ. By the recurrence relations of Equa- tions ( 5) and ( 6),NbhdSumt(SrcNbhdτ−1,DstNbhd)andLessDstSumt(SrcNbhdτ−1,DstNbhd) are functions of those quantities. Thus, it suffices to have e ach machine minSrcNbhdτmachine transmit its relevant terms to minSrcNbhdτ−1. A round of MPC communication that transmits such messages involves each machine sending most one messag e of sizeℓand receiving at most ℓ messages, each of size O(ℓ). Inductively, we ensure that all neighborhood sums are compu ted afterr−tpropagate-up rounds. Because each machine handles at most ℓdistinct messages having total size O(ℓ2)per MPC round, this protocol does not violate the bounded message size and b ounded distinct message constraints (so long asℓ2≪s), which can be guaranteed for sufficiently large n, so long asℓ=O(nρ/2). Aggregation rounds: After the completion of the aggregation rounds, each machin eicomputes terms of the form /braceleftbig EqualDstSumt,τ(minNbhdτ+1(i),DstNbhd) :DstNbhd ∈Children (Nbhdt j)/bracerightbig from relevant NbhdSumtterms ifi= minNbhdτ+1(i). By the recurrence, it is sufficient for machine ito following ℓ2distinct terms /braceleftbig NbhdSumt(SrcNbhd,DstNbhd) :SrcNbhd ∈Children (Nbhdτ(i)),DstNbhd ∈Children (Nbhdt j)/bracerightbig . Since all machines ialready knows such terms for SrcNbhd =Nbhdτ+1(i), it can obtain the remaining NbhdSumtterms by simultaneously sharing information with its “cous in machines:” 29 Page 30: minSrcNbhd , for allSrcNbhd ∈Children (Nbhdτ(i)). This can be handled by a single round of communication where each “ (τ+1) th-level neighborhood representative” machine forwards i ts sums sums to up to ℓother first representatives, for a total messaging cost of O(ℓ2). We user−tseparate rounds to repeat this process for each τ≥t. Propagate-down rounds: It remains to compute each EqualDstSumt,τ(i,·)and LessDstSumt(Nbhdτ(i),·)term at each machine i. The relevant LessDstSumtterms have already been computed by each respective minNbhdτ(i)machine and can be propagated to machineiby usingr−trounds of tree-propagation through intermediate minNbhdτ′(i)machines. The same is possible for EqualDstSumt,τterms, although the individual terms need to be added in order to follow the recurrence. We propagate the terms in the same way as the LessDstSumtterms, but we take special care to carry out the extra additions. Thi s can be accomplished simultaneously to the other propagate-down rounds. This protocol involves each first-child node sharing at most ℓ distinct messages, each of size at most O((r−t)ℓ). As before, every node sends and receives at mostO((r−t)ℓ2)≪swords. After these 3(r−t)rounds have elapsed, each machine icomputesbt,τ(P)for eachP∈Outboxt i. Using this itinerary, the machines routes tuples of the form (P,bt,r(P),...,bt,t(P)) to the respective machine bt,τ(P)in roundr−τ. Due to the arguments presented at the start of the section, this procedure terminates with each (Msg,Src,Dst)tuple being held by some machine i such that the resulting Outboxt+1 iis valid, and the procedure involves at most O(r−t) =O(1+γ ρ) rounds of (γ,δ,ρ)-MPC computation. B.3.2 Proof outline of Corollary 25 Corollary 25 (Transformers simulate (γ,δ,ρ)-MPC) .For constants δ∈(0,1) andγ,ρ >0, an R-round deterministic (γ,δ,ρ)-MPC protocol with ninputs can be simulated by a transformer f∈ TransformerN,N′ m,H,Lwith depthL=R+1, headsH= 1, embedding dimension m=O(nδ+4ρlogn), context length N=n, and blank chain-of-thought tokens N′= max(0,O(n1+γ−δ)−n). The proof of Corollary 25involves an adaptation to the proof of Theorem 3.1 of [ 69]. To avoid restating the proof in its entirety, we provide a brief outli ne of the proof of Theorem 3.1 and explain which modification is necessary. Theorem 3.1 is a consequence of Lemmas B.4, B.5, and B.6 of [ 69], which establish that there exist single-layer transformers that simulate the initializati on, a round of computation and communication, and the output formatting of any fixed (γ,δ)-MPC protocol. The input and output steps of (γ,δ), and(γ,δ,ρ)-MPC protocols are identical, only Lemma B.5 needs to be exam ined. To simulate a single round of an MPC protocol with a transform er, all local computations are sim- ulated in the element-wise multi-layer perceptron (MLP) un its, and all communication is handled in a single multi-headed self-attention layer (Lemma B.7) S ince(γ,δ,ρ)-MPC protocols add no re- strictions related to local computation, the former can be s imulated in exactly the same manner with identical MLPs, and it remains to analyze the construction o f Lemma B.7. We restate Lemma B.7 and provide an replacement lemma that suffices to prove Corol lary25. Lemma 34 (Lemma B.7 of [ 69]; multi-headed attention simulates MPC communication) .For any R-round MPC protocol with local memory sandqmachines and any round r∈[R−1], there exists a single-layer transformer f∈Transformerq,0 m,H,1withH=O(loglogq)andm=O(s4logq), which, given as input a length- qencoding of each machine’s outgoing messages in round r, returns an encoding of each machine’s incoming messages in round r+1. Lemma 35 (Single-headed attention simulates bounded-message MPC c ommunication) .For anyR- round MPC protocol with local memory s,qmachines, and a k-machine communication limit and any roundr∈[R−1], there exists a single-layer single-headed transformer f∈Transformerq,0 m,H,1 withH= 1 andm=O(k4slogq), which, given as input a length- qencoding of each machine’s outgoing messages in round r, returns an encoding of each machine’s incoming messages in round r+1. 30 Page 31: Lemma 35is an immediate consequence of Lemma 3.2 of [ 69], their main technical result, which already applies to the regime with limits on the number of mac hines in communication. By replacing Lemma 34with Lemma 35, applying the remainder of the proof of Theorem 3.1 of [69], and letting k=O(nρ), the proof of Corollary 25is complete. C Single-layer transformers and graph reasoning tasks This appendix presents the results of Section 3.3, which separates the collection of graph reason- ing tasks into those retrieval tasks that can be efficiently solved by single-layer parameter-ef ficient transformers—including node count, edge count, edge exist ence, and node degree—and those par- allelizable orsearch tasks that require deeper constructions—including connectivit y, shortest path, cycle check, and triangle count. Taken together, these resu lts establish that the single-layer trans- formers of the Depth 1regime are capable of solving simple aggregation-based tas ks, but that their known limitations in capacity as communication protocols o f [68] apply to non-trivial graph reason- ing tasks. We specific a particular node/edge encoding of an input graph G= (V,E)and a graph reasoning task using a consistent encoding scheme that closely resemb les the encoding used in our graph reasoning experiments and those of [ 40]. This encoding is distinguished by the fact that each node and vertex of the graph Gis represented by exactly one token, rather than the pair of t okens utilized in our experiments. This choice ensures that any trivial pre -processing of graph inputs (e.g. using a positional embedding to associate each edge token pair) ne ed not count towards the single-layer transformer model. Definition 4. Thenode/edge encoding of a graphG= (V,E)withV⊆[n]and|V|+|E| ≤N−1 and a graph reasoning task Pis a sequence X=X(G,P) = (x1,...,xN)∈RN×d whered= 5and each xi= (x1 i,x2 i,isVertex i,isEdgei,isTask i)∈ {0,...,n}2×{0,1}3 satisfies the following conditions: • For eachv∈V, there exists exactly one i∈[N−1]withxi= (v,0,1,0,0) . • For each (u,v)∈E, there exists exactly i∈[N−1]withxi= (u,v,0,1,0) . • The token xNencodes a particular instance of the task P, by encoding isTask N= 1with an optional edge or node encoding. That is, for tasks without arguments (such as triangle count),xN= (0,0,0,0,1) . For tasks with a single node argument v∈[n](such as node degree),xN= (v,0,1,0,1) . For tasks with a pair of node arguments u,v∈[n](such as shortest path and connectivity), xN= (u,v,0,1,1) . • All other tokens satisfy xi= (0,0,0,0,0) . We say that a single-layer transformer f∈TransformerN m,H,1solves taskPon graphGif the output corresponding to the task description f(X(G,P))Nencodes the output of the task. Since fis a single-layer transformer, we can write this output as f(X)N=ψ/parenleftBiggH/summationdisplay h=1softmax/parenleftbig φ(xN)TQhKT hφ(X)T/parenrightbig φ(X)Vh/parenrightBigg (7) for element-wise multi-layer perceptrons φ:Rd→Rm,ψ:Rm→R broadcasted across each input (i.e. φ(X) = (φ(x1),...,φ(xN))∈RN×m) and weight matrices Q1,...,QH,K1,...,KH,V1,...,VH∈Rm×m. 31 Page 32: Throughout, we assume that all parameters in the transforme r model and intermediate numerical quantities can be written using O(logN)-bit floating point numbers. This assumption can be sat- isfied for the positive results of Appendix C.1and is necessary to obtain the negative results of Appendix C.2. We permit the element-wise MLPs φandψto be arbitrary functions for the negative results, while re - stricting them to be MLPs that can be approximated using boun ded-size multi-layer ReLU networks for the positive results. While we do not make these ReLU netw orks explicit, we restrict ourselves to simple operations that can be computed using linear transfo rmations and the application of smooth univariate functions. Finally, we acknowledge that the negative results in are lar gely superseded by those of [ 53], which establishes that L-complete languages (including graph connectivity and cyc le check) cannot be effi- ciently solved by constant-depth transformers, let alone s ingle-layer transformers. We include these bounds anyway to mark a contrast with our positive results; d raw further connections to the com- munication complexity lens on transformers; and exhibit si mple task instances (G,P)that require greater depth, including constant diameter and constant de gree graphs. C.1 Positive results for single-layer transformers Theorem 36 (Formal version of Theorem 5;Depth 1computes retrieval tasks) .Fix any graph rea- soning task among node count, edge count, edge existence, an d node degree and any graph size N. Then, there exists a single-layer single-headed transform erf∈TransformerN m,1,1with embedding dimensionm=O(logN)that solves the task on all graphs G= (V,E)of size|V|+|E| ≤N−1 formatted as node/edge embedding sequences. Proof. We prove that a single head of self-attention with input and o utput MLPs can solve these re- trieval and aggregation tasks in a parameter-efficient mann er by first carefully designing a universal input MLPφ:Rd→Rmfor somem=O(logN)to produce embeddings that encode useful graph properties. Then, we define task-specific query, key, and val ue matrices Q,K,V:∈Rm×mand output MLPs ψ:Rm→Rthat produce the correct answers. While we do not explicitly account for finite-precision computations, all of the operations utili zed can be carefully implemented to respect O(logN)-bit floating-point numerical representations using the te chnical approaches of [ 68,69]. Shared sinusoidal embedding MLP. We denote the embedding output of the input MLP as φ(xi) = (isTask i,isVertex i,isEdgei,φ′(xi))∈Rm for someφ′(xi) :Rd→R2m′for somem′=O(logN)andm= 2m′+ 3. For some fixed a1,...,am′∈[0,1] to be determined, let φ′(xi) =  η(x1 i) ifisVertex i= 1, ξ(x1 i,x2 i)ifisEdgei= 1, /vector0 otherwise, whereηis a sinusoidal embedding MLP for nodes with η(v) = (sin(2πa1v),cos(2πa1v),...,sin(2πam′v),cos(2πam′v))∈R2m′, (8) andξis an edge embedding satisfying ξ(u,v) =m′ m′+η(u)Tη(v)(η(u)+η(v)). We first show that the node embeddings are approximately orth ogonal. By employing standard trigonometric identities, we obtain η(u)Tη(v) =m′/summationdisplay j=1cos(2πaj(u−v)), and note that /bar⌈blη(v)/bar⌈bl2 2=m′. We use a standard concentration argument to prove that |η(u)Tη(v)| ≪ m′ifu/\⌉}atio\slash=vwith high probability. 32 Page 33: Claim 37 (Near-orthogonality of embeddings η).There exist coefficients a1,...,am′∈[0,1] that comprise the embedding η: [n]→Rmof Equation (8)such that /vextendsingle/vextendsingleη(u)Tη(v)/vextendsingle/vextendsingle≤2/radicalbig m′logn,ifu/\⌉}atio\slash=v. Proof. We employ the probabilistic method. Consider m′iid random variables a1,...,am′∼ Unif([0,1]) and letηrepresent the respective node embedding. Fix some arbitrar yu,v∈[n] withu/\⌉}atio\slash=vand note that η(u)Tη(v) =m′/summationdisplay j=1cos(2πaj(u−v)). For anyj∈[m′], the integrality of u−vimplies that E aj[cos(2πaj(u−v))] = 0. Hoeffding’s inequality provides the following bound: Pr/bracketleftBig/vextendsingle/vextendsingleη(u)Tη(v)/vextendsingle/vextendsingle≥2/radicalbig m′logn/bracketrightBig ≤exp/parenleftbigg −4m′logn 2m′/parenrightbigg ≤1 n2. By applying a union bound to all/parenleftbign 2/parenrightbig choices ofuandv, we have the following: Pr/bracketleftBig ∀u,v∈[n],u/\⌉}atio\slash=v:/vextendsingle/vextendsingleη(u)Tη(v)/vextendsingle/vextendsingle≥2/radicalbig m′logn/bracketrightBig ≤n(n−1) 2n2<1. Hence, there exists a satisfactory set of coefficients a1,...,am′. For someηsatisfying Claim 37, letχ= max u,v∈[n],u/\⌉gatio\slash=v|η(u)Tη(v)| ≤2√m′logn. By taking m′=O(logn)be sufficiently large, we guarantee that χ≤m′ cfor any constant c≥2. We then bound all relevant inner products between vertex and edge embeddings for sufficently large m′(and sufficiently small χ). We assume throughout that u,u′,v,v′are distinct and note that ξ(u,v) =ξ(v,u)(and omit symmetric inequalities). /bar⌈blη(v)/bar⌈bl2 2=m′. (9) /vextendsingle/vextendsingleη(u)Tη(v)/vextendsingle/vextendsingle≤χ≤m′ 4. (10) /vextendsingle/vextendsingleη(u)Tξ(u,v)−m′/vextendsingle/vextendsingle=/vextendsingle/vextendsingle/vextendsingle/vextendsinglem′(m′+η(u)Tη(v)) m′+η(u)Tη(v)−m′/vextendsingle/vextendsingle/vextendsingle/vextendsingle= 0. (11) /vextendsingle/vextendsingleη(u)Tξ(u′,v′)/vextendsingle/vextendsingle=/vextendsingle/vextendsingle/vextendsingle/vextendsinglem′(η(u)Tη(u′)+η(u)Tη(v′)) m′+η(u′)Tη(v′)/vextendsingle/vextendsingle/vextendsingle/vextendsingle≤2m′χ m′−χ≤4χ≤m′ 4. (12) /vextendsingle/vextendsingle/vextendsingle/bar⌈blξ(u,v)/bar⌈bl2 2−2m′/vextendsingle/vextendsingle/vextendsingle=/vextendsingle/vextendsingle/vextendsingle/vextendsinglem′2·(2m′+2η(u)Tη(v))−2m′(m′+η(u)Tη(v))2 (m′+η(u)Tη(v))2/vextendsingle/vextendsingle/vextendsingle/vextendsingle ≤/vextendsingle/vextendsingle/vextendsingle/vextendsingle2m′η(u)Tη(v) m′+η(u)Tη(v)/vextendsingle/vextendsingle/vextendsingle/vextendsingle≤2m′χ m′−χ≤4χ≤m′ 4. (13) /vextendsingle/vextendsingleξ(u,v)Tξ(u,v′)−m′/vextendsingle/vextendsingle=/vextendsingle/vextendsingle/vextendsingle/vextendsinglem′2(m′+η(u)Tη(v′)+η(u)Tη(v)+η(v)Tη(v′)) (m′+η(u)Tη(v))(m′+η(u)Tη(v′))−m′/vextendsingle/vextendsingle/vextendsingle/vextendsingle =/vextendsingle/vextendsingle/vextendsingle/vextendsinglem′2η(v)Tη(v′)−m′η(u)Tη(v)η(u)Tη(v′) (m′+η(u)Tη(v))(m′+η(u)Tη(v′))/vextendsingle/vextendsingle/vextendsingle/vextendsingle ≤m′2χ+m′χ2 (m′−χ)2≤4χ+4χ2 m′≤m′ 4. (14) |ξ(u,v)Tξ(u′,v′)|=/vextendsingle/vextendsingle/vextendsingle/vextendsinglem′2(η(u)Tη(u′)+η(u)Tη(v′)+η(v)Tη(u′)+η(v)Tη(v′) (m′+η(u)Tη(v))(m′+η(u′)Tη(v′))/vextendsingle/vextendsingle/vextendsingle/vextendsingle ≤4m′2χ (m′−χ)2≤16χ≤m′ 4. (15) 33 Page 34: Therefore, we conclude the following bounds: η(u)Tη(v),η(u)Tξ(u′,v′),ξ(u,v)Tξ(u′,v′)∈/bracketleftbigg −m′ 4,m′ 4/bracketrightbigg , /bar⌈blη(v)/bar⌈bl2 2,η(u)Tξ(u,v),ξ(u,v)Tξ(u,v′)∈/bracketleftbigg3m′ 4,5m′ 4/bracketrightbigg , /bar⌈blξ(u,v)/bar⌈bl2 2∈/bracketleftbigg7m′ 4,9m′ 4/bracketrightbigg . We now apply the above bounds to prove that each task can be sol ved by a single-headed transformer with weights Q,K,V∈Rm×m, output MLP ψ:Rm→R, and shared input MLP φdiscussed previously. Node count task. IfPencodes the node count task, we specify weights matrices Q,K,Vin order to ensure that QTφ(xN) =m′e19;KTφ(xi) = (isVertex i+isTask i)·e1; andVTφ(xi) = isTask i·e110. Then, the following is true of exponentiated key/query inn er products and scalings products of value vectors: exp(φ(xN)TQKTφ(xi)) =/braceleftbiggem′ifisVertex i= 1ori=N, 1 otherwise. exp(φ(xN)TQKTφ(xi))Vφ(xi) =/braceleftbiggem′·e1ifi=N, /vector0 otherwise. The output of the softmax is then softmax/parenleftbig φ(xN)TQKTφ(X)T/parenrightbig φ(X)V=/summationtextN i=1exp(φ(xN)TQKTφ(xi))Vφ(xi) /summationtextN i=1exp(φ(xN)TQKTφ(xi)) =em′ em′·(1+|V|)+1·(N−|V|−1)·e1. Let yN:= (softmax/parenleftbig φ(xN)TQKTφ(X)/parenrightbig φ(X)V)1∈R denote the first coordinate of the softmax output. By taking m′≥log(4N), we guarantee that yN∈/bracketleftbigg1 1+|V|+N/em′,1 1+|V|/bracketrightbigg ⊆/bracketleftbigg1 5 4+|V|,1 1+|V|/bracketrightbigg . By letting the output MLP ψapproximate the function ψ(z) =/floorleftBig 1 z1−1/floorrightBig forz1∈[1,N+ 2], we can ensure that f(X)N=|V|. Edge count task. We similarly let QTφ(xN) =m′e1;KTφ(xi) = (isEdgei+isTask i)·e1; andVTφ(xi) =isTask i·e1. By an identical analysis of the attention matrix and ψconstruction, we ensure that f(X)N=|E|. Edge existence task. Under the node/edge encoding, we assume that the edge existe nce task is encoded as xN= (x1 N,x2 N,0,1,1) for somex1 N,x2 N∈Vand should return f(X)N= 1/braceleftbig (x1 N,x2 N)∈E/bracerightbig . We choose our weight matrices to ensure that QTφ(xN) =φ′(xN) = ξ(x1 N,x2 N);KTφ(xi) =φ′(xi), andVTφ(xi) = 2(1−isTask i)e1. By applying Claim 37and lettingm′=O(logN)to be sufficiently large, the following is true the query/key inner products: exp(φ(xN)TQKTφ(xi)) = exp(/vextenddouble/vextenddoubleξ(x1 N,x1 N)/vextenddouble/vextenddouble2 2)≥e7m′/4if/braceleftbig x1 i,x2 i/bracerightbig =/braceleftbig x1 N,x2 N/bracerightbig , exp(φ(xN)TQKTφ(xi))≤e5m′/4≤1 8Ne7m′/4otherwise. 9Note that QTφ(xN)is the only relevant query embedding to the output f(X)N; see Equation ( 7). 10While we do not specify the entries of Q,K,V, they are clearly linear transformations. 34 Page 35: We can therefore analyze the softmax output yNto obtain the following bound: yN≤2exp(/vextenddouble/vextenddoubleξ(x1 N,x1 N)/vextenddouble/vextenddouble2 2)1/braceleftbig (x1 N,x2 N)∈E/bracerightbig +2N·1 8Ne7m′/4 exp(/bar⌈blξ(x1 N,x1 N)/bar⌈bl2 2)(1+1{(x1 N,x2 N)∈E}) ≤21/braceleftbig (x1 N,x2 N)∈E/bracerightbig +1 4 1+1{(x1 N,x2 N)∈E}≤1/braceleftbig (x1 N,x2 N)∈E/bracerightbig +1 4, yN≥2exp(/vextenddouble/vextenddoubleξ(x1 N,x1 N)/vextenddouble/vextenddouble2 2)1/braceleftbig (x1 N,x2 N)∈E/bracerightbig exp(/bar⌈blξ(x1 N,x1 N)/bar⌈bl2 2)(1+1{(x1 N,x2 N)∈E})+N·1 8Ne7m′/4 ≥21/braceleftbig (x1 N,x2 N)∈E/bracerightbig 1+1{(x1 N,x2 N)∈E}+1 8≥1/braceleftbig (x1 N,x2 N)∈E/bracerightbig −1 4. Hence, yN∈/braceleftbigg/bracketleftbig −1 4,1 4/bracketrightbig if(x1 N,x2 N)/\⌉}atio\slash∈E,/bracketleftbig3 4,5 4/bracketrightbig if(x1 N,x2 N)∈E. Therefore, it suffices to design a threshold output MLP ψthat satisfies ψ(z) =/braceleftbigg1z1≥3 4, 0z1≤1 4, in order to distinguish between the two possible output rang es. This can be easily constructed by taking a linear combination of two ReLU neurons. Node degree task. We assume that the task is encoded as xN= (x1 N,0,1,0,1) and should return f(X)N= deg(x1 N) :=/vextendsingle/vextendsingle/braceleftbig (x1 N,v)∈E/bracerightbig/vextendsingle/vextendsingle. We use weight matrices with QTφ(xN) =φ′(xN) = η(xN);KTφ(xi) =φ′(xi); andVTφ(xi) =isTask i. This ensure that the following is true about the query/key inner products for sufficiently large m′: exp(φ(xN)TQKTφ(xi)) =em′ifi=Norx1 N∈/braceleftbig x1 i,x2 i/bracerightbig , exp(φ(xN)TQKTφ(xi))≤em′/4≤1 4Nem′otherwise. We similarly ensure that the following holds about the softm ax outputyN. yN≤em′ em′·(deg(x1 N)+2)=1 deg(x1 N)+2, yN≥em′ (deg(x1 N)+2)em′+N·1 4Nem′ ≥1 deg(x1 N)+9 4. We then use the similar approach employed in the node count ta sk to choose an output MLP that approximately computes φ(z) =/floorleftBig 1 z1−2/floorrightBig . We thus conclude that there exist single-layer transformer s that solve all of the described tasks. C.2 Negative results for single-layer transformers Theorem 38 (Formal version of Theorem 6;Depth 1cannot compute search or retrieval tasks) .Fix any graph reasoning task among graph connectivity, shortes t path, and cycle detection. Any single- layer transformer f∈TransformerN m,H,1withO(logN)-bit precision that solves the task on all graphsG= (V,E)of size|V|+|E| ≤N−1formatted as node/edge embedding sequences has width satisfying mH= Ω(N/logN). Our negative results generalizes and applies the approach o f [68] to prove negative results for single- layer transformers by communication complexity reduction s. The bounds hinge on the following fundamental fact about the hardness of a two-player game whe re two agents jointly attempt to com- pute a set disjointness quantity that depends on both of thei r inputs with bounded communication. 35 Page 36: Fact 1 (Disjointness communication complexity lower bound [ 83]).Suppose Alice and Bob are given inputs a,b∈ {0,1}rand wish to jointly compute DISJ(a,b) = max iaibiby alternately sending single-bit messages to one another. Any determinis tic protocol that computes DISJ(a,b) requires at least rrounds of communication, or rbits of information. We first generalize Theorem 7 of [ 68] to show that no transformer can efficiently solve an embedde d set disjointness problem without having a width that scales linearly inr. We prove the theorem and later apply to it prove negative results about relevant grap h reasoning tasks. To demonstrate that the graph reasoning tasks do not require pathological input gra phsGto be hard, we exhibit particular input graph instances with constant graph diameter or const ant degree where the task cannot be efficiently solved. Lemma 39 (Generic Depth 1communication complexity negative result) .For some sequence length, fix two disjoint subsets A,B⊂[N−1], and consider a single-layer transformer f∈ TransformerN m,H,1withO(logN)-bit precision that solves set disjointness, i.e. f(X)N=DISJ(a,b) for any input XwhereXAis a function of Alice’s input a∈ {0,1}r,XBis a function of Bob’s in- putb∈ {0,1}r, andX[N]\(A∪B)is fixed regardless of aandb. Then,fhas width satisfying mH= Ω(r/logN). Proof. Consider any transformer fthat solves D ISJas specified in the theorem statement. We show that this implies the existence of a O(mHlogN)-round communication protocol that solves DISJ(a,b)for any inputs a,b∈ {0,1}r. An application of Fact 1immediately proves the theorem statement. If such a transformer fexists, then the following is true for some φ,ψ, andQ,K,V: DISJ(a,b) =ψ/parenleftBiggH/summationdisplay h=1softmax/parenleftbig φ(xN)TQhKT hφ(X)/parenrightbig φ(X)Vh/parenrightBigg =ψ/parenleftBiggH/summationdisplay h=1Zh,Aexp(Lh,A)+Zh,Bexp(Lh,B)+Zh,[N]\(A∪B)exp(Lh,[N]\(A∪B)) exp(Lh,A)+exp(Lh,B)+exp(Lh,[N]\(A∪B))/parenrightBigg , for partial softmax and normalization terms11defined as follows for S⊂[N]andh∈[H]: Zh,S=/summationtext i∈Sexp(φ(xN)TQhKT hφ(xi))Vhφ(xi)/summationtext i∈Sexp(φ(xN)TQhKT hφ(xi))∈Rm, Lh,S= log/parenleftBigg/summationdisplay i∈Sexp(φ(xN)TQhKT hφ(xi))/parenrightBigg ∈R+. The definition of input instances Ximplies that Alice can compute Zh,AandLh,Aas a function of her inputa; Bob can similarly compute Zh,BandLh,Bfromb; andZh,[N]\(A∪B),Lh,[N]\(A∪B), and all implementation details of fare known by all players. Therefore, Bob can compute D ISJ(a,b)by anO(mHlogN)-round protocol where Alice sends him (Zh,A,Lh,A)bit-by-bit. It remains to apply Lemma 39to each graph reasoning task by defining graph instance encod ingsX that encode D ISJ(a,b)in the solution to the task for some r= Θ(N). Proof of Theorem 38.For each task, we provide a pair of “hard instances:” one with constant degree and one with constant diameter, in order to obey different no tions of graph simplicity. 11These terms are designed in order to facilitate computation withlog(N)-bit floating-point numbers. Note that each Zh,Sis a convex combination of value vectors VT hφ(xi), which means that Zh,Srequires no more bits of precision to accurately approximate than VT hφ(xi). Similarly, each Lh,Sis within an O(logN)additive factor ofmaxi∈Sφ(xN)TQhKT hφ(xi). Finally, the output of the softmax term for head his a convex combi- nation of Zh,A,Zh,B, andZh,[N]\(A∪B), which makes accurate computation of each output possible w ithout requiring greater precision. 36 Page 37: Graph connectivity task. For both instances, on any disjointness input a,b∈ {0,1}r, we define a graphG= (V,E)with|V|+|E|=O(r)whose edges encode the input, i.e. E=E(a,b). We define three fixed disjoint sets of potential edges ¯EA,¯EB,¯E∗with|¯EA|+|¯EB|+|¯E∗|=O(r) such thatE=EA(a)∪EB(b)∪¯E∗, whereEA(a)⊂¯EAis a function of only Alice’s input aand EB(b)⊂¯EBis a function of Bob’s b. We define an enumeration of all vertices in Vand potential edges in ¯EA,¯EB,¯E∗. We first fix their sizes|V|,|¯EA|,|¯EB|,|¯E∗|and then index the vertices and edges in the order V,¯EA,¯EB,¯E∗. That is, V={1,...,|V|}, ¯E∗={(ui,vi) :i∈S∗},forS∗=/braceleftbig |V|+1,...,|V|+|¯E∗|/bracerightbig , ¯EA={(ui,vi) :i∈A},forA=/braceleftbig |V|+|¯E∗|+1,...,|V|+|¯E∗|+|¯EA|/bracerightbig , ¯EB={(ui,vi) :i∈B},forB=/braceleftbig |V|+|¯E∗|+|¯EA|+1,...,|V|+|¯E∗|+|¯EA|+|¯EB|/bracerightbig . This implies that the following transformer input X∈RN×dforN=|V|+|S∗|+|A|+|B|+1 is a valid node/edge encoding such that inputs XAis a function of a,XBis a function of b, and X[N]\(A∪B)is constant: • Fori∈V,xi= (i,0,1,0,0) encodes a fixed vertex. • Fori∈S∗,xi= (ui,vi,0,1,0) encodes a fixed edge. • Fori∈A,xirepresents edge (ui,vi)if it exists: xi=/braceleftbigg(ui,vi,0,1,0) if(ui,vi)∈EA(a), (0,0,0,0,0) otherwise. • Fori∈B,xirepresents edge (ui,vi)if it exists: xi=/braceleftbigg(ui,vi,0,1,0) if(ui,vi)∈EB(b), (0,0,0,0,0) otherwise. •xN= (u,v,0,1,1) encodes the task token for some fixed (u,v)∈V2. Constant-diameter instance: We define a graph Gof diameter at most 8 that encodes an disjointness instancea,b∈ {0,1}rin an instance of connectivity. • Let|V|= 3r+2, and let nodes 1 and 3r+2denote the “source” and “sink” node respec- tively. That is, xN= (1,3r+2,0,1,1) . • Edges between the source node 1 and nodes {2,...,r+1}and between the end node 3r+2and{2r+2,...,3r+1}always are included. That is, E∗={(1,i) :i∈ {2,...,r+1}}∪{(3r+2,i) :i∈ {2r+2,...,3r+1}}. • Alice’s input is encoded as edges (i+1,i+r+1) fori∈[r]. That is, EA(a) ={(i+1,i+r+1) :ai= 1} ⊂¯EA={(i+1,i+r+1) :i∈[r]}. • Bob’s inputs are similarly encoded as (i+r+1,i+2r): EB(b) ={(i+r+1,i+2r+1) :bi= 1} ⊂¯EB={(i+r+1,i+2r+1) :i∈[r]}. We visualize this construction of Gin Figure 5. There exists a path between node 1 and node 3r+2if and only if there exists some i∈[r]such that (i+1,i+r+1),(i+r+1,i+2r+1)∈E, which corresponds to ai=bi= 1. Thus, the graph Gis connected if and only if D ISJ(a,b) = 1 . Any transformer fthat computes f(X)N= 1{1and3r+2are connected in G} 37 Page 38: 12 3 45 6 78 9 1011a1 a3b1 b2 Figure 5: The constant diameter graph construction for r= 3,A= (1,0,1) andB= (1,1,0) . The source1and sink11are connected which is equivalent to D ISJ(A,B) = 1 by construction. also solves disjointness. Since N=|V|+|S∗|+|A|+|B|+1 = Θ(r), we conclude the proof of hardness of solving graph connectivity on this instance by a pplying Lemma 39. Constant-degree instance: We can define a degree-3 graph Gthat similarly encodes a disjointness instancea,b. To do so, we modify the previous construction by replacing t he fixed edges E∗with two binary trees, each of depth O(logr), between the source and end nodes as roots and Alice and Bob’s nodes incident to ¯EAand¯EBrespectively as leaves. The remainder of the construction— including the encoding of EA(a)andEB(b)and the connectivity analysis—is identical. Since a binary tree with rleaves hasO(r)nodes and edges, the graph Gcan be similarly encoded as a length-Ntransformer input for N=O(r). See Figure 6for a visualization. S X a4a2b1 b2 Figure 6: The degree 3 graph construction for r= 4,A= (0,1,0,1) ,B= (1,1,0,0) . The source nodeSand sink node Xare connected, which is equivalent to D ISJ(A,B) = 1 by construction. Shortest path task. We can modify the graph connectivity constructions Gto create a decision variant of shortest path, rather than graph connectivity. L etD(G)be the length of a path between the source and sink node in each connectivity construction, and note that D(G) = 4 for the constant- diameter instance and D(G) =O(logr)for constant-degree. We design a new graph G′of size O(r)then appends a path of D(G) + 1 vertices toG. Then, the shortest path between the source and sink node is D(G)if they are connected and D(G)+1 if not. If there exists a transformer f′solves shortest path on graphs of the form G′, then there exists a minor modification fthat solves graph connectivity on graph G. Cycle check task. We again modify the graph connectivity constructions. We pr oduceG′by adding a single edge to Gbetween the source node 1 and the sink node 3r+2, which ensures that G′ has a cycle if and only if Gis connected between the source and sink nodes. Therefore, a transformer f′can only solve check check if there exists a transformer fthat solves connectivity. D Representational gaps between transformers and GNNs Our experiments in Section 4highlight the differences in graph reasoning abilities bet ween vanilla transformers with a naive input graph tokenization and grap h neural networks (GNNs). These dis- tinctions can be understood as the consequences of model cap acity or inductive bias between trans- 38 Page 39: formers and GNNs. This appendix contrasts the novel analysi s of the capacities of vanilla transform- ers in Section 3with previously established limitations on GNNs. We presen t two theoretical tools for deriving negative results on GNN capabilities: C ONGEST distributed computing model and the Weisfeiler-Leman isomorphism test (WL-test). We discuss t he results implied by these frameworks and contrast them with transformer capabilities and limita tions. D.1 Limitations of GNNs via C ONGEST analogy The bidirectional relationship between transformers and t he massively parallel computation (MPC) distributed computing model of [ 69] was partially inspired by a similar analogy between GNNs an d the C ONGEST model by [ 49]. The MPC and C ONGEST distributed computing protocols differ in their models of h ow messages are passed between machines. While MPC protocols permit any machine to send a message to any other machine subject to capacity constraints, C ONGEST protocols operate on a graph topology that restricts machines to send messages exclusively to and from neighbors. As a consequence, two nodes in a C ONGEST graph that are separated by a path of length pcannot share information with one another without Ω(p)rounds of communication; no similar notion of “long-range c ommunication” exists for MPC. All message-passing GNNs where each node initially knows on ly its own features can be simulated by a C ONGEST protocol whose rounds scales linearly in the depth of the GNN [49]. This reduction implies a lower bound on solving several parallelizable and search tasks discussed in Appendix A.2. Theorem 40. Any graph neural network fwith width (message size) mand depthLthat computes any of the following tasks on n-node graphs has the following size lower bounds: •Subgraph connectivity, minimum spanning forest, minumum c ut, shortest path: L√m= ˜Ω(√n). •Cycle detection, diameter computation: Lm=˜Ω(n). IfL=O(logN), then for all tasks, m=˜Ω(n). In contrast, the quantitative bounds in Section 3establish sharp trade-offs between transformers and GNNs for parallelizable tasks and suggest a possible equiva lence for search tasks. All parallelizable tasks—including (subgraph) connectiv ity, minimum spanning forest, minimum cut, and cycle detection—can be solved by transformers of de pthL=O(logn)and widthm= O(nǫ)for any constant ǫ∈(0,1) due to Theorem 2. In contrast, Theorem 40requires that a similar- depth GNNs have width m=˜Ω(n), and a GNN of comparable width requires depth L=˜Ω(n1−ǫ). On the other hand, search tasks, such as shortest path and dia mter, are only guaranteed to be solvable by transformers of depth O(logn)and widthO(n1+ǫ)(for graphs with |E|=n2) by Theorem 4. This statement compares to the GNN negative results of Theor em40. D.2 Limitations of GNNs via the Weisfeiler-Leman test A relationship between GNNs and the Weisfeiler-Leman heuri stic graph isomorphism test [ 81] (WL test) further establishes representational limitations o f message-passing GNNs. This connection and the rich literature surrounding it is presented in greater d etail by [ 57]. The 1-WL test is a permutation-invariant test for predictin g whether two candidate graphs are iso- morphic that works by first labeling each node with the empty s et∅and then repeatedly replacing each label with a multiset of its neighbors’ labels. A hierar chy of WL test variants exists where thek-WL test maintains a label for each k-tuple of vertices. The failure models of these heuristic solutions are well-understood; critically, the 1-WL test c annot distinguish between connected and disconnected graphs. The abilities of message-passing GNNs without unique node i dentifiers to determine whether two graphs are isomorphic are limited by the graphs distinguish able by the 1-WL test [ 82,57]. As an immediate consequence, such GNNs cannot solve graph connec tivity at all , unlike transformers, which can do so with a logarithmic-depth parameter-efficien t representation. The relationship is bidirectional; message-passing GNNs (and transformers as well) admit an efficient approximation of the 1-WL test. 39 Page 40: The effectiveness of these bounds is tempered by the assumpt ion that no node includes identifying features, which is easily overcome by standard GNN models. T he analogy is further limited by the fact that graph embeddings of transformers must have som e (possibly arbitrary) node identifier as input in order to tokenize a graph using the node/edge enco ding without losing the ability to associate each node with its incident edges. However, the ju xtaposition of the 1-WL and C ONGEST - based limitations on the abilities of GNNs to solve connecti vity-based tasks suggests a fundamental gap in capabilities between models that is apparent in multi ple theoretical lenses. E Experimental Details E.1 Datasets We evaluate our model on the diverse graph reasoning tasks pr esented in GraphQA [ 22]. We used the public code of the dataset available at https://github.com/google-research/google-research/ tree/master/graphqa . The code to generate the datasets is licensed under the Apach e License, Version 2.0. The tasks in the dataset range in difficulty and encompass the following c ategories: •Graph-level: node counting (counting the number of nodes in a graph), edge counting (counting the number of edges in a graph), cycle check (determining whe ther a graph contains a cycle), and triangle counting (counting the number of triangles in a gra ph). •Node-level: node degree (calculating the degree of a given node in a graph ). •Edge-level: connectivity (finding if there is a path from one node to anoth er), edge exis- tence (whether a given edge exists in a graph, and shortest pa th (finding the length of the shortest path from one node to another). The graphs used in the experiments in this paper and the corre sponding graph reasoning tasks are taken from [ 22]. There are 1,000 graphs in the original train set, 500graphs in the dev set, and 500 graphs in the test set. The graphs are generated randomly usi ng Erd ˝os-Rényi (ER) random graph model [ 21]. Graph size ranges from 5 to 20 nodes. Train set statistics. Average number of nodes: 11.90; average number of edges: 37. 01; average node degree: 5.43. Test set statistics. Average number of nodes: 12.37; average number of edges: 39. 79; average node degree: 5.70. no cycle 3 4 5 60200400600800 Figure 7: Histogram of minimum cycle lengths for cycle check instances.While random instances of graph reasoning tasks provide a valuable assessment of the task complexity on realistic graphs, they do not nec- essarily reflect the “worst case” graph inputs that convey negative results like Theorem 6and Theorem 3. For example, the reduction that es- tablishes that cycle check is “as hard as” graph connectivity and the consequential logarithmic- depth hardness results hinge on the considera- tion of graph instances with nnodes and poly- nomial cycle length. However, as witnessed by Figure 7, the shortest cycles observed in 1000 instances of GraphQA cycle check is almost always of length three, and only 3.2% of in- stances are larger. As a consequence, identify- ing the existence of a cycle on the GraphQA dataset is inherently local, which is reflected by a strong performance by heuristic-based GNN solutions (Table 2)—despite the fact that efficient GNNs for worst-case cycle check do not exist (Theorem 40). For our experiments on the effect of the scale of the number of training data points in the final results we obtain, we use the open-source code of GraphQA available t o generate a larger training dataset of 100K examples. We follow the original instructions and pa rameters to create this larger training dataset. 40 Page 41: E.2 Implementation Details Model Hyperparameters. We fixed the number of iterations as 1,000,000 and train stand ard decoder-only transformers with L= 12 layers,m= 768 embedding dimension, H= 12 heads, learning rate 5·10−4, and dropout 0.1. These models have an approximate parameter count of 60,000,000. We used random search [ 8] over the following set of hyperparameters to select a unive rsal archi- tecture for all tasks: The range provided for the learning ra te and dropout rate are [10−4,10−1] and[0,0.5] . The number of layers Land embedding dimension mis selected from L∈ {4,6,8,10,12,14,16 }andm∈ {192,384,576,768,960,1152,1344,1536 }. We employed the GLU [ 71] activation as a non-linearity. Model Selection. We implemented our model in JAX [ 24] and used AdamW [ 41,48] as the op- timizer. Optimal hyperparameters for each task and model we re determined by training on the GraphQA Train dataset and evaluating performance on the GraphQA Devdataset. The results pre- sented in the paper are based on the held-out GraphQA Test dataset. Hardware Acceleration. All experiments were conducted using Google’s TPUv3 and TPU v5e accelerators [ 33]. E.3 Baseline Results To rigorously evaluate the performance of transformers on g raph reasoning tasks, we compare them against three established categories of baselines: 1.Prompting-based methods. These methods provide the LLM with a textual descriptions of the graph and question within the prompt. We consider the follow ing variations and copy the results from the original papers: •ZERO -SHOT . In this approach, the model is given a task description and i mmediately asked to produce the desired output. No additional examples or demon strations are provided. •FEW-SHOT . This approach provides the model with a few examples of the t ask and their desired outputs [ 11]. Unlike traditional training, these examples are include d directly in the prompt, allowing the model to learn and adapt during the inference. • C OT. Chain-of-thought (CoT) prompting [ 80] provides examples each showing step-by-step reasoning, teaching the LLM to generate its own thought proc esses for tackling new tasks. •ZERO -COT. Zero-shot CoT [ 43] builds upon Chain-of-Thought (CoT) prompting by eliminat - ing the need for training examples. The LLM generates its own step-by-step reasoning process using a simple trigger phrase like “Let’s think step by step” . •COT-BAG. BAG prompting [ 78] extends COT to improve the performance of LLMs on graph- related tasks by appending “Let’s construct a graph with the nodes and edges first” to the prompt. 2.Graph-based methods. These models are specifically designed to process graphs as i nput and are trained task-specific. They leverage the connections be tween nodes to learn patterns and make predictions, making them ideal for tasks where a graph i s involved. We use GCN [ 42], MPNN [ 26], and GIN [ 82] from this category. GraphToken [ 61] is a GNN-based model that processes the graph and feed the output of the GNN as soft-tok ens to an LLM. 3.Transformer models (Ours). The last class of model are task-specific vanilla transforme r mod- els [75]. The 60M transformer-1K model is the one described above trained on 1,000 training examples from the GraphQA training set. To investigate the i mpact of training data scale, we generated a larger dataset containing 100,000 examples, ensuring the same distribution as the original training set by using the official GraphQA code and t rained 60M transformer-100K on that. The 11B transformer (FT)-1K is a vanilla transformer model that is started with a pre- trained checkpoint of T5 [ 64] and is fine-tuned on the 1K training dataset. We also include two fine-tuned PaLM 2 [ 4] transformers of size XXS and XS. Similar to prompting basel ines, this model receives a textual description of the graph as inp ut to leverage its textual reasoning capabilities. The results for ZERO -SHOT ,ZERO -COT,FEW-SHOT ,COT, and COT-BAG are taken from Fatemi et al. [22]. Results for SOFT -PROMPT and GraphToken are sourced from Perozzi et al. [ 61]. 41 Page 42: Retrieval tasks Parallelizable Tasks Search Tasks Subgrap h Counting Method Node count Edge count Edge existence Node degree Conn ectivity Cycle check Shortest path Triangle countingPromptingZERO -SHOT [22] 21.7 12.4 44.5 14.0 84.9 76.0 11.5 1.5 ZERO -COT [22] 14.6 9.4 33.5 10.4 73.5 32.3 33.6 12.7 FEW-SHOT [22] 25.3 12.0 36.8 17.4 79.4 37.4 22.7 3.0 COT [22] 27.6 12.8 42.8 29.2 45.2 58.0 38.6 8.1 COT-BAG [22] 26.9 12.5 37.3 28.0 45.2 52.1 40.4 8.1Graph-basedGCN [ 42] 6.4 1.2 47.0 9.8 83.8 83.2 50.2 4.0 MPNN [ 26] 19.4 16.2 69.2 99.4 94.0 99.0 66.8 30.6 GIN [ 82] 71.2 4.4 71.2 36.2 93.8 98.8 54.0 30.4 GraphToken [ 61] 99.6 42.6 73.8 96.2 93.2 95.6 63.8 34.8Ours60M transformer-1K 100.0 100.0 67.6 31.5 92.9 97.1 57.4 33.4 60M transformer-100K 100.0 100.0 96.1 91.7 98.0 98.0 97.2 40.5 XXS transformer (FT)-1K 100.0 70.6 73.0 31.0 93.6 98.0 60.4 29.0 XS transformer (FT)-1K 100.0 73.2 98.6 50.6 96.6 96.8 60.0 28.6 12B transformer (FT)-1K 100.0 45.0 100.0 68.8 98.4 98.0 92.8 26.0 Table 4: Comparison of various methods in different categor ies on graph reasoning tasks of GraphQA. Here, we categorize the tasks using the taxonomy pr oposed in Section 3. 10210310410500.20.40.60.81AccuracyNode count 10210310410500.20.40.60.81Edge count 1021031041050.60.70.80.91Edge existence 1021031041050.20.40.60.81Node degreeTrain Test 10210310410500.20.40.60.81 Training SamplesAccuracyConnectivity 1021031041050.91 Training SamplesCycle check 1021031041050.60.81 Training SamplesShortest path 1021031041050.20.40.60.81 Training SamplesTriangle counting Figure 8: Comparison of train and test scaling on all tasks. We independently evaluated GCN, MPNN, and GIN models on thes e tasks. We used the origi- nal architectures proposed in their respective papers and p erformed hyperparameter tuning on the GraphQA Devdataset. E.4 Further Experimental results Table 4presents a comprehensive comparison of the graph reasoning capabilities across various base- line models and our proposed transformer architectures. Th e results highlight several key findings, which we summarize below: Transformers Exhibit Strong Performance on Graph-based Re asoning Problems. While trans- formers are not explicitly designed for graph reasoning tas ks like graph-based models, they demon- strate surprisingly strong performance in this domain. The results of this study indicate that trans- formers, despite their versatility as a general architectu re, can often match or even surpass special- ized graph models on a variety of graph reasoning benchmarks . 42 Page 43: Transformers Excel at Retrieval Tasks. As proved in Theorem 36, retrieval tasks can be solved by transformers. The obtained results confirm that such task s are relatively easy for transformers as they obtained the full accuracy on most of such tasks. One exc eption here is the node degree task that GNNs outperform transformers but still transformers p erform relatively well. We discuss why GNNs outperform transformers well for this task. Larger Transformers Excel at Solving Search Tasks. As discussed in Theorem 4, transformers are effective for search tasks, albeit requiring a larger nu mber of parameters compared to retrieval tasks. This is empirically evident in the comparison betwee n Transformer-1K and Transformer- 1K (pretrained). It’s worth noting that the pretrained tran sformer used here has 11 billion parameters, a significant increase from the 1 million parameters in Trans former-1K. Transformers Excel at Capturing Global Patterns. An interesting observation here is the perfor- mance gap between Transformers and GNNs across tasks with va rying emphasis on local versus global graph structure. Notably: 1.Local Structure: The node degree task, which relies heavily on local node info rmation, is best handled by MPNN, a GNN-based model. 2.Global Structure: In contrast, tasks like connectivity, triangle counting an d shortest path, which require understanding global graph patterns, are dominate d by Transformer models. Notably, vanilla Transformers achieve a remarkable 45% relative imp rovement over even much larger LLMs augmented with GNN-generated soft prompts (GraphToke n) on the shortest path task. This showcases the exceptional ability of Transformers to c apture long-range dependencies, a critical factor in understanding global graph structure. E.4.1 Sample complexity ablations To develop a more comprehensive understanding of graph reas oning tasks learnability by small transformers, we train a variety of transformers for 1,000, 000 steps on each task on a range of sample sizes. In Figure 8, we demonstrate how the model performance improves as a func tion of sample complexity. By doing so, we witness the relative ha rdness of each task in terms of the marginal benefit of new samples. • Edge count and node count are “easy” retrieval tasks that ca n be solved perfectly with as few as 100 samples. • Edge existence attains near-perfect train and test classi fication accuracy as the number of training samples approaches 100,000. • Connectivity, shortest path, and node degree demonstrate a sharp improvement in evalua- tion error as a function of the sample size. These models perf ectly fit the training set in most sample size regimes, but yield a closer correspondence between training and testing error when trained on 100,000 samples. • Cycle check and triangle count have persistent gaps betwee n training and testing error and overfit even in the large sample setting. 43 Page 44: This figure "cpc_adj.jpeg" is available in "jpeg" format from: http://arxiv.org/ps/2405.18512v1 Page 45: This figure "cpc_int.jpeg" is available in "jpeg" format from: http://arxiv.org/ps/2405.18512v1 Page 46: This figure "ic.jpeg" is available in "jpeg" format from: http://arxiv.org/ps/2405.18512v1

---