loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2412.11711v2

MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for Table Reasoning

Authors: Zheng Li, Yang Du, Mao Zheng, Mingyang Song

Published: 2024-12-16

Abstract:

Extensive research has been conducted to explore the capability of Large Language Models (LLMs) for table reasoning and has significantly improved the performance on existing benchmarks. However, tables and user questions in real-world applications are more complex and diverse, presenting an unignorable gap compared to the existing benchmarks. To fill the gap, we propose a \textbf{M}ult\textbf{i}-scale spreadsheet benchmark with \textbf{M}eta \textbf{o}perations for \textbf{Table} reasoning, named as MiMoTable. Specifically, MiMoTable incorporates two key features. First, the tables in MiMoTable are all spreadsheets used in real-world scenarios, which cover seven domains and contain different types. Second, we define a new criterion with six categories of meta operations for measuring the difficulty of each question in MiMoTable, simultaneously as a new perspective for measuring the difficulty of the existing benchmarks. Experimental results show that Claude-3.5-Sonnet achieves the best performance with 77.4\% accuracy, indicating that there is still significant room to improve for LLMs on MiMoTable. Furthermore, we grade the difficulty of existing benchmarks according to our new criteria. Experiments have shown that the performance of LLMs decreases as the difficulty of benchmarks increases, thereby proving the effectiveness of our proposed new criterion.

Paper Content: on Alphaxiv
Page 1: MiMoTable: A Multi-scale Spreadsheet Benchmark with Meta Operations for Table Reasoning Zheng Li∗, Yang Du∗, Mao Zheng∗, Mingyang Song* Tencent Hunyuan jasonzli@tencent.com Abstract Extensive research has been conducted to ex- plore the capability of Large Language Mod- els (LLMs) for table reasoning and has signif- icantly improved the performance on existing benchmarks. However, tables and user ques- tions in real-world applications are more com- plex and diverse, presenting an unignorable gap compared to the existing benchmarks. To fill the gap, we propose a Multi-scale spreadsheet benchmark with Metaoperations for Table rea- soning, named as MiMoTable. Specifically, Mi- MoTable incorporates two key features. First, the tables in MiMoTable are all spreadsheets used in real-world scenarios, which cover seven domains and contain different types. Second, we define a new criterion with six categories of meta operations for measuring the difficulty of each question in MiMoTable, simultane- ously as a new perspective for measuring the difficulty of the existing benchmarks. Exper- imental results show that Claude-3.5-Sonnet achieves the best performance with 77.4% ac- curacy, indicating that there is still significant room to improve for LLMs on MiMoTable. Furthermore, we grade the difficulty of ex- isting benchmarks according to our new cri- teria. Experiments have shown that the per- formance of LLMs decreases as the difficulty of benchmarks increases, thereby proving the effectiveness of our proposed new criterion. All data and code are open-sourced at https: //github.com/jasonNLP/MiMoTable . 1 Introduction Tabular data plays a crucial role across diverse do- mains, including education, finance, and others. Table reasoning involves deriving meaningful in- sights and answers from structured tabular data to address specific user queries (Zhang et al., 2024). This process significantly improves the efficiency of information retrieval and interpretation for users. *Equal contribution. Table type: single file, multiple sheets, complex headerTable difficulty: hardAnExampleofaSpreadsheet: Question: What is the product code of product 1?Meta Operations: LookupDifficulty: 1Answer: The product code of product 1 is CP-001. Question: Highlight the ”Product Name" column in redMeta Operations: Lookup, EditDifficulty: 2Answer:Question:Draw a line chart to visualize the in bound quantity of different product name.Meta Operations: Lookup, VisualizeDifficulty: 2.16Answer: Figure 1: Examples of MiMoTable benchmark. To foster a comprehensive understanding of this field, researchers have proposed and developed nu- merous table reasoning tasks, such as TableQA, Table2Text, Table Manipulation, and Advanced Data Analysis (Lu et al., 2024). Various methods have been proposed to tackle these tasks, and large language models (LLMs) have achieved promis- ing results (Liu et al., 2022; Cheng et al., 2023). To evaluate performance, several table reasoning benchmarks have been introduced, including Wik- iTableQuestions (Pasupat and Liang, 2015), ToTTo (Parikh et al., 2020), SheetCopilot (Li et al., 2023a), Text2Analysis (He et al., 2024) and so on. In the realm of table reasoning, benchmark de- velopment has not kept pace with the rapid advance- ments in methodological approaches. While LLMs have exhibited remarkable performance on existing benchmarks, recent studies have brought to light persistent limitations in their capacity for nuanced table comprehension (Sui et al., 2024). Upon crit- ical analysis, we have identified shortcomings in existing benchmarks across two key aspects.arXiv:2412.11711v2 [cs.CL] 24 Dec 2024 Page 2: First, current benchmarks exhibit significant lim- itations in their representation of real-world tabular data complexity. Most tables in these benchmarks have simple headers (single-row/column) and fail to cover all four task types comprehensively. How- ever, real-world tables are diverse and can be di- vided into three parts: 1) Headers ranging from single-row/column to complex hierarchical forms. 2) Variable number of sheets in Excel files. 3) Mul- tiple tables within a single sheet. These complexi- ties are often overlooked in existing benchmarks, limiting their ability to accurately assess table rea- soning capabilities in practical applications. Second, although current existing benchmarks are divided according to task granularity, the dif- ficulty of different benchmark datasets within the same task can vary. For example, the WikiSQL (Zhong et al., 2017) dataset is simpler than the un- restricted WikiTableQuestions dataset because it limits questions to those that can be answered using a subset of SQL queries. The current task divisions cannot reflect this difference in difficulty. To address the above issues, we propose Mi- MoTable, a table reasoning benchmark with diverse spreadsheets and meta operations. Our dataset com- prises 428 spreadsheets from real-world scenarios, spanning seven domains: architecture, finance, of- fice, education, accounting, e-commerce, and man- ufacturing. Our table data is comprehensive, featur- ing both simple and complex headers, and varying in the number of sheets from single to multiple. Some spreadsheets even contain multiple tables within a single sheet. We have constructed 1,719 question-answer pairs based on these spreadsheets, forming (spreadsheet, question, answer) triplets. Examples of MiMoTable are shown in Figure 1. Simultaneously, to more deeply reflect the dif- ferences in dataset problems, we propose a new criterion for categorizing problems based on meta operations. There are six types of meta operations: Lookup, Edit, Compare, Calculate, Visualize, and Reasoning. Each type of meta operation corre- sponds to a difficulty score. With the new crite- rion, we can associate each problem with one or more meta operations, thereby assigning a diffi- culty score to each problem. In this way, different benchmarks can be graded into different difficulty scores, facilitating better analysis and comparison. Our main contributions are as follows: •We propose a new benchmark comprising 428 multi-scale spreadsheets in both Chinese andSimple Header Complex Header Single Sheet simple table medium table Multiple Sheets medium table hard table Multiple Files medium table hard table Multiple Tables hard table hard table Table 1: Categories of table difficulty. English, featuring simple and complex head- ers, single and multiple sheets, single and mul- tiple files, and multiple tables within a single sheet. Based on these characteristics, we clas- sify the tables into three difficulty levels: sim- ple, medium, and hard. We construct 1,719 (spreadsheet, question, answer) triplets cover- ing a wide range of tasks. •We introduce a novel criterion for categoriz- ing table reasoning problems using meta oper- ations, each assigned a difficulty score. These non-overlapping meta operations can be com- bined to represent existing tasks, allowing for a more precise evaluation of a model’s capa- bilities in table-related tasks. •We conducted extensive experiments, demon- strating that the proposed benchmark is chal- lenging for existing LLMs and proving the effectiveness of the proposed meta operations. 2 MiMoTable Benchmark In this section, we introduce how to prepare our new MiMoTable benchmark and ensure its quality. 2.1 Types and Difficulty of Tables Most existing table reasoning benchmarks utilize single tables with simple headers, contrasting with the diverse tables encountered in real-world sce- narios, particularly in spreadsheets like Excel files. After analyzing real-world spreadsheets, we cate- gorize them along four dimensions: header types, the number of sheets per file, the number of tables per sheet, and the file count. As shown in Figure 2, header types can be di- vided into simple headers and complex headers. A simple header refers to a single-row or single- column header, while all others are considered com- plex headers. For example, hierarchical headers are classified as complex headers. Understanding com- plex headers is more challenging than understand- ing simple ones, and multiple sheets generally con- tain more information than a single sheet. There- Page 3: Simple TableHeader Single Sheet Complex Header Multiple Sheets Multiple FilesFile1File2File3Multiple TablesFigure 2: Illustrations of different table types, including simple header, complex header, single sheet, multiple sheets, multiple files, and multiple tables in one sheet. Meta Operations Description Grade Examples Lookup Locate the position of specific target 1 What is the product code of product 1? Edit Modify, delete or add in a table 1 Highlight the “Product Name" column in red Calculate The numerical computation, sum, avg, max, etc 2 How many students are in the table? Compare Compare two or more targets in a table 2 Who has the highest score? Visualize Show in chart 2 Draw a chart to show the distribution of scores. ReasoningInferring information from the table content that is not explicitly included3Analyze the relationship between the loan term, monthly interest, and interest. Table 2: The description and grade of meta operations. TableQA Table2Text TableManipulation Advanced Data Analysis Lookup Edit Compare Calculate Visualize ReasoningGrade1Grade 2Grade 3Existed tasksMetaOperations Figure 3: The relationships between tasks in the existing benchmarks and our proposed meta operations. fore, based on the aforementioned dimensions, we classify spreadsheets into three difficulty levels: simple, medium, and hard. The specific classifica- tion rules are illustrated in Table 1. 2.2 Meta Operations Current table reasoning benchmarks are classified by tasks, mainly including TableQA, Table2Text, Table Manipulation, and Advanced Data Analysis. This task-based categorization evaluates model per- formance across different tasks but fails to measure differences between benchmarks within the same task or compare benchmarks from different tasks along the same dimension. To enhance the analysis of table reasoning bench- marks, we propose a novel criterion categorizing questions by meta operations: Lookup, Edit, Cal- culate, Compare, Visualize, and Reasoning. Table 2 defines each operation. These operations reflect specific LLM capabilities in handling table-related problems, with questions potentially involving mul- tiple operations. For instance, "How many students are in the table" requires both Lookup (locating student names) and Calculate (counting them) op- erations.The combination of six meta operations can en- compass tasks in current table benchmarks. Figure 3 shows the mapping relationship between existing tasks and meta operations. For instance, TableQA questions may involve combinations of Lookup, Compare, and Calculate operations. Additionally, to assess problem complexity, we categorize the six meta operations into three diffi- culty grades (1, 2, 3) based on common criteria, as shown in Table 2. Lookup and Edit, involving sim- ple content location or modification, are grade 1. Compare, Calculate, and Visualize, which require logical operations, are grade 2. Reasoning, neces- sitating inference beyond explicit table content, is the most complex at grade 3. With the difficulty score of meta operations, we can calculate the difficulty score of each question and the entire dataset. First, let’s assume there are N questions in the dataset. The i-th problem qican be associated with Kimeta operations. Suppose thek-th meta operation is denoted as opk, then the sequence of meta operations for the i-th questions can be represented as: OPqi= [op1, op2, ..., op Ki] (1) The difficulty sequence corresponding to the meta operations can be represented as: Sqi= [s1, s2, ..., s Ki], sk∈1,2,3 (2) where sKiindicates the difficulty score correspond- ing to meta operations opk. To ensure that questions involving more difficult meta operations are assigned a higher difficulty Page 4: Table Content<file_name>Exam.xlsx</file_name><sheet_name>Sheet</sheet_name><sheet_content>Student ID | Student Name | Unit 1 | Rank | Unit 2 ... (content in markdown) </sheet_content> SpeadsheetExam.xlsx PromptAccording to below table content, give me five questions using one or more meta operations. The questions should be diverse, different questions should contain different combinations of meta operations. ......<file_name>Exam.xlsx</file_name><sheet_name>Sheet</sheet_name><sheet_content>xxx</sheet_content>Meta Operations:Lookup: Locate the position of specific target ... Meta OperationsReasoningCompareVisualizeEditCalculateLookupExtract Table content Build question generation prompt QuestionsQuestion1: What is the student id of student 1?Related meta operations: LookupQuestion2: xxxRelated meta operations: xxx......Generate questions with meta operations Checked QuestionsQuestion1: What is the student id of student 1?Related meta operations: Lookup......1、Check with GPT4o2、Double check with human GPT-4oGPT-4o with code interpreter plugins. AnswersAnswer1: The student id of student 1 is 5123001.Answer2: The student id of student 1 is 5123002. Answer3: The student id of student 1 is 5123001. Inference multi times to get multi answer Checked AnswersAnswer: The student id of student 1 is 5123001.1、Code debug2、Votebetween multi answers.3、Double check with human.Figure 4: The data construction pipeline of MiMoTable benchmark. score, we define the difficulty score of a question qias follows: qsi=msqi+ (KiX 1si−msqi)/Mmsqi(3) msqi= max( Sqi) (4) The term Mmsqirefers to the maximum value thatKiP 1si−msqican be achieved under the condi- tion where the meta operations of the question have the highest level of difficulty msqi. Every meta op- eration can only appear once in each question. So when the msqi= 3, which means the correspond- ing meta operation is Reasoning, the most complex combination of the rest of the meta operations is Compare, Calculate, Visualize, Lookup, and Edit. The difficulty score sum of those meta operations is 2 + 2 + 2 + 1 + 1 = 8. So M3=8. And in the same manner, we can get all the values of Mmsqi as follows, Mmsqi=  1msqi= 1 6msqi= 2 8msqi= 3(5) So the range of qsiis [1, 4]. With the difficulty score qsiof a single question qi, we define the diffi- culty score of the entire dataset, ds, as the average of the difficulty scores of all questions: ds=PN 1qsi N(6)2.3 Dataset Construction We introduce how the dataset is constructed from three aspects: table collection, question generation, and answer generation. Figure 4 illustrates the whole construction process. Table Collection. Since Excel files are the most popular spreadsheet in real-world scenarios, we choose .xlsx as the file format to be collected. The spreadsheets of our dataset are collected from pub- licly available sources on the internet. The Chinese tables primarily come from Baidu Wenku, while the English tables are mainly sourced from Google searches. These spreadsheets cover seven common domains: architecture, finance, office, education, accounting, e-commerce, and manufacturing. To ensure that the types of spreadsheets encompass as many real-world scenarios as possible, accord- ing to the classifications of table type mentioned before, we collected spreadsheets with both sim- ple and complex headers, as well as those with single and multiple sheets. Even within a single sheet, our data may contain multiple tables. Ad- ditionally, we randomly sampled some individual spreadsheet files and combined them into groups of 2-5 files, and the subsequent questions and an- swers are generated with those multiple files as input. To maintain the quality of the collected spreadsheets, we manually checked the content, removing files with significant noise, garbled text, or non-tabular formats. Furthermore, we reviewed each spreadsheet to anonymize any potential pri- vate information. Specifically: (1) personal names, contact information, addresses, etc., are masked Page 5: BenchmarksTable Types Tasks Header Type Sheet Num File Num Table Num TableQA Table2Text Table Manipulation Advanced Data Analysis WikiTableQuestion simple single single single ✓ WikiSQL simple single single single ✓ FetaQA simple single single single ✓ HiTAB complex single single single ✓ ✓ ToTTo simple single single single ✓ DAEval simple single single single ✓ WikiTableEdit simple single single single ✓ Text2Analysis simple single single single ✓ ✓ ✓ MiMoTable(ours) simple & complex single & multiple single & multiple single & multiple ✓ ✓ ✓ ✓ Table 3: Comparison in table types and tasks between existing benchmarks and MiMoTable and randomly regenerated by GPT, while headers are retained due to their importance and generality; (2) we double-checked the final spreadsheets with legal professionals. Ultimately, we obtained 428 high-quality spread- sheets containing both Chinese and English lan- guages and various types. As shown in Table 3, compared to the current benchmarks, our collected spreadsheets far exceed in diversity of types, better reflecting various real-world scenarios. Question Generation. As shown in Figure 4, we use GPT-4o to generate relevant questions and double-check with models and humans. First, we extract the table content from a spreadsheet in markdown format. Then, according to the extracted table content and our meta operations, GPT-4o is prompted to generate related questions. We in- structed the model to generate multiple questions at once for each spreadsheet. To ensure the diversity of questions, the multiple questions should con- tain different combinations of meta operations. For multi-sheet or multi-file spreadsheets, we prompt the model to generate questions requiring cross- sheet or cross-file analysis. To ensure prompt ef- fectiveness, we initially generate 50 samples, con- duct a human evaluation to identify issues, and iteratively refine the prompt until most generated questions meet our criteria. After generating initial questions with GPT-4o, we prompt it to verify if they meet requirements: relevance to table content and correct meta oper- ations. We then manually review and filter out unsuitable questions, yielding 1,719 high-quality, comprehensive questions. We classify the questions by existing tasks, cov- ering TableQA, Table2Text, Table Manipulation, and Advanced Data Analysis. As Table 3 shows, our dataset exceeds all current table benchmarks in task comprehensiveness. Answer Generation. After we collected the tables Architecture23% Finance10%Office15%Education16%Accounting9%E-commerce11%Manufacturing16%Figure 5: Domain distribution of all spreadsheets. and generated the related questions based on meta operations, the final step is to obtain the correspond- ing answers. Since some questions are related to editing on the origin spreadsheet or drawing charts, we leverage GPT-4o with the code interpreter plu- gin to get initial answers. The spreadsheet files can be directly used as inputs, and the model can generate Python code to run in a code interpreter to generate the modified files or visual charts. As shown in Figure 4, we ensure GPT-4o answer quality by first debugging its code. If the code cannot be executed without errors, the answer is considered incorrect. Second, We perform multiple inferences on each question-table pair, selecting the most frequent answer as a candidate. If all answers are different, the sample is viewed as in- valid due to the inconsistency. Last, we have table analysis experts manually annotate the candidate answers, retaining the correct ones and correcting the wrong ones. We invited 10 experts with data analysis experience to annotate the dataset. Among them, native Chinese and English speakers each accounted for half of the group. Each answer is annotated twice, and Cohen’s Kappa is 0.83, which indicates a high inter-annotator agreement. The answers of our dataset not only contain text but also contain Excel files and charts. Page 6: 2.4 Dataset Statistic Our MiMoTable benchmark consists of 1,719 (spreadsheet, question, answer) triplets originating from 428 different spreadsheets. In this subsection, we provide statistics from different dimensions to provide a more comprehensive understanding of our dataset. Domains of Spreadsheets. As illustrated in Figure 5, our spreadsheets encompass seven domains in real-world applications. Type and Difficulty of Tables. From Table 4, we can see that the table type of the collected spread- sheet is diverse, covering both simple, medium, and hard difficulty. Difficulty Ratio Table Type Num Simple 33.6% single file + single sheet + simple header 144 Medium 32.5%single file + multiple sheets + simple header 30 multiple files + simple header 37 single file + single sheet + complicate header 72 Hard 33.9%single file + multiple sheets + complicate header 63 multiple files + complicate header 32 multiple tables 50 Table 4: Distribution of table difficulty. Meta Operations of Questions. Figure 6 shows the number of six meta operations in our bench- mark questions. As the most basic operation of a table, Lookup is the most frequently occurring meta operation. More difficult meta operations such as Calculate and Reasoning also account for a relatively large proportion of our dataset, indicating that the questions of MiMoTable are diverse and comprehensive. 160978926613104258020040060080010001200140016001800 LookupEditCalculateCompareVisualizeReasoning Figure 6: Distribution of meta operations. Difficulty of Questions. To investigate the diffi- culty of questions, we calculate the difficulty score of each question according to the Equation 3. The score is in the range of [1, 4], so we divided the distribution of scores into three intervals, [1, 2), [2, 3) and [3, 4]. The specific values of question number and ratio are in Table 5.Question Difficulty Num Ratio [1, 2) 311 18.1% [2, 3) 1150 66.9% [3, 4] 258 15.0% Table 5: Distribution of question difficulty. 3 Experiments and Results Our experiments have two main goals: (1) to eval- uate representative LLMs’ performance on our dataset; and (2) to prove the effectiveness of pro- posed meta operations. This section presents the relevant experiments and findings. 3.1 Experimental Setup Models. We conducted experiments on 16 selected LLMs, comprising open-source LLMs, closed- source LLMs, and tabular LLMs. The open-source LLMs we evaluated include Llama3.11, Llama3 (Dubey et al., 2024), Qwen2 (Yang et al., 2024), Qwen1.5 (Bai et al., 2023), Mistral (Jiang et al., 2023), DeepseekCoder (Guo et al., 2024), and Gemma (Mesnard et al., 2024). The closed-source LLMs are GPT-4o (OpenAI, 2023), Claude-3.5- Sonnet2, and Gemini-1.5-Pro (Reid et al., 2024). We also evaluated Tablellama (Zhang et al., 2023), a tabular model fine-tuned specifically for various table tasks. However, most tabular models, such as Binder (Cheng et al., 2023), require inputs to be tables with known headers, which is not suitable for our benchmark. Datasets. To demonstrate the generality and ef- fectiveness of our meta operation, we conducted experiments on the newly proposed benchmark as well as two existing open-source benchmarks: Wik- iTableQuestion and WikiSQL. These widely-used TableQA benchmarks feature tables sourced from Wikipedia with simple headers. Metrics. We used accuracy as the evaluation met- ric. Except for Tablellama, the predicted answers of other models in our experiments are all free-formed. We prompted GPT-4o to judge the correctness of the predicted answer based on the question and human-verified reference answer. Because a small portion of questions in MiMoTable are open-ended, we also instructed GPT-4o to give a score between 0-1 when it judges the question is open-ended. Implementation Details. For all LLMs except Tablellama, we input table contents in markdown format. For GPT-4o, we also tested another popular 1https://ai.meta.com/blog/meta-llama-3-1/ 2https://www.anthropic.com/news/claude-3-5-sonnet Page 7: Model OverallLanguage Table Difficulty Question Difficulty Meta Operations English Chinese Simple Medium Hard [1, 2) [2, 3) [3, 4] Lookup Compare Calculate Reasoning Claude-3.5-Sonnet 77.4% 79.0% 76.2% 81.3% 75.5% 72.1% 89.0% 77.1% 63.3% 89.0% 79.7% 76.1% 63.3% GPT-4o-CI 69.2% 70.8% 68.1% 81.7% 67.1% 50.8% 81.0% 71.1% 45.8% 81.0% 73.1% 70.6% 45.8% GPT-4o-TXT 69.0% 69.3% 68.8% 73.8% 66.2% 62.1% 85.1% 67.6% 53.9% 85.1% 73.1% 64.5% 53.9% Gemini-1.5-Pro 60.2% 61.6% 59.1% 64.9% 57.4% 55.3% 86.1% 55.0% 47.6% 86.1% 60.3% 50.2% 47.6% Llama-3.1-70B-Instruct 57.0% 56.6% 57.3% 64.0% 51.6% 51.3% 82.0% 52.1% 45.1% 82.0% 57.7% 48.8% 45.1% Qwen2-72B-Instruct 55.7% 51.5% 58.8% 61.4% 52.6% 49.2% 80.4% 50.1% 46.9% 80.4% 56.3% 45.5% 46.9% Llama-3-70B-Instruct 53.7% 52.3% 54.8% 60.1% 48.6% 48.5% 78.9% 47.8% 46.1% 78.9% 51.5% 44.4% 46.1% Qwen1.5-72B-Chat 47.5% 46.1% 48.5% 51.6% 45.1% 43.2% 75.2% 41.2% 37.5% 75.2% 42.7% 40.1% 37.5% Llama-3.1-8B-Instruct 44.1% 44.0% 44.2% 49.0% 43.3% 36.4% 70.0% 38.3% 34.9% 70.0% 41.0% 35.6% 34.9% Qwen2-7B-Instruct 41.6% 40.5% 42.4% 45.6% 40.7% 35.5% 70.9% 34.1% 35.1% 70.9% 35.8% 32.4% 35.1% Qwen1.5-14B-Chat 40.2% 38.6% 41.3% 44.1% 38.4% 35.3% 73.8% 32.4% 28.9% 73.8% 32.2% 30.9% 28.9% Llama-3-8B-Instruct 39.9% 39.8% 40.0% 45.0% 37.0% 34.4% 68.4% 34.0% 27.5% 68.4% 36.1% 31.6% 27.5% Mistral-7B-Instruct-v0.3 35.2% 35.2% 35.1% 40.5% 31.6% 30.1% 71.7% 25.8% 27.2% 71.7% 25.0% 23.4% 27.2% Qwen1.5-7B-Chat 34.4% 33.6% 35.0% 40.1% 29.9% 29.7% 69.3% 25.1% 28.3% 69.3% 24.8% 22.9% 28.3% Deepseek-Coder-7B-Instruct-v1.5 34.1% 33.9% 34.2% 38.4% 33.0% 27.6% 68.4% 25.1% 26.5% 68.4% 22.7% 24.2% 26.5% Gemma-7B-Instruct 23.3% 20.6% 25.3% 28.7% 18.2% 19.9% 48.2% 15.7% 22.9% 48.2% 13.5% 14.9% 22.9% Tablellama 21.1% 23.9% 19.1% 25.4% 20.0% 14.9% 45.4% 16.3% 9.9% 45.4% 15.5% 14.4% 9.9% Table 6: Performance of LLMs on MiMoTable. GPT-4o-CI refers to the GPT-4o model with a code interpreter plugin. Figure 7: The performance of LLMs on MiMoTable respecting to different Meta Operations, Domains and Table Types. approach in table reasoning, which is denoted as GPT-4o-CI in Table 6. This method uploads spread- sheets and generates Python code, executed via a code interpreter plugin, with results fed back for analysis. The format of MiMoTable and WikiTable- Question is spreadsheet files, which can be directly as part of inputs to GPT-4o-CI. For WikiSQL, we first saved the non-file table content as Excel files and fed them to GPT-4o-CI. For Tablellama, we followed the prompt format specified in the origi- nal paper. For the existing benchmarks, we used GPT-4o to divide the questions according to the meta operations and then calculated the difficulty scores of the datasets based on Equation 6. We use the official default parameters for all models. More details can be found in the supplementary material.3.2 Results and Analysis Overall Performance. As shown in Table 6, we evaluate our proposed benchmark dataset using dif- ferent LLMs. Because most experimental LLMs can not generate edited files and charts, we only in- fer the questions without meta operations Edit and Visualize for a fair comparison with GPT-4o-CI. As we can see, the best-performing model, Claude- 3.5-Sonnet, achieved an overall performance of only 77.4% on our benchmark, highlighting that MiMoTable poses significant challenges for current LLMs. This underscores the need for further ex- ploration to improve model performance on more realistic table data. Analysis of different approaches. There are mainly two approaches to solving table reasoning Page 8: Figure 8: Performance on different data types. problems of spreadsheets. One approach is to repre- sent the spreadsheet content in text form and input it into the model to directly generate answers. The other is directly using the spreadsheet as input to write code, run in a sandbox, and conclude to solve the problem in a ReAct (Yao et al., 2023) way. As shown in Figure 8, we compare the performance of those two methods based on the GPT-4o model, where GPT-4o-TXT is the first text form approach and GPT-4o-CI is the second code-based approach. The GPT-4o-CI performs better than GPT-4o-TXT in Calculate and Compare when the table difficulty is simple and medium, while GPT-4o-TXT per- forms better in hard tables and in meta operations of Lookup and Reasoning. This reveals that the code-based approach has advantages in calculating only when the tables are not so hard, as hard ta- bles can cause the model to be unable to write the correct code to locate the required data. The text- based approach is good at Lookup and Reasoning because the model can see the entire content of the table as long as the context window size is enough. The first radar chart in Figure 7 shows the results of more LLMs respecting to the different combina- tions of table difficulty and meta operations. Capability of LLMs for table reasoning. Ac- cording to the results of table difficulty in Table 6 and table types in Figure 7, most LLMs have struggled in medium and hard tables. We attribute the reasons to two factors, hierarchical header and multiple similar tables. As illustrated in Figure 9, although the hierarchical relations between table cells appear very clear in the original spreadsheets, they become much less intuitive when converted into text, which poses challenges for the LLMs to understand. Additionally, the tables in the spread- sheets with multiple sheets are usually very similar. LLMs need to comprehensively consider multiplesimilar tables to answer questions. In conclusion, the ability to understand complex table structures and multiple similar tables in table reasoning needs to be improved for current LLMs. We also ob- serve differences in the performance of different models across languages. For example, Claude- 3.5-Sonnet performs better in English than in Chi- nese, while Qwen2-72B-Instruct is the opposite. We believe this is due to the varying proportions of different languages used during the pretraining and SFT stages for each model. Effectiveness of Meta Operations. Although Wik- iTableQuestion and WikiSQL are both datasets for TableQA tasks with tables that have simple headers, we found that the performance of the same LLMs on these two datasets varies significantly. For ex- ample, Llama3-70B achieves 82.0% accuracy on WikiSQL but only 66.7% accuracy on WikiTable- Question. No objective metric exists to explain this discrepancy. By scoring the datasets according to our meta operations, we found that WikiTableQues- tion is significantly more difficult than WikiSQL: the difficulty of WikiSQL is 1.5, while the diffi- culty of WikiTableQuestion is 2.0. The questions involving simple tables in MiMoTable, denoted as MiMoTable-Simple, have a difficulty of 2.2. We evaluated the performance of different LLMs on these three datasets—WikiSQL, WikiTableQues- tion, and MiMoTable-Simple—and the results are shown in Figure 10. The x-axis represents the diffi- culty of the benchmarks graded by meta operations, and the y-axis shows the accuracy of the LLMs on these benchmarks. We found that as the difficulty score of the dataset increases, the model perfor- mance declines. This indicates that our proposed meta operations and difficulty scores are both gener- alizable and effective across different benchmarks. Page 9: to markdownProduct Sales Analysis Table||||||||||||:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|Category|Product1||||Product2||||Product3|||||Weight|StandardPieces|Pieces|Proportion|Weight|StandardPieces|Pieces|Proportion|Weight|StandardPieces|Pieces|Proportion|NecklaceSet|Below8g|32|1|0.03125|Below3g|88|3|0.0340909090909091|Below3g|232|4|0.0172413793103448||8-12g|||0|3-5g||12|0.136363636363636|3-5g|||0||12-15g|||0|Above5g|||0|Above5g|||0||Above15g|||0||||0||||0||Pendant|Below8g|314|12|0.0382165605095541|Below3g|219|15|0.0684931506849315|Below3g|960||0||8-12g||3|0.00955414012738853|3-5g|||0|3-5g|||0||12-15g||3|0.00955414012738853|Above5g|||0|Above5g|||0||15-20g|||0||||0||||0||Above20g||1|0.00318471337579618||||0||||0|Figure 9: A spreadsheet to text with the markdown format. 0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%100.0% 1.51.61.71.81.92.02.12.2Claude-3.5-SonnetGPT-4o-CIGemini-1.5-ProLlama3-70BTablellama Figure 10: The relations between performance and the difficulty of benchmarks. The x-axis is the difficulty of benchmarks graded by meta operations. The y-axis is the accuracy of tested LLMs. 4 Related Work The main tasks for table reasoning include four categories: TableQA, Table2Text, Table Manip- ulation, and Advanced Data Analysis (Lu et al., 2024). Researchers have proposed various table benchmarks for these tasks. TableQA is the most popular task, including benchmarks like WikiTable- Questions (Pasupat and Liang, 2015), WikiSQL (Zhong et al., 2017), FeTaQA (Nan et al., 2022), HybridQA (Chen et al., 2020), TATQA (Zhu et al., 2021), NQ-TABLES (Kwiatkowski et al., 2019), HybriDialogue (Nakamura et al., 2022), BIRD (Li et al., 2023b), Spider (Yu et al., 2018). The primary benchmark for Table2Text is ToTTo (Parikh et al., 2020). A high-quality Table Manipulation bench- mark called WikiTableEdit is introduced in (Li et al., 2024). SPREADSHEETBENCH (Ma et al., 2024) is a challenging spreadsheet manipulation benchmark. For the Advanced Data Analysis task, the benchmarks DAEval (Hu et al., 2024) and DS- 1000 (Lai et al., 2023) are proposed. Text2Analysis (He et al., 2024) is a recently introduced benchmark that includes both TableQA and Advanced Data Analysis tasks. Most existing table benchmarks feature simple table headers, but HiTab (Cheng et al., 2022) is a TableQA and Table2Text dataset based on hierarchical headers. AIT-QA (Katsis et al., 2022) is a dataset for TableQA with hierar- chical headers specific to the airline industry.Unlike the existing benchmarks, we propose a new benchmark, MiMoTable, the first benchmark with multi-scale spreadsheets that simultaneously covers four tasks: TableQA, Table2Text, Table Ma- nipulation, and Advanced Data Analysis. 5 Conclusion We propose a multi-scale spreadsheet benchmark with four tasks: TableQA, Table2Text, Table Ma- nipulation, and Advanced Data Analysis, named MiMoTable. Experiments have shown that existing LLMs perform poorly on this benchmark, indicat- ing that there is still significant room to improve in more realistic scenarios. For table reasoning, we also propose a new criterion for categorizing problems based on meta operations. Compared to task-based categorization, this criterion allows for a deeper and more accurate analysis of problems in table datasets. Our experiments demonstrate that the meta operations are general and effective. 6 Limitations When validating the effectiveness of meta opera- tions, we do not perform Supervised Fine-Tuning (SFT) on the models. Future work could examine the role and effect of each type of operation through SFT. Meanwhile, we used the same prompt for evaluating all models except Tablellama, without optimizing or adapting prompts for different mod- els. Regarding hyperparameters for different mod- els, we used the officially recommended default parameters and do not adjust different hyperparam- eters for different models. Additionally, inspired by work in other fields (Song et al., 2024a,b), develop- ing long-context table reasoning benchmarks and studying in-context learning for table reasoning are valuable directions for further exploration. Acknowledgments We thank the three anonymous reviewers for care- fully reading our paper and their insightful com- ments and suggestions. Page 10: References Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jin- gren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Qwen technical report. CoRR , abs/2309.16609. Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong, Hong Wang, and William Yang Wang. 2020. Hy- bridqa: A dataset of multi-hop question answering over tabular and textual data. In Findings of the As- sociation for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 , volume EMNLP 2020 of Findings of ACL , pages 1026–1036. Association for Computational Linguistics. Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia, Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and Dongmei Zhang. 2022. Hitab: A hierarchical table dataset for question answering and natural language generation. In Proceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022 , pages 1094–1110. Association for Computational Linguistics. Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong, Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer, Noah A. Smith, and Tao Yu. 2023. Binding language models in symbolic languages. In The Eleventh In- ternational Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . Open- Review.net. Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Bap- tiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al- lonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor- gia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Han- nah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov,Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and et al. 2024. The llama 3 herd of models. CoRR , abs/2407.21783. Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y . Wu, Y . K. Li, Fuli Luo, Yingfei Xiong, and Wen- feng Liang. 2024. Deepseek-coder: When the large language model meets programming - the rise of code intelligence. CoRR , abs/2401.14196. Xinyi He, Mengyu Zhou, Xinrun Xu, Xiaojun Ma, Rui Ding, Lun Du, Yan Gao, Ran Jia, Xu Chen, Shi Han, Zejian Yuan, and Dongmei Zhang. 2024. Text2analysis: A benchmark of table question an- swering with advanced data analysis and unclear queries. In Thirty-Eighth AAAI Conference on Artifi- cial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, Febru- ary 20-27, 2024, Vancouver, Canada , pages 18206– 18215. AAAI Press. Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming Zhu, Yao Cheng, Jianbo Yuan, Kun Kuang, Yang Yang, Hongxia Yang, and Fei Wu. 2024. Infiagent- dabench: Evaluating agents on data analysis tasks. CoRR , abs/2401.05507. Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men- sch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Re- nard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo- thée Lacroix, and William El Sayed. 2023. Mistral 7b.CoRR , abs/2310.06825. Yannis Katsis, Saneem A. Chemmengath, Vishwa- jeet Kumar, Samarth Bharadwaj, Mustafa Canim, Michael R. Glass, Alfio Gliozzo, Feifei Pan, Jay- deep Sen, Karthik Sankaranarayanan, and Soumen Chakrabarti. 2022. AIT-QA: question answering dataset over complex tables in the airline industry. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies: Industry Track, NAACL 2022, Hybrid: Seattle, Wash- ington, USA + Online, July 10-15, 2022 , pages 305– 314. Association for Computational Linguistics. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- field, Michael Collins, Ankur P. Parikh, Chris Alberti, Page 11: Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken- ton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu- ral questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics , 7:452– 466. Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih, Daniel Fried, Sida I. Wang, and Tao Yu. 2023. DS- 1000: A natural and reliable benchmark for data sci- ence code generation. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA , volume 202 of Proceedings of Machine Learning Research , pages 18319–18345. PMLR. Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhaoxiang Zhang. 2023a. Sheetcopilot: Bringing software productivity to the next level through large language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neu- ral Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 . Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang Li, Kevin Chen-Chuan Chang, Fei Huang, Reynold Cheng, and Yongbin Li. 2023b. Can LLM already serve as A database interface? A big bench for large- scale database grounded text-to-sqls. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Sys- tems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023 . Zheng Li, Xiang Chen, and Xiaojun Wan. 2024. Wik- itableedit: A benchmark for table editing by natural language instruction. CoRR , abs/2403.02962. Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi Lin, Weizhu Chen, and Jian-Guang Lou. 2022. TAPEX: table pre-training via learning a neural SQL executor. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net. Weizheng Lu, Jiaming Zhang, Jing Zhang, and Yueguo Chen. 2024. Large language model for table process- ing: A survey. CoRR , abs/2402.05121. Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xi- aokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang, and Jie Tang. 2024. Spreadsheetbench: Towards chal- lenging real world spreadsheet manipulation. CoRR , abs/2406.14991. Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Aakanksha Chowdh- ery, Adam Roberts, Aditya Barua, Alex Botev, AlexCastro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christo- pher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Cristian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, and et al. 2024. Gemma: Open models based on gemini re- search and technology. CoRR , abs/2403.08295. Kai Nakamura, Sharon Levy, Yi-Lin Tuan, Wenhu Chen, and William Yang Wang. 2022. Hybridialogue: An information-seeking dialogue dataset grounded on tabular and textual data. In Findings of the Asso- ciation for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022 , pages 481–492. Association for Computational Linguistics. Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victo- ria Lin, Neha Verma, Rui Zhang, Wojciech Kryscin- ski, Hailey Schoelkopf, Riley Kong, Xiangru Tang, Mutethia Mutuma, Ben Rosand, Isabel Trindade, Renusree Bandaru, Jacob Cunningham, Caiming Xiong, and Dragomir R. Radev. 2022. Fetaqa: Free- form table question answering. Trans. Assoc. Com- put. Linguistics , 10:35–49. OpenAI. 2023. Gpt-4 technical report. Preprint , arXiv:2303.08774. Ankur P. Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. 2020. Totto: A controlled table-to-text generation dataset. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020 , pages 1173–1186. Association for Computa- tional Linguistics. Panupong Pasupat and Percy Liang. 2015. Compo- sitional semantic parsing on semi-structured tables. InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Lan- guage Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers , pages 1470– 1480. The Association for Computer Linguistics. Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy P. Lillicrap, Jean-Baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, Ioannis Antonoglou, Ro- han Anil, Sebastian Borgeaud, Andrew M. Dai, Katie Millican, Ethan Dyer, Mia Glaese, Thibault Sotti- aux, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, James Molloy, Jilin Chen, Michael Isard, Paul Barham, Tom Hennigan, Ross McIl- roy, Melvin Johnson, Johan Schalkwyk, Eli Collins, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, Clemens Meyer, Gregory Thornton, Page 12: Zhen Yang, Henryk Michalewski, Zaheer Abbas, Nathan Schucher, Ankesh Anand, Richard Ives, James Keeling, Karel Lenc, Salem Haykal, Siamak Shakeri, Pranav Shyam, Aakanksha Chowdhery, Ro- man Ring, Stephen Spencer, Eren Sezener, and et al. 2024. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context. CoRR , abs/2403.05530. Mingyang Song, Mao Zheng, and Xuan Luo. 2024a. Can many-shot in-context learning help llms as eval- uators? a preliminary empirical study. Preprint , arXiv:2406.11629. Mingyang Song, Mao Zheng, and Xuan Luo. 2024b. Counting-stars: A multi-evidence, position-aware, and scalable benchmark for evaluating long-context large language models. Preprint , arXiv:2403.11802. Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and Dongmei Zhang. 2024. Table meets LLM: can large language models understand structured table data? A benchmark and empirical study. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM 2024, Merida, Mex- ico, March 4-8, 2024 , pages 645–654. ACM. An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Ke- qin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan. 2024. Qwen2 techni- cal report. CoRR , abs/2407.10671. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir R. Radev. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceed- ings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018 , pages 3911–3921. Association for Computational Linguistics. Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun. 2023. Tablellama: Towards open large generalist models for tables. CoRR , abs/2311.09206.Xuanliang Zhang, Dingzirui Wang, Longxu Dou, Qingfu Zhu, and Wanxiang Che. 2024. A survey of table reasoning with large language models. CoRR , abs/2402.08259. Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR , abs/1709.00103. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. TAT-QA: A question answer- ing benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th An- nual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer- ence on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, Au- gust 1-6, 2021 , pages 3277–3287. Association for Computational Linguistics. Page 13: A Appendix A.1 Data Statistic of The Language Table 7 shows the data statistics of the proposed benchmark under different languages, including table number, question number, and the average of question difficulty. Table Number Question Number Question Difficulty Overall 428 1719 2.2 English 182 671 2.2 Chinese 246 1048 2.2 Table 7: Data Statistics of Different Languages A.2 Used Prompts Table 8, Table 9, and Table 10 show the designed prompts for meta operations classification, model inference, and performance evaluation in this paper. You are a spreadsheet question classification expert. Given a user’s question about an Excel spreadsheet, classify the question according to the requirements and output it in the specified format. <Requirements> The following operation classification already exists, presented in the format of operation name: operation description. If the user’s question can be classified as some of the operations, output the operation names. One question can be classified into multiple operations. Lookup: Locate the position of the specific target. Edit: Modify, delete, or add to a table. Calculate: The numerical computation, sum, avg, max, etc. Compare: Compare two or more targets in a table. Visualize: Show in chart. Reasoning: Inferring informa- tion from the table content that is not explicitly included. </Requirements> <Output Format> operation name 1, operation name 2, ... </Output Format> <Question> what country hosted the most tournaments? </Question> Lookup, Calculate, Compare <Question> QUESTION TO BE CLASSIFIED </Question> Table 8: Prompt for Meta Operation ClassificationBelow is the table content in markdown, please answer the question according to the table content. <Table> <Table Name> SPREADSHEET FILE NAME </Table Name> <Table Content> <Sheet> <Sheet Name> SHEET NAME </Sheet Name> <Sheet Content> SHEET CONTENT IN MARKDOWN </Sheet Content> </Sheet> </Table Content> </Table> <Question> THE QUESTION TO BE ASKED </Question> Table 9: Prompt for Model Inference For the following questions, given the correct answer, determine whether the candidate’s answer is correct. If it is correct, output "Correct"; if it is incorrect, out- put "Incorrect"; if it is uncertain whether it is correct, output "Uncertain". As long as the candidate’s answer contains the key information that can correctly answer the question, it is considered correct. If the question is open-ended, give a score between 0-1 according to the correct answer. Do not output any other content. <Question> THE QUESTION </Question> <Correct Answer> THE CORRECT ANSWER </Correct Answer> <Candidate answer> THE CANDIDATE ANSWER </Candidate answer> Table 10: Prompt for Performance Evaluation

---