Page 1:
MiMoTable: A Multi-scale Spreadsheet Benchmark with
Meta Operations for Table Reasoning
Zheng Li∗, Yang Du∗, Mao Zheng∗, Mingyang Song*
Tencent Hunyuan
jasonzli@tencent.com
Abstract
Extensive research has been conducted to ex-
plore the capability of Large Language Mod-
els (LLMs) for table reasoning and has signif-
icantly improved the performance on existing
benchmarks. However, tables and user ques-
tions in real-world applications are more com-
plex and diverse, presenting an unignorable gap
compared to the existing benchmarks. To fill
the gap, we propose a Multi-scale spreadsheet
benchmark with Metaoperations for Table rea-
soning, named as MiMoTable. Specifically, Mi-
MoTable incorporates two key features. First,
the tables in MiMoTable are all spreadsheets
used in real-world scenarios, which cover seven
domains and contain different types. Second,
we define a new criterion with six categories of
meta operations for measuring the difficulty
of each question in MiMoTable, simultane-
ously as a new perspective for measuring the
difficulty of the existing benchmarks. Exper-
imental results show that Claude-3.5-Sonnet
achieves the best performance with 77.4% ac-
curacy, indicating that there is still significant
room to improve for LLMs on MiMoTable.
Furthermore, we grade the difficulty of ex-
isting benchmarks according to our new cri-
teria. Experiments have shown that the per-
formance of LLMs decreases as the difficulty
of benchmarks increases, thereby proving the
effectiveness of our proposed new criterion.
All data and code are open-sourced at https:
//github.com/jasonNLP/MiMoTable .
1 Introduction
Tabular data plays a crucial role across diverse do-
mains, including education, finance, and others.
Table reasoning involves deriving meaningful in-
sights and answers from structured tabular data to
address specific user queries (Zhang et al., 2024).
This process significantly improves the efficiency
of information retrieval and interpretation for users.
*Equal contribution.
Table type: single file, multiple sheets, complex headerTable difficulty: hardAnExampleofaSpreadsheet:
Question: What is the product code of product 1?Meta Operations: LookupDifficulty: 1Answer: The product code of product 1 is CP-001.
Question: Highlight the ”Product Name" column in redMeta Operations: Lookup, EditDifficulty: 2Answer:Question:Draw a line chart to visualize the in bound quantity of different product name.Meta Operations: Lookup, VisualizeDifficulty: 2.16Answer:
Figure 1: Examples of MiMoTable benchmark.
To foster a comprehensive understanding of this
field, researchers have proposed and developed nu-
merous table reasoning tasks, such as TableQA,
Table2Text, Table Manipulation, and Advanced
Data Analysis (Lu et al., 2024). Various methods
have been proposed to tackle these tasks, and large
language models (LLMs) have achieved promis-
ing results (Liu et al., 2022; Cheng et al., 2023).
To evaluate performance, several table reasoning
benchmarks have been introduced, including Wik-
iTableQuestions (Pasupat and Liang, 2015), ToTTo
(Parikh et al., 2020), SheetCopilot (Li et al., 2023a),
Text2Analysis (He et al., 2024) and so on.
In the realm of table reasoning, benchmark de-
velopment has not kept pace with the rapid advance-
ments in methodological approaches. While LLMs
have exhibited remarkable performance on existing
benchmarks, recent studies have brought to light
persistent limitations in their capacity for nuanced
table comprehension (Sui et al., 2024). Upon crit-
ical analysis, we have identified shortcomings in
existing benchmarks across two key aspects.arXiv:2412.11711v2 [cs.CL] 24 Dec 2024
Page 2:
First, current benchmarks exhibit significant lim-
itations in their representation of real-world tabular
data complexity. Most tables in these benchmarks
have simple headers (single-row/column) and fail
to cover all four task types comprehensively. How-
ever, real-world tables are diverse and can be di-
vided into three parts: 1) Headers ranging from
single-row/column to complex hierarchical forms.
2) Variable number of sheets in Excel files. 3) Mul-
tiple tables within a single sheet. These complexi-
ties are often overlooked in existing benchmarks,
limiting their ability to accurately assess table rea-
soning capabilities in practical applications.
Second, although current existing benchmarks
are divided according to task granularity, the dif-
ficulty of different benchmark datasets within the
same task can vary. For example, the WikiSQL
(Zhong et al., 2017) dataset is simpler than the un-
restricted WikiTableQuestions dataset because it
limits questions to those that can be answered using
a subset of SQL queries. The current task divisions
cannot reflect this difference in difficulty.
To address the above issues, we propose Mi-
MoTable, a table reasoning benchmark with diverse
spreadsheets and meta operations. Our dataset com-
prises 428 spreadsheets from real-world scenarios,
spanning seven domains: architecture, finance, of-
fice, education, accounting, e-commerce, and man-
ufacturing. Our table data is comprehensive, featur-
ing both simple and complex headers, and varying
in the number of sheets from single to multiple.
Some spreadsheets even contain multiple tables
within a single sheet. We have constructed 1,719
question-answer pairs based on these spreadsheets,
forming (spreadsheet, question, answer) triplets.
Examples of MiMoTable are shown in Figure 1.
Simultaneously, to more deeply reflect the dif-
ferences in dataset problems, we propose a new
criterion for categorizing problems based on meta
operations. There are six types of meta operations:
Lookup, Edit, Compare, Calculate, Visualize, and
Reasoning. Each type of meta operation corre-
sponds to a difficulty score. With the new crite-
rion, we can associate each problem with one or
more meta operations, thereby assigning a diffi-
culty score to each problem. In this way, different
benchmarks can be graded into different difficulty
scores, facilitating better analysis and comparison.
Our main contributions are as follows:
•We propose a new benchmark comprising 428
multi-scale spreadsheets in both Chinese andSimple Header Complex Header
Single Sheet simple table medium table
Multiple Sheets medium table hard table
Multiple Files medium table hard table
Multiple Tables hard table hard table
Table 1: Categories of table difficulty.
English, featuring simple and complex head-
ers, single and multiple sheets, single and mul-
tiple files, and multiple tables within a single
sheet. Based on these characteristics, we clas-
sify the tables into three difficulty levels: sim-
ple, medium, and hard. We construct 1,719
(spreadsheet, question, answer) triplets cover-
ing a wide range of tasks.
•We introduce a novel criterion for categoriz-
ing table reasoning problems using meta oper-
ations, each assigned a difficulty score. These
non-overlapping meta operations can be com-
bined to represent existing tasks, allowing for
a more precise evaluation of a model’s capa-
bilities in table-related tasks.
•We conducted extensive experiments, demon-
strating that the proposed benchmark is chal-
lenging for existing LLMs and proving the
effectiveness of the proposed meta operations.
2 MiMoTable Benchmark
In this section, we introduce how to prepare our
new MiMoTable benchmark and ensure its quality.
2.1 Types and Difficulty of Tables
Most existing table reasoning benchmarks utilize
single tables with simple headers, contrasting with
the diverse tables encountered in real-world sce-
narios, particularly in spreadsheets like Excel files.
After analyzing real-world spreadsheets, we cate-
gorize them along four dimensions: header types,
the number of sheets per file, the number of tables
per sheet, and the file count.
As shown in Figure 2, header types can be di-
vided into simple headers and complex headers.
A simple header refers to a single-row or single-
column header, while all others are considered com-
plex headers. For example, hierarchical headers are
classified as complex headers. Understanding com-
plex headers is more challenging than understand-
ing simple ones, and multiple sheets generally con-
tain more information than a single sheet. There-
Page 3:
Simple TableHeader
Single Sheet
Complex Header
Multiple Sheets
Multiple FilesFile1File2File3Multiple TablesFigure 2: Illustrations of different table types, including simple header, complex header, single sheet, multiple
sheets, multiple files, and multiple tables in one sheet.
Meta Operations Description Grade Examples
Lookup Locate the position of specific target 1 What is the product code of product 1?
Edit Modify, delete or add in a table 1 Highlight the “Product Name" column in red
Calculate The numerical computation, sum, avg, max, etc 2 How many students are in the table?
Compare Compare two or more targets in a table 2 Who has the highest score?
Visualize Show in chart 2 Draw a chart to show the distribution of scores.
ReasoningInferring information from the table
content that is not explicitly included3Analyze the relationship between
the loan term, monthly interest, and interest.
Table 2: The description and grade of meta operations.
TableQA
Table2Text
TableManipulation
Advanced Data Analysis
Lookup
Edit
Compare
Calculate
Visualize
ReasoningGrade1Grade 2Grade 3Existed tasksMetaOperations
Figure 3: The relationships between tasks in the existing
benchmarks and our proposed meta operations.
fore, based on the aforementioned dimensions, we
classify spreadsheets into three difficulty levels:
simple, medium, and hard. The specific classifica-
tion rules are illustrated in Table 1.
2.2 Meta Operations
Current table reasoning benchmarks are classified
by tasks, mainly including TableQA, Table2Text,
Table Manipulation, and Advanced Data Analysis.
This task-based categorization evaluates model per-
formance across different tasks but fails to measure
differences between benchmarks within the same
task or compare benchmarks from different tasks
along the same dimension.
To enhance the analysis of table reasoning bench-
marks, we propose a novel criterion categorizing
questions by meta operations: Lookup, Edit, Cal-
culate, Compare, Visualize, and Reasoning. Table
2 defines each operation. These operations reflect
specific LLM capabilities in handling table-related
problems, with questions potentially involving mul-
tiple operations. For instance, "How many students
are in the table" requires both Lookup (locating
student names) and Calculate (counting them) op-
erations.The combination of six meta operations can en-
compass tasks in current table benchmarks. Figure
3 shows the mapping relationship between existing
tasks and meta operations. For instance, TableQA
questions may involve combinations of Lookup,
Compare, and Calculate operations.
Additionally, to assess problem complexity, we
categorize the six meta operations into three diffi-
culty grades (1, 2, 3) based on common criteria, as
shown in Table 2. Lookup and Edit, involving sim-
ple content location or modification, are grade 1.
Compare, Calculate, and Visualize, which require
logical operations, are grade 2. Reasoning, neces-
sitating inference beyond explicit table content, is
the most complex at grade 3.
With the difficulty score of meta operations, we
can calculate the difficulty score of each question
and the entire dataset. First, let’s assume there are
N questions in the dataset. The i-th problem qican
be associated with Kimeta operations. Suppose
thek-th meta operation is denoted as opk, then the
sequence of meta operations for the i-th questions
can be represented as:
OPqi= [op1, op2, ..., op Ki] (1)
The difficulty sequence corresponding to the meta
operations can be represented as:
Sqi= [s1, s2, ..., s Ki], sk∈1,2,3 (2)
where sKiindicates the difficulty score correspond-
ing to meta operations opk.
To ensure that questions involving more difficult
meta operations are assigned a higher difficulty
Page 4:
Table Content<file_name>Exam.xlsx</file_name><sheet_name>Sheet</sheet_name><sheet_content>Student ID | Student Name | Unit 1 | Rank | Unit 2 ... (content in markdown) </sheet_content>
SpeadsheetExam.xlsx
PromptAccording to below table content, give me five questions using one or more meta operations. The questions should be diverse, different questions should contain different combinations of meta operations. ......<file_name>Exam.xlsx</file_name><sheet_name>Sheet</sheet_name><sheet_content>xxx</sheet_content>Meta Operations:Lookup: Locate the position of specific target ...
Meta OperationsReasoningCompareVisualizeEditCalculateLookupExtract Table content Build question generation prompt
QuestionsQuestion1: What is the student id of student 1?Related meta operations: LookupQuestion2: xxxRelated meta operations: xxx......Generate questions with meta operations
Checked QuestionsQuestion1: What is the student id of student 1?Related meta operations: Lookup......1、Check with GPT4o2、Double check with human
GPT-4oGPT-4o with code interpreter plugins.
AnswersAnswer1: The student id of student 1 is 5123001.Answer2: The student id of student 1 is 5123002. Answer3: The student id of student 1 is 5123001. Inference multi times to get multi answer
Checked AnswersAnswer: The student id of student 1 is 5123001.1、Code debug2、Votebetween multi answers.3、Double check with human.Figure 4: The data construction pipeline of MiMoTable benchmark.
score, we define the difficulty score of a question
qias follows:
qsi=msqi+ (KiX
1si−msqi)/Mmsqi(3)
msqi= max( Sqi) (4)
The term Mmsqirefers to the maximum value
thatKiP
1si−msqican be achieved under the condi-
tion where the meta operations of the question have
the highest level of difficulty msqi. Every meta op-
eration can only appear once in each question. So
when the msqi= 3, which means the correspond-
ing meta operation is Reasoning, the most complex
combination of the rest of the meta operations is
Compare, Calculate, Visualize, Lookup, and Edit.
The difficulty score sum of those meta operations
is 2 + 2 + 2 + 1 + 1 = 8. So M3=8. And in the
same manner, we can get all the values of Mmsqi
as follows,
Mmsqi=
1msqi= 1
6msqi= 2
8msqi= 3(5)
So the range of qsiis [1, 4]. With the difficulty
score qsiof a single question qi, we define the diffi-
culty score of the entire dataset, ds, as the average
of the difficulty scores of all questions:
ds=PN
1qsi
N(6)2.3 Dataset Construction
We introduce how the dataset is constructed from
three aspects: table collection, question generation,
and answer generation. Figure 4 illustrates the
whole construction process.
Table Collection. Since Excel files are the most
popular spreadsheet in real-world scenarios, we
choose .xlsx as the file format to be collected. The
spreadsheets of our dataset are collected from pub-
licly available sources on the internet. The Chinese
tables primarily come from Baidu Wenku, while
the English tables are mainly sourced from Google
searches. These spreadsheets cover seven common
domains: architecture, finance, office, education,
accounting, e-commerce, and manufacturing. To
ensure that the types of spreadsheets encompass
as many real-world scenarios as possible, accord-
ing to the classifications of table type mentioned
before, we collected spreadsheets with both sim-
ple and complex headers, as well as those with
single and multiple sheets. Even within a single
sheet, our data may contain multiple tables. Ad-
ditionally, we randomly sampled some individual
spreadsheet files and combined them into groups
of 2-5 files, and the subsequent questions and an-
swers are generated with those multiple files as
input. To maintain the quality of the collected
spreadsheets, we manually checked the content,
removing files with significant noise, garbled text,
or non-tabular formats. Furthermore, we reviewed
each spreadsheet to anonymize any potential pri-
vate information. Specifically: (1) personal names,
contact information, addresses, etc., are masked
Page 5:
BenchmarksTable Types Tasks
Header Type Sheet Num File Num Table Num TableQA Table2Text Table Manipulation Advanced Data Analysis
WikiTableQuestion simple single single single ✓
WikiSQL simple single single single ✓
FetaQA simple single single single ✓
HiTAB complex single single single ✓ ✓
ToTTo simple single single single ✓
DAEval simple single single single ✓
WikiTableEdit simple single single single ✓
Text2Analysis simple single single single ✓ ✓ ✓
MiMoTable(ours) simple & complex single & multiple single & multiple single & multiple ✓ ✓ ✓ ✓
Table 3: Comparison in table types and tasks between existing benchmarks and MiMoTable
and randomly regenerated by GPT, while headers
are retained due to their importance and generality;
(2) we double-checked the final spreadsheets with
legal professionals.
Ultimately, we obtained 428 high-quality spread-
sheets containing both Chinese and English lan-
guages and various types. As shown in Table 3,
compared to the current benchmarks, our collected
spreadsheets far exceed in diversity of types, better
reflecting various real-world scenarios.
Question Generation. As shown in Figure 4,
we use GPT-4o to generate relevant questions and
double-check with models and humans. First, we
extract the table content from a spreadsheet in
markdown format. Then, according to the extracted
table content and our meta operations, GPT-4o is
prompted to generate related questions. We in-
structed the model to generate multiple questions at
once for each spreadsheet. To ensure the diversity
of questions, the multiple questions should con-
tain different combinations of meta operations. For
multi-sheet or multi-file spreadsheets, we prompt
the model to generate questions requiring cross-
sheet or cross-file analysis. To ensure prompt ef-
fectiveness, we initially generate 50 samples, con-
duct a human evaluation to identify issues, and
iteratively refine the prompt until most generated
questions meet our criteria.
After generating initial questions with GPT-4o,
we prompt it to verify if they meet requirements:
relevance to table content and correct meta oper-
ations. We then manually review and filter out
unsuitable questions, yielding 1,719 high-quality,
comprehensive questions.
We classify the questions by existing tasks, cov-
ering TableQA, Table2Text, Table Manipulation,
and Advanced Data Analysis. As Table 3 shows,
our dataset exceeds all current table benchmarks in
task comprehensiveness.
Answer Generation. After we collected the tables
Architecture23%
Finance10%Office15%Education16%Accounting9%E-commerce11%Manufacturing16%Figure 5: Domain distribution of all spreadsheets.
and generated the related questions based on meta
operations, the final step is to obtain the correspond-
ing answers. Since some questions are related to
editing on the origin spreadsheet or drawing charts,
we leverage GPT-4o with the code interpreter plu-
gin to get initial answers. The spreadsheet files
can be directly used as inputs, and the model can
generate Python code to run in a code interpreter
to generate the modified files or visual charts.
As shown in Figure 4, we ensure GPT-4o answer
quality by first debugging its code. If the code
cannot be executed without errors, the answer is
considered incorrect. Second, We perform multiple
inferences on each question-table pair, selecting
the most frequent answer as a candidate. If all
answers are different, the sample is viewed as in-
valid due to the inconsistency. Last, we have table
analysis experts manually annotate the candidate
answers, retaining the correct ones and correcting
the wrong ones. We invited 10 experts with data
analysis experience to annotate the dataset. Among
them, native Chinese and English speakers each
accounted for half of the group. Each answer is
annotated twice, and Cohen’s Kappa is 0.83, which
indicates a high inter-annotator agreement. The
answers of our dataset not only contain text but
also contain Excel files and charts.
Page 6:
2.4 Dataset Statistic
Our MiMoTable benchmark consists of 1,719
(spreadsheet, question, answer) triplets originating
from 428 different spreadsheets. In this subsection,
we provide statistics from different dimensions to
provide a more comprehensive understanding of
our dataset.
Domains of Spreadsheets. As illustrated in Figure
5, our spreadsheets encompass seven domains in
real-world applications.
Type and Difficulty of Tables. From Table 4, we
can see that the table type of the collected spread-
sheet is diverse, covering both simple, medium,
and hard difficulty.
Difficulty Ratio Table Type Num
Simple 33.6% single file + single sheet + simple header 144
Medium 32.5%single file + multiple sheets + simple header 30
multiple files + simple header 37
single file + single sheet + complicate header 72
Hard 33.9%single file + multiple sheets + complicate header 63
multiple files + complicate header 32
multiple tables 50
Table 4: Distribution of table difficulty.
Meta Operations of Questions. Figure 6 shows
the number of six meta operations in our bench-
mark questions. As the most basic operation of
a table, Lookup is the most frequently occurring
meta operation. More difficult meta operations
such as Calculate and Reasoning also account for a
relatively large proportion of our dataset, indicating
that the questions of MiMoTable are diverse and
comprehensive.
160978926613104258020040060080010001200140016001800
LookupEditCalculateCompareVisualizeReasoning
Figure 6: Distribution of meta operations.
Difficulty of Questions. To investigate the diffi-
culty of questions, we calculate the difficulty score
of each question according to the Equation 3. The
score is in the range of [1, 4], so we divided the
distribution of scores into three intervals, [1, 2),
[2, 3) and [3, 4]. The specific values of question
number and ratio are in Table 5.Question Difficulty Num Ratio
[1, 2) 311 18.1%
[2, 3) 1150 66.9%
[3, 4] 258 15.0%
Table 5: Distribution of question difficulty.
3 Experiments and Results
Our experiments have two main goals: (1) to eval-
uate representative LLMs’ performance on our
dataset; and (2) to prove the effectiveness of pro-
posed meta operations. This section presents the
relevant experiments and findings.
3.1 Experimental Setup
Models. We conducted experiments on 16 selected
LLMs, comprising open-source LLMs, closed-
source LLMs, and tabular LLMs. The open-source
LLMs we evaluated include Llama3.11, Llama3
(Dubey et al., 2024), Qwen2 (Yang et al., 2024),
Qwen1.5 (Bai et al., 2023), Mistral (Jiang et al.,
2023), DeepseekCoder (Guo et al., 2024), and
Gemma (Mesnard et al., 2024). The closed-source
LLMs are GPT-4o (OpenAI, 2023), Claude-3.5-
Sonnet2, and Gemini-1.5-Pro (Reid et al., 2024).
We also evaluated Tablellama (Zhang et al., 2023),
a tabular model fine-tuned specifically for various
table tasks. However, most tabular models, such
as Binder (Cheng et al., 2023), require inputs to be
tables with known headers, which is not suitable
for our benchmark.
Datasets. To demonstrate the generality and ef-
fectiveness of our meta operation, we conducted
experiments on the newly proposed benchmark as
well as two existing open-source benchmarks: Wik-
iTableQuestion and WikiSQL. These widely-used
TableQA benchmarks feature tables sourced from
Wikipedia with simple headers.
Metrics. We used accuracy as the evaluation met-
ric. Except for Tablellama, the predicted answers of
other models in our experiments are all free-formed.
We prompted GPT-4o to judge the correctness of
the predicted answer based on the question and
human-verified reference answer. Because a small
portion of questions in MiMoTable are open-ended,
we also instructed GPT-4o to give a score between
0-1 when it judges the question is open-ended.
Implementation Details. For all LLMs except
Tablellama, we input table contents in markdown
format. For GPT-4o, we also tested another popular
1https://ai.meta.com/blog/meta-llama-3-1/
2https://www.anthropic.com/news/claude-3-5-sonnet
Page 7:
Model OverallLanguage Table Difficulty Question Difficulty Meta Operations
English Chinese Simple Medium Hard [1, 2) [2, 3) [3, 4] Lookup Compare Calculate Reasoning
Claude-3.5-Sonnet 77.4% 79.0% 76.2% 81.3% 75.5% 72.1% 89.0% 77.1% 63.3% 89.0% 79.7% 76.1% 63.3%
GPT-4o-CI 69.2% 70.8% 68.1% 81.7% 67.1% 50.8% 81.0% 71.1% 45.8% 81.0% 73.1% 70.6% 45.8%
GPT-4o-TXT 69.0% 69.3% 68.8% 73.8% 66.2% 62.1% 85.1% 67.6% 53.9% 85.1% 73.1% 64.5% 53.9%
Gemini-1.5-Pro 60.2% 61.6% 59.1% 64.9% 57.4% 55.3% 86.1% 55.0% 47.6% 86.1% 60.3% 50.2% 47.6%
Llama-3.1-70B-Instruct 57.0% 56.6% 57.3% 64.0% 51.6% 51.3% 82.0% 52.1% 45.1% 82.0% 57.7% 48.8% 45.1%
Qwen2-72B-Instruct 55.7% 51.5% 58.8% 61.4% 52.6% 49.2% 80.4% 50.1% 46.9% 80.4% 56.3% 45.5% 46.9%
Llama-3-70B-Instruct 53.7% 52.3% 54.8% 60.1% 48.6% 48.5% 78.9% 47.8% 46.1% 78.9% 51.5% 44.4% 46.1%
Qwen1.5-72B-Chat 47.5% 46.1% 48.5% 51.6% 45.1% 43.2% 75.2% 41.2% 37.5% 75.2% 42.7% 40.1% 37.5%
Llama-3.1-8B-Instruct 44.1% 44.0% 44.2% 49.0% 43.3% 36.4% 70.0% 38.3% 34.9% 70.0% 41.0% 35.6% 34.9%
Qwen2-7B-Instruct 41.6% 40.5% 42.4% 45.6% 40.7% 35.5% 70.9% 34.1% 35.1% 70.9% 35.8% 32.4% 35.1%
Qwen1.5-14B-Chat 40.2% 38.6% 41.3% 44.1% 38.4% 35.3% 73.8% 32.4% 28.9% 73.8% 32.2% 30.9% 28.9%
Llama-3-8B-Instruct 39.9% 39.8% 40.0% 45.0% 37.0% 34.4% 68.4% 34.0% 27.5% 68.4% 36.1% 31.6% 27.5%
Mistral-7B-Instruct-v0.3 35.2% 35.2% 35.1% 40.5% 31.6% 30.1% 71.7% 25.8% 27.2% 71.7% 25.0% 23.4% 27.2%
Qwen1.5-7B-Chat 34.4% 33.6% 35.0% 40.1% 29.9% 29.7% 69.3% 25.1% 28.3% 69.3% 24.8% 22.9% 28.3%
Deepseek-Coder-7B-Instruct-v1.5 34.1% 33.9% 34.2% 38.4% 33.0% 27.6% 68.4% 25.1% 26.5% 68.4% 22.7% 24.2% 26.5%
Gemma-7B-Instruct 23.3% 20.6% 25.3% 28.7% 18.2% 19.9% 48.2% 15.7% 22.9% 48.2% 13.5% 14.9% 22.9%
Tablellama 21.1% 23.9% 19.1% 25.4% 20.0% 14.9% 45.4% 16.3% 9.9% 45.4% 15.5% 14.4% 9.9%
Table 6: Performance of LLMs on MiMoTable. GPT-4o-CI refers to the GPT-4o model with a code interpreter
plugin.
Figure 7: The performance of LLMs on MiMoTable respecting to different Meta Operations, Domains and Table
Types.
approach in table reasoning, which is denoted as
GPT-4o-CI in Table 6. This method uploads spread-
sheets and generates Python code, executed via a
code interpreter plugin, with results fed back for
analysis. The format of MiMoTable and WikiTable-
Question is spreadsheet files, which can be directly
as part of inputs to GPT-4o-CI. For WikiSQL, we
first saved the non-file table content as Excel files
and fed them to GPT-4o-CI. For Tablellama, we
followed the prompt format specified in the origi-
nal paper. For the existing benchmarks, we used
GPT-4o to divide the questions according to the
meta operations and then calculated the difficulty
scores of the datasets based on Equation 6. We use
the official default parameters for all models. More
details can be found in the supplementary material.3.2 Results and Analysis
Overall Performance. As shown in Table 6, we
evaluate our proposed benchmark dataset using dif-
ferent LLMs. Because most experimental LLMs
can not generate edited files and charts, we only in-
fer the questions without meta operations Edit and
Visualize for a fair comparison with GPT-4o-CI.
As we can see, the best-performing model, Claude-
3.5-Sonnet, achieved an overall performance of
only 77.4% on our benchmark, highlighting that
MiMoTable poses significant challenges for current
LLMs. This underscores the need for further ex-
ploration to improve model performance on more
realistic table data.
Analysis of different approaches. There are
mainly two approaches to solving table reasoning
Page 8:
Figure 8: Performance on different data types.
problems of spreadsheets. One approach is to repre-
sent the spreadsheet content in text form and input
it into the model to directly generate answers. The
other is directly using the spreadsheet as input to
write code, run in a sandbox, and conclude to solve
the problem in a ReAct (Yao et al., 2023) way. As
shown in Figure 8, we compare the performance
of those two methods based on the GPT-4o model,
where GPT-4o-TXT is the first text form approach
and GPT-4o-CI is the second code-based approach.
The GPT-4o-CI performs better than GPT-4o-TXT
in Calculate and Compare when the table difficulty
is simple and medium, while GPT-4o-TXT per-
forms better in hard tables and in meta operations
of Lookup and Reasoning. This reveals that the
code-based approach has advantages in calculating
only when the tables are not so hard, as hard ta-
bles can cause the model to be unable to write the
correct code to locate the required data. The text-
based approach is good at Lookup and Reasoning
because the model can see the entire content of the
table as long as the context window size is enough.
The first radar chart in Figure 7 shows the results
of more LLMs respecting to the different combina-
tions of table difficulty and meta operations.
Capability of LLMs for table reasoning. Ac-
cording to the results of table difficulty in Table
6 and table types in Figure 7, most LLMs have
struggled in medium and hard tables. We attribute
the reasons to two factors, hierarchical header and
multiple similar tables. As illustrated in Figure 9,
although the hierarchical relations between table
cells appear very clear in the original spreadsheets,
they become much less intuitive when converted
into text, which poses challenges for the LLMs to
understand. Additionally, the tables in the spread-
sheets with multiple sheets are usually very similar.
LLMs need to comprehensively consider multiplesimilar tables to answer questions. In conclusion,
the ability to understand complex table structures
and multiple similar tables in table reasoning needs
to be improved for current LLMs. We also ob-
serve differences in the performance of different
models across languages. For example, Claude-
3.5-Sonnet performs better in English than in Chi-
nese, while Qwen2-72B-Instruct is the opposite.
We believe this is due to the varying proportions of
different languages used during the pretraining and
SFT stages for each model.
Effectiveness of Meta Operations. Although Wik-
iTableQuestion and WikiSQL are both datasets for
TableQA tasks with tables that have simple headers,
we found that the performance of the same LLMs
on these two datasets varies significantly. For ex-
ample, Llama3-70B achieves 82.0% accuracy on
WikiSQL but only 66.7% accuracy on WikiTable-
Question. No objective metric exists to explain this
discrepancy. By scoring the datasets according to
our meta operations, we found that WikiTableQues-
tion is significantly more difficult than WikiSQL:
the difficulty of WikiSQL is 1.5, while the diffi-
culty of WikiTableQuestion is 2.0. The questions
involving simple tables in MiMoTable, denoted as
MiMoTable-Simple, have a difficulty of 2.2. We
evaluated the performance of different LLMs on
these three datasets—WikiSQL, WikiTableQues-
tion, and MiMoTable-Simple—and the results are
shown in Figure 10. The x-axis represents the diffi-
culty of the benchmarks graded by meta operations,
and the y-axis shows the accuracy of the LLMs on
these benchmarks. We found that as the difficulty
score of the dataset increases, the model perfor-
mance declines. This indicates that our proposed
meta operations and difficulty scores are both gener-
alizable and effective across different benchmarks.
Page 9:
to markdownProduct Sales Analysis Table||||||||||||:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|:--|Category|Product1||||Product2||||Product3|||||Weight|StandardPieces|Pieces|Proportion|Weight|StandardPieces|Pieces|Proportion|Weight|StandardPieces|Pieces|Proportion|NecklaceSet|Below8g|32|1|0.03125|Below3g|88|3|0.0340909090909091|Below3g|232|4|0.0172413793103448||8-12g|||0|3-5g||12|0.136363636363636|3-5g|||0||12-15g|||0|Above5g|||0|Above5g|||0||Above15g|||0||||0||||0||Pendant|Below8g|314|12|0.0382165605095541|Below3g|219|15|0.0684931506849315|Below3g|960||0||8-12g||3|0.00955414012738853|3-5g|||0|3-5g|||0||12-15g||3|0.00955414012738853|Above5g|||0|Above5g|||0||15-20g|||0||||0||||0||Above20g||1|0.00318471337579618||||0||||0|Figure 9: A spreadsheet to text with the markdown format.
0.0%10.0%20.0%30.0%40.0%50.0%60.0%70.0%80.0%90.0%100.0%
1.51.61.71.81.92.02.12.2Claude-3.5-SonnetGPT-4o-CIGemini-1.5-ProLlama3-70BTablellama
Figure 10: The relations between performance and the
difficulty of benchmarks. The x-axis is the difficulty of
benchmarks graded by meta operations. The y-axis is
the accuracy of tested LLMs.
4 Related Work
The main tasks for table reasoning include four
categories: TableQA, Table2Text, Table Manip-
ulation, and Advanced Data Analysis (Lu et al.,
2024). Researchers have proposed various table
benchmarks for these tasks. TableQA is the most
popular task, including benchmarks like WikiTable-
Questions (Pasupat and Liang, 2015), WikiSQL
(Zhong et al., 2017), FeTaQA (Nan et al., 2022),
HybridQA (Chen et al., 2020), TATQA (Zhu et al.,
2021), NQ-TABLES (Kwiatkowski et al., 2019),
HybriDialogue (Nakamura et al., 2022), BIRD (Li
et al., 2023b), Spider (Yu et al., 2018). The primary
benchmark for Table2Text is ToTTo (Parikh et al.,
2020). A high-quality Table Manipulation bench-
mark called WikiTableEdit is introduced in (Li
et al., 2024). SPREADSHEETBENCH (Ma et al.,
2024) is a challenging spreadsheet manipulation
benchmark. For the Advanced Data Analysis task,
the benchmarks DAEval (Hu et al., 2024) and DS-
1000 (Lai et al., 2023) are proposed. Text2Analysis
(He et al., 2024) is a recently introduced benchmark
that includes both TableQA and Advanced Data
Analysis tasks. Most existing table benchmarks
feature simple table headers, but HiTab (Cheng
et al., 2022) is a TableQA and Table2Text dataset
based on hierarchical headers. AIT-QA (Katsis
et al., 2022) is a dataset for TableQA with hierar-
chical headers specific to the airline industry.Unlike the existing benchmarks, we propose a
new benchmark, MiMoTable, the first benchmark
with multi-scale spreadsheets that simultaneously
covers four tasks: TableQA, Table2Text, Table Ma-
nipulation, and Advanced Data Analysis.
5 Conclusion
We propose a multi-scale spreadsheet benchmark
with four tasks: TableQA, Table2Text, Table Ma-
nipulation, and Advanced Data Analysis, named
MiMoTable. Experiments have shown that existing
LLMs perform poorly on this benchmark, indicat-
ing that there is still significant room to improve
in more realistic scenarios. For table reasoning,
we also propose a new criterion for categorizing
problems based on meta operations. Compared to
task-based categorization, this criterion allows for
a deeper and more accurate analysis of problems in
table datasets. Our experiments demonstrate that
the meta operations are general and effective.
6 Limitations
When validating the effectiveness of meta opera-
tions, we do not perform Supervised Fine-Tuning
(SFT) on the models. Future work could examine
the role and effect of each type of operation through
SFT. Meanwhile, we used the same prompt for
evaluating all models except Tablellama, without
optimizing or adapting prompts for different mod-
els. Regarding hyperparameters for different mod-
els, we used the officially recommended default
parameters and do not adjust different hyperparam-
eters for different models. Additionally, inspired by
work in other fields (Song et al., 2024a,b), develop-
ing long-context table reasoning benchmarks and
studying in-context learning for table reasoning are
valuable directions for further exploration.
Acknowledgments
We thank the three anonymous reviewers for care-
fully reading our paper and their insightful com-
ments and suggestions.
Page 10:
References
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin,
Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu,
Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren,
Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong
Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang
Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian
Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi
Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang,
Yichang Zhang, Zhenru Zhang, Chang Zhou, Jin-
gren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023.
Qwen technical report. CoRR , abs/2309.16609.
Wenhu Chen, Hanwen Zha, Zhiyu Chen, Wenhan Xiong,
Hong Wang, and William Yang Wang. 2020. Hy-
bridqa: A dataset of multi-hop question answering
over tabular and textual data. In Findings of the As-
sociation for Computational Linguistics: EMNLP
2020, Online Event, 16-20 November 2020 , volume
EMNLP 2020 of Findings of ACL , pages 1026–1036.
Association for Computational Linguistics.
Zhoujun Cheng, Haoyu Dong, Zhiruo Wang, Ran Jia,
Jiaqi Guo, Yan Gao, Shi Han, Jian-Guang Lou, and
Dongmei Zhang. 2022. Hitab: A hierarchical table
dataset for question answering and natural language
generation. In Proceedings of the 60th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL 2022, Dublin, Ireland,
May 22-27, 2022 , pages 1094–1110. Association for
Computational Linguistics.
Zhoujun Cheng, Tianbao Xie, Peng Shi, Chengzu
Li, Rahul Nadkarni, Yushi Hu, Caiming Xiong,
Dragomir Radev, Mari Ostendorf, Luke Zettlemoyer,
Noah A. Smith, and Tao Yu. 2023. Binding language
models in symbolic languages. In The Eleventh In-
ternational Conference on Learning Representations,
ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . Open-
Review.net.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey,
Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman,
Akhil Mathur, Alan Schelten, Amy Yang, Angela
Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang,
Archi Mitra, Archie Sravankumar, Artem Korenev,
Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien
Rodriguez, Austen Gregerson, Ava Spataru, Bap-
tiste Rozière, Bethany Biron, Binh Tang, Bobbie
Chern, Charlotte Caucheteux, Chaya Nayak, Chloe
Bi, Chris Marra, Chris McConnell, Christian Keller,
Christophe Touret, Chunyang Wu, Corinne Wong,
Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Al-
lonsius, Daniel Song, Danielle Pintz, Danny Livshits,
David Esiobu, Dhruv Choudhary, Dhruv Mahajan,
Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes,
Egor Lakomkin, Ehab AlBadawy, Elina Lobanova,
Emily Dinan, Eric Michael Smith, Filip Radenovic,
Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Geor-
gia Lewis Anderson, Graeme Nail, Grégoire Mialon,
Guan Pang, Guillem Cucurell, Hailey Nguyen, Han-
nah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov,Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan
Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan
Geffert, Jana Vranes, Jason Park, Jay Mahadeokar,
Jeet Shah, Jelmer van der Linde, Jennifer Billock,
Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi,
Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu,
Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph
Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia,
Kalyan Vasuden Alwala, Kartikeya Upasani, Kate
Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and
et al. 2024. The llama 3 herd of models. CoRR ,
abs/2407.21783.
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai
Dong, Wentao Zhang, Guanting Chen, Xiao Bi,
Y . Wu, Y . K. Li, Fuli Luo, Yingfei Xiong, and Wen-
feng Liang. 2024. Deepseek-coder: When the large
language model meets programming - the rise of code
intelligence. CoRR , abs/2401.14196.
Xinyi He, Mengyu Zhou, Xinrun Xu, Xiaojun Ma,
Rui Ding, Lun Du, Yan Gao, Ran Jia, Xu Chen,
Shi Han, Zejian Yuan, and Dongmei Zhang. 2024.
Text2analysis: A benchmark of table question an-
swering with advanced data analysis and unclear
queries. In Thirty-Eighth AAAI Conference on Artifi-
cial Intelligence, AAAI 2024, Thirty-Sixth Conference
on Innovative Applications of Artificial Intelligence,
IAAI 2024, Fourteenth Symposium on Educational
Advances in Artificial Intelligence, EAAI 2014, Febru-
ary 20-27, 2024, Vancouver, Canada , pages 18206–
18215. AAAI Press.
Xueyu Hu, Ziyu Zhao, Shuang Wei, Ziwei Chai, Guoyin
Wang, Xuwu Wang, Jing Su, Jingjing Xu, Ming
Zhu, Yao Cheng, Jianbo Yuan, Kun Kuang, Yang
Yang, Hongxia Yang, and Fei Wu. 2024. Infiagent-
dabench: Evaluating agents on data analysis tasks.
CoRR , abs/2401.05507.
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Men-
sch, Chris Bamford, Devendra Singh Chaplot, Diego
de Las Casas, Florian Bressand, Gianna Lengyel,
Guillaume Lample, Lucile Saulnier, Lélio Re-
nard Lavaud, Marie-Anne Lachaux, Pierre Stock,
Teven Le Scao, Thibaut Lavril, Thomas Wang, Timo-
thée Lacroix, and William El Sayed. 2023. Mistral
7b.CoRR , abs/2310.06825.
Yannis Katsis, Saneem A. Chemmengath, Vishwa-
jeet Kumar, Samarth Bharadwaj, Mustafa Canim,
Michael R. Glass, Alfio Gliozzo, Feifei Pan, Jay-
deep Sen, Karthik Sankaranarayanan, and Soumen
Chakrabarti. 2022. AIT-QA: question answering
dataset over complex tables in the airline industry.
InProceedings of the 2022 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies:
Industry Track, NAACL 2022, Hybrid: Seattle, Wash-
ington, USA + Online, July 10-15, 2022 , pages 305–
314. Association for Computational Linguistics.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red-
field, Michael Collins, Ankur P. Parikh, Chris Alberti,
Page 11:
Danielle Epstein, Illia Polosukhin, Jacob Devlin, Ken-
ton Lee, Kristina Toutanova, Llion Jones, Matthew
Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob
Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natu-
ral questions: a benchmark for question answering
research. Trans. Assoc. Comput. Linguistics , 7:452–
466.
Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang,
Ruiqi Zhong, Luke Zettlemoyer, Wen-Tau Yih,
Daniel Fried, Sida I. Wang, and Tao Yu. 2023. DS-
1000: A natural and reliable benchmark for data sci-
ence code generation. In International Conference
on Machine Learning, ICML 2023, 23-29 July 2023,
Honolulu, Hawaii, USA , volume 202 of Proceedings
of Machine Learning Research , pages 18319–18345.
PMLR.
Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and
Zhaoxiang Zhang. 2023a. Sheetcopilot: Bringing
software productivity to the next level through large
language models. In Advances in Neural Information
Processing Systems 36: Annual Conference on Neu-
ral Information Processing Systems 2023, NeurIPS
2023, New Orleans, LA, USA, December 10 - 16,
2023 .
Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li,
Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng,
Nan Huo, Xuanhe Zhou, Chenhao Ma, Guoliang
Li, Kevin Chen-Chuan Chang, Fei Huang, Reynold
Cheng, and Yongbin Li. 2023b. Can LLM already
serve as A database interface? A big bench for large-
scale database grounded text-to-sqls. In Advances in
Neural Information Processing Systems 36: Annual
Conference on Neural Information Processing Sys-
tems 2023, NeurIPS 2023, New Orleans, LA, USA,
December 10 - 16, 2023 .
Zheng Li, Xiang Chen, and Xiaojun Wan. 2024. Wik-
itableedit: A benchmark for table editing by natural
language instruction. CoRR , abs/2403.02962.
Qian Liu, Bei Chen, Jiaqi Guo, Morteza Ziyadi, Zeqi
Lin, Weizhu Chen, and Jian-Guang Lou. 2022.
TAPEX: table pre-training via learning a neural SQL
executor. In The Tenth International Conference on
Learning Representations, ICLR 2022, Virtual Event,
April 25-29, 2022 . OpenReview.net.
Weizheng Lu, Jiaming Zhang, Jing Zhang, and Yueguo
Chen. 2024. Large language model for table process-
ing: A survey. CoRR , abs/2402.05121.
Zeyao Ma, Bohan Zhang, Jing Zhang, Jifan Yu, Xi-
aokang Zhang, Xiaohan Zhang, Sijia Luo, Xi Wang,
and Jie Tang. 2024. Spreadsheetbench: Towards chal-
lenging real world spreadsheet manipulation. CoRR ,
abs/2406.14991.
Thomas Mesnard, Cassidy Hardin, Robert Dadashi,
Surya Bhupatiraju, Shreya Pathak, Laurent Sifre,
Morgane Rivière, Mihir Sanjay Kale, Juliette Love,
Pouya Tafti, Léonard Hussenot, Aakanksha Chowdh-
ery, Adam Roberts, Aditya Barua, Alex Botev, AlexCastro-Ros, Ambrose Slone, Amélie Héliou, Andrea
Tacchetti, Anna Bulanova, Antonia Paterson, Beth
Tsai, Bobak Shahriari, Charline Le Lan, Christo-
pher A. Choquette-Choo, Clément Crepy, Daniel Cer,
Daphne Ippolito, David Reid, Elena Buchatskaya,
Eric Ni, Eric Noland, Geng Yan, George Tucker,
George-Cristian Muraru, Grigory Rozhdestvenskiy,
Henryk Michalewski, Ian Tenney, Ivan Grishchenko,
Jacob Austin, James Keeling, Jane Labanowski,
Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan,
Jeremy Chen, Johan Ferret, Justin Chiu, and et al.
2024. Gemma: Open models based on gemini re-
search and technology. CoRR , abs/2403.08295.
Kai Nakamura, Sharon Levy, Yi-Lin Tuan, Wenhu Chen,
and William Yang Wang. 2022. Hybridialogue: An
information-seeking dialogue dataset grounded on
tabular and textual data. In Findings of the Asso-
ciation for Computational Linguistics: ACL 2022,
Dublin, Ireland, May 22-27, 2022 , pages 481–492.
Association for Computational Linguistics.
Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victo-
ria Lin, Neha Verma, Rui Zhang, Wojciech Kryscin-
ski, Hailey Schoelkopf, Riley Kong, Xiangru Tang,
Mutethia Mutuma, Ben Rosand, Isabel Trindade,
Renusree Bandaru, Jacob Cunningham, Caiming
Xiong, and Dragomir R. Radev. 2022. Fetaqa: Free-
form table question answering. Trans. Assoc. Com-
put. Linguistics , 10:35–49.
OpenAI. 2023. Gpt-4 technical report. Preprint ,
arXiv:2303.08774.
Ankur P. Parikh, Xuezhi Wang, Sebastian Gehrmann,
Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and
Dipanjan Das. 2020. Totto: A controlled table-to-text
generation dataset. In Proceedings of the 2020 Con-
ference on Empirical Methods in Natural Language
Processing, EMNLP 2020, Online, November 16-20,
2020 , pages 1173–1186. Association for Computa-
tional Linguistics.
Panupong Pasupat and Percy Liang. 2015. Compo-
sitional semantic parsing on semi-structured tables.
InProceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics and the
7th International Joint Conference on Natural Lan-
guage Processing of the Asian Federation of Natural
Language Processing, ACL 2015, July 26-31, 2015,
Beijing, China, Volume 1: Long Papers , pages 1470–
1480. The Association for Computer Linguistics.
Machel Reid, Nikolay Savinov, Denis Teplyashin,
Dmitry Lepikhin, Timothy P. Lillicrap, Jean-Baptiste
Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan
Firat, Julian Schrittwieser, Ioannis Antonoglou, Ro-
han Anil, Sebastian Borgeaud, Andrew M. Dai, Katie
Millican, Ethan Dyer, Mia Glaese, Thibault Sotti-
aux, Benjamin Lee, Fabio Viola, Malcolm Reynolds,
Yuanzhong Xu, James Molloy, Jilin Chen, Michael
Isard, Paul Barham, Tom Hennigan, Ross McIl-
roy, Melvin Johnson, Johan Schalkwyk, Eli Collins,
Eliza Rutherford, Erica Moreira, Kareem Ayoub,
Megha Goel, Clemens Meyer, Gregory Thornton,
Page 12:
Zhen Yang, Henryk Michalewski, Zaheer Abbas,
Nathan Schucher, Ankesh Anand, Richard Ives,
James Keeling, Karel Lenc, Salem Haykal, Siamak
Shakeri, Pranav Shyam, Aakanksha Chowdhery, Ro-
man Ring, Stephen Spencer, Eren Sezener, and et al.
2024. Gemini 1.5: Unlocking multimodal under-
standing across millions of tokens of context. CoRR ,
abs/2403.05530.
Mingyang Song, Mao Zheng, and Xuan Luo. 2024a.
Can many-shot in-context learning help llms as eval-
uators? a preliminary empirical study. Preprint ,
arXiv:2406.11629.
Mingyang Song, Mao Zheng, and Xuan Luo. 2024b.
Counting-stars: A multi-evidence, position-aware,
and scalable benchmark for evaluating long-context
large language models. Preprint , arXiv:2403.11802.
Yuan Sui, Mengyu Zhou, Mingjie Zhou, Shi Han, and
Dongmei Zhang. 2024. Table meets LLM: can large
language models understand structured table data?
A benchmark and empirical study. In Proceedings
of the 17th ACM International Conference on Web
Search and Data Mining, WSDM 2024, Merida, Mex-
ico, March 4-8, 2024 , pages 645–654. ACM.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng,
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao-
ran Wei, Huan Lin, Jialong Tang, Jialin Wang,
Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin
Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai,
Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Ke-
qin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni,
Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize
Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan,
Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge,
Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren,
Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing
Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan,
Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang,
Zhifang Guo, and Zhihao Fan. 2024. Qwen2 techni-
cal report. CoRR , abs/2407.10671.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak
Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023.
React: Synergizing reasoning and acting in language
models. In The Eleventh International Conference
on Learning Representations, ICLR 2023, Kigali,
Rwanda, May 1-5, 2023 . OpenReview.net.
Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga,
Dongxu Wang, Zifan Li, James Ma, Irene Li,
Qingning Yao, Shanelle Roman, Zilin Zhang, and
Dragomir R. Radev. 2018. Spider: A large-scale
human-labeled dataset for complex and cross-domain
semantic parsing and text-to-sql task. In Proceed-
ings of the 2018 Conference on Empirical Methods
in Natural Language Processing, Brussels, Belgium,
October 31 - November 4, 2018 , pages 3911–3921.
Association for Computational Linguistics.
Tianshu Zhang, Xiang Yue, Yifei Li, and Huan Sun.
2023. Tablellama: Towards open large generalist
models for tables. CoRR , abs/2311.09206.Xuanliang Zhang, Dingzirui Wang, Longxu Dou,
Qingfu Zhu, and Wanxiang Che. 2024. A survey of
table reasoning with large language models. CoRR ,
abs/2402.08259.
Victor Zhong, Caiming Xiong, and Richard Socher.
2017. Seq2sql: Generating structured queries
from natural language using reinforcement learning.
CoRR , abs/1709.00103.
Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao
Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and
Tat-Seng Chua. 2021. TAT-QA: A question answer-
ing benchmark on a hybrid of tabular and textual
content in finance. In Proceedings of the 59th An-
nual Meeting of the Association for Computational
Linguistics and the 11th International Joint Confer-
ence on Natural Language Processing, ACL/IJCNLP
2021, (Volume 1: Long Papers), Virtual Event, Au-
gust 1-6, 2021 , pages 3277–3287. Association for
Computational Linguistics.
Page 13:
A Appendix
A.1 Data Statistic of The Language
Table 7 shows the data statistics of the proposed
benchmark under different languages, including
table number, question number, and the average of
question difficulty.
Table Number Question Number Question Difficulty
Overall 428 1719 2.2
English 182 671 2.2
Chinese 246 1048 2.2
Table 7: Data Statistics of Different Languages
A.2 Used Prompts
Table 8, Table 9, and Table 10 show the designed
prompts for meta operations classification, model
inference, and performance evaluation in this paper.
You are a spreadsheet question classification expert.
Given a user’s question about an Excel spreadsheet,
classify the question according to the requirements and
output it in the specified format.
<Requirements>
The following operation classification already exists,
presented in the format of operation name: operation
description. If the user’s question can be classified as
some of the operations, output the operation names. One
question can be classified into multiple operations.
Lookup: Locate the position of the specific target.
Edit: Modify, delete, or add to a table.
Calculate: The numerical computation, sum, avg, max,
etc.
Compare: Compare two or more targets in a table.
Visualize: Show in chart. Reasoning: Inferring informa-
tion from the table content that is not explicitly included.
</Requirements>
<Output Format>
operation name 1, operation name 2, ...
</Output Format>
<Question>
what country hosted the most tournaments?
</Question>
Lookup, Calculate, Compare
<Question>
QUESTION TO BE CLASSIFIED
</Question>
Table 8: Prompt for Meta Operation ClassificationBelow is the table content in markdown, please answer
the question according to the table content.
<Table>
<Table Name>
SPREADSHEET FILE NAME
</Table Name>
<Table Content>
<Sheet>
<Sheet Name>
SHEET NAME
</Sheet Name>
<Sheet Content>
SHEET CONTENT IN MARKDOWN
</Sheet Content>
</Sheet>
</Table Content>
</Table>
<Question>
THE QUESTION TO BE ASKED
</Question>
Table 9: Prompt for Model Inference
For the following questions, given the correct answer,
determine whether the candidate’s answer is correct.
If it is correct, output "Correct"; if it is incorrect, out-
put "Incorrect"; if it is uncertain whether it is correct,
output "Uncertain". As long as the candidate’s answer
contains the key information that can correctly answer
the question, it is considered correct. If the question is
open-ended, give a score between 0-1 according to the
correct answer. Do not output any other content.
<Question>
THE QUESTION
</Question>
<Correct Answer>
THE CORRECT ANSWER
</Correct Answer>
<Candidate answer>
THE CANDIDATE ANSWER
</Candidate answer>
Table 10: Prompt for Performance Evaluation