Paper Content:
Page 1:
arXiv:2503.05488v1 [cs.CL] 7 Mar 2025KIEval: Evaluation Metric
for Document Key Information Extraction
Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hong
Upstage AI, South Korea
{mkhang, eric, sungrae.park, tghong}@upstage.ai
Abstract. Document Key Information Extraction (KIE) is a technol-
ogy that transforms valuable information in document image s into struc-
tured data, and it has become an essential function in indust rial settings.
However, current evaluation metrics of this technology do n ot accurately
reflect the critical attributes of its industrial applicati ons. In this pa-
per, we present KIEval, a novel application-centric evalua tion metric for
Document KIE models. Unlike prior metrics, KIEval assesses Document
KIE models not just on the extraction of individual informat ion (entity)
but also of the structured information (grouping). Evaluat ion of struc-
tured information provides assessment of Document KIE mode ls that
are more reflective of extracting grouped information from d ocuments
in industrial settings. Designed with industrial applicat ion in mind, we
believe that KIEval can become a standard evaluation metric for devel-
oping or applying Document KIE models in practice. The code w ill be
publicly available.
Keywords: Document AI · Key Information Extraction · Evaluation
Metric.
1 Introduction
Document Key Information Extraction (KIE) is a well-known t ask of converting
information from document images into structured data and h as gained much at-
tention from both the academia and industry over the years [2 4,23,9,5,8,12,17,13,10].
One common application of Document KIE in industrial settin gs lies in Robotic
Process Automation (RPA) of document digitisation which ai ms to extract,
structure, and store the data in document images into databa ses for various
downstream applications. Information extracted from docu ments is often pre-
sented as key-value pairs (e.g. Menu.name: “AMERICANO”) th at are frequently
interrelated (e.g. Menu.name & Menu.price), forming the ba sis of structured
information in documents.
Despite such application settings, a standardized evaluat ion metric for Doc-
ument KIE models has yet to be established and existing metri cs used in prior
works fail to consider several key components from the appli cation’s standpoint.
The main causes of disparity between the existing evaluatio n metrics and appli-
cation settings can be attributed to: neglecting structure d nature of information
Page 2:
2 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hon g
Document Image Ground-truth Predictions
Menu.price Menu.name
29,091 CHOCO PUFF
42,727 CREAMY BEEF CLS FTC
34,545 NEW ORIENTAL CHK RICE
54,545 LIPTON PITCHER
47,273 SC/P SUPER SUPREME
48,182 CB/P BLACK PEPP BEEFTotal
282,000
SubTotal
256,363
Tax
25,637Non-grouped entities
Grouped entities
Menu.price Menu.name
29,091 CHOCO PUFF
CREAMY BEEF CLS FTC
34,545 NEW ORIENTAL CHK RICE
47,273 SC/P SUPER SUPREME
48,182 CB/P BLACK PEPP BEEFTotal
SubTotal
256,363
25,637Non-grouped entities
Grouped entities
Fig. 1: Example of CORD dataset (receipts). The dataset has n on-grouped and
grouped entities (non-grouped entities form a special grou p), and requires struc-
tured predictions including Menu groups: Menu.name and Men u.price. Errors in
model predictions are not limited to individual key-value p air errors but also in
the extraction of structural relation between entities (ma rked in red). Both error
types must be considered in Document KIE model evaluation.
in assessment and insufficient alignment of metric formulati on with the industrial
applications. Detailed explanations of these causes are as follows (corresponding
visualisations are shown in Fig. 1 and 2):
Structured nature in information refers to the presence of structural relation
between key-value pairs in documents. Referring to the exam ple in Fig. 1 and 2,
each values of the entity-type, Menu.name, has contextual l inkage to different
values of Menu.price. Existing metrics, however, mainly fo cus on the assessment
of individual entity extraction, while reflecting limited o r no evaluation for ex-
traction of such structured information. In industrial app lications, however, the
lack of such structured information can lead to critical inf ormation loss when
storing data in relational databases for downstream tasks.
Insufficient alignment of existing metrics’ design refers to gaps arising due
to the formulations that are not fully representative of Doc ument KIE applica-
tions in industrial settings. Existing metrics, such as the Entity-level F1 metric,
often distinguishes KIE model’s erroneous prediction (Fal se-Positive, FP) from
missed prediction (False-Negative, FN) in metric formulat ions. Such distinction,
while well-suited for model development, precipitates cle ar disparity with ap-
plication settings where KIE errors are often perceived in n umber of correction
counts needed. It is worth noting that, correction count ref ers to number of value
editing (one of substitution, addition, or deletion) steps needed to convert KIE
predictions to ground-truth values.
To address the causes of disparity identified above, we propo se an evaluation
metric with application-centric design named: KIEval (Key Information Extrac-
Page 3:
KIEval: Evaluation Metric for Document Key Information Ext raction 3
Ground- truth (group) Predictions
Entity-level F1
Recall = 11 / 12 = 91.67%
Precision = 11 / 11 = 100.00%
F1= Hmean(91.67%, 100.00%)
= 95.65%
KIEval
Group-level = 4 / 7
= 57.14%
Entity-level = 10 / 13
= 76.92%Menu.price Menu.name
29,091 CHOCO PUFF
42,727 CREAMY BEEF CLS FTC
34,545 NEW ORIENTAL CHK RICE
54,545 LIPTON PITCHER
47,273 SC/P SUPER SUPREME
48,182 CB/P BLACK PEPP BEEFMenu.price Menu.name
29,091 CHOCO PUFF
CREAMY BEEF CLS FTC
34,545 NEW ORIENTAL CHK RICE
54,545
LIPTON PITCHER
47,273 SC/P SUPER SUPREME
48,182 CB/P BLACK PEPP BEEF
Ground-truth (group) Predictions
Menu.price Menu.name
29,091 CHOCO PUFF
42,727 CREAMY BEEF CLS FTC
34,545 NEW ORIENTAL CHK RICE
54,545 LIPTON PITCHER
47,273 SC/P SUPER SUPREME
48,182 CB/P BLACK PEPP BEEFMenu.price Menu.name
29,091 CHOCO PUFF
CREAMY BEEF CLS FTC
34,545 NEW ORIENTAL CHK RICE
LIPTON PITCHER
47,273 SC/P SUPER SUPREME
48,182 CB/P BLACK PEPP BEEF
Fig. 2: Comparison of evaluation metrics for KIE tasks. Note that the ground-
truth and predictions follow the same Document Image in Fig. 1. The red boxes
indicate the errors accounted for by the respective metrics during evaluation.
Entity-level F1 does not account for structural relations u nlike the proposed
KIEval metric which performs both Entity-level and Group-l evel evaluations
based on the group-matching information (blue links).
tion evaluation). Firstly, KIEval is formulated to provide KIE model assessment
in two different levels: entity-level (individual entities such as Menu.name) and
group-level (sets of related entities such as Menu.name, Me nu.price). In both lev-
els of evaluation, prediction and ground-truth values are m atched by condition-
ing on information structure (group), facilitating struct ured-information level
assessment of KIE models. Secondly, KIEval formulates KIE e rrors in terms of
the number of substitution, addition, or deletion steps nee ded. Such formulation,
instead of the conventional FP and FN, aims to better represe nt the eventual
cost which KIE errors incur in application settings.
In this work, our key contributions can be summarized as foll ows. 1) We
propose KIEval (Key Information Extraction evaluation) me tric for Document
KIE which incorporates structured information assessment in both entity and
group-level evaluation. 2) Provision of KIE model evaluati on in terms of infor-
mation correction cost, bridging the disparity of existing metrics with industrial
applications. 3) We also showcase a use-case study on how KIE val can be applied
in RPA systems, highlighting its differences against existi ng metrics.
2 Related Works
2.1 Document KIE Methdology
Various types of approaches to Document KIE have been propos ed over the
years. One of the notable earlier works, LayoutLM [24], was t he first to propose
Page 4:
4 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hon g
a multi-modal framework, incorporating both text and layou t modalities in key
information extraction from documents. The framework’s ro bust performance
with simplistic design of BIO-tagging has motivated many fo llow-up works such
as: BROS [8], LayoutLMv2 [23], LayoutLMv3 [9], and ERNIE-La yout [17]. Al-
ternative forms of KIE with improved representational flexi bility such as graph-
based [10] and text generation [12,13] frameworks were prop osed in follow-up
works to better capture dependencies between entities and e ffectively capture
structured information from documents. With the rise of LLM applications, re-
cent works, such as ICL-D3IE [6] and SAIL [27], have leverage d the flexibility of
LLMs to tackle Document KIE in zero and few-shot settings.
2.2 Existing Metrics
Entity-level F1 score is one of the most commonly used metrics for Docu-
ment KIE model evaluation. Upon extracting entity-wise key -value pairs, they
are matched against the ground-truth key-value pairs, wher e the predicted pair
is considered valid if an exact-match can be found in the grou nd-truth set. Such
exact-match statistics are collated across different entit ies in the dataset to evalu-
ate the model’s entity-level F1 score. Commonly used in prio r works [24,23,9,17],
entity-level F1 score evaluates the degree to which model’s extracted information
exactly matches the expected key-value content in the docum ent.
This metric however, not only disregards structural relati on between enti-
ties during entity-level F1 evaluation but also does not pro vide any assessment
for group-level information extraction. In industrial app lications, entities ex-
tracted often form meaningful information when grouped wit h other entities
that are structurally related (i.e group), such as the group ing of Menu.name,
Menu.quantity, and Menu.price, in receipts. Variations of entity-level F1 score
were employed in prior works such as group-constrained enti ty-level F1 in SPADE
[10], Entity Extraction and Linking F1 in BROS [8], offering a more compre-
hensive assessment of KIE models. These variations of entit y-level F1 metric
underscores the necessity for a standardized metric for KIE model evaluation
in the field of Document AI. Furthermore, the absence of direc t assessment for
group-level information extraction in these metrics highl ights the disparity in
meeting the industrial application requirements.
Tree Edit Distance (TED) score is another type of KIE evaluation metric,
commonly adopted in text-generation based KIE models [12]. In contrary to the
exact-match based entity-level F1 score, TED adopts a soft- match approach to
avoid over-penalisation of model’s KIE. Edit distance base d metric could provide
a more objective assessment of the model by mitigating the im pact of minor
discrepancies, such as those between “ice cream” and “ice-c ream”, which could
lead to underestimation of model’s KIE performance. As disc ussed in Donut [12],
TED metric can be applied to KIE models by first representing t he prediction
and ground truth as trees, before evaluating the edit distan ce between them.
With the structural relation between entities captured usi ng tree representation,
this metric offers assessment of not only the model’s entity- level KIE performance
but also at the group-level.
Page 5:
KIEval: Evaluation Metric for Document Key Information Ext raction 5
Despite such capacity of TED metric, its soft-match approac h could exacer-
bate the discrepancy when applied to industrial settings. T aking automatic KIE
setting as an example, pairs of information with minor edit d istance could refer
to completely different items such as “Pear” and “Pea” or “700 0” and “1000”.
Consequently, it is required of evaluation metrics to be str ingent and provision
of partial scores (with edit distance) could offer a misleadi ng KIE assessment.
Other notable metrics include Average Normalized Levenshtein Si milarity
(ANLS) [2,19] and hybrid metric of exact-match and edit dist ance [26]. ANLS
aims to reduce the effect of overestimation of KIE models by co nstraining the
maximum edit-distance between prediction and ground-trut h to a pre-specified
threshold value (e.g. 0.5) beyond which, no partial score is given. Hybrid metric,
on the other hand, is a weighted arithmetic mean of the KIE mod el’s entity-level
F1 and inverse Normalized Edit Distance (NED). Nevertheles s, these metrics still
share the limitations of Entity-level F1 and TED metrics, an d do not provide
group-level assessment of KIE models.
3 Problem Definition
3.1 Document KIE Task
Document KIE is a task in the field of Document Understanding ( DU), with
the objective of extracting structured key-value pairs fro m Document images.
Commonly positioned as the task preceding various knowledg e-application op-
erations (e.g. financial data analysis), Document KIE is oft en faced with two
principal challenges: (1) accurate extraction of key-valu e information, requir-
ing value extraction and entity-key classification with min imal error, and (2)
the discernment of structural relations between different k ey-value pairs, which
demands accurate identification of contextual links betwee n key-value pairs, to
form a coherent group-level information unit.
3.2 Measuring Document KIE Model in Industrial Settings
To create an application-centric evaluation metric, follo wing principal challenges
of prior metrics must to be addressed: (1) absence of structu ral relation in metrics
and (2) insufficient alignment of metric formulation with app lication settings.
Structural relation refers to the contextual linkage betwe en key-value pairs
associated with entities in documents. Entity key-value pa irs that share contex-
tual linkage are formally defined as a group , such as Menu.name and Menu.price
in the CORD example (Fig. 1). Such structural relations pres ent in documents
need to be considered in both entity-level and group-level e valuations. Prior
metrics in entity-level evaluation (e.g. Entity-level F1) show limited or no inclu-
sion of structural relation, by evaluating each extracted k ey-value independent
from the remaining key-value pairs. Prediction results vis ualized in Fig. 2 clearly
demonstrates this where, the prediction of {Menu.price: “5 4,545”} is not matched
with the corresponding {Menu.name: “LIPTON PITCHER”}. Des pite predicting
Page 6:
6 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hon g
Menu.name
LatteMenu.name
AmericanoGround-truth Prediction
FN
Recall = ( ) / ( ) = 1/2 AmericanoAmericano
Latte
F1 = 2/3Precision = ( ) / ( ) = 1/1 Americano Americano
(a) Scenario #1Menu.name
Americano
LatteMenu.name
Americano
JuiceGround-truth Prediction
FN FP
Recall = ( ) / ( ) = 1/2 AmericanoAmericano
Latte
F1 = 1/2Precision = ( ) / ( ) = 1/2 AmericanoAmericano
Latte
(b) Scenario #2Menu.name
AmericanoMenu.name
Americano
JuiceGround-truth Prediction
FP
Recall = ( ) / ( ) = 1/1 Americano Americano
F1 = 2/3Precision = ( ) / ( ) = 1/2 AmericanoAmericano
Latte
(c) Scenario #3
Fig. 3: F1 score examples in three different scenarios. From t he application per-
spective, the three different scenarios require the same num ber of error correc-
tions; (#1) filling missing information, (#2) replacing wro ng information, and
(#3) deleting the unexpected information. However, in the v iew of F1 scores,
false negative (FN) and false positive (FP) are separately c ounted to identify
representative score value, F1.
a valid Menu.price key-value pair, this prediction is regar ded erroneous due to
failure in capturing the relation with contextually linked Menu.name key-value
pair. This can be intuitively understood as: Menu.price key -value standalone does
not provide meaningful information from the application po int-of-view, unless
paired with the corresponding Menu.name. In group-level ev aluations, a more
direct assessment of structural relation is conducted wher e, each group (instead
of key-value pair) is treated as a unit of information extrac ted. Group-level eval-
uation provides essential assessment of the KIE model espec ially in industrial
applications where information applications are often con ducted in contextually
related groups (e.g. Menu.name, Menu.price).
In addition to structural relation between key-value pairs , metric formulation
is another factor that creates the disparity between the (pr ior works’) model-
centric and (KIEval’s) application-centric design. Model -centric metric formula-
tions often distinguish model’s erroneous prediction (FP) from missed prediction
(FN), such as in Entity-level F1 metric. In industrial appli cations, however, it
is more relevant to assess KIE models in terms of additional c ost incurred due
to KIE errors. To elaborate, with reference to Fig. 3, Entity -level F1 evaluation
across the three scenarios implies lower KIE performance in scenario 2. From
the application perspective, however, all of the three scen arios’ predictions in-
cur the same cost of one editing operation (addition, substi tution, or deletion) in
KIE automation. Consequently, it is imperative to develop a n application-centric
metric formulation that accurately reflects the actual appl ication settings.
Based on the key challenges of application-centric design d efined above, our
work’s proposed metric, KIEval, is designed with these fact ors in mind to bridge
the disparity between the current metrics and industrial ap plications.
Page 7:
KIEval: Evaluation Metric for Document Key Information Ext raction 7
4 KIEval
4.1 Structured Evaluation – Entity and Group Level
In KIEval, to integrate structural relation into the KIE eva luation, group-matching
was conducted between the predicted and ground-truth key-v alue pairs prior to
entity-level and group-level evaluations. While variant o fgroup-matching for
entity-level evaluation was employed in [10], the lack of fo rmal definition un-
derscores its significance in the view of KIE metric standard isation. To illus-
trate, let PR={pr1,pr2,...,prN}be a set of predicted groups and GT=
{gt1,gt2,...,gtM}be ground-truth groups, where each group consists of a set of
entities represented by tuples, (entity-type, value). The non-group entities (i.e.
company.name and company.number in receipt) are included i n1-st group to rep-
resent all entities with the same structural format. For the formal definition of
group-matching , we define a matching score S(n,m)counting the identical entities
between prnand gtm. Based on the matching scores between groups, each predic-
tion group is matched with a ground-truth group through Hung arian matching
to obtain a group-matched set of groups, G={(n1,m1),(n2,m2),...,}, whereng
andmgindicate the g-th matched indices of predicted and ground-truth groups,
respectively, and |G|results as min(N,M). The group-matching can be defined
as follows:
G=Hungarian (PR,GT,S) (1)
whereSindicates a set of matching scores, S(n,m), between all pairs between
predictions and ground-truth. For an entity eat a matching (n,m), F1 statistics
such as True-Positive ( TP), False-Negative ( FN), and False-Positive ( FP) can
be calculated as follows;
TPe
(n,m)=Se
(n,m),FNe
(n,m)=Ne(gtm)−Se
(n,m),FPe
(n,m)=Ne(prn)−Se
(n,m)
(2)
whereSe
(n,m)indicates the number of identical entity pairs, which has en tity-type
ebetween n-th predicted and m-th ground-truth groups. The N e(·)represents the
operation counting entity-type ein a group. In other words, TPe
(n,m)indicates
the matched entity, and FNe
(n,m)andFPe
(n,m)represent the remaining ground-
truths and predictions in the specific match (n,m)in terms of entity type e,
respectively. To calculate a final cumulated score, KIEval Entity F1 , the total
F1 statistics are identified as follows:
TPentity=/summationdisplay
(n,m)∈G/summationdisplay
eTPe
(n,m) (3)
FNentity=M/summationdisplay
m/summationdisplay
eNe(gtm)−TPentity(4)
FPentity=N/summationdisplay
n/summationdisplay
eNe(prn)−TPentity(5)
Page 8:
8 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hon g
The statistics can be used to calculate KIEval Entity F1 metric using standard
precision and recall manners.
Group-level evaluation, KIEval Group F1 , is also conducted on the group-
matched Gwhere F1 statistics are evaluated across different groups. U nlike
KIEval Entity F1 which treats all entities in a group as a unit of information t o
evaluate F1 statistics, KIEval Group F1 evaluates on the entire group as a unit
of information. It should be noted that, group-level evalua tion is conducted on
all but the first element of G(i.e.G′) as the first element represent non-group
entities.
G′=G\(n1,m1) (6)
TPgroup=/summationdisplay
(n,m)∈G′
/BD[Se
(n,m)=Ne(gtm) =Ne(prn)∀e] (7)
Eq. 7 shows formulation of group-level True-Positive measu re where counting
identical pairs of prediction and ground truth groups in G′. In the equation,/BD[·]indicates a binary operator providing 1 when the predicted a nd ground-
truth groups are identical. FN and FP are calculated by count ing the remaining
ground-truth and predicted groups, respectively. Finally ,KIEval Group F1 can
be identified with the same precision and recall fashion. Bas ed on the formal
definition of KIEval Entity F1 andKIEval Group F1 above, both formulations
aim to incorporate structure relation assessment in evalua tion at the entity and
group-level, respectively.
4.2 Aligned Metric Formulation
While distinction of model’s erroneous prediction and miss ed prediction as FP
and FN in metric formulation could be well-suited from the mo del-centric point-
of-view, its misalignment in the standpoint of industrial a pplication has mo-
tivated the formulation of our metric. KIEval’s applicatio n-centric design ad-
dresses this misalignment by conceptualizing KIE errors as correction costs in-
curred in application settings. Correction refers to one of the three editing steps:
substitution, addition, and deletion of prediction values to match the ground-
truth. For an entity eat the matching condition (n,m)∈G, the steps can be
defined in terms of FN and FP as follows:
Subse
(n,m)= min(FPe
(n,m),FNe
(n,m)) (8)
Adde
(n,m)=FNe
(n,m)−Subse
(n,m) (9)
Dele
(n,m)=FPe
(n,m)−Subse
(n,m) (10)
As can be seen, the substitution is defined as the minimum numb er ofFPe
(n,m)
andFNe
(n,m), which indicates the number of predictions that require mod ifica-
tions to match corresponding ground-truth values. The addi tion and deletion
are the number of remaining FNe
(n,m)andFPe
(n,m), respectively. The number of
error, Errore
(n,m)=Subse
(n,m)+Adde
(n,m)+Dele
(n,m), is represented by summing
Page 9:
KIEval: Evaluation Metric for Document Key Information Ext raction 9
Fig. 4: Sample images from the SROIE (left), CORD (center), a nd FUNSD
(right) datasets.
the three error corrections. The total number of error can be defined as follows;
Error=/summationdisplay
(n,m)∈G/summationdisplay
eErrore
(n,m)+/summationdisplay
(∗,m)/∈G/summationdisplay
eNe(gtm)
/bracehtipupleft /bracehtipdownright/bracehtipdownleft /bracehtipupright
Add unmatched gt+/summationdisplay
(n,∗)/∈G/summationdisplay
eNe(prn)
/bracehtipupleft/bracehtipdownright/bracehtipdownleft /bracehtipupright
Del unmatched pr
(11)
Here, the first term on the right-hand side indicates the numb er of error correc-
tions in the group match condition G, and the second and third terms represent
the number of additions and deletions, respectively, for th e non-matched groups.
Finally, KIEval Aligned is calculated with the Error and the number of correct val-
ues,TP. The following equation shows the formulation;
KIEval Aligned=TPentity
TPentity+Error(12)
The KIEval Aligned not only better aligns with industrial applications, but al so
benefits from high interpretability due to its formulations in terms of well-known
F1 components: TP, FP, and FN.
5 Experiment Settings
5.1 Datasets
Experiments were conducted with the KIEval metric on models trained us-
ing three widely used benchmark datasets in the Document KIE task, namely:
SROIE, CORD and FUNSD, shown in Fig 4.
SROIE dataset refers to the dataset introduced in task 3 of Scanned receipts
OCR and information extraction challenge of ICDAR 20191. This dataset com-
prises of 626 train and 347 test receipt images, requiring pa rticipants’ models
1https://rrc.cvc.uab.es/?ch=13
Page 10:
10 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng
to extract key-value pairs of 4 entities: Company, Date, Add ress and Total price
from these images.
Consolidated Receipt Dataset ( CORD ) [16] comprises of receipt images
from shops and restaurants designed for the task of extracti ng grouped enti-
ties. Dataset’s annotation consists of 30 entities, which a re categorized into 4
groups: Menu, Void menu, Subtotal, and Total. Entities with in each group are
contextually linked such as: Menu.name, Menu.price and Men u.quantity. There
are 800 training, 100 validation and 100 testing images.
Form Understanding in Noisy Scanned Documents ( FUNSD ) [11] consists
of 149 training and 50 test documents, which are noisy, scann ed, and have var-
ious layouts. This dataset is composed of three entities: He ader, Question, and
Answer. FUNSD, unlike aforementioned datasets, allows eac h entity to hold mul-
tiple values within the same document image. To maintain con sistency with prior
works on FUNSD, all entities are regarded as non-group in our experiments.
5.2 Document KIE Models
Current works on Document KIE can be largely categories into two frameworks:
sequence labeling and generative frameworks.
Prior works in the sequence labeling framework adopt taggin g-based ap-
proach to extract key-value pairs from document images. In d etail, with refer-
ence to CORD sample image in Fig. 4, OCR is first applied to extr act texts
such as “Vt Pep Mocha” before tokenizing it into “Vt”, “Pep” a nd “Mocha”.
The KIE model then processes these tokens, often conditione d with layout
and image information, to provide token-wise label (e.g. BI O tag) classifica-
tions such as “B-Menu.name”, “I-Menu.name”, and “I-Menu.n ame” to the exam-
ple text respectively. Tokenized texts along with their cor responding token-level
tags are then postprocessed to form the final key-value pairs (e.g. Menu.name:
“Vt Pep Mocha”). Representative works in this framework inc lude: LayoutLM
family [24,23,25,9], StructuralLM [14], BROS [8], LiLT [21 ], and DocFormer [1].
Generative framework based models conduct KIE from documen t images by
directly generating the key-value pairs as text. Taking the same CORD example
in Fig. 4, generative KIE models generates text sequence of k ey-value pairs such
as: {Menu.name: “Vt Pep Mocha”, Menu.price: “4.95”}. OCR in formation can
also be provided as auxiliary input to these KIE models. Nota ble generation
methods include TILT [18], Donut [12], and Pix2Struct [13], where ResNet [7]
or ViT [4] is commonly used for image encoder and Transformer decoder [20] for
text decoder.
In this work, we conduct experiments using LayoutXLM [25] an d LayoutLMv3 [9]
models for the sequence labeling framework, and the Donut [1 2] model for
the generative framework. Given recent advancements in lar ge language model
(LLM) applications for document intelligence, we also cond uct zero-shot LLM-
based KIE experiments with GPT-4o [15], Qwen2-VL [22] and In ternVL 2.5 [3].
These experiments demonstrate how KIEval can provide addit ional insights into
LLM evaluation within the KIE context.
Page 11:
KIEval: Evaluation Metric for Document Key Information Ext raction 11
5.3 Grouping Information
With prior KIE models mainly designed for KIE at the entity-l evel, we adopt
simple methodology to extract grouping information prior t o KIEval evaluations.
For models of sequence labeling framework, a simple slot fill ing method is
adopted for grouping. To elaborate, given the set of entity- types constituting a
group (e.g. Menu.name, Menu.price, ... in CORD’s Menu group ), KIE model’s
output of these entity-types are sequentially filled in a slo t filling manner to
form groups. While different approaches for grouping extrac tion can be adopted,
such as relation extraction [25] or graph-based method [10] on top of the KIE
models, for the purpose of assessing the effectiveness of KIE val metric, a simple
grouping method was employed. For text-generation based mo dels, group-level
information can be extracted by simple structuring of the ta rget key-value pair
text sequence such as JSON format strings.
5.4 Experiment Details
All sequence labeling and generation models were trained fo r 1,000 steps with
a batch size of 16. The initial learning rate was set to 5e-5, a long with linear
learning rate decay. We used the provided OCR annotations al ong with images
for experiments involving LayoutXLM [25] and LayoutLMv3 [9 ] while only the
document image was provided for Donut [12] experiments. For reproducibility, all
experiments were conducted using the models and datasets up loaded to Hugging
Face Models and Datasets2. Details can be found in Appendix A. For multimodal
LLMs, all experiments were conducted in zero-shot setting, and the prompts used
can be found in Appendix B.
6 Results and Discussion
6.1 Structured Evaluation
The conventional Entity F1 metric fails to accurately repre sent the KIE model’s
performance due to absence of structural relation consider ation. Fig. 5(top) il-
lustrates conceptual examples with corresponding metric s cores across Entity
F1, KIEval Entity F1 and KIEval Group F1. In Fig. 5(top), whil e both Predic-
tion 1 and 2 display accurate entity-level key-value pair ex tractions, contextual
relations (grouping) between different key-value pairs are not well-captured in
Prediction 2. Such observations are not well-reflected in th e conventional Entity
F1 metric, scoring 1.0 across both predictions.
In industrial applications where both the key-value and con textual linkage
information need to be extracted (if present), Entity F1’s i nsensitivity towards
the latter could lead to sub-optimal reflection of the KIE mod el’s performance
especially in RPA applications. KIEval Entity F1 and KIEval Group F1, on
the contrary, provide distinct evaluations across the two p redictions by taking
2https://huggingface.co
Page 12:
12 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng
Menu.
priceMenu.
countMenu.
name
80,000 4 SIAO MAI BABI
60,000 3 CEKER AYAM
42,000 2 BAKPAO BKR C
CRISPYGround-truth
Menu.
priceMenu.
countMenu.
name
80,000 4 SIAO MAI BABI
60,000 3 CEKER AYAM
42,000 2 BAKPAO BKR C
CRISPYPr e
Menu.
priceMenu.
countMenu.
name
60,000 2 SIAO MAI BABI
80,000 3 BAKPAO BKR C
CRISPY
42,000 4 CEKER AYAMPr e
Prediction 1: Entity F1 = 1, KIEval Entity F1 = 1, KIEval Group F1 = 1
Gr o
-truth Pre
Menu.
priceMenu.
countMenu.
name
79,000 1 BLACK PEPPER MEATBALL PAS
77,000 1 TRUFFLE CREAM
59,000 1 EARL GREY MILK TEA
Menu.Price: Entity F1 = 1, KIEval Entity F1 = 0 Menu.
priceMenu.
countMenu.
name
FOOD
77,000 1 BLACK PEPPER MEATBALL PAS
59,000 1 TRUFFLE CREAM
215,000 1 EARL GREY MILK TEAPrediction 2: Entity F1 = 1, KIEval Entity F1 = 1/3, KIEval Group F1 = 0
Fig. 5: Examples illustrating the difference between Entity F1 and KIEval. The
above scenario is constructed to showcase metric dispariti es, whereas the scenario
below is based on real prediction result from the Donut model .
LayoutXLM LayoutLMv3 Donut
SROIE CORD FUNSD SROIE CORD FUNSD SROIE CORD
Entity F1 91.77 95.43 84.02 91.87 95.13 85.87 83.85 84.93
nTED 97.24 94.86 61.96 96.91 94.43 69.36 96.17 90.62
KIEval Entity F1 91.77 92.88 84.02 91.87 91.84 85.87 83.85 84 .47
KIEval Group F1 - 82.68 - - 82.11 - - 68.26
KIEval Aligned 90.32 89.02 79.22 91.15 88.15 80.22 83.57 79.70
Table 1: Comparision of Entity F1, nTED, and KIEval. When gro up entities are
absent, Entity F1 and KIEval Entity F1 yield identical value s. Note: Donut dis-
plays substantially lower performance than other models du e to its sole reliance
on image input, unlike other models’ use of ground-truth OCR annotations.
into account of structural relations in the formulations. I n KIEval Entity F1,
despite error-free extraction of key-value pairs for each e ntity-type, Prediction
2 is penalized for its grouping errors, resulting in a score o f1/3. Similarly in
KIEval Group F1, where each group is treated as a single-unit of information
instead of key-value pairs, Prediction 2 is evaluated to be c ompletely incorrect,
which is not discernible from the Entity F1 metric.
Fig. 5(bottom) depicts a sampled inference result of the Don ut (generation
KIE) model. Despite accurate extraction of Menu.name key-v alues, contextual
linkage with other entity types are misaligned possibly due to tilt rotation of the
receipt image. The conventional Entity F1 score of Menu.pri ce entity does not
reflect this error and assigns a full score of 1.0 unlike KIEva l Entity F1 which
penalizes the prediction accordingly.
Evaluation results for different metrics across all models a nd datasets are
shown in Table 1. For nTED, its soft-match approach inaccura tely compares
Page 13:
KIEval: Evaluation Metric for Document Key Information Ext raction 13
Donut GPT-4o Qwen2-VL InternVL 2.5
Entity F1 84.93 73.56 77.07 54.99
KIEval Entity F1 84.47 72.93 77.07 54.54
Difference 0.46 0.63 0.00 0.45
Table 2: Comparison of Entity F1 and KIEval Entity F1 across g enerative models
including multimodal LLMs on the CORD dataset. All multimod al LLMs are
evaluated in a zero-shot setting. The difference between Ent ity F1 and KIEval
Entity F1 serves to highlight information structure awaren ess of LLMs in KIE.
Menu.price Menu.cnt Menu.name
30.000 3 N
P
… … …
15.000 1 K
K
Menu.price Menu.cnt Menu.name
30.000 3 N
P
… … …1 1 K
K
Sample Entity F1 =
Sample KIEval E
= Ground-truth Prediction (GPT-4o)
Fig. 6: Sample of CORD dataset, illustrating the performanc e gap between Entity
F1 and KIEval Entity F1 in GPT-4o, highlighting the importan ce of structure
awareness evaluation on top of the existing KIE metric.
KIE performance, as seen in LayoutXLM and LayoutLMv3 on SROI E, where
trends differ from Entity F1 and KIEval Entity F1. For Entity F 1, the differences
compared to KIEval Entity F1 are prominent in CORD dataset wh ere contextual
links (grouping) between entities are present. KIEval Enti ty F1 consistently un-
derperforms compared to Entity F1 in CORD across all models, despite achieving
equivalent scores in SROIE and FUNSD. This discrepancy high lights the overes-
timation of KIE model performance when structural relation s are ignored, while
the metric converges to Entity F1 in datasets without groupi ng.
The discrepancy is also evident in multimodal LLM evaluatio n, as shown in
Table 2. The difference between Entity F1 and KIEval Entity F1 provides deeper
insight into the LLM’s ability in grouping correctly extrac ted information into
the expected semantic structures. Based on the CORD results , Qwen2VL outper-
forms not only in extraction but also in grouping these infor mation accurately.
Fig. 6 shows an example where GPT-4o correctly extracts key i nformation but
groups it into an incorrect structure, showcasing KIEval’s utility in offering a
new perspective for assessing LLMs in KIE.
6.2 Metrics from the Correction Cost Perspective
As previously discussed in Fig. 3, the disparity in the conce ptualization of KIE
errors (either as {FP, FN} or as correction cost) results in a ssessment of KIE
Page 14:
14 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng
LayoutXLM LayoutLMv3 Donut
SROIE CORD FUNSD SROIE CORD FUNSD SROIE CORD
FP + FN 231 190 637 226 218 566 447 404
Subs 93 37 198 102 53 142 219 124
Add 7 59 126 9 56 136 9 78
Del 38 57 115 13 56 146 0 78
Correction 138 153 439 124 165 424 228 280
Table 3: Comparison of FP + FN and Correction (Subs + Add + Del) statistics.
Both Add and Del indicate the sum of counts within the matched and unmatched
groups. Note: Correction refers to the number of correction steps taken.
Fig. 7: In CORD, Donut’s KIEval Aligned and KIEvalτ
Aligned in relation to vary-
ing confidence score thresholds, τ, alongside the corresponding automation
rates, auto-rateτ. Increase in confidence score thresholds leads to an increas e
in KIEvalτ
Aligned , while the automation rate decreases due to the rising numbe r
of entities requiring human revision.
models that is misaligned with industrial applications. Ta ble 3 shows distinctive
gap between the FP+FNand Correction Cost values consistent across all
models and datasets. Our work, brings to light of this discre pancy and proposes
KIEval Aligned formulation to better align KIE evaluation to application s ettings.
7 KIE Evaluation for RPA System
In addition to the inclusion of structural relation and alig nment of metric for-
mulation, there exists a distinctive factor of human-corre ction that warrants
attention when evaluating KIE models in RPA systems. Irresp ective of the KIE
model’s training, it is improbable to consistently achieve error-free extraction
performance across a diverse range of documents. In view of t his improbabil-
ity, human-correction (correction by human-intervention ) is commonly adopted
by RPA systems. Human-correction however, requires a metho d for selecting
a subset of predictions, as verifying and correcting all ext racted information is
impractical and undermines the very goal of automation in RP A.
Page 15:
KIEval: Evaluation Metric for Document Key Information Ext raction 15
Existing RPA systems commonly adopt confidence score based c orrection
where information extracted with confidence below a specific threshold (i.e. un-
certain) is selected for verification and correction (if nec essary). Selection of
optimal threshold value is an application-specific decisio n that differs from one
RPA system to another, contingent on the system’s inclinati ons to trade-off au-
tomation rate for KIE performance. In this work, we demonstr ate this trade-off
analysis with KIEval and highlighting its added insights ov er prior metrics.
We propose a method to analyse this trade-off in terms of post- correction KIE
performance, automation rate, and confidence score thresho ld value, τ. We first
define the automation rate, auto-rateτ, which reflects the proportion of model
predictions processed without human verification, interpr eted as the number of
entities with a confidence score higher than τ. Post-correction KIE performance,
KIEvalτ
Aligned , denotes the final KIE performance after KIE predictions wit h con-
fidence scores below τare verified and corrected by humans. A formal definition
of these two formulations is provided in the Appendix C.
Fig. 7 presents the auto-rateτand KIEvalτ
Aligned as a function of the confi-
dence score threshold, τin Donut’s performance on CORD. The trade-off trend
depicted in Fig. 7 indicates that, as the threshold value inc reases, the number
of information extracted requiring human review increases , leading to a higher
post-correction KIE score at the cost of reduced automation rate. Incorporating
such trade-off analysis in evaluation of KIE models not only p rovides deeper
insights but also enables stakeholders to conduct cost-ben efit evaluations effec-
tively and determine the optimal threshold value for their R PA system.
8 Conclusion
In this work, we bring to light of the discrepancies between t he existing Doc-
ument KIE evaluation metrics and the key consideration fact ors of industrial
settings, such as RPA systems. We identify the challenges be hind these discrep-
ancies and propose KIEval, metric formulated with an applic ation-centric design.
Specifically, KIEval leverages group matching data between the predictions and
ground-truth groupings to integrate structural relations in KIE evaluations, dif-
ferentiating itself from prior metrics that lack grouping a wareness in evaluation.
Additionally, KIEval formulates KIE errors in terms of the c orrections incurred
in automation systems (i.e. Substitution, Addition, or Del etion) further bridging
the gap between the evaluation metric and industrial settin gs. The experiments
not only verify these discrepancies in existing metrics but also shows how KIEval
provides a different perspective of KIE model evaluation fro m the industrial ap-
plication’s standpoint. On top of these discrepancies, we a lso demonstrate an
application use-case scenario that illustrates the valuab le insights which the
trade-off analysis brings to RPA systems. This aspect has bee n overlooked in
prior Document KIE metrics. We believe that KIEval could ser ve as a standard
evaluation metric for various KIE tasks and encourage the re search community
to focus on solving the remaining challenges in KIE tasks wit h the industrial
application in mind.
Page 16:
16 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng
References
1. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha , R.: Docformer: End-to-
end transformer for document understanding. In: Proceedin gs of the IEEE/CVF
international conference on computer vision. pp. 993–1003 (2021)
2. Biten, A.F., Tito, R., Mafla, A., Gomez, L., Rusinol, M., Ma thew, M., Jawahar, C.,
Valveny, E., Karatzas, D.: Icdar 2019 competition on scene t ext visual question an-
swering. In: 2019 International Conference on Document Ana lysis and Recognition
(ICDAR). pp. 1563–1570. IEEE (2019)
3. Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J ., Ye, S., Tian,
H., Liu, Z., et al.: Expanding performance boundaries of ope n-source multimodal
models with model, data, and test-time scaling. arXiv prepr int arXiv:2412.05271
(2024)
4. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn , D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al .: An image is worth
16x16 words: Transformers for image recognition at scale. I n: International Con-
ference on Learning Representations (2020)
5. Garncarek, Ł., Powalski, R., Stanisławek, T., Topolski, B., Halama, P., Turski,
M., Graliński, F.: Lambert: Layout-aware language modelin g for information ex-
traction. In: International Conference on Document Analys is and Recognition. pp.
532–547. Springer (2021)
6. He, J., Wang, L., Hu, Y., Liu, N., Liu, H., Xu, X., Shen, H.: I cl-d3ie: In-context
learning with diverse demonstrations updating for documen t information extrac-
tion. arxiv 2023. arXiv preprint arXiv:2303.05063
7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pa ttern recognition.
pp. 770–778 (2016)
8. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros : A pre-trained
language model focusing on text and layout for better key inf ormation extraction
from documents. In: Proceedings of the AAAI Conference on Ar tificial Intelligence.
vol. 36, pp. 10767–10775 (2022)
9. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: Pre -training for docu-
ment ai with unified text and image masking. In: Proceedings o f the 30th ACM
International Conference on Multimedia. pp. 4083–4091 (20 22)
10. Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial de pendency parsing for
semi-structured document information extraction. arXiv p reprint arXiv:2005.00642
(2020)
11. Jaume, G., Ekenel, H.K., Thiran, J.P.: Funsd: A dataset f or form understanding in
noisy scanned documents. In: 2019 International Conferenc e on Document Analysis
and Recognition Workshops (ICDARW). vol. 2, pp. 1–6. IEEE (2 019)
12. Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwan g, W., Yun, S.,
Han, D., Park, S.: Ocr-free document understanding transfo rmer. In: European
Conference on Computer Vision. pp. 498–517. Springer (2022 )
13. Lee, K., Joshi, M., Turc, I.R., Hu, H., Liu, F., Eisenschl os, J.M., Khandelwal, U.,
Shaw, P., Chang, M.W., Toutanova, K.: Pix2struct: Screensh ot parsing as pretrain-
ing for visual language understanding. In: International C onference on Machine
Learning. pp. 18893–18912. PMLR (2023)
14. Li, C., Bi, B., Yan, M., Wang, W., Huang, S., Huang, F., Si, L.: Structurallm:
Structural pre-training for form understanding. In: Proce edings of the 59th Annual
Page 17:
KIEval: Evaluation Metric for Document Key Information Ext raction 17
Meeting of the Association for Computational Linguistics a nd the 11th Interna-
tional Joint Conference on Natural Language Processing (Vo lume 1: Long Papers).
pp. 6309–6318 (2021)
15. OpenAI: Gpt-4v(ision) system card (2023)
16. Park, S., Shin, S., Lee, B., Lee, J., Surh, J., Seo, M., Lee , H.: Cord: a consolidated
receipt dataset for post-ocr parsing. In: Workshop on Docum ent Intelligence at
NeurIPS 2019 (2019)
17. Peng, Q., Pan, Y., Wang, W., Luo, B., Zhang, Z., Huang, Z., Hu, T., Yin, W.,
Chen, Y., Zhang, Y., et al.: Ernie-layout: Layout knowledge enhanced pre-training
for visually-rich document understanding. arXiv preprint arXiv:2210.06155 (2022)
18. Powalski, R., Borchmann, Ł., Jurkiewicz, D., Dwojak, T. , Pietruszka, M., Pałka,
G.: Going full-tilt boogie on document understanding with t ext-image-layout trans-
former. In: Document Analysis and Recognition–ICDAR 2021: 16th International
Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II
16. pp. 732–747. Springer (2021)
19. Tito, R., Mathew, M., Jawahar, C., Valveny, E., Karatzas , D.: Icdar 2021 com-
petition on document visual question answering. In: Docume nt Analysis and
Recognition–ICDAR 2021: 16th International Conference, L ausanne, Switzerland,
September 5–10, 2021, Proceedings, Part IV 16. pp. 635–649. Springer (2021)
20. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jon es, L., Gomez, A.N., Kaiser,
Ł., Polosukhin, I.: Attention is all you need. Advances in ne ural information pro-
cessing systems 30(2017)
21. Wang, J., Jin, L., Ding, K.: Lilt: A simple yet effective la nguage-independent layout
transformer for structured document understanding. In: Pr oceedings of the 60th
Annual Meeting of the Association for Computational Lingui stics (Volume 1: Long
Papers). pp. 7747–7757 (2022)
22. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen , K., Liu, X., Wang, J.,
Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zho u, C., Zhou, J.,
Lin, J.: Qwen2-vl: Enhancing vision-language model’s perc eption of the world at
any resolution. arXiv preprint arXiv:2409.12191 (2024)
23. Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Fl orencio, D., Zhang, C.,
Che, W., et al.: Layoutlmv2: Multi-modal pre-training for v isually-rich document
understanding. arXiv preprint arXiv:2012.14740 (2020)
24. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layou tlm: Pre-training of
text and layout for document image understanding. In: Proce edings of the 26th
ACM SIGKDD International Conference on Knowledge Discover y & Data Mining.
pp. 1192–1200 (2020)
25. Xu, Y., Lv, T., Cui, L., Wang, G., Lu, Y., Florencio, D., Zh ang, C., Wei, F.:
Xfund: A benchmark dataset for multilingual visually rich f orm understanding. In:
Findings of the Association for Computational Linguistics : ACL 2022. pp. 3214–
3224 (2022)
26. Yu, W., Zhang, C., Cao, H., Hua, W., Li, B., Chen, H., Liu, M ., Chen, M., Kuang,
J., Cheng, M., et al.: Icdar 2023 competition on structured t ext extraction from
visually-rich document images. arXiv preprint arXiv:2306 .03287 (2023)
27. Zhang, J., You, Z., Wang, J., Le, X.: Sail: Sample-centri c in-context learning for
document information extraction. arXiv preprint arXiv:24 12.17092 (2024)
Page 18:
Supplementary - KIEval: Evaluation Metric
for Document Key Information Extraction
Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hong
Upstage AI, South Korea
{mkhang, eric, sungrae.park, tghong}@upstage.ai
A Datasets
The following table provides details on the datasets and mod els used in the
experiments conducted in this paper. All datasets and model s are available on
Hugging Face to ensure experimental reproducibility.
LayoutXLM LayoutLMv3 Donut
Models microsoft/layoutxlm-base microsoft/layoutlmv3- base naver-clova-ix/donut-base
SROIE darentang/sroie podbilabs/sroie-donut
CORD nielsr/cord-layoutlmv3 naver-clova-ix/cord-v2
FUNSD nielsr/funsd-layoutlmv3 -
Table 1: Hugging Face Models and Datasets used in the experim ents.
B Multimodal LLM prompt for CORD KIE
Following prompt is used when experimenting with GPT-4o, Qw en2-VL and
InternVL 2.5 for KIE in the CORD dataset. Arrow symbol ֒→represents new-
line wrapping in the following text.
You will be provided with a receipt as an image.
Your task is to analyze the receipt carefully and extract key
information from it.
The entities to be extracted along with their descriptions a re
provided below.
| Category | Sub-Category (if applicable) | Entity | Descrip tion |
| --- | --- | --- | --- |
| menu | (not applicable) | menu.cnt | quantity of menu |
| | (not applicable) | menu.discountprice | discounted pric e
of menu |֒→
| | (not applicable) | menu.etc | others |
Page 19:
Title Suppressed Due to Excessive Length 19
| | (not applicable) | menu.itemsubtotal | price of each menu
after discount applied |֒→
| | (not applicable) | menu.nm | name of menu |
| | (not applicable) | menu.num | identification # of menu |
| | (not applicable) | menu.price | total price of menu |
| | sub | menu.sub_cnt | quantity of submenu |
| | sub | menu.sub_nm | name of submenu |
| | sub | menu.sub_price | total price of submenu |
| | sub | menu.sub_unitprice | unit price of submenu |
| | (not applicable) | menu.unitprice | unit price of menu |
| | (not applicable) | menu.vatyn | whether the price
includes tax or not |֒→
| sub_total | (not applicable) | sub_total.discount_price |
discounted price in total |֒→
| | (not applicable) | sub_total.etc | others |
| | (not applicable) | sub_total.service_price | service
charge |֒→
| | (not applicable) | sub_total.subtotal_price |
subtotal price |֒→
| | (not applicable) | sub_total.tax_price | tax amount
|֒→
| total | (not applicable) | total.cashprice | amount of pric e
paid in cash |֒→
| | (not applicable) | total.changeprice | amount of change
in cash |֒→
| | (not applicable) | total.creditcardprice | amount of
price paid in credit/debit card |֒→
| | (not applicable) | total.emoneyprice | amount of price
paid in emoney, point |֒→
| | (not applicable) | total.menuqty_cnt | total count of
quantity |֒→
| | (not applicable) | total.menutype_cnt | total count of
type of menu |֒→
| | (not applicable) | total.total_etc | others |
| | (not applicable) | total.total_price | total price |
Each entity (e.g. menu.cnt) is part of a category (e.g. menu) .
You are to extract the entities from the receipt and return in the
following format:֒→
/grave.ts1/grave.ts1/grave.ts1json
{{
"menu": <dictionary or list of dictionaries>,
"sub_total": <dictionary or list of dictionaries>,
"total": <dictionary or list of dictionaries>
}}
Page 20:
20 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng
/grave.ts1/grave.ts1/grave.ts1
Note the following characteristics:
1. All entities falling under the same category should be gro uped
together (represented as a dictionary, such as
{total.cashprice, total.changeprice, ...}).֒→
֒→
2. If there are multiple entities of the same category, they
should be represented as a list of dictionaries.֒→
3. If an entity is not present in the receipt, it should be
excluded from the dictionary.֒→
4. Each of the entity's value should either be a string or a lis t
of strings.֒→
5. Note that menu.sub represents a sub-category of the menu
category. As such, all entities under menu.sub should be
grouped together (either dictionary or list of dictionarie s)
under the same menu group.֒→
֒→
֒→
6. You are to respond in JSON format only and ensure that the ke ys
in the dictionary are exactly the same as the entities
provided above.֒→
֒→
7. If you are unable to extract any information, please retur n an
empty list for that category.֒→
Here is an example of the expected return format:
/grave.ts1/grave.ts1/grave.ts1json
{
"menu": [
{
"menu.nm": "SPGTHY BOLOGNASE",
"menu.cnt": "1",
"menu.price": "58,000"
},
{
"menu.nm": "PEPPER AUS",
"menu.cnt": "1",
"menu.price": "165,000",
"menu.sub": {
"menu.sub_nm": "WELL DONE"
}
},
{
"menu.nm": "WAGYU RIBEYE",
"menu.cnt": "1",
"menu.price": "195,000",
"menu.sub": {
"menu.sub_nm": "MEDIUM WELL"
Page 21:
Title Suppressed Due to Excessive Length 21
}
}
],
"sub_total": {
"sub_total.subtotal_price": "503,000",
"sub_total.service_price": "25,150",
"sub_total.tax_price": "52,815"
},
"total": {
"total.total_price": "580,965"
}
}
/grave.ts1/grave.ts1/grave.ts1
C Automation Trade-off Analysis Metric
Prior metrics, including KIEval Aligned defined above, evaluate Document KIE
models without consideration of the full pipeline of Docume nt KIE applications.
The RPA system commonly employs a human-correction stage af ter model in-
ference. Specifically, the RPA system utilizes confidence sc ores of the extracted
entities by a Document KIE model and identifies which entitie s require further
manual verification and corrections with a certain threshol d,τ, of the confidence
score. We assume that human correction is only conducted on t he predictions
with lower confidence scores and considers only substitutio n and deletion without
any addition operations because addition operation usuall y requires examining
all predictions and ground-truths, making the correction p rocess and the RPA
system inefficient.
To illustrate the formulation, let c(prn,i)be the confidence score of prn,i,
where prn,iindicates the i-th entity in the n-th predicted group. PR<τis the
set of the predictions of which confidence scores are less tha n the threshold τ.
SincePR<τis only reviewed among the total PR, the automation rate of the
RPA system can be defined as follows:
auto-rateτ= 1−|PR<τ|/|PR|. (1)
If the automation rate becomes close to 0 with high τ, the system becomes
inefficient but the output of the system becomes accurate. Whe n the automation
rate is close to 1 with sufficiently low τ, the system becomes efficient but at
the cost of potentially containing incorrect predictions b y skipping the human-
correction stage.
To control the trade-off between the system efficiency and accu racy, we intro-
duce KIEvalτ
Aligned that evaluates the accuracy of the RPA automation system
with the human-correction stage. The evaluation assumes no human error in the
correction stage and the errors in PR<τare only revised with substitution and
deletion operations. After the correction process, the rem aining errors can be
categorized into Subsτ, Delτ, and Add. Subsτand Delτdenote the error present
Page 22:
22 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng
in predictions with confidence score higher than τ, while Add represents the
number of required entities missed in PR. With the remaining error counts,
KIEvalτ
Aligned can be calculated as follows:
KIEvalτ
Aligned= 1−Subsτ+Delτ+Add
N(PR∗)+Add, (2)
where N(PR∗)indicates the number of predictions, PR∗, after the human cor-
rection stage. The denominator includes Add to represent th e total number of
entities of the system output, including the entities missi ng inPR∗. Through
auto-rateτand KIEvalτ
Aligned , the automation efficiency and accuracy of the RPA
system can be measured by adjusting the confidence threshold τ, facilitating their
trade-off analysis in Document KIE.