loader
Generating audio...

arxiv

Paper 2503.05488

KIEval: Evaluation Metric for Document Key Information Extraction

Authors: Minsoo Khang, Sang Chul Jung, Sungrae Park, Teakgyu Hong

Published: 2025-03-07

Abstract:

Document Key Information Extraction (KIE) is a technology that transforms valuable information in document images into structured data, and it has become an essential function in industrial settings. However, current evaluation metrics of this technology do not accurately reflect the critical attributes of its industrial applications. In this paper, we present KIEval, a novel application-centric evaluation metric for Document KIE models. Unlike prior metrics, KIEval assesses Document KIE models not just on the extraction of individual information (entity) but also of the structured information (grouping). Evaluation of structured information provides assessment of Document KIE models that are more reflective of extracting grouped information from documents in industrial settings. Designed with industrial application in mind, we believe that KIEval can become a standard evaluation metric for developing or applying Document KIE models in practice. The code will be publicly available.

Paper Content:
Page 1: arXiv:2503.05488v1 [cs.CL] 7 Mar 2025KIEval: Evaluation Metric for Document Key Information Extraction Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hong Upstage AI, South Korea {mkhang, eric, sungrae.park, tghong}@upstage.ai Abstract. Document Key Information Extraction (KIE) is a technol- ogy that transforms valuable information in document image s into struc- tured data, and it has become an essential function in indust rial settings. However, current evaluation metrics of this technology do n ot accurately reflect the critical attributes of its industrial applicati ons. In this pa- per, we present KIEval, a novel application-centric evalua tion metric for Document KIE models. Unlike prior metrics, KIEval assesses Document KIE models not just on the extraction of individual informat ion (entity) but also of the structured information (grouping). Evaluat ion of struc- tured information provides assessment of Document KIE mode ls that are more reflective of extracting grouped information from d ocuments in industrial settings. Designed with industrial applicat ion in mind, we believe that KIEval can become a standard evaluation metric for devel- oping or applying Document KIE models in practice. The code w ill be publicly available. Keywords: Document AI · Key Information Extraction · Evaluation Metric. 1 Introduction Document Key Information Extraction (KIE) is a well-known t ask of converting information from document images into structured data and h as gained much at- tention from both the academia and industry over the years [2 4,23,9,5,8,12,17,13,10]. One common application of Document KIE in industrial settin gs lies in Robotic Process Automation (RPA) of document digitisation which ai ms to extract, structure, and store the data in document images into databa ses for various downstream applications. Information extracted from docu ments is often pre- sented as key-value pairs (e.g. Menu.name: “AMERICANO”) th at are frequently interrelated (e.g. Menu.name & Menu.price), forming the ba sis of structured information in documents. Despite such application settings, a standardized evaluat ion metric for Doc- ument KIE models has yet to be established and existing metri cs used in prior works fail to consider several key components from the appli cation’s standpoint. The main causes of disparity between the existing evaluatio n metrics and appli- cation settings can be attributed to: neglecting structure d nature of information Page 2: 2 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hon g Document Image Ground-truth Predictions Menu.price Menu.name 29,091 CHOCO PUFF 42,727 CREAMY BEEF CLS FTC 34,545 NEW ORIENTAL CHK RICE 54,545 LIPTON PITCHER 47,273 SC/P SUPER SUPREME 48,182 CB/P BLACK PEPP BEEFTotal 282,000 SubTotal 256,363 Tax 25,637Non-grouped entities Grouped entities Menu.price Menu.name 29,091 CHOCO PUFF CREAMY BEEF CLS FTC 34,545 NEW ORIENTAL CHK RICE 47,273 SC/P SUPER SUPREME 48,182 CB/P BLACK PEPP BEEFTotal SubTotal 256,363 25,637Non-grouped entities Grouped entities Fig. 1: Example of CORD dataset (receipts). The dataset has n on-grouped and grouped entities (non-grouped entities form a special grou p), and requires struc- tured predictions including Menu groups: Menu.name and Men u.price. Errors in model predictions are not limited to individual key-value p air errors but also in the extraction of structural relation between entities (ma rked in red). Both error types must be considered in Document KIE model evaluation. in assessment and insufficient alignment of metric formulati on with the industrial applications. Detailed explanations of these causes are as follows (corresponding visualisations are shown in Fig. 1 and 2): Structured nature in information refers to the presence of structural relation between key-value pairs in documents. Referring to the exam ple in Fig. 1 and 2, each values of the entity-type, Menu.name, has contextual l inkage to different values of Menu.price. Existing metrics, however, mainly fo cus on the assessment of individual entity extraction, while reflecting limited o r no evaluation for ex- traction of such structured information. In industrial app lications, however, the lack of such structured information can lead to critical inf ormation loss when storing data in relational databases for downstream tasks. Insufficient alignment of existing metrics’ design refers to gaps arising due to the formulations that are not fully representative of Doc ument KIE applica- tions in industrial settings. Existing metrics, such as the Entity-level F1 metric, often distinguishes KIE model’s erroneous prediction (Fal se-Positive, FP) from missed prediction (False-Negative, FN) in metric formulat ions. Such distinction, while well-suited for model development, precipitates cle ar disparity with ap- plication settings where KIE errors are often perceived in n umber of correction counts needed. It is worth noting that, correction count ref ers to number of value editing (one of substitution, addition, or deletion) steps needed to convert KIE predictions to ground-truth values. To address the causes of disparity identified above, we propo se an evaluation metric with application-centric design named: KIEval (Key Information Extrac- Page 3: KIEval: Evaluation Metric for Document Key Information Ext raction 3 Ground- truth (group) Predictions Entity-level F1 Recall = 11 / 12 = 91.67% Precision = 11 / 11 = 100.00% F1= Hmean(91.67%, 100.00%) = 95.65% KIEval Group-level = 4 / 7 = 57.14% Entity-level = 10 / 13 = 76.92%Menu.price Menu.name 29,091 CHOCO PUFF 42,727 CREAMY BEEF CLS FTC 34,545 NEW ORIENTAL CHK RICE 54,545 LIPTON PITCHER 47,273 SC/P SUPER SUPREME 48,182 CB/P BLACK PEPP BEEFMenu.price Menu.name 29,091 CHOCO PUFF CREAMY BEEF CLS FTC 34,545 NEW ORIENTAL CHK RICE 54,545 LIPTON PITCHER 47,273 SC/P SUPER SUPREME 48,182 CB/P BLACK PEPP BEEF Ground-truth (group) Predictions Menu.price Menu.name 29,091 CHOCO PUFF 42,727 CREAMY BEEF CLS FTC 34,545 NEW ORIENTAL CHK RICE 54,545 LIPTON PITCHER 47,273 SC/P SUPER SUPREME 48,182 CB/P BLACK PEPP BEEFMenu.price Menu.name 29,091 CHOCO PUFF CREAMY BEEF CLS FTC 34,545 NEW ORIENTAL CHK RICE LIPTON PITCHER 47,273 SC/P SUPER SUPREME 48,182 CB/P BLACK PEPP BEEF Fig. 2: Comparison of evaluation metrics for KIE tasks. Note that the ground- truth and predictions follow the same Document Image in Fig. 1. The red boxes indicate the errors accounted for by the respective metrics during evaluation. Entity-level F1 does not account for structural relations u nlike the proposed KIEval metric which performs both Entity-level and Group-l evel evaluations based on the group-matching information (blue links). tion evaluation). Firstly, KIEval is formulated to provide KIE model assessment in two different levels: entity-level (individual entities such as Menu.name) and group-level (sets of related entities such as Menu.name, Me nu.price). In both lev- els of evaluation, prediction and ground-truth values are m atched by condition- ing on information structure (group), facilitating struct ured-information level assessment of KIE models. Secondly, KIEval formulates KIE e rrors in terms of the number of substitution, addition, or deletion steps nee ded. Such formulation, instead of the conventional FP and FN, aims to better represe nt the eventual cost which KIE errors incur in application settings. In this work, our key contributions can be summarized as foll ows. 1) We propose KIEval (Key Information Extraction evaluation) me tric for Document KIE which incorporates structured information assessment in both entity and group-level evaluation. 2) Provision of KIE model evaluati on in terms of infor- mation correction cost, bridging the disparity of existing metrics with industrial applications. 3) We also showcase a use-case study on how KIE val can be applied in RPA systems, highlighting its differences against existi ng metrics. 2 Related Works 2.1 Document KIE Methdology Various types of approaches to Document KIE have been propos ed over the years. One of the notable earlier works, LayoutLM [24], was t he first to propose Page 4: 4 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hon g a multi-modal framework, incorporating both text and layou t modalities in key information extraction from documents. The framework’s ro bust performance with simplistic design of BIO-tagging has motivated many fo llow-up works such as: BROS [8], LayoutLMv2 [23], LayoutLMv3 [9], and ERNIE-La yout [17]. Al- ternative forms of KIE with improved representational flexi bility such as graph- based [10] and text generation [12,13] frameworks were prop osed in follow-up works to better capture dependencies between entities and e ffectively capture structured information from documents. With the rise of LLM applications, re- cent works, such as ICL-D3IE [6] and SAIL [27], have leverage d the flexibility of LLMs to tackle Document KIE in zero and few-shot settings. 2.2 Existing Metrics Entity-level F1 score is one of the most commonly used metrics for Docu- ment KIE model evaluation. Upon extracting entity-wise key -value pairs, they are matched against the ground-truth key-value pairs, wher e the predicted pair is considered valid if an exact-match can be found in the grou nd-truth set. Such exact-match statistics are collated across different entit ies in the dataset to evalu- ate the model’s entity-level F1 score. Commonly used in prio r works [24,23,9,17], entity-level F1 score evaluates the degree to which model’s extracted information exactly matches the expected key-value content in the docum ent. This metric however, not only disregards structural relati on between enti- ties during entity-level F1 evaluation but also does not pro vide any assessment for group-level information extraction. In industrial app lications, entities ex- tracted often form meaningful information when grouped wit h other entities that are structurally related (i.e group), such as the group ing of Menu.name, Menu.quantity, and Menu.price, in receipts. Variations of entity-level F1 score were employed in prior works such as group-constrained enti ty-level F1 in SPADE [10], Entity Extraction and Linking F1 in BROS [8], offering a more compre- hensive assessment of KIE models. These variations of entit y-level F1 metric underscores the necessity for a standardized metric for KIE model evaluation in the field of Document AI. Furthermore, the absence of direc t assessment for group-level information extraction in these metrics highl ights the disparity in meeting the industrial application requirements. Tree Edit Distance (TED) score is another type of KIE evaluation metric, commonly adopted in text-generation based KIE models [12]. In contrary to the exact-match based entity-level F1 score, TED adopts a soft- match approach to avoid over-penalisation of model’s KIE. Edit distance base d metric could provide a more objective assessment of the model by mitigating the im pact of minor discrepancies, such as those between “ice cream” and “ice-c ream”, which could lead to underestimation of model’s KIE performance. As disc ussed in Donut [12], TED metric can be applied to KIE models by first representing t he prediction and ground truth as trees, before evaluating the edit distan ce between them. With the structural relation between entities captured usi ng tree representation, this metric offers assessment of not only the model’s entity- level KIE performance but also at the group-level. Page 5: KIEval: Evaluation Metric for Document Key Information Ext raction 5 Despite such capacity of TED metric, its soft-match approac h could exacer- bate the discrepancy when applied to industrial settings. T aking automatic KIE setting as an example, pairs of information with minor edit d istance could refer to completely different items such as “Pear” and “Pea” or “700 0” and “1000”. Consequently, it is required of evaluation metrics to be str ingent and provision of partial scores (with edit distance) could offer a misleadi ng KIE assessment. Other notable metrics include Average Normalized Levenshtein Si milarity (ANLS) [2,19] and hybrid metric of exact-match and edit dist ance [26]. ANLS aims to reduce the effect of overestimation of KIE models by co nstraining the maximum edit-distance between prediction and ground-trut h to a pre-specified threshold value (e.g. 0.5) beyond which, no partial score is given. Hybrid metric, on the other hand, is a weighted arithmetic mean of the KIE mod el’s entity-level F1 and inverse Normalized Edit Distance (NED). Nevertheles s, these metrics still share the limitations of Entity-level F1 and TED metrics, an d do not provide group-level assessment of KIE models. 3 Problem Definition 3.1 Document KIE Task Document KIE is a task in the field of Document Understanding ( DU), with the objective of extracting structured key-value pairs fro m Document images. Commonly positioned as the task preceding various knowledg e-application op- erations (e.g. financial data analysis), Document KIE is oft en faced with two principal challenges: (1) accurate extraction of key-valu e information, requir- ing value extraction and entity-key classification with min imal error, and (2) the discernment of structural relations between different k ey-value pairs, which demands accurate identification of contextual links betwee n key-value pairs, to form a coherent group-level information unit. 3.2 Measuring Document KIE Model in Industrial Settings To create an application-centric evaluation metric, follo wing principal challenges of prior metrics must to be addressed: (1) absence of structu ral relation in metrics and (2) insufficient alignment of metric formulation with app lication settings. Structural relation refers to the contextual linkage betwe en key-value pairs associated with entities in documents. Entity key-value pa irs that share contex- tual linkage are formally defined as a group , such as Menu.name and Menu.price in the CORD example (Fig. 1). Such structural relations pres ent in documents need to be considered in both entity-level and group-level e valuations. Prior metrics in entity-level evaluation (e.g. Entity-level F1) show limited or no inclu- sion of structural relation, by evaluating each extracted k ey-value independent from the remaining key-value pairs. Prediction results vis ualized in Fig. 2 clearly demonstrates this where, the prediction of {Menu.price: “5 4,545”} is not matched with the corresponding {Menu.name: “LIPTON PITCHER”}. Des pite predicting Page 6: 6 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hon g Menu.name LatteMenu.name AmericanoGround-truth Prediction FN Recall = ( ) / ( ) = 1/2 AmericanoAmericano Latte F1 = 2/3Precision = ( ) / ( ) = 1/1 Americano Americano (a) Scenario #1Menu.name Americano LatteMenu.name Americano JuiceGround-truth Prediction FN FP Recall = ( ) / ( ) = 1/2 AmericanoAmericano Latte F1 = 1/2Precision = ( ) / ( ) = 1/2 AmericanoAmericano Latte (b) Scenario #2Menu.name AmericanoMenu.name Americano JuiceGround-truth Prediction FP Recall = ( ) / ( ) = 1/1 Americano Americano F1 = 2/3Precision = ( ) / ( ) = 1/2 AmericanoAmericano Latte (c) Scenario #3 Fig. 3: F1 score examples in three different scenarios. From t he application per- spective, the three different scenarios require the same num ber of error correc- tions; (#1) filling missing information, (#2) replacing wro ng information, and (#3) deleting the unexpected information. However, in the v iew of F1 scores, false negative (FN) and false positive (FP) are separately c ounted to identify representative score value, F1. a valid Menu.price key-value pair, this prediction is regar ded erroneous due to failure in capturing the relation with contextually linked Menu.name key-value pair. This can be intuitively understood as: Menu.price key -value standalone does not provide meaningful information from the application po int-of-view, unless paired with the corresponding Menu.name. In group-level ev aluations, a more direct assessment of structural relation is conducted wher e, each group (instead of key-value pair) is treated as a unit of information extrac ted. Group-level eval- uation provides essential assessment of the KIE model espec ially in industrial applications where information applications are often con ducted in contextually related groups (e.g. Menu.name, Menu.price). In addition to structural relation between key-value pairs , metric formulation is another factor that creates the disparity between the (pr ior works’) model- centric and (KIEval’s) application-centric design. Model -centric metric formula- tions often distinguish model’s erroneous prediction (FP) from missed prediction (FN), such as in Entity-level F1 metric. In industrial appli cations, however, it is more relevant to assess KIE models in terms of additional c ost incurred due to KIE errors. To elaborate, with reference to Fig. 3, Entity -level F1 evaluation across the three scenarios implies lower KIE performance in scenario 2. From the application perspective, however, all of the three scen arios’ predictions in- cur the same cost of one editing operation (addition, substi tution, or deletion) in KIE automation. Consequently, it is imperative to develop a n application-centric metric formulation that accurately reflects the actual appl ication settings. Based on the key challenges of application-centric design d efined above, our work’s proposed metric, KIEval, is designed with these fact ors in mind to bridge the disparity between the current metrics and industrial ap plications. Page 7: KIEval: Evaluation Metric for Document Key Information Ext raction 7 4 KIEval 4.1 Structured Evaluation – Entity and Group Level In KIEval, to integrate structural relation into the KIE eva luation, group-matching was conducted between the predicted and ground-truth key-v alue pairs prior to entity-level and group-level evaluations. While variant o fgroup-matching for entity-level evaluation was employed in [10], the lack of fo rmal definition un- derscores its significance in the view of KIE metric standard isation. To illus- trate, let PR={pr1,pr2,...,prN}be a set of predicted groups and GT= {gt1,gt2,...,gtM}be ground-truth groups, where each group consists of a set of entities represented by tuples, (entity-type, value). The non-group entities (i.e. company.name and company.number in receipt) are included i n1-st group to rep- resent all entities with the same structural format. For the formal definition of group-matching , we define a matching score S(n,m)counting the identical entities between prnand gtm. Based on the matching scores between groups, each predic- tion group is matched with a ground-truth group through Hung arian matching to obtain a group-matched set of groups, G={(n1,m1),(n2,m2),...,}, whereng andmgindicate the g-th matched indices of predicted and ground-truth groups, respectively, and |G|results as min(N,M). The group-matching can be defined as follows: G=Hungarian (PR,GT,S) (1) whereSindicates a set of matching scores, S(n,m), between all pairs between predictions and ground-truth. For an entity eat a matching (n,m), F1 statistics such as True-Positive ( TP), False-Negative ( FN), and False-Positive ( FP) can be calculated as follows; TPe (n,m)=Se (n,m),FNe (n,m)=Ne(gtm)−Se (n,m),FPe (n,m)=Ne(prn)−Se (n,m) (2) whereSe (n,m)indicates the number of identical entity pairs, which has en tity-type ebetween n-th predicted and m-th ground-truth groups. The N e(·)represents the operation counting entity-type ein a group. In other words, TPe (n,m)indicates the matched entity, and FNe (n,m)andFPe (n,m)represent the remaining ground- truths and predictions in the specific match (n,m)in terms of entity type e, respectively. To calculate a final cumulated score, KIEval Entity F1 , the total F1 statistics are identified as follows: TPentity=/summationdisplay (n,m)∈G/summationdisplay eTPe (n,m) (3) FNentity=M/summationdisplay m/summationdisplay eNe(gtm)−TPentity(4) FPentity=N/summationdisplay n/summationdisplay eNe(prn)−TPentity(5) Page 8: 8 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hon g The statistics can be used to calculate KIEval Entity F1 metric using standard precision and recall manners. Group-level evaluation, KIEval Group F1 , is also conducted on the group- matched Gwhere F1 statistics are evaluated across different groups. U nlike KIEval Entity F1 which treats all entities in a group as a unit of information t o evaluate F1 statistics, KIEval Group F1 evaluates on the entire group as a unit of information. It should be noted that, group-level evalua tion is conducted on all but the first element of G(i.e.G′) as the first element represent non-group entities. G′=G\(n1,m1) (6) TPgroup=/summationdisplay (n,m)∈G′ /BD[Se (n,m)=Ne(gtm) =Ne(prn)∀e] (7) Eq. 7 shows formulation of group-level True-Positive measu re where counting identical pairs of prediction and ground truth groups in G′. In the equation,/BD[·]indicates a binary operator providing 1 when the predicted a nd ground- truth groups are identical. FN and FP are calculated by count ing the remaining ground-truth and predicted groups, respectively. Finally ,KIEval Group F1 can be identified with the same precision and recall fashion. Bas ed on the formal definition of KIEval Entity F1 andKIEval Group F1 above, both formulations aim to incorporate structure relation assessment in evalua tion at the entity and group-level, respectively. 4.2 Aligned Metric Formulation While distinction of model’s erroneous prediction and miss ed prediction as FP and FN in metric formulation could be well-suited from the mo del-centric point- of-view, its misalignment in the standpoint of industrial a pplication has mo- tivated the formulation of our metric. KIEval’s applicatio n-centric design ad- dresses this misalignment by conceptualizing KIE errors as correction costs in- curred in application settings. Correction refers to one of the three editing steps: substitution, addition, and deletion of prediction values to match the ground- truth. For an entity eat the matching condition (n,m)∈G, the steps can be defined in terms of FN and FP as follows: Subse (n,m)= min(FPe (n,m),FNe (n,m)) (8) Adde (n,m)=FNe (n,m)−Subse (n,m) (9) Dele (n,m)=FPe (n,m)−Subse (n,m) (10) As can be seen, the substitution is defined as the minimum numb er ofFPe (n,m) andFNe (n,m), which indicates the number of predictions that require mod ifica- tions to match corresponding ground-truth values. The addi tion and deletion are the number of remaining FNe (n,m)andFPe (n,m), respectively. The number of error, Errore (n,m)=Subse (n,m)+Adde (n,m)+Dele (n,m), is represented by summing Page 9: KIEval: Evaluation Metric for Document Key Information Ext raction 9 Fig. 4: Sample images from the SROIE (left), CORD (center), a nd FUNSD (right) datasets. the three error corrections. The total number of error can be defined as follows; Error=/summationdisplay (n,m)∈G/summationdisplay eErrore (n,m)+/summationdisplay (∗,m)/∈G/summationdisplay eNe(gtm) /bracehtipupleft /bracehtipdownright/bracehtipdownleft /bracehtipupright Add unmatched gt+/summationdisplay (n,∗)/∈G/summationdisplay eNe(prn) /bracehtipupleft/bracehtipdownright/bracehtipdownleft /bracehtipupright Del unmatched pr (11) Here, the first term on the right-hand side indicates the numb er of error correc- tions in the group match condition G, and the second and third terms represent the number of additions and deletions, respectively, for th e non-matched groups. Finally, KIEval Aligned is calculated with the Error and the number of correct val- ues,TP. The following equation shows the formulation; KIEval Aligned=TPentity TPentity+Error(12) The KIEval Aligned not only better aligns with industrial applications, but al so benefits from high interpretability due to its formulations in terms of well-known F1 components: TP, FP, and FN. 5 Experiment Settings 5.1 Datasets Experiments were conducted with the KIEval metric on models trained us- ing three widely used benchmark datasets in the Document KIE task, namely: SROIE, CORD and FUNSD, shown in Fig 4. SROIE dataset refers to the dataset introduced in task 3 of Scanned receipts OCR and information extraction challenge of ICDAR 20191. This dataset com- prises of 626 train and 347 test receipt images, requiring pa rticipants’ models 1https://rrc.cvc.uab.es/?ch=13 Page 10: 10 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng to extract key-value pairs of 4 entities: Company, Date, Add ress and Total price from these images. Consolidated Receipt Dataset ( CORD ) [16] comprises of receipt images from shops and restaurants designed for the task of extracti ng grouped enti- ties. Dataset’s annotation consists of 30 entities, which a re categorized into 4 groups: Menu, Void menu, Subtotal, and Total. Entities with in each group are contextually linked such as: Menu.name, Menu.price and Men u.quantity. There are 800 training, 100 validation and 100 testing images. Form Understanding in Noisy Scanned Documents ( FUNSD ) [11] consists of 149 training and 50 test documents, which are noisy, scann ed, and have var- ious layouts. This dataset is composed of three entities: He ader, Question, and Answer. FUNSD, unlike aforementioned datasets, allows eac h entity to hold mul- tiple values within the same document image. To maintain con sistency with prior works on FUNSD, all entities are regarded as non-group in our experiments. 5.2 Document KIE Models Current works on Document KIE can be largely categories into two frameworks: sequence labeling and generative frameworks. Prior works in the sequence labeling framework adopt taggin g-based ap- proach to extract key-value pairs from document images. In d etail, with refer- ence to CORD sample image in Fig. 4, OCR is first applied to extr act texts such as “Vt Pep Mocha” before tokenizing it into “Vt”, “Pep” a nd “Mocha”. The KIE model then processes these tokens, often conditione d with layout and image information, to provide token-wise label (e.g. BI O tag) classifica- tions such as “B-Menu.name”, “I-Menu.name”, and “I-Menu.n ame” to the exam- ple text respectively. Tokenized texts along with their cor responding token-level tags are then postprocessed to form the final key-value pairs (e.g. Menu.name: “Vt Pep Mocha”). Representative works in this framework inc lude: LayoutLM family [24,23,25,9], StructuralLM [14], BROS [8], LiLT [21 ], and DocFormer [1]. Generative framework based models conduct KIE from documen t images by directly generating the key-value pairs as text. Taking the same CORD example in Fig. 4, generative KIE models generates text sequence of k ey-value pairs such as: {Menu.name: “Vt Pep Mocha”, Menu.price: “4.95”}. OCR in formation can also be provided as auxiliary input to these KIE models. Nota ble generation methods include TILT [18], Donut [12], and Pix2Struct [13], where ResNet [7] or ViT [4] is commonly used for image encoder and Transformer decoder [20] for text decoder. In this work, we conduct experiments using LayoutXLM [25] an d LayoutLMv3 [9] models for the sequence labeling framework, and the Donut [1 2] model for the generative framework. Given recent advancements in lar ge language model (LLM) applications for document intelligence, we also cond uct zero-shot LLM- based KIE experiments with GPT-4o [15], Qwen2-VL [22] and In ternVL 2.5 [3]. These experiments demonstrate how KIEval can provide addit ional insights into LLM evaluation within the KIE context. Page 11: KIEval: Evaluation Metric for Document Key Information Ext raction 11 5.3 Grouping Information With prior KIE models mainly designed for KIE at the entity-l evel, we adopt simple methodology to extract grouping information prior t o KIEval evaluations. For models of sequence labeling framework, a simple slot fill ing method is adopted for grouping. To elaborate, given the set of entity- types constituting a group (e.g. Menu.name, Menu.price, ... in CORD’s Menu group ), KIE model’s output of these entity-types are sequentially filled in a slo t filling manner to form groups. While different approaches for grouping extrac tion can be adopted, such as relation extraction [25] or graph-based method [10] on top of the KIE models, for the purpose of assessing the effectiveness of KIE val metric, a simple grouping method was employed. For text-generation based mo dels, group-level information can be extracted by simple structuring of the ta rget key-value pair text sequence such as JSON format strings. 5.4 Experiment Details All sequence labeling and generation models were trained fo r 1,000 steps with a batch size of 16. The initial learning rate was set to 5e-5, a long with linear learning rate decay. We used the provided OCR annotations al ong with images for experiments involving LayoutXLM [25] and LayoutLMv3 [9 ] while only the document image was provided for Donut [12] experiments. For reproducibility, all experiments were conducted using the models and datasets up loaded to Hugging Face Models and Datasets2. Details can be found in Appendix A. For multimodal LLMs, all experiments were conducted in zero-shot setting, and the prompts used can be found in Appendix B. 6 Results and Discussion 6.1 Structured Evaluation The conventional Entity F1 metric fails to accurately repre sent the KIE model’s performance due to absence of structural relation consider ation. Fig. 5(top) il- lustrates conceptual examples with corresponding metric s cores across Entity F1, KIEval Entity F1 and KIEval Group F1. In Fig. 5(top), whil e both Predic- tion 1 and 2 display accurate entity-level key-value pair ex tractions, contextual relations (grouping) between different key-value pairs are not well-captured in Prediction 2. Such observations are not well-reflected in th e conventional Entity F1 metric, scoring 1.0 across both predictions. In industrial applications where both the key-value and con textual linkage information need to be extracted (if present), Entity F1’s i nsensitivity towards the latter could lead to sub-optimal reflection of the KIE mod el’s performance especially in RPA applications. KIEval Entity F1 and KIEval Group F1, on the contrary, provide distinct evaluations across the two p redictions by taking 2https://huggingface.co Page 12: 12 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng Menu. priceMenu. countMenu. name 80,000 4 SIAO MAI BABI 60,000 3 CEKER AYAM 42,000 2 BAKPAO BKR C CRISPYGround-truth Menu. priceMenu. countMenu. name 80,000 4 SIAO MAI BABI 60,000 3 CEKER AYAM 42,000 2 BAKPAO BKR C CRISPYPr e       Menu. priceMenu. countMenu. name 60,000 2 SIAO MAI BABI 80,000 3 BAKPAO BKR C CRISPY 42,000 4 CEKER AYAMPr e       Prediction 1: Entity F1 = 1, KIEval Entity F1 = 1, KIEval Group F1 = 1 Gr o  -truth Pre      Menu. priceMenu. countMenu. name 79,000 1 BLACK PEPPER MEATBALL PAS 77,000 1 TRUFFLE CREAM 59,000 1 EARL GREY MILK TEA Menu.Price: Entity F1 = 1, KIEval Entity F1 = 0 Menu. priceMenu. countMenu. name FOOD 77,000 1 BLACK PEPPER MEATBALL PAS 59,000 1 TRUFFLE CREAM 215,000 1 EARL GREY MILK TEAPrediction 2: Entity F1 = 1, KIEval Entity F1 = 1/3, KIEval Group F1 = 0 Fig. 5: Examples illustrating the difference between Entity F1 and KIEval. The above scenario is constructed to showcase metric dispariti es, whereas the scenario below is based on real prediction result from the Donut model . LayoutXLM LayoutLMv3 Donut SROIE CORD FUNSD SROIE CORD FUNSD SROIE CORD Entity F1 91.77 95.43 84.02 91.87 95.13 85.87 83.85 84.93 nTED 97.24 94.86 61.96 96.91 94.43 69.36 96.17 90.62 KIEval Entity F1 91.77 92.88 84.02 91.87 91.84 85.87 83.85 84 .47 KIEval Group F1 - 82.68 - - 82.11 - - 68.26 KIEval Aligned 90.32 89.02 79.22 91.15 88.15 80.22 83.57 79.70 Table 1: Comparision of Entity F1, nTED, and KIEval. When gro up entities are absent, Entity F1 and KIEval Entity F1 yield identical value s. Note: Donut dis- plays substantially lower performance than other models du e to its sole reliance on image input, unlike other models’ use of ground-truth OCR annotations. into account of structural relations in the formulations. I n KIEval Entity F1, despite error-free extraction of key-value pairs for each e ntity-type, Prediction 2 is penalized for its grouping errors, resulting in a score o f1/3. Similarly in KIEval Group F1, where each group is treated as a single-unit of information instead of key-value pairs, Prediction 2 is evaluated to be c ompletely incorrect, which is not discernible from the Entity F1 metric. Fig. 5(bottom) depicts a sampled inference result of the Don ut (generation KIE) model. Despite accurate extraction of Menu.name key-v alues, contextual linkage with other entity types are misaligned possibly due to tilt rotation of the receipt image. The conventional Entity F1 score of Menu.pri ce entity does not reflect this error and assigns a full score of 1.0 unlike KIEva l Entity F1 which penalizes the prediction accordingly. Evaluation results for different metrics across all models a nd datasets are shown in Table 1. For nTED, its soft-match approach inaccura tely compares Page 13: KIEval: Evaluation Metric for Document Key Information Ext raction 13 Donut GPT-4o Qwen2-VL InternVL 2.5 Entity F1 84.93 73.56 77.07 54.99 KIEval Entity F1 84.47 72.93 77.07 54.54 Difference 0.46 0.63 0.00 0.45 Table 2: Comparison of Entity F1 and KIEval Entity F1 across g enerative models including multimodal LLMs on the CORD dataset. All multimod al LLMs are evaluated in a zero-shot setting. The difference between Ent ity F1 and KIEval Entity F1 serves to highlight information structure awaren ess of LLMs in KIE. Menu.price Menu.cnt Menu.name 30.000 3 N  P … … … 15.000 1 K     K  Menu.price Menu.cnt Menu.name 30.000 3 N  P … … …1      1 K     K  Sample Entity F1 =      Sample KIEval E    =    Ground-truth Prediction (GPT-4o) Fig. 6: Sample of CORD dataset, illustrating the performanc e gap between Entity F1 and KIEval Entity F1 in GPT-4o, highlighting the importan ce of structure awareness evaluation on top of the existing KIE metric. KIE performance, as seen in LayoutXLM and LayoutLMv3 on SROI E, where trends differ from Entity F1 and KIEval Entity F1. For Entity F 1, the differences compared to KIEval Entity F1 are prominent in CORD dataset wh ere contextual links (grouping) between entities are present. KIEval Enti ty F1 consistently un- derperforms compared to Entity F1 in CORD across all models, despite achieving equivalent scores in SROIE and FUNSD. This discrepancy high lights the overes- timation of KIE model performance when structural relation s are ignored, while the metric converges to Entity F1 in datasets without groupi ng. The discrepancy is also evident in multimodal LLM evaluatio n, as shown in Table 2. The difference between Entity F1 and KIEval Entity F1 provides deeper insight into the LLM’s ability in grouping correctly extrac ted information into the expected semantic structures. Based on the CORD results , Qwen2VL outper- forms not only in extraction but also in grouping these infor mation accurately. Fig. 6 shows an example where GPT-4o correctly extracts key i nformation but groups it into an incorrect structure, showcasing KIEval’s utility in offering a new perspective for assessing LLMs in KIE. 6.2 Metrics from the Correction Cost Perspective As previously discussed in Fig. 3, the disparity in the conce ptualization of KIE errors (either as {FP, FN} or as correction cost) results in a ssessment of KIE Page 14: 14 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng LayoutXLM LayoutLMv3 Donut SROIE CORD FUNSD SROIE CORD FUNSD SROIE CORD FP + FN 231 190 637 226 218 566 447 404 Subs 93 37 198 102 53 142 219 124 Add 7 59 126 9 56 136 9 78 Del 38 57 115 13 56 146 0 78 Correction 138 153 439 124 165 424 228 280 Table 3: Comparison of FP + FN and Correction (Subs + Add + Del) statistics. Both Add and Del indicate the sum of counts within the matched and unmatched groups. Note: Correction refers to the number of correction steps taken. Fig. 7: In CORD, Donut’s KIEval Aligned and KIEvalτ Aligned in relation to vary- ing confidence score thresholds, τ, alongside the corresponding automation rates, auto-rateτ. Increase in confidence score thresholds leads to an increas e in KIEvalτ Aligned , while the automation rate decreases due to the rising numbe r of entities requiring human revision. models that is misaligned with industrial applications. Ta ble 3 shows distinctive gap between the FP+FNand Correction Cost values consistent across all models and datasets. Our work, brings to light of this discre pancy and proposes KIEval Aligned formulation to better align KIE evaluation to application s ettings. 7 KIE Evaluation for RPA System In addition to the inclusion of structural relation and alig nment of metric for- mulation, there exists a distinctive factor of human-corre ction that warrants attention when evaluating KIE models in RPA systems. Irresp ective of the KIE model’s training, it is improbable to consistently achieve error-free extraction performance across a diverse range of documents. In view of t his improbabil- ity, human-correction (correction by human-intervention ) is commonly adopted by RPA systems. Human-correction however, requires a metho d for selecting a subset of predictions, as verifying and correcting all ext racted information is impractical and undermines the very goal of automation in RP A. Page 15: KIEval: Evaluation Metric for Document Key Information Ext raction 15 Existing RPA systems commonly adopt confidence score based c orrection where information extracted with confidence below a specific threshold (i.e. un- certain) is selected for verification and correction (if nec essary). Selection of optimal threshold value is an application-specific decisio n that differs from one RPA system to another, contingent on the system’s inclinati ons to trade-off au- tomation rate for KIE performance. In this work, we demonstr ate this trade-off analysis with KIEval and highlighting its added insights ov er prior metrics. We propose a method to analyse this trade-off in terms of post- correction KIE performance, automation rate, and confidence score thresho ld value, τ. We first define the automation rate, auto-rateτ, which reflects the proportion of model predictions processed without human verification, interpr eted as the number of entities with a confidence score higher than τ. Post-correction KIE performance, KIEvalτ Aligned , denotes the final KIE performance after KIE predictions wit h con- fidence scores below τare verified and corrected by humans. A formal definition of these two formulations is provided in the Appendix C. Fig. 7 presents the auto-rateτand KIEvalτ Aligned as a function of the confi- dence score threshold, τin Donut’s performance on CORD. The trade-off trend depicted in Fig. 7 indicates that, as the threshold value inc reases, the number of information extracted requiring human review increases , leading to a higher post-correction KIE score at the cost of reduced automation rate. Incorporating such trade-off analysis in evaluation of KIE models not only p rovides deeper insights but also enables stakeholders to conduct cost-ben efit evaluations effec- tively and determine the optimal threshold value for their R PA system. 8 Conclusion In this work, we bring to light of the discrepancies between t he existing Doc- ument KIE evaluation metrics and the key consideration fact ors of industrial settings, such as RPA systems. We identify the challenges be hind these discrep- ancies and propose KIEval, metric formulated with an applic ation-centric design. Specifically, KIEval leverages group matching data between the predictions and ground-truth groupings to integrate structural relations in KIE evaluations, dif- ferentiating itself from prior metrics that lack grouping a wareness in evaluation. Additionally, KIEval formulates KIE errors in terms of the c orrections incurred in automation systems (i.e. Substitution, Addition, or Del etion) further bridging the gap between the evaluation metric and industrial settin gs. The experiments not only verify these discrepancies in existing metrics but also shows how KIEval provides a different perspective of KIE model evaluation fro m the industrial ap- plication’s standpoint. On top of these discrepancies, we a lso demonstrate an application use-case scenario that illustrates the valuab le insights which the trade-off analysis brings to RPA systems. This aspect has bee n overlooked in prior Document KIE metrics. We believe that KIEval could ser ve as a standard evaluation metric for various KIE tasks and encourage the re search community to focus on solving the remaining challenges in KIE tasks wit h the industrial application in mind. Page 16: 16 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng References 1. Appalaraju, S., Jasani, B., Kota, B.U., Xie, Y., Manmatha , R.: Docformer: End-to- end transformer for document understanding. In: Proceedin gs of the IEEE/CVF international conference on computer vision. pp. 993–1003 (2021) 2. Biten, A.F., Tito, R., Mafla, A., Gomez, L., Rusinol, M., Ma thew, M., Jawahar, C., Valveny, E., Karatzas, D.: Icdar 2019 competition on scene t ext visual question an- swering. In: 2019 International Conference on Document Ana lysis and Recognition (ICDAR). pp. 1563–1570. IEEE (2019) 3. Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J ., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of ope n-source multimodal models with model, data, and test-time scaling. arXiv prepr int arXiv:2412.05271 (2024) 4. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn , D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al .: An image is worth 16x16 words: Transformers for image recognition at scale. I n: International Con- ference on Learning Representations (2020) 5. Garncarek, Ł., Powalski, R., Stanisławek, T., Topolski, B., Halama, P., Turski, M., Graliński, F.: Lambert: Layout-aware language modelin g for information ex- traction. In: International Conference on Document Analys is and Recognition. pp. 532–547. Springer (2021) 6. He, J., Wang, L., Hu, Y., Liu, N., Liu, H., Xu, X., Shen, H.: I cl-d3ie: In-context learning with diverse demonstrations updating for documen t information extrac- tion. arxiv 2023. arXiv preprint arXiv:2303.05063 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pa ttern recognition. pp. 770–778 (2016) 8. Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., Park, S.: Bros : A pre-trained language model focusing on text and layout for better key inf ormation extraction from documents. In: Proceedings of the AAAI Conference on Ar tificial Intelligence. vol. 36, pp. 10767–10775 (2022) 9. Huang, Y., Lv, T., Cui, L., Lu, Y., Wei, F.: Layoutlmv3: Pre -training for docu- ment ai with unified text and image masking. In: Proceedings o f the 30th ACM International Conference on Multimedia. pp. 4083–4091 (20 22) 10. Hwang, W., Yim, J., Park, S., Yang, S., Seo, M.: Spatial de pendency parsing for semi-structured document information extraction. arXiv p reprint arXiv:2005.00642 (2020) 11. Jaume, G., Ekenel, H.K., Thiran, J.P.: Funsd: A dataset f or form understanding in noisy scanned documents. In: 2019 International Conferenc e on Document Analysis and Recognition Workshops (ICDARW). vol. 2, pp. 1–6. IEEE (2 019) 12. Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J., Hwan g, W., Yun, S., Han, D., Park, S.: Ocr-free document understanding transfo rmer. In: European Conference on Computer Vision. pp. 498–517. Springer (2022 ) 13. Lee, K., Joshi, M., Turc, I.R., Hu, H., Liu, F., Eisenschl os, J.M., Khandelwal, U., Shaw, P., Chang, M.W., Toutanova, K.: Pix2struct: Screensh ot parsing as pretrain- ing for visual language understanding. In: International C onference on Machine Learning. pp. 18893–18912. PMLR (2023) 14. Li, C., Bi, B., Yan, M., Wang, W., Huang, S., Huang, F., Si, L.: Structurallm: Structural pre-training for form understanding. In: Proce edings of the 59th Annual Page 17: KIEval: Evaluation Metric for Document Key Information Ext raction 17 Meeting of the Association for Computational Linguistics a nd the 11th Interna- tional Joint Conference on Natural Language Processing (Vo lume 1: Long Papers). pp. 6309–6318 (2021) 15. OpenAI: Gpt-4v(ision) system card (2023) 16. Park, S., Shin, S., Lee, B., Lee, J., Surh, J., Seo, M., Lee , H.: Cord: a consolidated receipt dataset for post-ocr parsing. In: Workshop on Docum ent Intelligence at NeurIPS 2019 (2019) 17. Peng, Q., Pan, Y., Wang, W., Luo, B., Zhang, Z., Huang, Z., Hu, T., Yin, W., Chen, Y., Zhang, Y., et al.: Ernie-layout: Layout knowledge enhanced pre-training for visually-rich document understanding. arXiv preprint arXiv:2210.06155 (2022) 18. Powalski, R., Borchmann, Ł., Jurkiewicz, D., Dwojak, T. , Pietruszka, M., Pałka, G.: Going full-tilt boogie on document understanding with t ext-image-layout trans- former. In: Document Analysis and Recognition–ICDAR 2021: 16th International Conference, Lausanne, Switzerland, September 5–10, 2021, Proceedings, Part II 16. pp. 732–747. Springer (2021) 19. Tito, R., Mathew, M., Jawahar, C., Valveny, E., Karatzas , D.: Icdar 2021 com- petition on document visual question answering. In: Docume nt Analysis and Recognition–ICDAR 2021: 16th International Conference, L ausanne, Switzerland, September 5–10, 2021, Proceedings, Part IV 16. pp. 635–649. Springer (2021) 20. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jon es, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in ne ural information pro- cessing systems 30(2017) 21. Wang, J., Jin, L., Ding, K.: Lilt: A simple yet effective la nguage-independent layout transformer for structured document understanding. In: Pr oceedings of the 60th Annual Meeting of the Association for Computational Lingui stics (Volume 1: Long Papers). pp. 7747–7757 (2022) 22. Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen , K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zho u, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perc eption of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 23. Xu, Y., Xu, Y., Lv, T., Cui, L., Wei, F., Wang, G., Lu, Y., Fl orencio, D., Zhang, C., Che, W., et al.: Layoutlmv2: Multi-modal pre-training for v isually-rich document understanding. arXiv preprint arXiv:2012.14740 (2020) 24. Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layou tlm: Pre-training of text and layout for document image understanding. In: Proce edings of the 26th ACM SIGKDD International Conference on Knowledge Discover y & Data Mining. pp. 1192–1200 (2020) 25. Xu, Y., Lv, T., Cui, L., Wang, G., Lu, Y., Florencio, D., Zh ang, C., Wei, F.: Xfund: A benchmark dataset for multilingual visually rich f orm understanding. In: Findings of the Association for Computational Linguistics : ACL 2022. pp. 3214– 3224 (2022) 26. Yu, W., Zhang, C., Cao, H., Hua, W., Li, B., Chen, H., Liu, M ., Chen, M., Kuang, J., Cheng, M., et al.: Icdar 2023 competition on structured t ext extraction from visually-rich document images. arXiv preprint arXiv:2306 .03287 (2023) 27. Zhang, J., You, Z., Wang, J., Le, X.: Sail: Sample-centri c in-context learning for document information extraction. arXiv preprint arXiv:24 12.17092 (2024) Page 18: Supplementary - KIEval: Evaluation Metric for Document Key Information Extraction Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Hong Upstage AI, South Korea {mkhang, eric, sungrae.park, tghong}@upstage.ai A Datasets The following table provides details on the datasets and mod els used in the experiments conducted in this paper. All datasets and model s are available on Hugging Face to ensure experimental reproducibility. LayoutXLM LayoutLMv3 Donut Models microsoft/layoutxlm-base microsoft/layoutlmv3- base naver-clova-ix/donut-base SROIE darentang/sroie podbilabs/sroie-donut CORD nielsr/cord-layoutlmv3 naver-clova-ix/cord-v2 FUNSD nielsr/funsd-layoutlmv3 - Table 1: Hugging Face Models and Datasets used in the experim ents. B Multimodal LLM prompt for CORD KIE Following prompt is used when experimenting with GPT-4o, Qw en2-VL and InternVL 2.5 for KIE in the CORD dataset. Arrow symbol ֒→represents new- line wrapping in the following text. You will be provided with a receipt as an image. Your task is to analyze the receipt carefully and extract key information from it. The entities to be extracted along with their descriptions a re provided below. | Category | Sub-Category (if applicable) | Entity | Descrip tion | | --- | --- | --- | --- | | menu | (not applicable) | menu.cnt | quantity of menu | | | (not applicable) | menu.discountprice | discounted pric e of menu |֒→ | | (not applicable) | menu.etc | others | Page 19: Title Suppressed Due to Excessive Length 19 | | (not applicable) | menu.itemsubtotal | price of each menu after discount applied |֒→ | | (not applicable) | menu.nm | name of menu | | | (not applicable) | menu.num | identification # of menu | | | (not applicable) | menu.price | total price of menu | | | sub | menu.sub_cnt | quantity of submenu | | | sub | menu.sub_nm | name of submenu | | | sub | menu.sub_price | total price of submenu | | | sub | menu.sub_unitprice | unit price of submenu | | | (not applicable) | menu.unitprice | unit price of menu | | | (not applicable) | menu.vatyn | whether the price includes tax or not |֒→ | sub_total | (not applicable) | sub_total.discount_price | discounted price in total |֒→ | | (not applicable) | sub_total.etc | others | | | (not applicable) | sub_total.service_price | service charge |֒→ | | (not applicable) | sub_total.subtotal_price | subtotal price |֒→ | | (not applicable) | sub_total.tax_price | tax amount |֒→ | total | (not applicable) | total.cashprice | amount of pric e paid in cash |֒→ | | (not applicable) | total.changeprice | amount of change in cash |֒→ | | (not applicable) | total.creditcardprice | amount of price paid in credit/debit card |֒→ | | (not applicable) | total.emoneyprice | amount of price paid in emoney, point |֒→ | | (not applicable) | total.menuqty_cnt | total count of quantity |֒→ | | (not applicable) | total.menutype_cnt | total count of type of menu |֒→ | | (not applicable) | total.total_etc | others | | | (not applicable) | total.total_price | total price | Each entity (e.g. menu.cnt) is part of a category (e.g. menu) . You are to extract the entities from the receipt and return in the following format:֒→ /grave.ts1/grave.ts1/grave.ts1json {{ "menu": <dictionary or list of dictionaries>, "sub_total": <dictionary or list of dictionaries>, "total": <dictionary or list of dictionaries> }} Page 20: 20 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng /grave.ts1/grave.ts1/grave.ts1 Note the following characteristics: 1. All entities falling under the same category should be gro uped together (represented as a dictionary, such as {total.cashprice, total.changeprice, ...}).֒→ ֒→ 2. If there are multiple entities of the same category, they should be represented as a list of dictionaries.֒→ 3. If an entity is not present in the receipt, it should be excluded from the dictionary.֒→ 4. Each of the entity's value should either be a string or a lis t of strings.֒→ 5. Note that menu.sub represents a sub-category of the menu category. As such, all entities under menu.sub should be grouped together (either dictionary or list of dictionarie s) under the same menu group.֒→ ֒→ ֒→ 6. You are to respond in JSON format only and ensure that the ke ys in the dictionary are exactly the same as the entities provided above.֒→ ֒→ 7. If you are unable to extract any information, please retur n an empty list for that category.֒→ Here is an example of the expected return format: /grave.ts1/grave.ts1/grave.ts1json { "menu": [ { "menu.nm": "SPGTHY BOLOGNASE", "menu.cnt": "1", "menu.price": "58,000" }, { "menu.nm": "PEPPER AUS", "menu.cnt": "1", "menu.price": "165,000", "menu.sub": { "menu.sub_nm": "WELL DONE" } }, { "menu.nm": "WAGYU RIBEYE", "menu.cnt": "1", "menu.price": "195,000", "menu.sub": { "menu.sub_nm": "MEDIUM WELL" Page 21: Title Suppressed Due to Excessive Length 21 } } ], "sub_total": { "sub_total.subtotal_price": "503,000", "sub_total.service_price": "25,150", "sub_total.tax_price": "52,815" }, "total": { "total.total_price": "580,965" } } /grave.ts1/grave.ts1/grave.ts1 C Automation Trade-off Analysis Metric Prior metrics, including KIEval Aligned defined above, evaluate Document KIE models without consideration of the full pipeline of Docume nt KIE applications. The RPA system commonly employs a human-correction stage af ter model in- ference. Specifically, the RPA system utilizes confidence sc ores of the extracted entities by a Document KIE model and identifies which entitie s require further manual verification and corrections with a certain threshol d,τ, of the confidence score. We assume that human correction is only conducted on t he predictions with lower confidence scores and considers only substitutio n and deletion without any addition operations because addition operation usuall y requires examining all predictions and ground-truths, making the correction p rocess and the RPA system inefficient. To illustrate the formulation, let c(prn,i)be the confidence score of prn,i, where prn,iindicates the i-th entity in the n-th predicted group. PR<τis the set of the predictions of which confidence scores are less tha n the threshold τ. SincePR<τis only reviewed among the total PR, the automation rate of the RPA system can be defined as follows: auto-rateτ= 1−|PR<τ|/|PR|. (1) If the automation rate becomes close to 0 with high τ, the system becomes inefficient but the output of the system becomes accurate. Whe n the automation rate is close to 1 with sufficiently low τ, the system becomes efficient but at the cost of potentially containing incorrect predictions b y skipping the human- correction stage. To control the trade-off between the system efficiency and accu racy, we intro- duce KIEvalτ Aligned that evaluates the accuracy of the RPA automation system with the human-correction stage. The evaluation assumes no human error in the correction stage and the errors in PR<τare only revised with substitution and deletion operations. After the correction process, the rem aining errors can be categorized into Subsτ, Delτ, and Add. Subsτand Delτdenote the error present Page 22: 22 Minsoo Khang, Sang Chul Jung, Sungrae Park, and Teakgyu Ho ng in predictions with confidence score higher than τ, while Add represents the number of required entities missed in PR. With the remaining error counts, KIEvalτ Aligned can be calculated as follows: KIEvalτ Aligned= 1−Subsτ+Delτ+Add N(PR∗)+Add, (2) where N(PR∗)indicates the number of predictions, PR∗, after the human cor- rection stage. The denominator includes Add to represent th e total number of entities of the system output, including the entities missi ng inPR∗. Through auto-rateτand KIEvalτ Aligned , the automation efficiency and accuracy of the RPA system can be measured by adjusting the confidence threshold τ, facilitating their trade-off analysis in Document KIE.

---