loader
Generating audio...

arxiv

Paper 2503.04506

Multi-modal Summarization in Model-Based Engineering: Automotive Software Development Case Study

Authors: Nenad Petrovic, Yurui Zhang, Moaad Maaroufi, Kuo-Yi Chao, Lukasz Mazur, Fengjunjie Pan, Vahid Zolfaghari, Alois Knoll

Published: 2025-03-06

Abstract:

Multimodal summarization integrating information from diverse data modalities presents a promising solution to aid the understanding of information within various processes. However, the application and advantages of multimodal summarization have not received much attention in model-based engineering (MBE), where it has become a cornerstone in the design and development of complex systems, leveraging formal models to improve understanding, validation and automation throughout the engineering lifecycle. UML and EMF diagrams in model-based engineering contain a large amount of multimodal information and intricate relational data. Hence, our study explores the application of multimodal large language models within the domain of model-based engineering to evaluate their capacity for understanding and identifying relationships, features, and functionalities embedded in UML and EMF diagrams. We aim to demonstrate the transformative potential benefits and limitations of multimodal summarization in improving productivity and accuracy in MBE practices. The proposed approach is evaluated within the context of automotive software development, while many promising state-of-art models were taken into account.

Paper Content:
Page 1: Multi-modal Summarization in Model-Based Engineering: Automotive Software Development Case Study Nenad Petrovic1, Yurui Zhang1, Moaad Maaroufi1, Kuo-Yi Chao1, Lukasz Mazur1, Fengjunjie Pan1, Vahid Zolfaghari1, and Alois Knoll1 Technical University of Munich, Robotics, Artificial Intelligence and Real-Time Systems, Munich, Germany, nenad.petrovic@tum.de, yurui.zhang@tum.de, moaad.maaroufi@tum.de, kuoyi.chao@tum.de, lukasz.mazur@tum.de, f.pan@tum.de, v.zolfaghari@tum.de, knoll@in.tum.de Abstract. Multimodal summarization integrating information from di- verse data modalities presents a promising solution to aid the under- standing of information within various processes. However, the appli- cation and advantages of multimodal summarization have not received much attention in model-based engineering (MBE), where it has be- come a cornerstone in the design and development of complex systems, leveraging formal models to improve understanding, validation and au- tomation throughout the engineering lifecycle. UML and EMF diagrams in model-based engineering contain a large amount of multimodal in- formation and intricate relational data. Hence, our study explores the application of multimodal large language models within the domain of model-based engineering to evaluate their capacity for understanding and identifying relationships, features, and functionalities embedded in UML and EMF diagrams. We aim to demonstrate the transformative poten- tial benefits and limitations of multimodal summarization in improving productivity and accuracy in MBE practices. The proposed approach is evaluated within the context of automotive software development, while many promising state-of-art models were taken into account. Keywords: multimodal summarization, model-based engineering (MBE), large language model (LLM) 1 Introduction Multimodal Large Language Models (MLLMs)[1] represent a significant advance- ment in artificial intelligence, extending the capabilities of traditional language models to process and generate data across multiple modalities, such as text, images, audio, and video. Unlike conventional large language models (LLMs)[2] that focus solely on textual information, MLLMs are designed to integrate and interpret diverse forms of data, enabling them to address complex, real-world challenges[3] where information is often transferred through a combination ofarXiv:2503.04506v1 [cs.SE] 6 Mar 2025 Page 2: 2 Nenad Petrovic et al. modalities. Consequently, the aviation and automotive industries are increas- ingly leveraging MLLMs to address complex real-world challenges and use cases in industrial design. In particular, Model-Based Engineering (MBE) demands that MLLMs accurately comprehend and handle systematic approaches for com- plex tasks such as requirements management, system analysis, design, validation, and verification[4][7] which are mostly presented by text descriptions and visual diagrams of Eclipse Modeling Framework (EMF)[5] and Unified Modeling Lan- guage (UML)[6] as in the automotive domain example of Centralized Car Server Metamodel from [7]. Enabling MLLMs to accurately analyze class-to-class re- lationships, as well as the properties and functionalities of classes from meta- modeling diagrams in UML and EMF, has become a key research focus in the industrial domains[17]. Although research on the capability of MLLMs in analyzing MBE diagrams remains rare attention (especially in automotive), the past three years have wit- nessed significant studies on MLLMs or Vision-Language Models (VLMs)[8][14] in domains such as multimodal summarization[13], multimodal Chain-of-Thought[9], textbook question answering[10][11][15] and diagram-based question answering[12][16], etc. However, a lot of information beneficial for maintainability and updates of older vehicles resides in different types of diagrams, which needs lots of time and effort to be analyzed by experts. Therefore, this study primarily explores whether existing techniques can be applied to the analysis of MBSE diagrams, with focus on automotive industry usage. Additionally, it examines the current challenges and limitations in this emerging field and presents the workflow aiming automated development of automotive software as one of the outcomes. Numer- ous state-of-the-art models are compared side-by-side, while the most promising one was used for proof-of-concept implementation shown towards the end of the paper. The rest of the paper has the following structure. Next section provides overview of related works, covering both MLLMs and approaches leveraging them for diagram prompting. Additionally, this section also gives tabular sum- mary of relevant state-of-art models, considering their parameter number and usage costs among other factors. The third section describes our experiment from automotive domain which was used for comparative evaluation of the selected models. The fourth section focuses on adoption of MLLMs within automotive software development toolchain. The fifth section shows results of evaluation for the selected models. Finally, the conclusion summarizes the main contributions achieved and aspects observed during evaluation. 2 Related Works 2.1 Multimodal Large Language Models Since the release of ChatGPT as a LLM in December 2022, the field of MLLMs has experienced explosive growth. Over the past two years, various AI technol- ogy companies and academic research institutions have developed and publicly Page 3: Multi-modal Summarization in Model-Based Engineering 3 released their own MLLM models. Notable examples include OpenAI’s GPT- 4 series[18], Google’s Gemini series[19], Meta’s Llama series[20], Anthropic’s Claude 3[21], Mistral’s Pixtral[22], xAI’s Grok[23], and others[24][25][26][27][28][29]. The methods for utilizing these MLLMs are highly diverse. Some offer interactive user interfaces or API access, others provide only limited test access. While some models are open-source, others are entirely closed-source. In certain cases, mod- els are limited to demos, making it impossible for external users to directly test or use them. Therefore, Table 1 presents an clear overview of currently available models that external users can directly access. It includes details such as model size, release date, development organization, access usage methods, open-source status, and whether the model requires payment. Recent researches[35][36] have explored MLLMs to generate sample UML diagrams from drawings. These models translate drawn visual elements into structured representations, enabling automated and efficient diagram-to-model conversions. This highlights the potential of MLLMs in bridging visual and for- mal representations, particularly in software engineering. 2.2 Diagram prompting and pre-processing Due to MLLM development and limitations in hardware resources, many re- search institutions are unable to leverage MLLMs for complex diagram im- age processing. Therefore, early research on diagram understanding in geom- etry problem-solving focused on integrating visual and textual information to enhance reasoning capabilities. The G-ALIGNER model[40] proposed in 2014 introduced a method for diagram understanding by combining visual element detection with textual alignment through submodular optimization, enabling accurate identification and alignment of geometric primitives with correspond- ing textual descriptions. Building on this foundation, the Weakly Supervised Learning for Textbook Question Answering (WSTQ) framework[15] in 2022 utilized weak supervision from text retrieval and object detection to develop text matching and rela- tion detection tasks, significantly improving accuracy on the CK12-QA[44] and AI2D[32] datasets through multitask learning. In the same year, PGDP-Net[41] was introduced as an end-to-end solution for plane geometry diagram parsing, employing a modified instance segmentation method for geometric primitive ex- traction and a Graph Neural Network (GNN)[45] for relation parsing, supported by the comprehensive PGDP5K dataset. In 2023, the Multimodal Chain-of- Thought (MCoT) framework[9] proposed a two-stage reasoning approach that separates rationale generation from answer inference, effectively mitigating hal- lucinations and achieving state-of-the-art performance on multimodal reasoning tasks such as ScienceQA[42] and A-OKVQA[43]. Continuing this progression, the CoG-DQA framework[16] introduced in 2024 leverages Large Language Models (LLMs) to guide Diagram Parsing Tools (DPTs) through a chain-of-guiding mechanism, integrating visual parsing with domain- specific knowledge to enhance diagram question answering tasks[31][32]. Most recently, the DiagramQG[30] dataset and its Hierarchical Knowledge Integration Page 4: 4 Nenad Petrovic et al. framework (HKI-DQG) were developed to generate concept-focused educational questions from diagrams, utilizing advanced vision-language models to surpass existing methods in educational question generation tasks. Collectively, these studies demonstrate a clear trajectory of progress in multi- modal reasoning and diagram understanding, highlighting the growing effective- ness of integrating visual and textual information in complex reasoning tasks. Table 1. A summary of commonly used MLLMs. Model Release DateOrganization Parameter Size (B)Access Usage MethodsCost Grok-2 Dec-2024 xAI >314 Website & API Paid Emu3 Sep-2024 BAAI 8 Open Source (Code & Model)Free Llama 3.2 Sep-2024 Meta 11 / 90 Open Source (Code & Model)Free Pixtral-12B Sep-2024 Mistral 12 Website & API Paid Llama 3 Jul-2024 Meta 8 / 70 / 405 Open Source (Code & Model)Free InternVL2 July-2024 OpenGVLab 8 Open Source (Code & Model)Free xGen-MM (BLIP-3)Aug-2024 Salesforce AI 4 Open Source (Code & Model)Free Chameleon May-2024 Meta 7 / 34 Open Source (Code & Model)Free GPT-4o May-2024 OpenAI unknown Website & API Paid Claude 3 Mar-2024 Anthropic >175 Website & API Paid Grok-1 Mar-2024 xAI 314 Open Source (Code & Model)Free Gemini 1.5 Feb-2024 Google unknown Website & API Paid Fuyu-8B Oct-2023 Adept 8 Open Source (Code & Model)Free PaLI-3 Oct-2023 Google Deep- Mind2 / 3 / 5 Open Source (Code & Model)Free GPT-4V Sep-2023 OpenAI unknown Website & API Paid LaVIT Sep-2023 Peking Univer- sity7 Open Source (Code & Model)Free Emu1 Jul-2023 BAAI 14 Open Source (Code & Model)Free UnIVAL Jul-2023 Sorbonne Uni- versity0.25 Open Source (Code & Model)Free KOSMOS-2 Jun-2023 Microsoft Re- search7 Open Source (Code & Model)Free GPT-4 Mar-2023 OpenAI unknown Website & API Paid Page 5: Multi-modal Summarization in Model-Based Engineering 5 3 Experiment 3.1 Research Questions Building upon the exploration of MLLMs and VLMs for diagram recognition, this subsection investigates the recognition capabilities of MLLMs in the auto- motive manufacturing and autonomous driving industry, specifically focusing on complex UML class diagrams for automotive components. We propose the following research questions inspired by [35]: – RQ1: Can MLLMs accurately identify all categories of automobile compo- nents depicted in a UML class diagram? – RQ2: Are MLLMs capable of recognizing and understanding the functional descriptions of automotive components within the UML class diagram? – RQ3: Can MLLMs correctly classify automotive components into their ap- propriate categories based on the relationships and descriptions in the UML class diagram? – RQ4: Can MLLMs accurately identify and map the relationship chains be- tween automotive components as illustrated in the UML class diagram? – RQ5: Can MLLMs accurately detect the differences between the most sim- ilar UML class diagrams? These research questions aim to assess the potential of MLLMs in analyzing and transforming automotive UML diagrams into semantically correct machine- readable formats. The focus is on testing MLLMs’ abilities in recognizing struc- tural, relational, and functional information, which is critical for the automotive manufacturing and autonomous driving sectors. 3.2 Experimental Setup To correspond with the research questions mentioned above, we designed five questions based on visually represented model with respect to [7]: – Q1: Given a UML diagram about Centralized Car Server Metamodel, list all classes in this UML diagram – Q2: Given a UML diagram about Centralized Car Server Metamodel, list all properties and functions in processing node class – Q3: Given a UML diagram about Centralized Car Server Metamodel, is FPGA one of the Co-Processor? (A)/ is the camera sensor? (B). – Q4: Given a UML diagram about Centralized Car Server Metamodel, list all classes on the relation chain between camera and component (A)/ list all subclasses that processing task class has (B). – Q5: Given twoUML diagrams about Centralized Car Server Metamodel, what are the differences between these diagrams? Page 6: 6 Nenad Petrovic et al. In Q5, we remove the GPU, FPGA, and TPU classes displayed in the UML diagram, leaving blank spaces, and assign the MLLMs to identify the differences between these two UML diagrams. We also defined ground truths (GT) for each of the five different questions to better evaluate the performance of the MLLMs. – GT1: The MLLMs are required to identify the names of all classes as well as the total number of classes. – GT2: The MLLMs are required to accurately list the names and types of attributes and functions within the Processing Node class. – GT3: The MLLMs only need to respond with a ”yes” or ”no” for question AandB. – GT4: The MLLMs need to list the correct four class names in the relation- ship chain between Camera and Component (A)and the three subclasses of the Process Task class (B). – GT5: The MLLMs need to identify that the second UML class diagram, which has been manually modified, lacks the GPU, FPGA and TPU classes. Due to hardware limitations, we evaluated mostly the performance of web- based MLLMs. During the testing, we initiated a new conversation, uploaded model diagram based on [7], and sequentially asked five questions. The perfor- mance of the MLLMs on these five questions was manually assessed based on ground truths. 4 Evaluation In Table 2, the results of MLLMs on the five questions are presented. For Q1, Q3 and Q4B, almost all MLLMs were able to provide completely correct an- swers. However, none of the MLLMs detected differences and changes between the two UML diagrams in answering Q5. Furthermore, different models exhib- ited significant capability differences for Q2. For example, Claude-3.5 provided a completely correct answer, while Gemini-2.0 gave an entirely incorrect response. Other models could answer correctly to varying extents, but many of their an- swers still contained errors and hallucinations. When asked about the relation- ship chain between the Camera and Component classes (Q4B), the results from MLLMs showed significant inconsistency. Some models were able to provide a completely accurate relationship chain, while others generated responses that were merely plausible, but ultimately hallucinated. Based on this results, we found that current MLLMs can perfectly identify non-complex content in UML diagrams, such as class names, the number of classes, and simple class inheritance relationships. But for more complex ques- tions, such as those involving class attributes and functions as well as relation chains, many models lack the capability to provide correct answers. Most notably, MLLMs entirely lack the ability to recognize differences between two similar but distinct UML diagrams. However, it can be noticed that InternVL2-8B-MPO model which is free to use exhibits capability to answer correctly in case of all Page 7: Multi-modal Summarization in Model-Based Engineering 7 the question templates, which is making it suitable for the considered automotive use case. Based on our findings, InternVL2-8B-MPO is very good at processing and reasoning across metamodels, as it manages to understand the hierarchies and attributes and relationships between the components of the metamodel, which is quite impressive with only 8B parameters for the connection layer be- tween the linguistic and visual encoder (visual encoder has 6B and the linguistic encoder has 13B parameters; overall the model has 6+8+13=27B parameters). Additionally, this model is also deployable locally without relying on external providers and services, which is of utmost importance in automotive industry, as companies usually apply policies which prevent sending data to third parties outside the boundaries of the underlying organization. In summary, Table 2 highlights the strengths and limitations of current MLLMs in answering UML diagram-related questions, providing clear directions for improvement in future research on this topic. Table 2. The results of MLLMs performance on 5 questions related to [7]. Model Q1 Q2 Q3 Q4 Q5 Grok-2 28/29 Partially correct with much hallu- cinationA&B correct Only B correct No correct differ- ence detected Pixtral-12B 29/29 Mostly correct with lacking 2 attributes and 1 functionA&B correct Only B correct No correct differ- ence detected Claude-3.5 29/29 Totally correct A&B correct A&B correct No correct differ- ence detected Gemini-2.0 28/29 Totally wrong A&B correct A&B correct No correct differ- ence detected Gemini-1.5 29/29 Partially correct with much hallu- cinationA&B correct A&B correct No correct differ- ence detected GPT-4o 28/29 Mostly correct with 1 wrong attributeA&B correct Only B correct No correct differ- ence detected GPT-o1 29/29 Mostly correct with 1 wrong attribute and lacking 1 at- tributeA&B correct A&B correct No correct differ- ence detected GPT-4 29/29 Mostly correct with few halluci- nationA&B correct A&B correct No correct differ- ence detected GPT-4o-mini 29/29 Partially correct with much hallu- cinationA&B correct Only B correct No correct differ- ence detected InternVL2-8B-MPO 29/29 Totally correct A&B correct A&B correct Correct difference detected Qwen2 VL 7B 24/29 Correct Proper- ties, hallucinated functionsA&B correct Only B correct Correct difference detected Page 8: 8 Nenad Petrovic et al. 5 Usage in Automotive Fig. 1. MLLM-based diagram prompting for product updates identification. In some domains, such as automotive, products may require continuous itera- tive updates through their life cycle. Each product is associated with a significant amount of requirements, expressed as either parameter information, diagrams (such as UML-alike representations in automotive), tables, and operational in- structions. During every product update or modification, a corresponding series of related artifacts (such as documentation, configuration and software code) must be changed accordingly. as illustrated in the Fig. 1. In this process, a large volume of product information needs to be compared and summarized to provide a clear overview of updates. In automotive, maintainability and updateability represent challenges due to strict standardization which requires lot of time and efforts, slowing down the innovation. To reduce the labor costs associated with related tasks, we in- troduce automated approach leveraging MLLMs to perform multimodal infor- mation summarization, as depicted in Fig. 2. In the first steps, user specifies changes of the requirements or new requirements, either as freeform textual de- scription or providing diagram representation of system instance with respect to pre-defined metamodel. Considering the current system representation, up- dates are identified using MLLM, such as addition of new sensors, actuators or their improvement (camera resolution increase). In the next step, the changes detected as outcome of multimodal summarization are further leveraged as input of LLM-based code generation workflow, targeting CARLA simulation environ- ment. Before the actual code generation, the new requirements are taken into account for updated model instance creation. Moreover, this updated model in- Page 9: Multi-modal Summarization in Model-Based Engineering 9 stance is checked for automotive compliance with respect to given ISO standard, making use of Object Constraint Language (OCL) rules. In case that addition of new requirements leads to system model instance which is not compliant, feedback to the user generated. In that case, user can understand which part of new requirements is not compliant and should be corrected. For model instance creation and feedback generation, we make use of Llama3-8B-Instruct LLM. Fig. 2. The workflow of MLLM-based image prompting for automotive scenarios. Services for tasks performing image prompting are based on OpenGVLab/ InternVL2-8B-MPO [38] and deployed as web application relying on Quart [39] framework for Python. It can also be deployed using lmdeploy [46] which is a toolkit for compressing, deploying, and serving LLM. Despite that there are variants with more parameters, we select the one with 8B as it is the smallest one which answers to all the questions correctly, while runnable on lower hardware configurations quantized to 4 bit. The deployment was done in Google Colab environment with the lowest Pro subscription plan, as more than 20GB of VRAM was required for execution, which is above the limit of free account. Based on our experience, the lowest configuration offered by Google Colab able to run it was L4 GPU-based, where 22GB of VRAM were occupied. In Table 3, an overview of API and average execution times in seconds for the described L4- based deployment in Google Colab is given. It can be noticed that execution time has order of magnitude of second. Moreover, we can also identify that the detection of differences between diagrams has the slowest execution time, which was expected considering the fact that it includes prompting of both diagrams, so additional processing was needed compared to other cases. As the authors mention in [38], the model uses a substantial middleware layer between the visual encoder and the LLM unlike than typical lightweight connectors most MLLMs use. Furthermore, the MPO (Mixed Preference Opti- mization) training which mixes three losses (preference ranking, quality checks, Page 10: 10 Nenad Petrovic et al. and generation guidance) allows to enhance the model’s reasoning capabilities and reduce hallucinations. Table 3. Automotive diagram prompting service REST API and execution times. Endpoint Parameters Description Execution time [s] /extractSensors Diagram Returns the list of all sensors from the dia- grams2.86 /extractActuators Diagram Returns the list of all ac- tuators from the diagram2.64 /extractElement PropertiesDiagram Element nameReturns the list of prop- erties and values for given element (such as sensor or actuator)3.08 /detectDifferences Current diagram New diagramReturns the list of dif- ferences between two diagrams, considering sensors, actuators, their properties, interfaces and parameter values9.13 6 Conclusion In this work, we considered some of the promising MLLM solutions currently available, together with the frameworks and techniques related to the processing of scientific diagrams. Subsequently, we designed five types of research questions for UML diagrams and proposed five example questions along with their corre- sponding ground truths based on the UML class diagram we have. Using these questions, we tested the web-based MLLMs currently available and evaluated their performance. On the other side, we also show a proof-of-concept imple- mentation of the MLLM-based web service relying on the Intern-VL model. Future work should be drawn attention to enhancing MLLMs’ ability to process set of large and complex UML diagrams describing singular system, par- ticularly in understanding class attributes and functions for purpose of complex scenarios targeting code generation in automotive domain, focusing on hardware abstraction aspects. Meanwhile, improving models’ capacity to detect and inter- pret subtle differences between similar UML diagrams is a critical direction. In addition, developing new benchmarks and datasets that target these challenges in automotive will also be essential to driving progress in this area. Page 11: Multi-modal Summarization in Model-Based Engineering 11 7 Acknowledgment This work has received funding from the European Chips Joint Undertaking un- der Framework Partnership Agreement No 101139789 (HAL4SDV) including the national funding from the German Federal Ministry of Education and Research (BMBF) under grant number 16MEE00471K. The responsibility for the content of this publication lies with the authors. References 1. Zhang, D., Yu, Y., Dong, J., Li, C., Su, D., Chu, C., Yu, D.: MM-LLMs: Recent Advances in MultiModal Large Language Models. arXiv preprint arXiv:2401.13601 (2024). https://arxiv.org/abs/2401.13601 2. Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., et al.: A Survey of Large Language Models. arXiv preprint arXiv:2303.18223 (2024). https://arxiv.org/abs/2303.18223 3. Liang, C.X., Tian, P., Yin, C.H., et al.: A Comprehensive Survey and Guide to Multimodal Large Language Models in Vision-Language Tasks. arXiv preprint arXiv:2411.06284 (2024). https://arxiv.org/abs/2411.06284 4. Pan, F., Zolfaghari, V., Wen, L., Petrovic, N., Lin, J., Knoll, A.: Generative AI for OCL Constraint Generation: Dataset Collection and LLM Fine-tuning. In: IEEE International Symposium on Systems Engineering (ISSE), pp. 1–8. IEEE (2024). https://doi.org/10.1109/ISSE63315.2024.10741141 5. Steinberg, D., Budinsky, F., Paternostro, M., Merks, E.: EMF: Eclipse Modeling Framework. Pearson Education, Boston (2008) 6. Pilone, D., Pitman, N.: UML 2.0 in a Nutshell. O’Reilly Media, Sebastopol (2005) 7. Petrovic, N., Fengjunjie, P., Lebioda, K., Zolfaghari, V., Kirchner, S., Purschke, N., Khan, M. A., Vorobev, V., Knoll, A.: Synergy of Large Language Model and Model Driven Engineering for Automated Development of Centralized Vehicular Systems. arXiv preprint arXiv:2404.05508 (2024). https://arxiv.org/abs/2404.05508 8. Zhang, G., Zhang, Y., Zhang, K., Tresp, V.: Can Vision-Language Models Be a Good Guesser? Exploring VLMs for Times and Location Reasoning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 636–645 (2024) 9. Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal Chain- of-Thought Reasoning in Language Models. arXiv preprint arXiv:2302.00923 (2023) 10. Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.W., Zhu, S.C., Tafjord, O., Clark, P., Kalyan, A.: Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering. Advances in Neural Information Processing Systems, vol. 35, pp. 2507–2521 (2022) 11. Tan, C., Wei, J., Sun, L., Gao, Z., Li, S., Yu, B., Guo, R., Li, S.Z.: Retrieval Meets Reasoning: Even High-School Textbook Knowledge Benefits Multimodal Reasoning. arXiv preprint arXiv:2405.20834 (2024) 12. Wang, S., Zhang, L., Luo, X., Yang, Y., Hu, X., Qin, T., Liu, J.: Computer Science Diagram Understanding with Topology Parsing. ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 16, no. 6, pp. 1–20. ACM New York, NY (2022) 13. Jangra, A., Mukherjee, S., Jatowt, A., Saha, S., Hasanuzzaman, M.: A Survey on Multi-Modal Summarization. ACM Computing Surveys, vol. 55, no. 13s, pp. 1–36. ACM New York, NY (2023) Page 12: 12 Nenad Petrovic et al. 14. Che, C., Lin, Q., Zhao, X., Huang, J., Yu, L.: Enhancing Multimodal Understand- ing with CLIP-Based Image-to-Text Transformation. In: Proceedings of the 2023 6th International Conference on Big Data Technologies, pp. 414–418 (2023) 15. Ma, J., Chai, Q., Huang, J., Liu, J., You, Y., Zheng, Q.: Weakly Supervised Learn- ing for Textbook Question Answering. IEEE Transactions on Image Processing, vol. 31, pp. 7378–7388. IEEE (2022) 16. Wang, S., Zhang, L., Zhu, L., Qin, T., Yap, K.H., Zhang, X., Liu, J.: CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question An- swering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13969–13979 (2024) 17. Geipel, M.M.: Towards a Benchmark of Multimodal Large Language Models for Industrial Engineering. In: 2024 IEEE 29th International Conference on Emerging Technologies and Factory Automation (ETFA), pp. 1–4. IEEE (2024) 18. OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Leoni Ale- man, F., et al.: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2024). https://arxiv.org/abs/2303.08774 19. Gemini Team, Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., et al.: Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context. arXiv preprint arXiv:2403.05530 (2024). https://arxiv.org/abs/2403.05530 20. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., et al.: The Llama 3 Herd of Models. arXiv preprint arXiv:2407.21783 (2024). https://arxiv.org/abs/2407.21783 21. Anthropic Inc.: The Claude 3 Model Family: Opus, Son- net, Haiku. Semantic Scholar, CorpusID: 268232499 (2024). https://api.semanticscholar.org/CorpusID:268232499 22. Agrawal, P., Antoniak, S., Bou Hanna, E., Bout, B., Chaplot, D., Chud- novsky, J., et al.: Pixtral 12B. arXiv preprint arXiv:2410.07073 (2024). https://arxiv.org/abs/2410.07073 23. xAI Inc.: Open Release of Grok-1. xAI Blog (2024). https://x.ai/blog/grok-os 24. Wang, X., Zhang, X., Luo, Z., Sun, Q., Cui, Y., Wang, J., et al.: Emu3: Next-Token Prediction is All You Need. arXiv preprint arXiv:2409.18869 (2024). https://arxiv.org/abs/2409.18869 25. Xue, L., Shu, M., Awadalla, A., Wang, J., Yan, A., Purushwalkam, S., et al.: xGen-MM (BLIP-3): A Family of Open Large Multimodal Models. arXiv preprint arXiv:2408.08872 (2024). https://arxiv.org/abs/2408.08872 26. Bavishi, R., Elsen, E., Hawthorne, C., Nye, M., Odena, A., Somani, A., Ta¸ sırlar, S.: Fuyu-8B: A Multimodal Architecture for AI Agents. Adept AI Blog (2023). https://www.adept.ai/blog/fuyu-8b 27. Jin, Y., Xu, K., Chen, L., Liao, C., Tan, J., Huang, Q., et al.: Unified Language- Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization. arXiv preprint arXiv:2309.04669 (2024) 28. Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos- 2: Grounding Multimodal Large Language Models to the World. arXiv preprint arXiv:2306.14824 (2023). https://arxiv.org/abs/2306.14824 29. OpenGVLab: InternVL Family: Closing the Gap to Commercial Multimodal Mod- els with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT- 4oOpen Release of Grok-1 (2024). https://github.com/OpenGVLab/InternVL 30. Zhang, X., Zhang, L., Wu, Y., Huang, M., Wu, W., Li, B., Wang, S., Liu, J.: DiagramQG: A Dataset for Generating Concept-Focused Questions from Diagrams. arXiv preprint arXiv:2411.17771 (2024). https://arxiv.org/abs/2411.17771 Page 13: Multi-modal Summarization in Model-Based Engineering 13 31. Wang, S., Zhang, L., Yang, Y., Hu, X., Qin, T., Wei, B., Liu, J.: CSDQA: Diagram Question Answering in Computer Science. In: Knowledge Graph and Semantic Com- puting: Knowledge Graph Empowers New Infrastructure Construction: 6th China Conference, CCKS 2021, Guangzhou, China, November 4-7, 2021, Proceedings, vol. 6, pp. 274–280. Springer (2021) 32. Hiippala, T., Alikhani, M., Haverinen, J., Kalliokoski, T., Logacheva, E., Orekhova, S., Tuomainen, A., Stone, M., Bateman, J.A.: AI2D-RST: A Multimodal Corpus of 1000 Primary School Science Diagrams. Language Resources and Evaluation, vol. 55, pp. 661–688. Springer (2021) 33. Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping Language-Image Pre- Training with Frozen Image Encoders and Large Language Models. In: International Conference on Machine Learning, pp. 19730–19742. PMLR (2023) 34. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sas- try, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning Transferable Visual Models From Natural Language Supervision. arXiv preprint arXiv:2103.00020 (2021). https://arxiv.org/abs/2103.00020 35. Conrardy, A., Cabot, J.: From Image to UML: First Results of Image-Based UML Diagram Generation Using LLMs. arXiv preprint arXiv:2404.11376 (2024). https://arxiv.org/abs/2404.11376 36. Rossi, R.: The Importance of Visual Modelling Languages in Generative Software Engineering. arXiv 37. HuggingFace:Meta-Llama-3-8B-Instruct.https://huggingface.co/meta- llama/Meta-Llama-3-8B-Instruct 38. HuggingFace: OpenGVLab/InternVL2-8B. https://huggingface.co/OpenGVLab/InternVL2- 8B 39. Pallets Project: Quart. https://quart.palletsprojects.com/en/latest/ 40. Seo, M. J., Hajishirzi, H., Farhadi, A., Etzioni, O.: Diagram understanding in ge- ometry questions. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 28, no. 1 (2014). 41. Zhang, M.-L., Yin, F., Hao, Y.-H., Liu, C.-L.: Plane geometry diagram parsing. arXiv preprint arXiv:2205.09363 (2022). https://arxiv.org/abs/2205.09363 42. Saikh, T., Ghosal, T., Mittal, A., Ekbal, A., Bhattacharyya, P.: ScienceQA: A novel resource for question answering on scholarly articles. International Journal on Digital Libraries, vol. 23, no. 3, pp. 289–301 (2022). Springer. 43. Schwenk, D., Khandelwal, A., Clark, C., Marino, K., Mottaghi, R.: A-OKVQA: A benchmark for visual question answering using world knowledge. In: European Conference on Computer Vision, pp. 146–162 (2022). Springer. 44. Ma, J., Chai, Q., Liu, J., Yin, Q., Wang, P., Zheng, Q.: XTQA: Span-level Expla- nations for Textbook Question Answering. *IEEE Transactions on Neural Networks and Learning Systems* (2023). IEEE. 45. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., Yu, S.: A Comprehensive Survey on Graph Neural Networks. *IEEE Transactions on Neural Networks and Learning Systems*, **32**(1), 4–24 (2020). IEEE. 46. LMDeploy Contributors: LMDeploy: A Toolkit for Compressing, De- ploying, and Serving LLM. GitHub repository (2023). Available: https://github.com/InternLM/lmdeploy

---