loader
Generating audio...

arxiv

Paper 2412.11683

Multimodal LLM for Intelligent Transportation Systems

Authors: Dexter Le, Aybars Yunusoglu, Karn Tiwari, Murat Isik, I. Can Dikmen

Published: 2024-12-16

Abstract:

In the evolving landscape of transportation systems, integrating Large Language Models (LLMs) offers a promising frontier for advancing intelligent decision-making across various applications. This paper introduces a novel 3-dimensional framework that encapsulates the intersection of applications, machine learning methodologies, and hardware devices, particularly emphasizing the role of LLMs. Instead of using multiple machine learning algorithms, our framework uses a single, data-centric LLM architecture that can analyze time series, images, and videos. We explore how LLMs can enhance data interpretation and decision-making in transportation. We apply this LLM framework to different sensor datasets, including time-series data and visual data from sources like Oxford Radar RobotCar, D-Behavior (D-Set), nuScenes by Motional, and Comma2k19. The goal is to streamline data processing workflows, reduce the complexity of deploying multiple models, and make intelligent transportation systems more efficient and accurate. The study was conducted using state-of-the-art hardware, leveraging the computational power of AMD RTX 3060 GPUs and Intel i9-12900 processors. The experimental results demonstrate that our framework achieves an average accuracy of 91.33\% across these datasets, with the highest accuracy observed in time-series data (92.7\%), showcasing the model's proficiency in handling sequential information essential for tasks such as motion planning and predictive maintenance. Through our exploration, we demonstrate the versatility and efficacy of LLMs in handling multimodal data within the transportation sector, ultimately providing insights into their application in real-world scenarios. Our findings align with the broader conference themes, highlighting the transformative potential of LLMs in advancing transportation technologies.

Paper Content:
Page 1: Multimodal LLM for Intelligent Transportation Systems Dexter Le∗ Drexel University Philadelphia, USA Email: dql27@drexel.edu ∗Corresponding authorAybars Yunusoglu∗ Purdue University West Lafayette, USA ayunusog@purdue.edu ∗Corresponding authorKarn Tiwari Indian Institute of Science, Bangalore Bengaluru, India karntiwari@iisc.ac.in Murat Isik Stanford University Stanford, USA misik@stanford.eduI. Can Dikmen Temsa Research & Development Center Adana, Turkey can.dikmen@temsa.com Abstract —In the evolving landscape of transportation systems, integrating Large Language Models (LLMs) offers a promising frontier for advancing intelligent decision-making across various applications. This paper introduces a novel 3-dimensional frame- work that encapsulates the intersection of applications, machine learning methodologies, and hardware devices, particularly em- phasizing the role of LLMs. Instead of using multiple machine learning algorithms, our framework uses a single, data-centric LLM architecture that can analyze time series, images, and videos. We explore how LLMs can enhance data interpretation and decision-making in transportation. We apply this LLM framework to different sensor datasets, including time-series data and visual data from sources like Oxford Radar RobotCar, D-Behavior (D-Set), nuScenes by Motional, and Comma2k19. The goal is to streamline data processing workflows, reduce the complexity of deploying multiple models, and make intelligent transportation systems more efficient and accurate. The study was conducted using state-of-the-art hardware, leveraging the computational power of AMD RTX 3060 GPUs and Intel i9- 12900 processors. The experimental results demonstrate that our framework achieves an average accuracy of 91.33% across these datasets, with the highest accuracy observed in time-series data (92.7%), showcasing the model’s proficiency in handling sequential information essential for tasks such as motion plan- ning and predictive maintenance. Through our exploration, we demonstrate the versatility and efficacy of LLMs in handling multimodal data within the transportation sector, ultimately pro- viding insights into their application in real-world scenarios. Our findings align with the broader conference themes, highlighting the transformative potential of LLMs in advancing transportation technologies. Index Terms —Multimodal LLM, Intelligent Transportation, Sensor Data, Transportation System I. I NTRODUCTION The rise of large language models (LLM) has contributed significantly to the field of natural language processing appli- cations. LLMs offer great potential for systems that require in- telligent decision-making and can do so with great versatility. In transportation systems, making fast and accurate decisions is imperative. However, a great cost in LLMs derives from the significant cost of development and the aggregation andretention of viable data to enhance the LLM. Procedures for Data-Centric AI are thus imperative for the composition of many techniques for improving and maintaining datasets. The methodologies of Data-Centric AI range from data augmen- tation for data diversification, data labeling, reduction, and maintenance [1]. Additionally, a healthy dataset is utilized to produce a highly robust and resilient LLM that evolves alongside new data ob- served. In transportation systems, robustness and resilience are imperative goals for the LLM. The cost of developing LLMs is high; however, adopting these techniques can help alleviate the objectives of evolving the LLM and influencing improved continuity. Using LLMs in the transportation system can speed up accurate and intelligent responses for specific situations. A basic overview of LLMs consists of data that is initially pre-trained and tuned for instructions. These instructions are rewarded based on a reward model that features tasks. Then, the LLM initiates a prompting sequence, which is utilized for the generated response determined to be the output of the LLM [2]. We propose a unified LLM encompassing datasets featuring time series and visuals. The novelty of a unified LLM aims to provide a framework that encapsulates the physical layer with the application layer employing machine learning algorithms. These advancements are achieved through a singular LLM architecture, which not only enhances the efficiency and accu- racy of intelligent transportation systems but also facilitates real-time processing on edge devices. This architecture is particularly suitable for edge computing environments, where computational efficiency and reduced latency are critical, en- suring seamless performance in resource-constrained settings. The main contributions of this paper address the approach of a unified multimodal LLM framework for enhancing intel- ligent transport systems. Our primary contributions are: •A unified multimodal LLM framework, a novel approach that reduces the complexity of developing and increases performance for intelligent transport systems.arXiv:2412.11683v1 [cs.LG] 16 Dec 2024 Page 2: •Examination of datasets with differing data types, Time- Series, Audio, and Video; detailing various use cases with the unified architecture. •Analysis of unified multimodal LLM framework’s inte- gration into devices such as GPUs and CPUs, distinguish- ing performance benchmark. The paper is organized as follows: Section 2 discusses the current methodologies for designing an intelligent trans- portation system, challenges, and analysis of current learning algorithms. Section 3 outlines the datasets used, along with the associated hardware and learning algorithms. Section 4 describes the machine learning algorithms evaluated. Section 5focuses on the summary of the proposed unified multimodal LLM framework. Section 6 discusses potential future work. II. B ACKGROUND Transportation is a critical component in our everyday lives, and the rising price of commodities and the demand for vehicles accelerate the need for a better transportation channel. Additionally, transportation plays a vital role in economic development where trade can be accomplished to move goods and services [3]. Furthermore, transportation is significant in our contributions to environmental sustainability, safety, and community engagement [4] [5]. The emergence of Intelligent Transportation Systems (ITS) has influenced the manners of transportation by positively improving existing is- sues, including but not limited to: traffic safety, cost efficiency, comfort, and speed [6] [7]. The applications of ITS vary from autonomously driving vehicles to traffic management. Where the predominant rise of IoT devices has accelerated the employment of ITS [8]. Similarly, the emergence of cloud computing has further influenced the development of IoT devices to integrate with ITS to produce a highly robust and resilient system [9]. Transportation is essential to everyday life, and safety, where fostering a sustainable model should be paramount. Despite the emergence of ITS, transportation still faces many challenges that impact livelihoods. The challenges that surround transportation range from environment to health, safety, privacy, and accessibility. Environmental challenges stem from the byproducts of transportation, the act of trans- porting or manufacturing capable products. A tremendous environmental challenge is the emission to the atmosphere of carbon dioxide as a result of fuel combustion [10]. This environmental challenge also affects the health of others, as mass emission reduces air quality and can be a factor in life-long diseases. Safety plays a crtical role in transportation systems, where some of the most significant challenges are general safety, sustainability, and autonomous transportation [11]. Figure 1 illustrates the interactions within the Multimodal LLM Framework span three key dimensions: Modality, Mod- els, and Hardware. Each modality, including time series, audio, and visual data, is represented as an input to the network, ensuring the framework effectively processes and analyzes Fig. 1. Our Framework Depicted Across Three Dimensions: Data, Models, and Hardware. diverse data formats across these dimensions. Then, each node applies to subsequent learning algorithms in the physical layer. Each node in the network and its interactions with other layers display the challenges of the Multimodal LLM Framework and the tasks to accomplish—traditional scalability results in an exponential increase in dependencies and complexities. The approach of the Multimodal LLM Framework aims to reduce scaling complexities while improving performance and abstracting interactions of the input layer to the physical. III. M ETHODS A. Dataset The dataset utilized in this study is derived from time- series sensor data and encompasses a wide range of param- eters related to vehicle performance, control systems, and environmental conditions. This rich dataset is designed to support analyzing intelligent transportation systems and their interactions with the surrounding environment. TABLE I OVERVIEW OF DIFFERENT DATA TYPES IN THE DATASET . Data Type Description Applications Time-Series Data Continuous sensor read- ings over time, including vehicle speed, tire pres- sure, engine torque, etc.Analyzing dynamic be- havior, fault detection, predictive maintenance Audio Data Sound recordings from the vehicle’s surround- ings or internal systems.Noise analysis, engine sound and environmental conditions assessment Video Data Recordings from onboard cameras capturing the ve- hicle’s environment.Object detection, lane- keeping assistance, envi- ronmental monitoring Table I shows the datasets comprise three primary types of data; time-series, audio, and video each serving distinct purposes in vehicular analysis and applications. Time-series data includes continuous sensor readings over time, such as vehicle speed, tire pressure, and engine torque, and is essential for analyzing dynamic behavior, fault detection, and predictive maintenance. The audio data consists of sound recordings taken from the vehicle’s surroundings and internal systems, Page 3: which can be used for noise analysis, engine sound evaluation, and evaluation of the vehicle’s environment. In video data, on- board cameras record the vehicle’s environment, which is used to detect objects, assist with lane-keeping, and monitor the environment. We intend to evaluate LLMs’ ability to process and interpret multimodal sensor data by utilizing this dataset. Using LLMs, this research aims to improve understanding of complex vehicle-environment interactions, ultimately leading to advances in intelligent transportation. Audio Signal ProcessingAudio Optimization Engine Visual Signal ProcessingVideo Frame ProcessingProcessed InformationProcessed V isual Signal Input V ideo Data Input Time Series DataTime Series ProcessingProcessed Time Series SignalInput Audio DataProcessed Audio Signal Visual Optimization Engine Time Series Optimization Engine Fig. 2. Sensor Processing Diagram. Figure 2 illustrates the sensor processing of the Multimodal LLM Framework, which integrates time-series, audio, and visual data. Data inputs, such as time series and audio, are pro- cessed by their optimization engines in a loopback mechanism. The input video frames are preprocessed first for visual data. A loopback process is used by the visual optimization engine to process the visual frame after preprocessing. For their tasks, the optimization engines use the AdamW optimizer and continuous learning loops. Information that has been processed is the output of each signal processing module. The framework provides a scalable solution that maintains efficiency without increasing complexity while supporting various data formats. B. Overview of the Proposed Framework The proposed framework integrates time series, audio, and visual data processing with the most advanced developments, the proposed framework illustrates the versatility of converter- based architectures across a variety of data formats. The framework consists of three main components: time series analysis, audio classification, and visual data processing. In order to classify structured tabular data, the framework imple- ments a Converter architecture for time series data, specifically BERT. The features in the tabular data are converted into asingle text string, allowing the BERT tokenizer to process the data and capture complex relationships in the dataset [12]. The architecture comprises a pre-trained BERT model followed by a fine-tuned classification head on the target dataset. Training is optimized using the AdamW optimizer, and the model’s performance is evaluated using accuracy met- rics [13]. This approach enables the model to handle complex time series data effectively, providing robust performance on unseen data. The framework’s audio classification component leverages a pre-trained Wav2Vec2 model, fine-tuned on a specific audio dataset for classifying environmental audio files [14]. The process begins with data preprocessing, where Wav2Vec2Processor converts the raw audio waveforms into a format suitable for the Wav2Vec2 model, including padding and normalization. An AudioDataset class is implemented to manage the audio data appropriately, ensuring correct label assignment and necessary transformations. Audio sequences vary in length and are processed by a custom sorting function that pads the sequences in each batch to the maximum sequence length, ensuring consistent input sizes for the model during training. The model is then fine-tuned using the cross- entropy loss and AdamW optimizer, and evaluation is based on validation accuracy and loss metrics. Video sequences are processed in the visual data process- ing component by extracting frames and generating textual descriptions. The descriptions are based on image captioning models like BLIP and CLIP. Based on the generated de- scriptions, a language model, like T5, refines or transforms the text based on specific natural language processing (NLP) tasks, like translation or summarization. This process enables visual and textual data integration, leveraging transfer learning to minimize the need for task-specific training. The entire visual processing pipeline is implemented using Hugging Face Transformers library, ensuring flexibility and compatibility with different pre-trained models. The multimodal framework shows how LLMs can be ap- plied to a variety of data types, such as time series, audio, and visual, illustrating the model’s adaptability and effective- ness. In addition to leveraging the contextual understanding provided by pre-trained models, the framework achieves high performance with minimal computational resources by trans- forming non-textual data into formats suitable for NLP models. Multimodal systems are particularly useful for complex tasks that take advantage of rich, contextual representations provided by transformers. This makes the framework applicable to a variety of domains, including predictive maintenance, audio event detection, and video analysis. Figure 3 shows that the proposed Multimodal LLM Frame- work integrates time-series, audio, and visual data process- ing using transformer-based models. It begins with a Pre- processing Layer that includes Time-Series, Audio, Visual Data Preprocessing modules, Error Detection and Handling, along with Feature Extraction. Then, the data is processed using transformers that are specific to the modality: Time- Page 4: Time-Series Preprocessing Error Detection & Handling Audio Preprocessing Visual Data Preprocessing Data Augmentation Feature ExtractionTime-Series Analysis (BER T) Audio Classification (Wav2V ec2) Visual Data Processing (T5)Model Feedback LLMMultimodal Data IntegrationInter-Modality CommunicationContinuous Learning/Model UpdateTraining Data Repositories Visualization ToolsTime-Series PredictionsAudio Classification ResultsVisual Analysis Results User RequestPerformance Metrics Deployment PipelineModel Training & Update Mechanisms Scalability Modules Web UI Memory Components Asynchronous Processing Service Human InterfaceMultimodal LLM FrameworkFig. 3. Block Diagram of Implementation. Series Analysis (BERT), Audio Classification (Wav2Vec2), and Visual Data Processing (T5). Multimodal Data Integration is at the core of the framework, which provides seamless intermodal communication and incorporates an LLM model feedback loop for continuous learning and model updates. Models are trained and updated through the framework, which is guided by performance metrics and supported by train- ing data repositories. An Asynchronous Processing Service manages tasks and facilitates user interaction with a Web UI and Human Interface. Integration of models into production environments is ensured by the Deployment Pipeline. The framework provides sophisticated analysis and predictions across time series, audio, and visual modalities, making it a scalable and flexible multimodal data processing solution. C. Hardware Implementation Our algorithms were implemented using Python on CPUs and GPUs. The study was carried out by leveraging the computational power of NVIDIA’s GeForce RTX 3060 GPU and Intel’s Core i9 12900H CPU, optimized for different tasks, ensuring efficient execution of our implementations. IV. E VALUATION The evaluation of the proposed framework is centered on assessing the performance and computational efficiency of the integrated LLMs when applied to multimodal data in intel- ligent transportation systems. The framework’s performance was demonstrated across varying data types, including time series, audio, and video data depicting different aspects of vehicle and environmental conditions. Table II shows the performance of a multimodal LLM across three different modalities: time series, audio, and video. The model showcases an impressive accuracy of 94.48% for time-series classification with a latency of 11.5 ms and a computational complexity of 1.8 GOPs. Audio classification follows with an accuracy of 92.80%, a latency of 13.1 ms, and a computational demand of 2.7 GOPs. Video processing, tasked with captioning, presents a lower accuracy of 88.73%TABLE II EVALUATION RESULTS FOR MULTIMODAL LLM ININTELLIGENT TRANSPORTATION SYSTEMS Accuracy MAC (GOP) Task Latency (ms) Time-Series 94.48% 1.8 Classification 11.5 Audio 92.80% 2.7 Classification 13.1 Video 88.73% 4.5 Captioning 13.5 but requires the highest computational effort of 4.5 GOPs, with a latency to audio at 13.5 ms. V. C ONCLUSION The results demonstrate that the proposed multimodal LLM framework balances accuracy and computational efficiency across various data modalities, making it highly suitable for autonomous driving applications. The framework excels in processing time-series data, which is crucial for motion plan- ning and real-time decision-making in transportation systems. While audio and video data performance showed room for improvement, these results highlight potential areas for further optimization, particularly in handling more complex and high- dimensional inputs. The latency results confirm that the model is capable of real-time processing, even with computational constraints, making it viable for deployment in real-world intelligent transportation systems. Transfer learning and task- specific fine-tuning also allowed the model to achieve robust performance without excessive computational demands. The proposed framework offers a robust and efficient solution for integrating LLMs into autonomous driving and other transportation-related tasks, providing high accuracy and com- putational efficiency. VI. F UTURE WORK The LLM framework features a multimodal approach that encapsulates varying datasets for an intelligent transportation system. Potential fields of improvement could be accomplished Page 5: by testing the LLM framework with another dataset. Utilizing other datasets can help improve the LLM framework and make cross-validation possible. Similarly, applying techniques for augmenting the dataset can validate any potential overfitting in the results. Another challenge in multimodal systems is the integration of diverse data formats (time series, audio, visual) into a shared latent space. In future work, t-Distributed Stochastic Neighbor Embedding (t-SNE) or similar dimension- ality reduction techniques can be employed to visualize and optimize the learned latent representations. These techniques can help assess whether the LLM framework is effectively grouping semantically similar data points across different modalities. Exploring the prospect of varying extra knowledge prompting concerning sample quantity could improve the LLM framework performance [15]. Additionally, explorations of visual and text prompting improvements could range from reliance on linguistic biases to crucial information in the text. VII. A CKNOWLEDGMENT We acknowledge the Temsa Research R&D Center for their generous financial support and the reviewers for their invalu- able insights and suggestions that significantly contributed to the enhancement of our paper. REFERENCES [1] D. Zha, Z. P. Bhat, K.-H. Lai, F. Yang, and X. Hu, Data-centric AI: Perspectives and Challenges , pp. 945–948. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1.9781611977653.ch106 [2] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,” 2024. [Online]. Available: https://arxiv.org/abs/ 2307.06435 [3] Y . Zhang and L. Cheng, “The role of transport infrastructure in economic growth: Empirical evidence in the uk,” Transport Policy , vol. 133, pp. 223–233, 2023. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0967070X23000239 [4] Z. Hussain, B. Marcel, A. Majeed, and R. S. M. Tsimisaraka, “Effects of transport–carbon intensity, transportation, and economic complexity on environmental and health expenditures,” Environment, Development and Sustainability , vol. 26, no. 7, pp. 16 523–16 553, 2024. [5] H. Demirel, E. Sertel, S. Kaya, and D. Zafer Seker, “Exploring impacts of road transportation on environment: a spatial approach,” Desalination , vol. 226, no. 1, pp. 279–288, 2008, 10th IWA International Specialized Conference on Diffuse Pollution and Sustainable Basin Management. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ S0011916408001604 [6] A. Waqar, A. H. Alshehri, F. Alanazi, S. Alotaibi, and H. R. Almujibah, “Evaluation of challenges to the adoption of intelligent transportation system for urban smart mobility,” Research in Transportation Business & Management , vol. 51, p. 101060, 2023. [7] M. Elassy, M. Al-Hattab, M. Takruri, and S. Badawi, “Intelligent transportation systems for sustainable smart cities,” Transportation Engineering , vol. 16, p. 100252, 2024. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S2666691X24000277 [8] T. Khalid, A. N. Khan, M. Ali, A. Adeel, A. ur Rehman Khan, and J. Shuja, “A fog-based security framework for intelligent traffic light control system,” Multimedia Tools and Applications , vol. 78, pp. 24 595– 24 615, 2019. [9] M. Mnyakin, “Applications of ai, iot, and cloud computing in smart transportation: A review,” Artificial Intelligence in Society , vol. 3, no. 1, p. 9–27, Feb. 2023. [Online]. Available: https: //researchberg.com/index.php/ai/article/view/108 [10] R. Colvile, E. Hutchinson, J. Mindell, and R. Warren, “The transport sector as a source of air pollution,” Atmospheric Environment , vol. 35, no. 9, pp. 1537–1565, 2001. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S1352231000005513[11] S. Kaewunruen, J. M. Sussman, and A. Matsumoto, “Grand challenges in transportation and transit systems,” Frontiers in Built Environment , vol. 2, 2016. [Online]. Available: https://www.frontiersin.org/journals/ built-environment/articles/10.3389/fbuil.2016.00004 [12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2019. [Online]. Available: https://arxiv.org/abs/1810.04805 [13] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” 2019. [Online]. Available: https://arxiv.org/abs/1711.05101 [14] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” 2020. [Online]. Available: https://arxiv.org/abs/2006.11477 [15] S. Qi, Z. Cao, J. Rao, L. Wang, J. Xiao, and X. Wang, “What is the limitation of multimodal llms? a deeper look into multimodal llms through prompt probing,” Information Processing & Management , vol. 60, no. 6, p. 103510, 2023.

---