Authors: Dexter Le, Aybars Yunusoglu, Karn Tiwari, Murat Isik, I. Can Dikmen
Paper Content:
Page 1:
Multimodal LLM for Intelligent Transportation
Systems
Dexter Le∗
Drexel University
Philadelphia, USA
Email: dql27@drexel.edu
∗Corresponding authorAybars Yunusoglu∗
Purdue University
West Lafayette, USA
ayunusog@purdue.edu
∗Corresponding authorKarn Tiwari
Indian Institute of Science, Bangalore
Bengaluru, India
karntiwari@iisc.ac.in
Murat Isik
Stanford University
Stanford, USA
misik@stanford.eduI. Can Dikmen
Temsa Research & Development Center
Adana, Turkey
can.dikmen@temsa.com
Abstract —In the evolving landscape of transportation systems,
integrating Large Language Models (LLMs) offers a promising
frontier for advancing intelligent decision-making across various
applications. This paper introduces a novel 3-dimensional frame-
work that encapsulates the intersection of applications, machine
learning methodologies, and hardware devices, particularly em-
phasizing the role of LLMs. Instead of using multiple machine
learning algorithms, our framework uses a single, data-centric
LLM architecture that can analyze time series, images, and
videos. We explore how LLMs can enhance data interpretation
and decision-making in transportation. We apply this LLM
framework to different sensor datasets, including time-series
data and visual data from sources like Oxford Radar RobotCar,
D-Behavior (D-Set), nuScenes by Motional, and Comma2k19.
The goal is to streamline data processing workflows, reduce the
complexity of deploying multiple models, and make intelligent
transportation systems more efficient and accurate. The study
was conducted using state-of-the-art hardware, leveraging the
computational power of AMD RTX 3060 GPUs and Intel i9-
12900 processors. The experimental results demonstrate that
our framework achieves an average accuracy of 91.33% across
these datasets, with the highest accuracy observed in time-series
data (92.7%), showcasing the model’s proficiency in handling
sequential information essential for tasks such as motion plan-
ning and predictive maintenance. Through our exploration, we
demonstrate the versatility and efficacy of LLMs in handling
multimodal data within the transportation sector, ultimately pro-
viding insights into their application in real-world scenarios. Our
findings align with the broader conference themes, highlighting
the transformative potential of LLMs in advancing transportation
technologies.
Index Terms —Multimodal LLM, Intelligent Transportation,
Sensor Data, Transportation System
I. I NTRODUCTION
The rise of large language models (LLM) has contributed
significantly to the field of natural language processing appli-
cations. LLMs offer great potential for systems that require in-
telligent decision-making and can do so with great versatility.
In transportation systems, making fast and accurate decisions
is imperative. However, a great cost in LLMs derives from
the significant cost of development and the aggregation andretention of viable data to enhance the LLM. Procedures for
Data-Centric AI are thus imperative for the composition of
many techniques for improving and maintaining datasets. The
methodologies of Data-Centric AI range from data augmen-
tation for data diversification, data labeling, reduction, and
maintenance [1].
Additionally, a healthy dataset is utilized to produce a highly
robust and resilient LLM that evolves alongside new data ob-
served. In transportation systems, robustness and resilience are
imperative goals for the LLM. The cost of developing LLMs
is high; however, adopting these techniques can help alleviate
the objectives of evolving the LLM and influencing improved
continuity. Using LLMs in the transportation system can speed
up accurate and intelligent responses for specific situations.
A basic overview of LLMs consists of data that is initially
pre-trained and tuned for instructions. These instructions are
rewarded based on a reward model that features tasks. Then,
the LLM initiates a prompting sequence, which is utilized for
the generated response determined to be the output of the LLM
[2].
We propose a unified LLM encompassing datasets featuring
time series and visuals. The novelty of a unified LLM aims to
provide a framework that encapsulates the physical layer with
the application layer employing machine learning algorithms.
These advancements are achieved through a singular LLM
architecture, which not only enhances the efficiency and accu-
racy of intelligent transportation systems but also facilitates
real-time processing on edge devices. This architecture is
particularly suitable for edge computing environments, where
computational efficiency and reduced latency are critical, en-
suring seamless performance in resource-constrained settings.
The main contributions of this paper address the approach
of a unified multimodal LLM framework for enhancing intel-
ligent transport systems. Our primary contributions are:
•A unified multimodal LLM framework, a novel approach
that reduces the complexity of developing and increases
performance for intelligent transport systems.arXiv:2412.11683v1 [cs.LG] 16 Dec 2024
Page 2:
•Examination of datasets with differing data types, Time-
Series, Audio, and Video; detailing various use cases with
the unified architecture.
•Analysis of unified multimodal LLM framework’s inte-
gration into devices such as GPUs and CPUs, distinguish-
ing performance benchmark.
The paper is organized as follows: Section 2 discusses
the current methodologies for designing an intelligent trans-
portation system, challenges, and analysis of current learning
algorithms. Section 3 outlines the datasets used, along with
the associated hardware and learning algorithms. Section 4
describes the machine learning algorithms evaluated. Section
5focuses on the summary of the proposed unified multimodal
LLM framework. Section 6 discusses potential future work.
II. B ACKGROUND
Transportation is a critical component in our everyday
lives, and the rising price of commodities and the demand
for vehicles accelerate the need for a better transportation
channel. Additionally, transportation plays a vital role in
economic development where trade can be accomplished to
move goods and services [3]. Furthermore, transportation is
significant in our contributions to environmental sustainability,
safety, and community engagement [4] [5]. The emergence
of Intelligent Transportation Systems (ITS) has influenced the
manners of transportation by positively improving existing is-
sues, including but not limited to: traffic safety, cost efficiency,
comfort, and speed [6] [7]. The applications of ITS vary
from autonomously driving vehicles to traffic management.
Where the predominant rise of IoT devices has accelerated
the employment of ITS [8]. Similarly, the emergence of cloud
computing has further influenced the development of IoT
devices to integrate with ITS to produce a highly robust and
resilient system [9]. Transportation is essential to everyday
life, and safety, where fostering a sustainable model should
be paramount.
Despite the emergence of ITS, transportation still faces
many challenges that impact livelihoods. The challenges that
surround transportation range from environment to health,
safety, privacy, and accessibility. Environmental challenges
stem from the byproducts of transportation, the act of trans-
porting or manufacturing capable products. A tremendous
environmental challenge is the emission to the atmosphere
of carbon dioxide as a result of fuel combustion [10]. This
environmental challenge also affects the health of others, as
mass emission reduces air quality and can be a factor in
life-long diseases. Safety plays a crtical role in transportation
systems, where some of the most significant challenges are
general safety, sustainability, and autonomous transportation
[11].
Figure 1 illustrates the interactions within the Multimodal
LLM Framework span three key dimensions: Modality, Mod-
els, and Hardware. Each modality, including time series, audio,
and visual data, is represented as an input to the network,
ensuring the framework effectively processes and analyzes
Fig. 1. Our Framework Depicted Across Three Dimensions: Data, Models,
and Hardware.
diverse data formats across these dimensions. Then, each node
applies to subsequent learning algorithms in the physical layer.
Each node in the network and its interactions with other layers
display the challenges of the Multimodal LLM Framework
and the tasks to accomplish—traditional scalability results
in an exponential increase in dependencies and complexities.
The approach of the Multimodal LLM Framework aims to
reduce scaling complexities while improving performance and
abstracting interactions of the input layer to the physical.
III. M ETHODS
A. Dataset
The dataset utilized in this study is derived from time-
series sensor data and encompasses a wide range of param-
eters related to vehicle performance, control systems, and
environmental conditions. This rich dataset is designed to
support analyzing intelligent transportation systems and their
interactions with the surrounding environment.
TABLE I
OVERVIEW OF DIFFERENT DATA TYPES IN THE DATASET .
Data Type Description Applications
Time-Series Data Continuous sensor read-
ings over time, including
vehicle speed, tire pres-
sure, engine torque, etc.Analyzing dynamic be-
havior, fault detection,
predictive maintenance
Audio Data Sound recordings from
the vehicle’s surround-
ings or internal systems.Noise analysis, engine
sound and environmental
conditions assessment
Video Data Recordings from onboard
cameras capturing the ve-
hicle’s environment.Object detection, lane-
keeping assistance, envi-
ronmental monitoring
Table I shows the datasets comprise three primary types
of data; time-series, audio, and video each serving distinct
purposes in vehicular analysis and applications. Time-series
data includes continuous sensor readings over time, such as
vehicle speed, tire pressure, and engine torque, and is essential
for analyzing dynamic behavior, fault detection, and predictive
maintenance. The audio data consists of sound recordings
taken from the vehicle’s surroundings and internal systems,
Page 3:
which can be used for noise analysis, engine sound evaluation,
and evaluation of the vehicle’s environment. In video data, on-
board cameras record the vehicle’s environment, which is used
to detect objects, assist with lane-keeping, and monitor the
environment. We intend to evaluate LLMs’ ability to process
and interpret multimodal sensor data by utilizing this dataset.
Using LLMs, this research aims to improve understanding of
complex vehicle-environment interactions, ultimately leading
to advances in intelligent transportation.
Audio Signal
ProcessingAudio Optimization
Engine
Visual Signal
ProcessingVideo Frame
ProcessingProcessed
InformationProcessed V isual
Signal
Input V ideo Data
Input Time Series
DataTime Series
ProcessingProcessed Time
Series SignalInput
Audio DataProcessed
Audio Signal
Visual Optimization
Engine
Time Series
Optimization Engine
Fig. 2. Sensor Processing Diagram.
Figure 2 illustrates the sensor processing of the Multimodal
LLM Framework, which integrates time-series, audio, and
visual data. Data inputs, such as time series and audio, are pro-
cessed by their optimization engines in a loopback mechanism.
The input video frames are preprocessed first for visual data.
A loopback process is used by the visual optimization engine
to process the visual frame after preprocessing. For their
tasks, the optimization engines use the AdamW optimizer and
continuous learning loops. Information that has been processed
is the output of each signal processing module. The framework
provides a scalable solution that maintains efficiency without
increasing complexity while supporting various data formats.
B. Overview of the Proposed Framework
The proposed framework integrates time series, audio, and
visual data processing with the most advanced developments,
the proposed framework illustrates the versatility of converter-
based architectures across a variety of data formats. The
framework consists of three main components: time series
analysis, audio classification, and visual data processing. In
order to classify structured tabular data, the framework imple-
ments a Converter architecture for time series data, specifically
BERT. The features in the tabular data are converted into asingle text string, allowing the BERT tokenizer to process
the data and capture complex relationships in the dataset
[12]. The architecture comprises a pre-trained BERT model
followed by a fine-tuned classification head on the target
dataset. Training is optimized using the AdamW optimizer,
and the model’s performance is evaluated using accuracy met-
rics [13]. This approach enables the model to handle complex
time series data effectively, providing robust performance on
unseen data. The framework’s audio classification component
leverages a pre-trained Wav2Vec2 model, fine-tuned on a
specific audio dataset for classifying environmental audio
files [14]. The process begins with data preprocessing, where
Wav2Vec2Processor converts the raw audio waveforms into a
format suitable for the Wav2Vec2 model, including padding
and normalization. An AudioDataset class is implemented to
manage the audio data appropriately, ensuring correct label
assignment and necessary transformations. Audio sequences
vary in length and are processed by a custom sorting function
that pads the sequences in each batch to the maximum
sequence length, ensuring consistent input sizes for the model
during training. The model is then fine-tuned using the cross-
entropy loss and AdamW optimizer, and evaluation is based
on validation accuracy and loss metrics.
Video sequences are processed in the visual data process-
ing component by extracting frames and generating textual
descriptions. The descriptions are based on image captioning
models like BLIP and CLIP. Based on the generated de-
scriptions, a language model, like T5, refines or transforms
the text based on specific natural language processing (NLP)
tasks, like translation or summarization. This process enables
visual and textual data integration, leveraging transfer learning
to minimize the need for task-specific training. The entire
visual processing pipeline is implemented using Hugging Face
Transformers library, ensuring flexibility and compatibility
with different pre-trained models.
The multimodal framework shows how LLMs can be ap-
plied to a variety of data types, such as time series, audio,
and visual, illustrating the model’s adaptability and effective-
ness. In addition to leveraging the contextual understanding
provided by pre-trained models, the framework achieves high
performance with minimal computational resources by trans-
forming non-textual data into formats suitable for NLP models.
Multimodal systems are particularly useful for complex tasks
that take advantage of rich, contextual representations provided
by transformers. This makes the framework applicable to a
variety of domains, including predictive maintenance, audio
event detection, and video analysis.
Figure 3 shows that the proposed Multimodal LLM Frame-
work integrates time-series, audio, and visual data process-
ing using transformer-based models. It begins with a Pre-
processing Layer that includes Time-Series, Audio, Visual
Data Preprocessing modules, Error Detection and Handling,
along with Feature Extraction. Then, the data is processed
using transformers that are specific to the modality: Time-
Page 4:
Time-Series
Preprocessing
Error Detection &
Handling
Audio Preprocessing
Visual Data
Preprocessing
Data Augmentation
Feature ExtractionTime-Series Analysis
(BER T)
Audio Classification
(Wav2V ec2)
Visual Data Processing
(T5)Model Feedback LLMMultimodal Data
IntegrationInter-Modality
CommunicationContinuous
Learning/Model UpdateTraining Data
Repositories
Visualization ToolsTime-Series PredictionsAudio Classification
ResultsVisual Analysis Results User RequestPerformance Metrics
Deployment PipelineModel Training & Update Mechanisms
Scalability Modules Web UI
Memory Components Asynchronous Processing Service Human InterfaceMultimodal LLM FrameworkFig. 3. Block Diagram of Implementation.
Series Analysis (BERT), Audio Classification (Wav2Vec2),
and Visual Data Processing (T5). Multimodal Data Integration
is at the core of the framework, which provides seamless
intermodal communication and incorporates an LLM model
feedback loop for continuous learning and model updates.
Models are trained and updated through the framework, which
is guided by performance metrics and supported by train-
ing data repositories. An Asynchronous Processing Service
manages tasks and facilitates user interaction with a Web UI
and Human Interface. Integration of models into production
environments is ensured by the Deployment Pipeline. The
framework provides sophisticated analysis and predictions
across time series, audio, and visual modalities, making it a
scalable and flexible multimodal data processing solution.
C. Hardware Implementation
Our algorithms were implemented using Python on CPUs
and GPUs. The study was carried out by leveraging the
computational power of NVIDIA’s GeForce RTX 3060 GPU
and Intel’s Core i9 12900H CPU, optimized for different tasks,
ensuring efficient execution of our implementations.
IV. E VALUATION
The evaluation of the proposed framework is centered on
assessing the performance and computational efficiency of the
integrated LLMs when applied to multimodal data in intel-
ligent transportation systems. The framework’s performance
was demonstrated across varying data types, including time
series, audio, and video data depicting different aspects of
vehicle and environmental conditions.
Table II shows the performance of a multimodal LLM
across three different modalities: time series, audio, and video.
The model showcases an impressive accuracy of 94.48% for
time-series classification with a latency of 11.5 ms and a
computational complexity of 1.8 GOPs. Audio classification
follows with an accuracy of 92.80%, a latency of 13.1 ms,
and a computational demand of 2.7 GOPs. Video processing,
tasked with captioning, presents a lower accuracy of 88.73%TABLE II
EVALUATION RESULTS FOR MULTIMODAL LLM ININTELLIGENT
TRANSPORTATION SYSTEMS
Accuracy MAC (GOP) Task Latency (ms)
Time-Series 94.48% 1.8 Classification 11.5
Audio 92.80% 2.7 Classification 13.1
Video 88.73% 4.5 Captioning 13.5
but requires the highest computational effort of 4.5 GOPs, with
a latency to audio at 13.5 ms.
V. C ONCLUSION
The results demonstrate that the proposed multimodal LLM
framework balances accuracy and computational efficiency
across various data modalities, making it highly suitable for
autonomous driving applications. The framework excels in
processing time-series data, which is crucial for motion plan-
ning and real-time decision-making in transportation systems.
While audio and video data performance showed room for
improvement, these results highlight potential areas for further
optimization, particularly in handling more complex and high-
dimensional inputs. The latency results confirm that the model
is capable of real-time processing, even with computational
constraints, making it viable for deployment in real-world
intelligent transportation systems. Transfer learning and task-
specific fine-tuning also allowed the model to achieve robust
performance without excessive computational demands. The
proposed framework offers a robust and efficient solution
for integrating LLMs into autonomous driving and other
transportation-related tasks, providing high accuracy and com-
putational efficiency.
VI. F UTURE WORK
The LLM framework features a multimodal approach that
encapsulates varying datasets for an intelligent transportation
system. Potential fields of improvement could be accomplished
Page 5:
by testing the LLM framework with another dataset. Utilizing
other datasets can help improve the LLM framework and make
cross-validation possible. Similarly, applying techniques for
augmenting the dataset can validate any potential overfitting
in the results. Another challenge in multimodal systems is
the integration of diverse data formats (time series, audio,
visual) into a shared latent space. In future work, t-Distributed
Stochastic Neighbor Embedding (t-SNE) or similar dimension-
ality reduction techniques can be employed to visualize and
optimize the learned latent representations. These techniques
can help assess whether the LLM framework is effectively
grouping semantically similar data points across different
modalities. Exploring the prospect of varying extra knowledge
prompting concerning sample quantity could improve the LLM
framework performance [15]. Additionally, explorations of
visual and text prompting improvements could range from
reliance on linguistic biases to crucial information in the text.
VII. A CKNOWLEDGMENT
We acknowledge the Temsa Research R&D Center for their
generous financial support and the reviewers for their invalu-
able insights and suggestions that significantly contributed to
the enhancement of our paper.
REFERENCES
[1] D. Zha, Z. P. Bhat, K.-H. Lai, F. Yang, and X. Hu, Data-centric
AI: Perspectives and Challenges , pp. 945–948. [Online]. Available:
https://epubs.siam.org/doi/abs/10.1137/1.9781611977653.ch106
[2] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman,
N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of
large language models,” 2024. [Online]. Available: https://arxiv.org/abs/
2307.06435
[3] Y . Zhang and L. Cheng, “The role of transport infrastructure
in economic growth: Empirical evidence in the uk,” Transport
Policy , vol. 133, pp. 223–233, 2023. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S0967070X23000239
[4] Z. Hussain, B. Marcel, A. Majeed, and R. S. M. Tsimisaraka, “Effects
of transport–carbon intensity, transportation, and economic complexity
on environmental and health expenditures,” Environment, Development
and Sustainability , vol. 26, no. 7, pp. 16 523–16 553, 2024.
[5] H. Demirel, E. Sertel, S. Kaya, and D. Zafer Seker, “Exploring impacts
of road transportation on environment: a spatial approach,” Desalination ,
vol. 226, no. 1, pp. 279–288, 2008, 10th IWA International Specialized
Conference on Diffuse Pollution and Sustainable Basin Management.
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
S0011916408001604
[6] A. Waqar, A. H. Alshehri, F. Alanazi, S. Alotaibi, and H. R. Almujibah,
“Evaluation of challenges to the adoption of intelligent transportation
system for urban smart mobility,” Research in Transportation Business
& Management , vol. 51, p. 101060, 2023.
[7] M. Elassy, M. Al-Hattab, M. Takruri, and S. Badawi, “Intelligent
transportation systems for sustainable smart cities,” Transportation
Engineering , vol. 16, p. 100252, 2024. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S2666691X24000277
[8] T. Khalid, A. N. Khan, M. Ali, A. Adeel, A. ur Rehman Khan, and
J. Shuja, “A fog-based security framework for intelligent traffic light
control system,” Multimedia Tools and Applications , vol. 78, pp. 24 595–
24 615, 2019.
[9] M. Mnyakin, “Applications of ai, iot, and cloud computing in
smart transportation: A review,” Artificial Intelligence in Society ,
vol. 3, no. 1, p. 9–27, Feb. 2023. [Online]. Available: https:
//researchberg.com/index.php/ai/article/view/108
[10] R. Colvile, E. Hutchinson, J. Mindell, and R. Warren, “The transport
sector as a source of air pollution,” Atmospheric Environment ,
vol. 35, no. 9, pp. 1537–1565, 2001. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S1352231000005513[11] S. Kaewunruen, J. M. Sussman, and A. Matsumoto, “Grand challenges
in transportation and transit systems,” Frontiers in Built Environment ,
vol. 2, 2016. [Online]. Available: https://www.frontiersin.org/journals/
built-environment/articles/10.3389/fbuil.2016.00004
[12] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
of deep bidirectional transformers for language understanding,” 2019.
[Online]. Available: https://arxiv.org/abs/1810.04805
[13] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,”
2019. [Online]. Available: https://arxiv.org/abs/1711.05101
[14] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0:
A framework for self-supervised learning of speech representations,”
2020. [Online]. Available: https://arxiv.org/abs/2006.11477
[15] S. Qi, Z. Cao, J. Rao, L. Wang, J. Xiao, and X. Wang, “What is
the limitation of multimodal llms? a deeper look into multimodal
llms through prompt probing,” Information Processing & Management ,
vol. 60, no. 6, p. 103510, 2023.