Paper Content:
Page 1:
AutoIOT: LLM-Driven Automated Natural Language
Programming for AIoT Applications
Leming Shen1, Qiang Yang2, Yuanqing Zheng1, Mo Li3
1The Hong Kong Polytechnic University,2University of Cambridge,
3Hong Kong University of Science and Technology,,,
The advent of Large Language Models (LLMs) has profoundly
transformed our lives, revolutionizing interactions with AI
and lowering the barrier to AI usage. While LLMs are primar-
ily designed for natural language interaction, the extensive
embedded knowledge empowers them to comprehend digital
sensor data. This capability enables LLMs to engage with
the physical world through IoT sensors and actuators, per-
forming a myriad of AIoT tasks. Consequently, this evolution
triggers a paradigm shift in conventional AIoT application
development, democratizing its accessibility to all by facil-
itating the design and development of AIoT applications
via natural language. However, some limitations need to be
addressed to unlock the full potential of LLMs in AIoT ap-
plication development. First, existing solutions often require
transferring raw sensor data to LLM servers, which raises
privacy concerns, incurs high query fees, and is limited by
token size. Moreover, the reasoning processes of LLMs are
opaque to users, making it difficult to verify the robustness
and correctness of inference results. This paper introduces
AutoIOT , an LLM-based automated program generator for
AIoT applications. AutoIOT enables users to specify their
requirements using natural language (input) and automati-
cally synthesizes interpretable programs with documenta-
tion (output). AutoIOT automates the iterative optimization
to enhance the quality of generated code with minimum user
involvement. AutoIOT not only makes the execution of AIoT
tasks more explainable but also mitigates privacy concerns
and reduces token costs with local execution of synthesized
programs. Extensive experiments and user studies demon-
strate AutoIOT ’s remarkable capability in program synthesis
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, or republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from
MobiCom ’25, Nov 4–8, 2025, Hong Kong, China
©2025 Association for Computing Machinery.
ACM ISBN nnn-n-nnnn-nnnn-n/nn/nn. . . $15.00 various AIoT tasks. The synthesized programs can match
and even outperform some representative baselines.
•Computing methodologies →Artificial intelligence ;
•Computer systems organization →Embedded and
cyber-physical systems .
Large Language Model, Penetrative AI, Program Synthesis
ACM Reference Format:
Leming Shen1, Qiang Yang2, Yuanqing Zheng1, Mo Li3. 2025. Au-
toIOT: LLM-Driven Automated Natural Language Programming
for AIoT Applications. In The 31st Annual International Conference
on Mobile Computing and Networking (ACM MobiCom ’25), Nov
4–8, 2025, Hong Kong, China. ACM, New York, NY, USA, 15 pages.
Artificial Intelligence of Things (AIoT) [ 19,41,48,51,65,77]
is an emerging paradigm that leverages advanced artificial
intelligence (AI) algorithms to process a vast amount of data
generated by Internet of Things (IoT) devices. This technol-
ogy brings a new level of intelligence and automation to
various applications, including healthcare [ 57,59], smart
sensing [82–84], and autonomous driving [36].
Recent advances in large language models (LLMs) ( e.g.,
GPT-4 [ 16]) fundamentally changed the way we interact with
AI. While initially designed to understand natural languages,
recent pioneering works [ 43,53,79] have demonstrated con-
siderable proficiency of LLMs in exploiting embedded world
knowledge by interpreting IoT sensor data to perform var-
ious AIoT tasks. Recent works [ 79] term such an endeavor
– Penetrative AI. Fig. 1 illustrates how LLMs can be tasked
to comprehend and even interact with the physical world
through integration with IoT sensors and actuators.
However, current LLMs on AIoT tasks [ 53,61,79] fall
short in supporting AIoT applications [ 20,38–40,86,87]: 1)
The trustworthiness of the inference results is compromised
since the inference process is performed inside a "black box"
and opaque to users. Thus, the robustness of the applica-
tions or the correctness of the inference results are hard toarXiv:2503.05346v1 [cs.CL] 7 Mar 2025
Page 2:
MobiCom ’25, Nov 4–8, 2025, Hong Kong, China Leming Shen, Qiang Yang, Yuanqing Zheng, Mo Li
Pre-collected Datasets
A sudden change in IMU
data from a smartwatch
implies that the user falls.
Falling is very dangerous for
elder people.LLM
Response: It looks like you’ve
taken a hard fall.
First Aid
Various Sensors
Figure 1: Illustration of how LLMs can sense and inter-
act with the physical world in AIoT applications.
verify; 2) Transmitting the raw or intermediate sensor data
from user devices to LLM servers raises privacy concerns,
incurs prohibitive query fees, and increases response latency;
3) Sensor data typically exhibits extensive length and high
dimensionality, making remote processing at LLM servers
infeasible due to token limits [ 15,85]. Ideally, the integra-
tion of LLMs with AIoT applications should be trustworthy,
privacy-preserving, and communication-efficient.
On the other hand, existing works on LLMs have show-
cased their remarkable capabilities in code generation to
accomplish a variety of programming tasks [ 44,47,54].Can
we leverage LLMs to synthesize programs to fulfill AIoT ap-
plication requirements? This approach can 1) enhance the
explainability and trustworthiness of the AIoT applications
as the synthesized programs can be examined and interpreted
by developers, 2) mitigate privacy concerns, and reduce the
communication cost since the programs can be executed lo-
cally on user devices without offloading raw sensor data, and
3) efficiently process high-dimensional continuous sensor
data without being limited by the token size or bounded
by the round trip time over the network. To this end, we
propose AutoIOT , a user-friendly natural language program-
ming system based on LLMs. AutoIOT automatically identi-
fies and retrieves the necessary domain knowledge over the
Internet, intelligently synthesizes programs, and evolves the
programs iteratively given sample inputs and ground truth.
Surprisingly, we found that the synthesized programs can
sometimes outperform some representative baselines and
sample programs of recent academic papers.
While the automatic program synthesis for AIoT applica-
tions is promising and exciting, it entails tremendous techni-
cal challenges. 1) High Complexity of AIoT Tasks. Con-
trary to existing works that generate code for individual
modules or well-defined functions [ 17], AIoT applications
typically need a systematic design and integration involv-
ing multiple functional components, leading to much higher
reasoning and planning complexity beyond the capability of
current LLMs. To address this issue, AutoIOT decomposes the
programming task into several distinct modules and gener-
ates the corresponding code segments. In particular, AutoIOT
leverages chain-of-thought (CoT) prompts [ 76] to divide thetask into a few sub-tasks and integrate their solutions, even-
tually making the sub-tasks manageable by LLMs. 2) Lack
of Domain Knowledge in AIoT. LLMs are trained on pre-
collected general corpus datasets, which may not include
the latest domain-specific knowledge needed for the devel-
opment of various emerging AIoT applications. To tackle
this problem, AutoIOT guides the LLMs to search and re-
trieve necessary knowledge and algorithms, thereby enabling
in-context training and inference augmented with domain
knowledge for LLMs. 3) Heavy Intervention and Constant
Feedback. Our preliminary experiments (§ 2.2) reveal that
to generate functionally correct programs, developers have
to give timely feedback to LLMs and constantly intervene
in the entire development process. For example, developers
need to provide specific reference materials and describe
algorithms in great detail, which can be time-consuming
and defeat the very purpose of automated natural language
programming. Ideally, AutoIOT should be able to synthesize
the program with no intervention from users and require
minimum user input only when necessary. To this end, we
develop AutoIOT that can execute, debug, and optimize the
synthesized program given sample inputs and outputs.
We fully implement AutoIOT1and evaluate its synthe-
sized programs with four representative AIoT applications:
heartbeat detection, IMU-based human activity recognition
(HAR), mmWave-based HAR, and multimodal HAR. Exten-
sive experiments and user studies show that, the synthesized
programs can achieve comparable performance to the cor-
responding baselines and significant improvements in user
satisfaction. Besides, AutoIOT substantially reduces the com-
munication cost and the total execution time. These findings
demonstrate the LLM’s exceptional proficiency and great
potential in synthesizing programs for AIoT applications. In
summary, we make the following contributions:
•To our best, AutoIOT is the first work that enables system-
atic natural language programming for AIoT applications.
•We design and implement three novel technical mod-
ules ( i.e., background knowledge retrieval module, CoT
prompting-based program synthesis module, and auto-
mated code improvement module) to synthesize and opti-
mize programs for AIoT applications.
•Our comprehensive experiments demonstrate that synthe-
sized programs can achieve comparable performance to
baselines and sometimes outperform them.
We first revisit the pioneering Penetrative AI efforts that
leverage the embedded knowledge of LLMs to address AIoT
tasks. Then, we present the results of our preliminary ex-
periments to demonstrate the feasibility and identify the
1The project is available at
Page 3:
AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications MobiCom ’25, Nov 4–8, 2025, Hong Kong, China
* Objective *
Find the R -peaks in ECG data
* Background Knowledge *
An R -peak is …
* Response Format *
The response must follow …
* Reasoning Examples *……
* Input Sensor Data *
Sensor name 1: {……} …Data Interpretation PromptLLMVarious IoT Sensors
Down -sampling & rounding
Digitized: 100, 101, 102, 103, …
Textualized: the sampling rate is …The R -peaks are ……
Figure 2: Prior work processes sensor data with LLMs.
Given the ‘MIT -BIH Arrhythmia Database’, please write some Python code to
detect all the R-peaks in each ECG data sample and output the detection results …
Certainly! First, we need to load all records in the database, then we can adopt ……
Here is an example to implement R-peak detection :
import os
import numpy as np
import wfdb, ecgDetector
def detect_r_peaks (ecg_data ):
# Implement R -peak detection algorithm
# Return R -peak indices
def evaluate_detection (gt_r_peaks , detected_r_peaks ):
# ......
def process_ecg_record (record_name , ecg_data ):
# ......
if __name__ == "__main__" :
# Specify the path to the database folder
data_path = '/path_to/MIT-BIH-Arrhythmia -Database'
main(data_path )Python Copy code
Figure 3: An example of direct code generation.
key challenges of implementing an LLM-driven automated
natural language programming system for AIoT applications.
2.1 Background
Fig. 2 illustrates an example of recent work in Penetrative AI
[79], where various sensor data are textualized and embedded
in a prompt, which is further used to instruct the LLM to
perform inference tasks on the textualized sensor data.
Preliminary results indicate that the common knowledge
embedded in LLMs can be leveraged to accomplish various
real-world AIoT tasks ( e.g., inferring a user’s location via
WiFi fingerprints, and counting heartbeats via raw ECG data).
However, the inference processes of LLMs are largely opaque
to users, rendering the results less explainable and trustwor-
thy. Moreover, raw sensor data is transmitted from users to
LLM servers, raising privacy concerns about sensor data. Fur-
thermore, limited by token size, existing work down-samples
and quantizes raw sensor data, leading to degraded infer-
ence performance. The remote processing at LLM servers
also necessitates the round-trip transmission of prompts and
results over the network, which increases response latency.
To overcome these issues, we explore a new approach
that leverages LLMs to synthesize AIoT programs and exe-
cutes the programs locally to process users’ data. This new
approach allows developers to examine and verify the syn-
thesized programs, protect data privacy, process sensor data
Research article : An Adaptive and Time -Efficient ECG R-Peak
Detection Algorithm, https ://doi .org/ 10.1155 /2017 /5980541
* Target * According to the
paper, please provide an
algorithm outline to
implement R -peak detection .
* Target * Next, please e nrich
the generated outline with
more advanced algorithms.
* Target * Then, please w rite
some Python code for each
step in the algorithm.
* Target * Finally, please combine all the code and provide a final version .
import numpy as np
# Step 1: Load and preprocess ECG signal
def load_process_ecg_record (parameters ):
# Step 2: R -peak detection
def detect_r_peaks (paramters ):
# Step 3: Post -processing
def adaptive_thresholding (parameters ):
# Step 4: Validation and correction
def validate_and_correct_r_peaks (parameters ):
# Step 5: Evalute detection result
def evaluate_detection (parameters ):
# Step 6: Output generation
def generate_r_peak_output (parameters ):
def main(data_path ):
load_process_ecg_record ()
# ...Python Copy code
Definition of R-peak and ECG data
Here is the enriched version :
Step 1: [Load Data] …
Step 2: [Preprocessing]Step 1: [Load Data]
- Load the ECG data from …
Step 2: [Preprocessing]
- Apply bandpass filter …
Here is the code for Step 1: …
Here is the code for Step 2: …
Figure 4: Code generation with user intervention.
streams without compromising data resolution or quality,
and avoid transmission time over networks.
2.2 Preliminary Experiments
The latest LLMs [ 16,28,71] have demonstrated extraordinary
proficiency in generating code snippets. For example, Mer-
cury [ 28] leverages LLMs to generate code for well-defined
programming tasks. Is it feasible to instruct LLMs to synthe-
size programs that can tackle AIoT tasks? Our preliminary
results show that it is possible yet extremely challenging for
LLMs to synthesize functionally correct programs for AIoT
tasks. Taking heartbeat detection as an example, as depicted
in Fig. 3. When we instruct the LLM to generate a program to
process raw ECG waveform and detect heartbeats, the LLM
can only generate a few null functions without concrete
implementation or import some nonexistent packages.
We hypothesize that the reasons for this might be three-
fold: 1) LLMs lack domain-specific knowledge, let alone the
latest algorithms tailored for AIoT tasks. As a result, for
highly specialized AIoT applications, LLMs can only offer
some suggestions or generate high-level code outlines rather
than detailed functional implementation. 2) AIoT applica-
tions typically require systematic programming, where mul-
tiple functional modules are first developed for different
subtasks ( e.g., signal preprocessing, data cleaning, neural
network initialization), which are then constructively in-
tegrated to form a comprehensive and cohesive program.
Page 4:
MobiCom ’25, Nov 4–8, 2025, Hong Kong, China Leming Shen, Qiang Yang, Yuanqing Zheng, Mo Li
* Target *
Given the “MIT -BIH Arrhythmia Database”, please
load all the ECG records from my local disk and
detect all the R -peaks. Then, for each record, output
its name along with the detection accuracy, adhering
to the specified output format below.
* Remarks *
- I only download a zip file from the official website.
* Input *
The path to the dataset
* Output Format *
Case {ECG data record index}
Detection accuracy: 0.92
Case ……User Problem
- Term1
- Term2 , …Terminology
Search Tool
Background Knowledge Retrieval
User Input Response Prompt OutputAutoIOT Agent
Tool Pool
Store the
ResultsAutomated Program Synthesis
Generatio n1. Load Data
2. …… Detailed
1. Load Data
- Input data path
- Load each record …8. Output Results
- Output index
- Output accuracy ………
Code segment 1 </>Modularized
Code Generatio n
Code segment 2 </>……
Modularized Code IntegrationCode Improvement
Integrated Code Debug
Algorith m
Final Program &
Instruction s
Figure 5: The system overview and workflow of AutoIOT .
This development process involves much higher reasoning,
planning, and programming complexity than other simple
well-defined programming tasks. 3) Current LLMs inherently
lack code validation and optimization mechanisms to ensure
the correctness of synthesized programs and improve the
performance of programs in terms of execution efficiency
and inference accuracy in AIoT program synthesis tasks.
To validate the above hypotheses, we conduct follow-
up experiments 1) by providing necessary domain-specific
knowledge to facilitate the design and implementation of
corresponding algorithms to address the AIoT task and 2) by
explicitly instructing the LLM to synthesize programs with
clear structures via a divide-and-conquer approach. Fig. 4
illustrates the code generation process involving user in-
tervention. We first manually retrieve relevant background
knowledge ( e.g., definitions of ECG data and R-peak, research
papers about heartbeat detection) from the Internet and feed
the information to the LLM, enabling in-context learning.
Second, we instruct the LLM to learn the relevant context
and comprehend the papers. Then, we ask the LLM to gen-
erate an outline of the algorithm in the paper. We further
request the LLM to enrich the outline with more advanced
and detailed technologies. Later, we ask the LLM to generate
code snippets corresponding to each step of the algorithm.
The final program is thereafter synthesized by integrating all
the code snippets. Finally, we fix bugs if there are any, and
execute the program to evaluate its performance with test
data. We further give feedback and ask the LLM to refine the
program. With several rounds of iterations, the synthesized
program evolves and improves its performance in the task.
In summary, although the synthesized programs eventu-
ally achieve reasonable performance in the tested AIoT tasks,
this LLM-driven development method demands specialized
domain expertise and constant manual intervention through
several rounds of iterations for program optimization.
2.3 Motivation & Key Ideas
In this paper, we aim to develop an LLM-driven automated
natural language programming system named AutoIOT tosynthesize programs for AIoT applications. AutoIOT features
three key modules: 1) Background knowledge retrieval module
that automatically collects domain knowledge from the Inter-
net for in-context learning; 2) Automated program synthesis
module that emulates the program development lifecycle
[42] via CoT prompting. This module decomposes an AIoT
task into several subtasks and generates corresponding func-
tional code snippets; and 3) Code improvement module that
executes the synthesized program and feeds the compiler
and interpreter feedback to the LLM, facilitating iterative
code correction and improvement. We note that although
the program synthesis process needs communication and in-
teraction with remote LLM servers, the synthesized program
can be executed locally on the client side. This approach
fundamentally differs from existing approaches such as Pen-
etrative AI [ 79], and allows users to not only preserve data
privacy but also improve the interpretability of synthesized
programs as well as inference results.
AutoIOT builds an intelligent agent that can automatically
synthesize programs to fulfill user requirements in AIoT ap-
plications. As shown in Fig. 5, AutoIOT comprises three key
modules: background knowledge retrieval (§ 4.2), automated
program synthesis (§ 4.3), and code improvement (§ 4.4).
Users can specify their requirements on AIoT applications
in natural language ( ①). Then, the background knowledge
retrieval module identifies a set of relevant terminologies
(②) and searches over the Internet ( ③). With the retrieved
domain-specific knowledge, the automated program synthesis
module instructs the agent to draft an algorithm outline ( ④).
The agent is then requested to elaborate on each step of the
algorithm and produce a detailed design ( ⑤). Such a process
decomposes a complex AIoT task into several manageable
subtasks. Then, the agent is instructed to generate a code seg-
ment for each subtask ( ⑥). Afterward, the agent is requested
to integrate the codes for subtasks and synthesize the final
program ( ⑦). Next, the code self-improvement module exe-
cutes the synthesized program and feeds the compiler and
Page 5:
AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications MobiCom ’25, Nov 4–8, 2025, Hong Kong, China
AutoIOT AgentSearch for definitions
of ECG data from
Wikipedia and store
the retrieved content
in a local database .Input Prompt
Accessible Tool List
Mapping Table
Original Prompt
Target …Encapsulated Prompt
Available Tool s
& Store
Final ResponseOutput
Figure 6: AutoIOT agent answers user’s prompt with
LLMs and LangChain tools.
interpreter output back to the agent. The agent iteratively
corrects syntax and semantics errors ( ⑧). With the obtained
output from the synthesized program, AutoIOT explicitly
instructs the agent to explore more advanced algorithms
using the web search tool, aiming to optimize the program
and improve the performance of inference tasks ( ⑨). After
several iterations ( ④-⑨),AutoIOT will present the final pro-
gram with detailed documentation ( ⑩). In addition, AutoIOT
provides an interface for users to offer specific algorithms or
instructions for code improvement.
To enable the interaction between the LLM and the web
search tool, the knowledge database, and the code executor,
we leverage the LangChain [ 7] framework to build an intel-
ligent agent. LangChain assembles various tools (abstracted
as functions) and provides descriptions of the available func-
tions (added into prompts) to the LLM. With such a prompt,
the LLM performs reasoning and selects suitable tools to
answer the user’s query (potentially via multiple rounds of
function invocations and message exchanges initiated by the
LLM). This approach allows the LLM to answer queries that
require context information ( e.g., local weather, user’s local
documents), augmenting the LLM with retrieved knowledge.
Taking step ③in Fig. 5 as an example, Fig. 6 shows how
AutoIOT agent works with LangChain tools and the LLM to
answer the query about terminology searching via network.
Given an input prompt, AutoIOT encapsulates it with addi-
tional information ( e.g., a list of available tools) and sends it
to the LLM. Then, the LLM performs a sequence of actions
possibly leveraging the available tools in the list, and gener-
ates an output. The output will then be passed to a parser to
generate the final response. We note that the above processes
involved in step ③as well as other steps ( ①-⑩in Fig. 5) are
all automatically orchestrated by the AutoIOT agent.
Usage Scenario. Suppose a user wants to develop a heart-
beat detection application, she can interact with AutoIOT
with natural language, which describes her requirements for
the application. Then, AutoIOT will automatically synthesize
a corresponding program and documentation for the user.
Following the instructions in the documentation, she can
deploy and execute the program on a target device, which
contains the patient’s ECG data. The program will then gen-
erate the final heartbeat detection results for her.4 SYSTEM DESIGN
In this section, we select heartbeat detection as an example
to illustrate how AutoIOT works.
4.1 User Interface
Users can describe an AIoT task in natural language as input
toAutoIOT . To help the LLM interpret the intention and
desired outcome ( i.e., synthesized program and inference
results), we design a prompt template for the user to describe
the problem, since LLMs can comprehend and process well-
structured instructions more efficiently [ 30]. As shown in
Fig. 5, the user problem includes four parts: target, remarks,
and program input and output specifications. The target part
describes the user’s objective and task in natural language,
e.g., "Given the MIT-BIH Arrhythmia Database, please load
all the ECG records and detect all the R-peaks. Then, evaluate
the detection results and output the detection accuracy for
each record." The remarks provide additional information,
e.g., "I only downloaded a zip file from the dataset’s official
website". The program input and output specifications clarify
the I/O format of the synthesized program. The path to the
dataset is required during the code improvement process for
program execution and optimization.
4.2 Background Knowledge Retrieval
LLMs are typically trained on extensive and pre-collected
corpus datasets that include a wide range of general com-
mon knowledge over the Internet. These training datasets,
however, may not include domain-specific knowledge or the
latest advances in research literature. In the rapidly evolving
AIoT field, with new technologies and algorithms constantly
emerging, the knowledge gap is particularly pronounced. To
fill this gap, we develop the background knowledge retrieval
module to automatically identify and fetch necessary infor-
mation online so as to enable in-context learning for LLMs
augmented with up-to-date domain knowledge.
Terminology Determination. The background knowledge
retrieval module first instructs the agent to identify some
relevant key terminologies given the user problem, with the
prompt shown in Fig. 7(a). For example, given the user prob-
lem in Fig. 5, the terminologies generated by the LLM are
"MIT-BIH Arrhythmia Database", "ECG data", and "R-peaks".
Next, to obtain the relevant knowledge, AutoIOT instructs
(Fig. 7(b)) the LLM to actively search for the definitions and
descriptions of these terminologies from public websites,
such as Wikipedia and GitHub, with the web search tool.
Additionally, for the terminologies with multiple interpreta-
tions, AutoIOT requests the LLM to filter out the irrelevant
content and focus on those pertinent to the user problem.
Context Database Construction. After retrieving neces-
sary information from relevant websites, AutoIOT uses Ope-
nAI’s text embedding [ 56] to convert the HTML documents
into vector representations, containing semantic meanings
Page 6:
MobiCom ’25, Nov 4–8, 2025, Hong Kong, China Leming Shen, Qiang Yang, Yuanqing Zheng, Mo Li
* User Problem *
<user_input >{…}</user_input >
* Motivation *
I need to understand the meaning of
some concepts in the user’s problem
to gain some background knowledge.
* Target *
Based on the user’s problem, please
determine a list of terminologies that
I need to search online. Please only
output a list of terminologies.
* Response Format *
term1, term2, …Terminology Determination
(a) Terminology determination
* User Problem *
<user_input >{…}</user_input >
* Target *
Use the web search tool to search for
precise definition or description of
{terminology }.
* Rules *
- Wikipedia is preferred.
- Filter out the contents that are
irrelevant to the problem.
- Do not provide algorithms or
implementation details .
* Response Format *
URL1, URL2, ……Terminology Searching (b) Terminology searching
* User Problem *
<user_input >{…}</user_input >
* Target *
Based on the background information
in the context documents, please
provide an algorithm outline step by
step to solve the user’s problem.
* Rules *
- Use the web search tool multiple
- Analyze the retrieved information
and filter out irrelevant contents.
- Refer to the context.Algorithm Outline Generation (c) Algorithm outline generation
* User Problem *
<user_input >{…}</user_input >
* Target *
Please elaborate on each step with
detailed technologies or algorithms.
* Rules *
- Do not modify the outline.
- Use the web search tool.
- Filter out irrelevant content.
* Response Format *
Step 1: [title]
- xxx
Step 2: xxxDetailed Design Generation (d) Detailed design generation
Figure 7: The prompt template for (a) terminology determination, (b) terminology searching, (c) algorithm outline
generation, and (d) detailed design generation.
comprehensible to the LLM. These representations are then
used to build a local vector database with Faiss [ 27], serving
as a contextual knowledge base for the LLM. During the in-
ference process, the LLM retrieves relevant content from the
database with a high degree of similarity to the user problem
in the vector space. This approach ensures that the LLM
understands the user problem and objective with necessary
context information and domain knowledge.
Remarks. The background knowledge retrieval module is
user-friendly, as it is operated automatically by our AutoIOT
agent without any user intervention. In addition, we provide
an interface for users to explicitly complement necessary
background information as well as highly specialized do-
main knowledge ( e.g., research papers, detailed algorithm
descriptions) to enrich the context database.
4.3 Automated Program Synthesis
As observed in our preliminary experiments (§ 2.2), if we
directly instruct LLMs to generate programs for AIoT appli-
cations, multiple subtasks should be undertaken manually by
the user. Specifically, the user needs to decompose the AIoT
task into several subtasks and request the LLM to generate a
solution for each subtask. After integrating the solutions, the
user has to manually debug, execute, and improve the pro-
gram. Although the synthesized program eventually meets
the user’s requirement, the program synthesis necessitates
frequent user intervention and active involvement through-
out the process, which is cumbersome and time-consuming.
To address this problem, we develop the automated program
synthesis module, aiming to automate the programming pro-
cedure, reduce the involved workload, and improve the de-
velopment efficiency and user experience. In particular, the
automated program synthesis module uses Chain-of-Thought
(CoT) prompts to guide LLMs through step-by-step reason-
ing processes, thereby enhancing their capability of tackling
complex AIoT problems by mimicking human-like divide-
and-conquer reasoning processes.
CoT 1: Algorithm Outline Generation. AutoIOT first
prompts the LLM to analyze the user problem and design a
preliminary algorithm outline. As illustrated in Fig. 7(c), theprompt for algorithm outline generation consists of three
parts: 1) The "User Problem" is reiterated at the beginning
to ensure continuity and coherence in the LLM’s responses
since the LLM may forget the previous context [ 75]. 2) The
"Target" specifies our request, i.e., algorithm outline gen-
eration. 3) The "Rules" add detailed instructions for qual-
ity assurance. For example, AutoIOT explicitly requests the
LLM to actively search for advanced AIoT algorithms using
the web search tool throughout the process. AutoIOT also
asks the LLM to filter out irrelevant information. With such
well-structured prompts, the LLM can generate an algorithm
outline according to the problem specification.
CoT 2: Detailed Design Generation. Given the gener-
ated algorithm outline, the LLM is then tasked with further
elaborating on each step in the outline with more detailed
technologies and algorithms to refine the approach compre-
hensively. The prompt for this stage includes "User Problem",
"Target", and "Rules". In addition, it also includes a new re-
quirement - "Response Format" to specify the expected for-
mat of the LLM’s output, as illustrated in Fig. 7(d). Given such
a prompt, the LLM can generate detailed steps and specific
actions in each step to achieve the overall objective and solve
the user problem. In this stage, essentially AutoIOT guides
the LLM to decompose the AIoT task into multiple subtasks,
facilitating a divide-and-conquer strategy to synthesize the
corresponding program in the next stage.
CoT 3: Modularized Code Generation. Given a set of sub-
tasks generated at the previous stage, AutoIOT instructs the
LLM to generate one function for each module with a clear
function name, signature, and input/output specification.
The modules can then be developed independently, ensuring
specific functionalities are effectively implemented according
to the algorithm descriptions generated above. This divide-
and-conquer approach is proven effective in synthesizing
complex programs by fully exploiting the modularized code
generation capabilities [17] of LLMs.
In our initial trials, we observed that LLMs sometimes
generate null functions with placeholders, invoke undefined
functions, or import nonexistent packages. To tackle this
Page 7:
AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications MobiCom ’25, Nov 4–8, 2025, Hong Kong, China
* Target *
According to step {…} in the outline,
write some Python code to
implement it.
* Rules *
- Develop one well -structured
function with detailed comments,
including clear function name,
signatures, and I/O specifications.
- Do not provide a null function with
only placeholders. Do not import
nonexistent packages and use
undefined functions.
- Consider edge cases.
- Ensure consistency.Modularized Code Generation
(a) Modularized code generation
* Target *
Constructively integrate all the
previous code segments .
* Rules *
- Do not provide null functions with
only placeholders, but detailed
- The code should be executed via
command line: {Python –i
<input_file >}
- Ensure consistency.
* Response Format *
```Modularized Code Integration (b) Modularized code integration
* Target *
The compiler/interpreter can’t
successfully run the code. Analyze
and correct the code.
* Rules *
- Provide revised code in complete
format. The following is not allowed:
# (same as before)
# (function remains the same)
- Use the web search tool for the
correct usage of some packages or
* Compiler/Interpreter Logs*
{…}Code Debugging (c) Code debugging
* User Problem *
<user_input >{…}</user_input >
* Target *
- The program output is listed below.
Please first analyze it and modify the
algorithm outline to improve the
performance by integrating more
advanced algorithms.
* Program Output *
* Rules *
- Omit any warnings.
- Refer to the chat history and the
context documents.Algorithm Modification (d) Algorithm modification
Figure 8: The prompt template for (a) modularized code generation, (b) modularized code integration, (c) code
debugging, and (d) code improvement via algorithm modification
problem, AutoIOT adds more stringent rules and require-
ments to explicitly ask the LLM to avoid generating null
functions and verify the availability of imported packages
and invoked functions by involving the web search tool.
With the prompts shown in Fig. 8(a), the LLM can generate
cohesive code segments with detailed comments for each
module, facilitating the module integration in the next stage.
CoT 4: Modularized Code Integration. Given the gener-
ated code segments, AutoIOT prompts the LLM to construc-
tively integrate all modularized code segments and create a
cohesive and comprehensive program. Since the code gener-
ated for different modules may have disparate input/output
variable names, AutoIOT first prompts the LLM to ensure the
consistency among all the modules and synthesize the final
program without null functions as illustrated in Fig. 8(b).
For the convenience of code execution, debugging, and
optimization, AutoIOT asks the LLM to add a main function
so that the program can be directly executed from the com-
mand console ( e.g.,python3 -i <input_file> ).
Moreover, AutoIOT also asks the LLM to generate user docu-
mentation in Markdown format, specifying how to properly
install, execute, and troubleshoot the program for end users.
Remarks. The automated program synthesis module facili-
tates a seamless transformation from natural language to a
readily executable program with CoT prompts. AutoIOT is
regarded as an experienced developer, adept at decomposing
complex AIoT tasks into multiple modules, generating modu-
larized code, and organically integrating them. The program
synthesis process can be fully automated by the agent.
4.4 Code Improvement
In § 2.2, we found that the LLM can evaluate and improve
the code with heavy user intervention. To alleviate the user’s
workload involved in debugging and code optimization, we
develop the code improvement module.
Automated Debugging. Upon obtaining the final program
after integration, AutoIOT constructs a code executor to
run the generated code within a virtue environment ( e.g., asandbox), ensuring safe and controlled code execution. The
code executor loads the sensor dataset from the user’s local
device for program execution and exports the compiler or
interpreter output to the LLM. If the program encounters exe-
cution issues ( e.g., syntax or I/O errors), AutoIOT embeds the
logs from the compiler into a prompt (Fig. 8(c)) and instructs
the LLM to debug the code for several rounds of interactions
until the generated code can be executed successfully.
Code Optimization via Algorithm Modification. To
achieve better performance, AutoIOT progressively refines
the synthesized program via several iterations. In particular,
AutoIOT processes the test dataset with the first version of
the integrated program. Then, AutoIOT prompts (Fig. 8(d))
the LLM with the context information ( e.g., algorithm out-
line, chat history) of generating the first program as well as
the program output, and asks the LLM to improve the perfor-
mance by integrating more advanced algorithms. Specifically,
AutoIOT uses the web search tool to search for solutions that
can achieve higher accuracies referring to academic papers
and relevant webpages. This initiates a new recursive cycle
of program synthesis, starting from refining the algorithm
outline accordingly, enriching the outline with the retrieved
algorithms, generating modularized code for the updated
design, combing the modularized code, debugging, and im-
proving code quality. This code optimization cycle is not a
one-time process but is repeated multiple times. Empirically,
AutoIOT takes five iterations to progressively generate five
different programs, striking a balance between thoroughness
and efficiency. Finally, AutoIOT requests the LLM to analyze
the execution results of all the programs and select the one
that achieves the best performance as the final program.
Fig. 9 shows a specific example of how the synthesized
program evolves over multiple iterations. Version 1 is gen-
erated with a bug. By providing the error message to the
LLM, AutoIOT can automatically fix the bug and generate
Version 2. AutoIOT further instructs the LLM to modify the
algorithm and adopt advanced technologies to generate new
versions that iteratively achieve higher accuracies.
Page 8:
MobiCom ’25, Nov 4–8, 2025, Hong Kong, China Leming Shen, Qiang Yang, Yuanqing Zheng, Mo Li
import wfdb
def load_mit_bih _database ():
record_names = wfdb.get_record_list ('mitdb')
# Iterate over each record name
for name in record_names :
# Load the record and its annotations
record = wfdb.rdrecord (name, pb_dir='')Version 1
import wfdb
def load_mit_bih _database ():
# Iterate over each ...
# ......
def bandpass_filter (param):
# Apply bandpass ...
# ......Version 3Executor
$ python3 –i <path>
TypeError : rdrecord () got an
unexpected keyword argument
import wfdb, os
def load_mit_ bih_database ():
record_names = wfdb.get_record_list ('mitdb')
# Iterate over each record name
for name in record_names :
# Load the record and its annotations
record = wfdb.rdrecord (os.path.join(path, name))Version 2
$ python3 –i <path>
Average detection accuracy: 0.85
1. Load data
2. PreprocessModify algorithm
outline and adopt more
advanced technologies$ python3 ...
Accuracy : 0.94Modify …
and ……
import wfdb
def load_mit_bih _database ():
# ......
def bandpass_filter (param):
# ......
def adaptive_thresholding (param):
# ......Version 4
$ python3 ...
Accuracy : 0.97Executor
Figure 9: An example of code improvement via iterations (details omitted).
Supporting User-in-the-Loop Optimization. After each
iteration of code optimization, AutoIOT also provides an
interface for the user to optionally provide instructions that
can help the LLM improve the synthesized program. For
example, when the user finds that the LLM fails to recall
relevant information from the retrieved contents, the user
can prompt AutoIOT to refer to a specified algorithm or
provide a recent academic paper. This enables a user-in-the-
loop optimization that requires minimal user intervention
and promotes code optimization iteratively.
Remarks. The code improvement module automates code
debugging and optimization by leveraging the LLM’s profi-
ciency in debugging and refining code based on the compiler
and interpreter feedback [ 23,55]. This automation not only
releases the manual burden but also heralds a new era where
AIoT applications can evolve iteratively and autonomously
with minimum user intervention.
5.1 Implementation
We implement AutoIOT with GPT-4 [ 16] based on LangChain
[7], which provides various tools ( e.g., web search engine,
vector database, etc) for LLMs to collect relevant informa-
tion. We select Tavily [ 9] as the web search tool to search
for relevant information. It uses OpenAI’s text embedding
model [ 56] to convert the retrieved webpages into vector
representations. AutoIOT then uses Faiss [ 27] for efficient
similarity search of vector representations. The code execu-
tor controlled by AutoIOT is deployed on a Linux Ubuntu
workstation equipped with an NVIDIA RTX 4090 GPU.
5.2 AIoT Applications & Datasets
We select four representative AIoT applications from the do-
mains of healthcare and human activity recognition (HAR).
Unlike other program synthesis tasks, these AIoT tasks re-
quire domain-specific knowledge and highly specialized al-
gorithms in signal processing and machine learning.
Heartbeat Detection. R-peak detection in electrocardio-
gram (ECG) data is a crucial task in cardiac signal process-
ing, serving as a foundational step for heart rate variabilitystudies, and arrhythmia detection [ 78]. We use MIT-BIH
Arrhythmia Database [ 1] and five representative baseline
algorithms, including Hamilton [ 35], Christov [ 24], Engzee
[31], Pan-Tompkins [60], and SWT [45].
IMU-based Human Activity Recognition. Inertial mea-
surement unit (IMU)-based HAR enables continuous identifi-
cation of a wide range of daily activities ( e.g., sitting, walking)
by capturing and analyzing motion characteristics from the
IMU data [ 33,80]. For the baselines, we select five open-
source GitHub repositories: LSTM-RNN [ 3], 1D-CNN [ 4],
Conv-LSTM [ 6], BiLSTM [ 5], and NN [ 8]. We compare their
performance with AutoIOT on the WISDM dataset [46].
mmWave-based Human Activity Recognition. mmWave
can capture fine-grained human gestures with high resolu-
tion [ 25,26,81]. We select the XRF55 dataset [ 73] with the
models proposed in the paper as the baselines, including
ResNet-18, 34, 50, 101, and 152. This task is more challenging
because: 1) The XRF55 dataset was recently published on
websites only a few months ago, which means LLMs have
not yet seen knowledge about this dataset; 2) The mmWave
data has high dimensionality, necessitating the use of more
sophisticated models with optimized configurations.
Multimodal Human Activity Recognition. By leverag-
ing different sensors to capture complementary information,
HAR systems can achieve higher robustness and versatility.
We select the Harmony dataset [ 59] containing three sen-
sor modalities: audio, depth image, and radar. The baseline
system consists of three encoders, each designed to extract
unique features from the respective modalities, followed by
feature concatenation and a classifier model. This task is
also challenging for AutoIOT as it involves the fusion of data
from different modalities with cross-modal interaction.
6.1 Metrics
We adopt the following evaluation metrics: 1) Task accuracy :
we repeat the experiment 10 times and calculate the average
task accuracy. In heartbeat detection, we consider task accu-
racy as the percentage of correctly identified peaks within a
Page 9:
AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications MobiCom ’25, Nov 4–8, 2025, Hong Kong, China
/uni00000024/uni00000011 /uni0000002b/uni00000011 /uni00000026/uni00000011 /uni00000028/uni00000011 /uni00000033/uni00000011 /uni00000036/uni00000011
/uni00000025/uni00000044/uni00000056/uni00000048/uni0000004f/uni0000004c/uni00000051/uni00000048/uni00000056/uni0000001b/uni00000013/uni0000001b/uni00000017/uni0000001b/uni0000001b/uni0000001c/uni00000015/uni0000001c/uni00000019 /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000003/uni00000024/uni00000046/uni00000046/uni00000011/uni00000003/uni0000000b/uni00000008/uni0000000c/uni00000017/uni00000011/uni00000018/uni00000017/uni00000011/uni0000001b/uni00000018/uni00000011/uni00000014/uni00000018/uni00000011/uni00000017/uni00000018/uni00000011/uni0000001a
(a) Heartbeat detection
/uni00000050/uni00000050/uni0000003a/uni00000044/uni00000059/uni00000048 (b) IMU & mmWave-based HAR
/uni00000024/uni00000011/uni00000014 /uni00000024/uni00000011/uni00000015 /uni00000024/uni00000011/uni00000016 /uni00000025/uni00000011
/uni00000025/uni00000044/uni00000056/uni00000048/uni0000004f/uni0000004c/uni00000051/uni00000048/uni00000056/uni00000014/uni00000018/uni00000016/uni00000018/uni00000018/uni00000018/uni0000001a/uni00000018/uni0000001c/uni00000018 /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000003/uni00000024/uni00000046/uni00000046/uni00000011/uni00000003/uni0000000b/uni00000008/uni0000000c /uni00000015/uni00000017/uni00000019/uni0000001b/uni00000014/uni00000013
/uni0000002a/uni00000033/uni00000038/uni00000003/uni00000030/uni00000048/uni00000050/uni00000052/uni00000055/uni0000005c (c) Multimodal HAR
/uni00000030/uni00000058/uni0000004f/uni00000057/uni0000004c/uni00000011 (d) Inference time per sample
Figure 10: The overall performance of the four IoT applications. In (a), A. for AutoIOT , H. for Hamiltion, C. for
Christov, E. for Engzee, P. for Pan-Tompkins, and S. for SWT. In (b), N. for NN, 1D for 1D-CNN, Bi. for BiLSTM, C.
for Conv-LSTM, L. for LSTM-RNN, and 𝒏for ResNet- 𝒏. In (c) & (d), A.1, A.2, and A.3 for three different AutoIOT -
generated programs; B. for the baseline in the multimodal HAR application.
predefined tolerance window compared to the ground truth.
In HAR, we consider classification accuracy. 2) MAE : We use
medium absolute error (MAE) to measure the discrepancy in
beat positions between the predicted R-peaks and the ground
truth. 3) Communication cost : we use psutil [2] to monitor
the network traffic. 4) Wall-clock execution time : we record
the total time consumed from the moment the user inputs
the problem to the generation of the final inference results
for all the sensor data. 5) Memory consumption : we record
the GPU memory consumption during code execution if AI
models are adopted. 6) Inference time per sample : we compute
the inference time per data sample if AI models are used.
6.2 Performance against Baselines
6.2.1 Average accuracy & MAE. Fig. 10(a) shows the heart-
beat detection accuracy with MAE of AutoIOT (denoted as A.)
and baselines. First of all, AutoIOT can synthesize a program
automatically to achieve comparable performance with base-
lines in the heartbeat detection task. More surprisingly, the
automatically synthesized program can even beat some of
the baselines! For example, the synthesized program achieves
higher detection accuracy than Pan-Tompkins (P.) and En-
gzee (E.). Moreover, it yields a lower error rate than Christov
(C.) and Pan-Tompkins (P.). To investigate the reasons behind
this, we examine and analyze the synthesized program.
We learned : 1) Armed with the web search tool, the syn-
thesized program implemented a few basic as well as sophis-
ticated signal processing methods, including bandpass filter-
ing and stationary wavelet transformation in preprocessing,
and adaptive thresholding in postprocessing. Some selected
algorithms are well-known and widely adopted, while oth-
ers are less likely to be chosen, even by experienced pro-
grammers with domain expertise. Unlike narrowly focused,
well-defined simple programming tasks, AIoT tasks typically
require systematic integration of multiple algorithms and
components to achieve optimal system performance, which
creates opportunities for AutoIOT to explore more possibili-
ties in automatically synthesizing optimized programs that
could outperform not all but some representative baselines.
2) Given a single performance objective, we notice that
the LLM carries out extensive optimization, sometimes at the
cost of other equally important metrics. For example, when
instructed to improve the detection accuracy, the synthesizedprogram sets a larger tolerance window, which increases the
chance of correctly detecting heartbeats (true positives) at
the cost of increased false positives. Considering AIoT appli-
cations’ complexity and multiple competing or even contra-
dicting objectives, user requirement specification needs to be
as complete and comprehensive as possible, which necessi-
tates domain expertise and system development experience.
3) We found that the webpages returned by the web search
tools are often about general algorithms due to their popu-
larity and higher page rankings. Such popular algorithms,
however, may not perform the best in domain-specific tasks.
With minimum user intervention by providing specialized
algorithms, AutoIOT can synthesize programs accordingly
and achieve comparable performance to the baselines.
Fig. 10(b) shows the classification accuracy in two HAR
tasks. We observe that AutoIOT outperforms NN and 1D-
CNN while underperforms BiLSTM, Conv-LSTM and LSTM-
RNN. The main reasons are twofold: 1) HAR tasks require
both signal processing and machine learning algorithms, in-
creasing the programming complexity to some extent; 2)
Training neural networks requires fine-tuning of a vast ar-
ray of hyper-parameters ( e.g., network architecture config-
urations, epoch number, learning rate, optimizer, and loss
function). This significantly amplifies the instability of the
generated code and calls for careful fine-tuning to achieve the
best performance in practice. As a result, AutoIOT surpasses
those baselines adopting simple model architectures (NN
and one-dimensional CNN) but falls short against baselines
using sophisticated architectures (BiLSTM and Conv-LSTM)
with highly optimized hyper-parameters. Although during
code improvement, some synthesized programs define a set
of configurations and adopt a searching strategy to obtain
optimal hyper-parameters, the performance still remains
slightly lower than some baselines. This is because the deter-
mination of the optimal configurations for machine learning
models is typically a trial-and-error process, requiring sub-
stantial human effort. Fortunately, we observe that if the user
provides a potential search space in advance, the LLM can
design a search algorithm to try different hyper-parameter
configurations and select the one with the best performance.
For multimodal HAR, the input instruction (A1) includes
the basic information of the task, i.e., the task target, the
Page 10:
MobiCom ’25, Nov 4–8, 2025, Hong Kong, China Leming Shen, Qiang Yang, Yuanqing Zheng, Mo Li
/uni00000014 /uni00000015 /uni00000016 /uni00000017 /uni00000018
/uni00000027/uni0000004c/uni00000049/uni00000049/uni00000048/uni00000055/uni00000048/uni00000051/uni00000057/uni00000003/uni0000004c/uni00000051/uni00000057/uni00000048/uni00000055/uni00000059/uni00000048/uni00000051/uni00000057/uni0000004c/uni00000052/uni00000051/uni00000003/uni0000004f/uni00000048/uni00000059/uni00000048/uni0000004f/uni0000001c/uni00000016/uni0000001c/uni00000017/uni0000001c/uni00000018/uni0000001c/uni00000019/uni0000001c/uni0000001a/uni0000001c/uni0000001b /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000003/uni00000024/uni00000046/uni00000046/uni00000011/uni00000003/uni0000000b/uni00000008/uni0000000c/uni00000015/uni00000018/uni00000015/uni00000019/uni00000015/uni0000001a/uni00000015/uni0000001b/uni00000015/uni0000001c/uni00000016/uni00000013
(a) Single level intervention
/uni00000014 /uni00000015/uni0000000e/uni00000016 /uni00000017/uni0000000e/uni00000018 /uni00000016/uni00000061/uni00000018 /uni00000015/uni00000061/uni00000018
/uni00000027/uni0000004c/uni00000049/uni00000049/uni00000048/uni00000055/uni00000048/uni00000051/uni00000057/uni00000003/uni0000004c/uni00000051/uni00000057/uni00000048/uni00000055/uni00000059/uni00000048/uni00000051/uni00000057/uni0000004c/uni00000052/uni00000051/uni00000003/uni00000046/uni00000052/uni00000050/uni00000045/uni0000004c/uni00000051/uni00000044/uni00000057/uni0000004c/uni00000052/uni00000051/uni0000001c/uni00000016/uni0000001c/uni00000017/uni0000001c/uni00000018/uni0000001c/uni00000019/uni0000001c/uni0000001a/uni0000001c/uni0000001b /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000003/uni00000024/uni00000046/uni00000046/uni00000011/uni00000003/uni0000000b/uni00000008/uni0000000c/uni00000015/uni00000018/uni00000015/uni0000001b/uni00000016/uni00000014/uni00000016/uni00000017/uni00000016/uni0000001a/uni00000017/uni00000013
/uni00000036/uni0000005c/uni00000051/uni00000057/uni0000004b/uni00000048/uni00000056/uni0000004c/uni00000056/uni00000003/uni00000037/uni0000004c/uni00000050/uni00000048 (b) Combined intervention
Figure 11: Different levels of user intervention.
/uni00000013 /uni00000015 /uni00000017 /uni00000019 /uni0000001b /uni00000014/uni00000013
/uni00000028/uni00000053/uni00000052/uni00000046/uni0000004b/uni00000056/uni0000001b/uni0000001b/uni0000001b/uni0000001c/uni0000001c/uni00000013/uni0000001c/uni00000014/uni0000001c/uni00000015/uni0000001c/uni00000016 /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000003/uni00000024/uni00000046/uni00000046/uni00000011/uni00000003/uni0000000b/uni00000008/uni0000000c/uni00000014/uni00000013/uni00000015/uni00000013/uni00000016/uni00000013/uni00000017/uni00000013/uni00000018/uni00000013/uni00000019/uni00000013
/uni00000037/uni0000004c/uni00000050/uni00000048(a) GPT-4
/uni00000013 /uni00000015 /uni00000017 /uni00000019 /uni0000001b /uni00000014/uni00000013
/uni00000028/uni00000053/uni00000052/uni00000046/uni0000004b/uni00000056/uni0000001b/uni00000016/uni0000001b/uni00000017/uni0000001b/uni00000018/uni0000001b/uni00000019/uni0000001b/uni0000001a/uni0000001b/uni0000001b /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000003/uni00000024/uni00000046/uni00000046/uni00000011/uni00000003/uni0000000b/uni00000008/uni0000000c/uni00000014/uni00000015/uni00000015/uni00000013/uni00000015/uni0000001b/uni00000016/uni00000019/uni00000017/uni00000017/uni00000018/uni00000015
/uni00000037/uni0000004c/uni00000050/uni00000048 (b) GPT-3.5
Figure 12: Different numbers of iteration.
dataset specifications, and the output format. Based on
that, we create two additional variations: one with a GPU-
memory-constrained requirement (A2) and another with a
high accuracy requirement (A3). We then feed the instruc-
tions into AutoIOT and measure the accuracy and inference
time of the synthesized programs, with results shown in
Fig. 10(c). By analyzing the three synthesized different pro-
grams, we observe that: 1) All the synthesized programs
adopt a similar workflow as the baseline system, i.e., they first
construct three encoders to extract effective features from the
three modalities, then concatenate these features and feed
them into a classifier for activity recognition. This implies
that benefiting from our CoT-based problem-solving para-
digm, AutoIOT recognizes the workflow and architecture as
effective and standard for handling multimodal data-related
tasks, which is consistent with most of the existing meth-
ods [ 59]. 2) AutoIOT can adjust the generated code to fulfill
different requirements. The second program consumes less
memory than others due to the resource constraint require-
ment, resulting in lower accuracy but reduced inference time
(Fig. 10(d)). On the other hand, the third program adopts a
more complex and larger model architecture, requiring more
GPU memory and incurring a longer inference time. Such
differences validate the capabilities of AutoIOT in accurately
understanding and processing natural language-based user
requirements. These observations further demonstrate the
effectiveness of AutoIOT in ensuring the correctness of user
requirement understanding and the generated code, benefit-
ing from our automatic self-improvement component.
6.2.2 Communication cost. & wall-clock time. We select
heartbeat detection as an example and measure the total
communication cost with wall-clock execution time of Au-
toIOT and direct LLM inference as done in Penetrative AI
[79]. Specifically, ECG data is first down-sampled and seg-
mented into multiple windows and then embedded into the
prompt for LLMs’ inference. Experiment results show that
AutoIOT requires 8MB of network traffic mainly for prompt
transmissions, while [ 79] consumes more than 50MB mainly
for sensor data transmissions. Besides, AutoIOT takes 25
minutes to complete the task with a dramatic reduction in
inference time compared to [ 79], which needs to send and
process all windowed signals with the remote LLM serving.
6.3 Sensitivity Analysis
Different levels of user intervention. To show how au-
tomated program synthesis improves user experiences, we
/uni00000017/uni00000016/uni00000011/uni00000018 /uni00000026/uni00000011/uni00000024/uni00000011 /uni0000002f/uni00000011/uni0000002a/uni00000011
/uni00000027/uni0000004c/uni00000049/uni00000049/uni00000048/uni00000055/uni00000048/uni00000051/uni00000057/uni00000003/uni0000002f/uni0000002f/uni00000030/uni00000056/uni0000001b/uni00000013/uni0000001b/uni00000018/uni0000001c/uni00000013/uni0000001c/uni00000018/uni00000014/uni00000013/uni00000013 /uni00000024/uni00000059/uni00000048/uni00000055/uni00000044/uni0000004a/uni00000048/uni00000003/uni00000024/uni00000046/uni00000046/uni00000011/uni00000003/uni0000000b/uni00000008/uni0000000c/uni00000018/uni00000011/uni00000013/uni00000013/uni00000018/uni00000011/uni00000015/uni00000018/uni00000018/uni00000011/uni00000018/uni00000013/uni00000018/uni00000011/uni0000001a/uni00000018/uni00000019/uni00000011/uni00000013/uni00000013
/uni00000030/uni00000024/uni00000028(a) Average accuracy & MAE
/uni00000017/uni00000016/uni00000011/uni00000018 /uni00000026/uni00000011 /uni00000024/uni00000011 /uni0000002f/uni00000011 /uni0000002a/uni00000011
/uni00000027/uni0000004c/uni00000049/uni00000049/uni00000048/uni00000055/uni00000048/uni00000051/uni00000057/uni00000003/uni0000002f/uni0000002f/uni00000030/uni00000056/uni00000014/uni00000013/uni00000016/uni00000013/uni00000018/uni00000013/uni0000001a/uni00000013/uni0000001c/uni00000013 /uni00000037/uni0000004c/uni00000050/uni00000048/uni00000003/uni0000000b/uni00000050/uni0000004c/uni00000051/uni0000000c/uni00000015/uni00000017/uni00000019/uni0000001b/uni00000014/uni00000013
/uni00000031/uni00000048/uni00000057/uni0000005a/uni00000052/uni00000055/uni0000004e (b) Wall-clock time & network
Figure 13: Different LLMs. (4 for GPT-4, 3.5 for GPT-3.5,
C. for Cohere, A. for Anthropic Claude 2, L. for Llama2-
7b, and G. for Gemma-7b.)
evaluate the AutoIOT ’s performance under five different lev-
els of user intervention: 1) No intervention; 2) Intervention
with user-provided domain knowledge; 3) Intervention with
user-specified algorithms for program synthesis; 4) Inter-
vention with user-based debugging; 5) Intervention with
user-decided algorithm modification for code improvement.
Fig. 11(a) shows the performance of AutoIOT with different
levels of user intervention. When the user manually instructs
the LLM to generate code according to specific hand-picked
algorithms ( e.g., designed by experts or research papers), the
average accuracy can be improved. This user-in-the-loop
process becomes particularly advantageous when users have
a higher level of domain expertise in AIoT, enabling them
to design or select more advanced and robust algorithms.
But it leads to increased synthesis time as the LLM has to
revise outputs until the user is satisfied. Fig. 11(b) shows the
performance with different user intervention combinations.
We see that increased user involvement in the program syn-
thesis process correlates with higher accuracy. However, this
heightened engagement leads to significantly longer synthe-
sis time and extra user overhead. Note that AutoIOT may
not always be able to fix bugs and finish program synthesis
tasks. In this case, user intervention with minimum effort is
still a must. Thus, AutoIOT allows users to provide detailed
instructions necessary for program synthesis by the LLMs.
Different numbers of improvement iterations. We vary
the number of epochs for code optimization from 0 to 10
and evaluate the impact on the synthesized programs. As
shown in Fig. 12, with more improvement epochs, the accu-
racies/synthesis time of the GPT-3.5/-4 generated programs
gradually increase. However, after around 5 epochs, the mar-
ginal gain of average accuracy starts to diminish while the
synthesis time increases dramatically. This is because, with
longer conversation history, LLMs may fail to recall past con-
text information and tend to generate inconsistent responses
[75]. Therefore, we empirically set the epoch number to five.
Page 11:
AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications MobiCom ’25, Nov 4–8, 2025, Hong Kong, China
/uni00000024/uni00000058/uni00000057/uni00000052/uni0000002c/uni00000032/uni00000037 /uni0000005a/uni00000012/uni00000052/uni00000003/uni00000025/uni00000011 /uni0000005a/uni00000012/uni00000052/uni00000003/uni00000032/uni00000011 /uni0000005a/uni00000012/uni00000052/uni00000003/uni00000027/uni00000011 /uni0000005a/uni00000012/uni00000052/uni00000003/uni0000002c/uni00000011/uni00000014/uni00000018/uni00000016/uni00000013/uni00000017/uni00000018/uni00000019/uni00000013/uni0000001a/uni00000018/uni0000001c/uni00000013 /uni00000028/uni00000036/uni00000035/uni00000003/uni0000000b/uni00000008/uni0000000c
/uni00000024/uni0000002c/uni00000035 /uni00000028/uni00000026/uni0000002a/uni00000003/uni00000028/uni00000036/uni00000035
(a) Single component ablation
/uni00000024/uni00000058/uni00000057/uni00000052/uni0000002c/uni00000032/uni00000037 /uni0000005a/uni00000012/uni00000052/uni00000003/uni00000025/uni00000011/uni0000000e/uni00000032/uni00000011 /uni0000005a/uni00000012/uni00000052/uni00000003/uni00000032/uni00000011/uni0000000e/uni00000027/uni00000011 /uni0000005a/uni00000012/uni00000052/uni00000003/uni00000025/uni00000011/uni0000000e/uni00000032/uni00000011/uni0000000e/uni00000027/uni00000011 /uni0000005a/uni00000012/uni00000052/uni00000003/uni00000044/uni0000004f/uni0000004f/uni00000014/uni00000018/uni00000016/uni00000013/uni00000017/uni00000018/uni00000019/uni00000013/uni0000001a/uni00000018/uni0000001c/uni00000013 /uni00000028/uni00000036/uni00000035/uni00000003/uni0000000b/uni00000008/uni0000000c
/uni00000050/uni00000050/uni0000003a/uni00000044/uni00000059/uni00000048/uni00000003/uni00000024/uni0000002c/uni00000035 (b) Combined component ablation
/uni00000031/uni00000052/uni00000051/uni00000010/uni00000048/uni0000005b/uni00000053/uni00000011 (c) User study (subjective)
Figure 14: (a) & (b): Ablation study. B. for background knowledge retrieval, O. for algorithm outline generation, D.
for detailed design generation, and I. for code improvement. (c): User study on subjective metrics.
/uni00000014 /uni00000015 /uni00000016 /uni00000017 /uni00000018
/uni00000037/uni00000055/uni0000004c/uni00000044/uni0000004f/uni00000003/uni0000000b/uni00000057/uni0000004c/uni00000050/uni00000048/uni00000056/uni0000000c/uni00000016/uni00000013/uni00000017/uni00000013/uni00000018/uni00000013/uni00000019/uni00000013/uni0000001a/uni00000013 /uni00000037/uni00000044/uni00000056/uni0000004e/uni00000003/uni00000044/uni00000046/uni00000046/uni00000058/uni00000055/uni00000044/uni00000046/uni0000005c/uni00000003/uni0000000b/uni00000008/uni0000000c/uni00000028/uni0000005b/uni00000053/uni00000048/uni00000055/uni00000057
(a) Task accuracy
/uni00000014 /uni00000015 /uni00000016 /uni00000017 /uni00000018
/uni0000002c/uni00000057/uni00000048/uni00000055/uni00000044/uni00000057/uni0000004c/uni00000052/uni00000051/uni00000003/uni0000000b/uni00000048/uni00000053/uni00000052/uni00000046/uni0000004b/uni00000056/uni0000000c/uni00000013/uni00000015/uni00000017/uni00000019/uni0000001b /uni00000031/uni00000058/uni00000050/uni00000045/uni00000048/uni00000055/uni00000003/uni00000052/uni00000049/uni00000003/uni00000052/uni00000046/uni00000046/uni00000058/uni00000055/uni00000055/uni00000048/uni00000051/uni00000046/uni00000048/uni00000056
/uni00000036/uni00000048/uni00000046/uni00000058/uni00000055/uni0000004c/uni00000057/uni0000005c/uni00000036/uni00000050/uni00000048/uni0000004f/uni0000004f (b) Code correctness verification
Figure 15: User study (Objective Evaluation).
Impact of Different LLMs. We select the following LLMs
for comparison: GPT-4 [ 16], GPT-3.5 [ 18], Llama2-7b [ 72],
Cohere [ 10], Claude 2 [ 11], and Gemma-7b [ 70]. Llama2
and Gemma are locally deployed in our lab. We select R-
peak detection as an example with the Christov algorithm as
the baseline. Fig. 13(a) shows that GPT-4 performs the best.
Given the knowledge retrieved by the same tools, LLMs still
need strong language understanding and reasoning capabil-
ities to comprehend AIoT tasks and synthesize programs.
Experiment results indicate that GPT-4 might have supe-
rior performance in language understanding and reasoning
capability for this specific task. Although Llama2-7b and
Gemma-7b achieve relatively lower accuracy, these two local
models offer much faster response speeds.
6.4 Ablation Study
In the ablation study, we use two metrics to evaluate the
code quality: 1) Execution success rate (ESR): the proportion
of the code that can be executed successfully for the first
time. 2) Average iteration round (AIR): the average number
of improvement iterations required to achieve 80% accuracy.
Background knowledge retrieval. To explore the influ-
ence of the background knowledge retrieval module, we dis-
able the web search tool and the knowledge database. We
then instruct the LLM to synthesize 20 different programs
with no user intervention. Fig. 14(a) shows the ESR and AIR
of two AIoT tasks. We see that the ESR of heartbeat detec-
tion drops slightly while the ESR of mmWave-based HAR
exhibits a significant drop. This is because the mmWave-
based HAR uses a newly published dataset, which has not
been seen by the LLM. Therefore, the LLM does not know the
dimensionality of the dataset and only randomly configures
the hyper-parameters of the neural network. Additionally,
both applications require larger numbers of iterations to im-
prove the accuracy of synthesized programs. We note thatmmWave-based HAR even fails to achieve the expected accu-
racy (thus, its AIR is marked as infinite). A similar phenom-
enon is also observed in Fig. 14(b). The experiment results
indicate that the background knowledge retrieval module
plays a pivotal role for the LLM to retrieve up-to-date domain
knowledge to augment the program synthesis process.
Chain-of-thought. We evaluate the contribution of the
algorithm outline generation step and the detailed design
generation step during the CoT prompting, respectively. As
shown in Fig. 14(a), when only one step is enabled, we ob-
serve a slight drop in ESR and an increase in AIR for both
applications. When we disable both steps, the ESR drops sig-
nificantly as shown in Fig. 14(b). Without the explicit guid-
ance specified in the two steps, the LLM cannot synthesize
executable programs and presents null functions with place-
holders. Therefore, the CoT method with detailed instruc-
tions emulating the software development lifecycle plays a
crucial role in helping LLMs synthesize executable programs.
Code improvement. The code improvement module con-
tains automated debugging and code optimization. Since
automated debugging is essential to ensure the executability
of synthesized programs, we only conduct an ablation study
on the code optimization step. With the program after the
debugging step, we evaluate the code improvement module
by directly instructing the LLM to modify the program over
several iterations without providing the compiler or inter-
preter feedback. As shown in Fig. 14(a), without the code
improvement module, the ESR is almost unaffected while the
AIR increases significantly. This is because generating an
executable program with no syntax errors is seldom influ-
enced by this module. However, without feedback from the
compiler or interpreter, more iterations are required since
the LLM does not know which step of the algorithm should
be modified or improved. Besides, due to the laziness of LLMs
[52], synthesized programs tend to adopt simple and popular
algorithms. This necessitates the code improvement mod-
ule, which progressively directs the LLM to explore more
advanced algorithms to improve the synthesized program.
6.5 User Study
To investigate the utilities of AutoIOT , we conduct a user
study (N=20) by inviting 5 expert and 15 non-expert users,
whose detailed background information is listed in Table 1.
The expert users are PhD students and professors with work
Page 12:
MobiCom ’25, Nov 4–8, 2025, Hong Kong, China Leming Shen, Qiang Yang, Yuanqing Zheng, Mo Li
Table 1: Participant information of user study (N=20)
Category Background Information
Gender Female (45%), Male (40%), Prefer not to say (15%)
Age Under 18 (10%), 18-30 (75%), 30-39 (10%), 40 and older (5%)
Education Bachelor (20%), Master (15%), Doctoral (60%), Others (5%)
English Beginner (5%), Intermediate (25%), Advanced (60%), Fluent (10%)
Expertise Expert (25%), Non-expert (75%)
or research experience in the IoT field and have developed
many IoT applications. We select human activity recognition
using RFID data (XRF55 dataset [ 73]) as the IoT application,
where a 1D Conv-based ResNet18 is the baseline.
Objective Evaluation. We first repeatedly measure the aver-
age task accuracy (classification accuracy) after executing the
synthesized programs, with the results shown in Fig. 15(a).
We see that the programs synthesized by the two groups of
users outperform the baseline across multiple trials. Besides,
the programs synthesized for experts typically perform better
than those for non-experts. The main reason is that experts
tend to provide more information in the specifications ( e.g.,
the dataset format, the training workflow). Consequently,
AutoIOT can synthesize programs with more advanced algo-
rithms and detailed specifications for expert users. Next, we
use SonarQube [ 13] to verify the correctness of the generated
code, including bug/logic errors, security issues, and code
smells [ 14] after every improvement iteration. Code smells
are not bugs but bad coding styles ( e.g., variable name mis-
matching regular expression) or potential weaknesses ( e.g.,
package version incompatibility). From Fig 15(b), we observe
that several bugs and one security issue present initially are
ultimately fixed by AutoIOT .AutoIOT may not be able to
address all code smells, as they are closely related to coding
styles [ 50]. Applying code smell correction to the retrieved
algorithms and programs can be promising for AutoIOT to
iteratively detect and fix code smell-related issues.
Subjective Measurement. We ask the users to execute the
synthesized programs and rate AutoIOT based on four sub-
jective metrics: 1) System Utility (SU) measures the user’s
overall satisfaction with AutoIOT ’s performance; 2) Require-
ment Coverage (RC) evaluates how well the user require-
ments are fulfilled by AutoIOT ; 3)Code & Documentation
Readability (CDR) measures the clarity and structure of the
code and documentation; 4) Generation Efficiency (GE) ac-
cesses how acceptable the waiting time is for synthesizing
the final program. All the above metrics are rated by the
users on a scale from 1 (not at all) to 6 (more than expected).
As shown in Fig. 14(c), the average GE of both user groups
reaches 4.5, indicating that the waiting time for AutoIOT to
synthesize the final program is acceptable for most users. Ad-
ditionally, we find that non-experts tend to give higher scores
(SU, RC, and CDR) than experts. Further examination reveals
that non-experts tend to under-specify their requirements.
Surprisingly, LLMs can sometimes provide comprehensive
responses to meet their requirements. This ability is rooted
in LLMs’ extensive training on diverse datasets and retrieved
User problem User problem
Make the code
more efficient
, device="cuda")AutoIOT(a) Runtime efficiency optimization
User problem User problem +
The target
platform is
Jetson Nano
model .eval ()model .half ()
model .eval ()AutoIOT (b) Tailored for target platform
Figure 16: Further experiments.
online information, enabling them to infer and bridge the
gaps with relevant information [32, 69].
Runtime Efficiency. AutoIOT focuses on generating func-
tionally correct code for IoT data processing. While the run-
time efficiency of the programs is not optimized, the users
can specify the extra requirements in the prompts. Fig. 16(a)
shows an example that given the user specification, the gen-
erated code adopts CUDA optimization and thereby achieves
similar runtime efficiency compared to expert-optimized pro-
grams. This is because AutoIOT can retrieve and learn from
hand-crafted optimizations via various online sources.
Workflow Generalizability. Existing IoT applications can
be primarily categorized into four types based on the specific
stages of the IoT workflow they address: 1) data collection
applications rely on sensors ( e.g., IMU and radar) to gather
raw information from the environment; 2) data transmis-
sion applications enable seamless communication across IoT
networks to cloud or edge systems; 3) data processing appli-
cations typically adopt advanced algorithms or AI models
to analyze and process IoT data; 4) decision-making and
actuation applications parse the processed data to perform
task automation or actuator control. In this paper, AutoIOT
is designed primarily for IoT data processing tasks by of-
fering end-to-end solutions with executable programs. The
prompts designed in AutoIOT are not limited to specific data
patterns (time-series or high-dimensional) and processing
pipelines (sequentially or parallelly). For other IoT applica-
tion workflows, users can specify the requirements in the
prompt. For example, users can first instruct AutoIOT to
develop a WiFi data collection application. Then, the synthe-
sized program can be deployed on WiFi-related devices to
capture WiFi data. Next, users can further request AutoIOT
to process and analyze the collected WiFi data to perform
HAR or other tasks. Moreover, some IoT applications may re-
quire executing multiple programs simultaneously for cross-
device communication or synchronization. To tackle such
applications, a possible solution can be deploying multiple
AutoIOT agents with advanced collaboration mechanisms
for enhanced cross-device and cross-program interaction.
We leave the comprehensive generalization to other types of
IoT tasks beyond pipelined data processing for future work.
Real-World Deployment. We show that AutoIOT can syn-
thesize functionally correct IoT programs in Python with
data processing accuracy as the main performance metric.
AutoIOT is not limited to specific target IoT platforms. For
Page 13:
AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications MobiCom ’25, Nov 4–8, 2025, Hong Kong, China
example, regarding those resource-constrained MCU-class
devices [ 63,74], various practical requirements ( e.g., power
management and network protocols) of the IoT device can
be factored into AutoIOT ’s prompts for program synthesis.
Fig. 16(b) shows an example where a user explicitly specifies
Jetson Nano, which has limited GPU resources, as the target
platform. AutoIOT will then adopt half-precision training
and inference for the AI model to save GPU memory. More-
over, users can even provide AutoIOT with handbooks of
some dedicated IoT devices for reference.
Memorization Issue. InAutoIOT , sometimes the synthe-
sized programs’ performance cannot be improved, even after
multiple improvement iterations. This is because LLMs may
forget the previous context, resulting in inconsistent code im-
provement suggestions during iterations [ 22]. In such cases,
our user-in-the-loop optimization allows users to option-
ally provide instructions that can help the LLM improve
the synthesized programs. For instance, users can explicitly
instruct the LLM to use the Pan-Tompkins algorithm for
ECG data processing. Additionally, we can further upgrade
AutoIOT by adopting existing memorization enhancement
methods, such as context compression via RAG [ 67] and it-
erative summarization [ 68]. By integrating these orthogonal
approaches with AutoIOT , each self-improvement iteration
can contribute more positively to the performance of the
generated code with enhanced context consistency.
Privacy Concerns. InAutoIOT , only user requirements are
transmitted to the cloud LLM for processing. We proactively
instruct AutoIOT to treat user configuration data ( e.g., the
local file path) as program input. As such, during local ex-
ecution, users can directly input the private configurations
in the console (§ 4.3), rather than pre-defining them in the
prompts for cloud LLMs. To further mitigate privacy con-
cerns, one possible solution can be deploying a local LLM to
handle all code generation and debugging tasks [64, 66].
Integration of LLMs with AIoT. Existing integration meth-
ods have two main types: 1) Prompt-based methods embed
raw sensor data into tailored prompts and instruct LLMs to
perform various AIoT tasks. HARGPT [ 43] and LLMSense
[58] embed textualized sensor data into prompts to show the
proficiency of LLMs in comprehending IoT sensor data. They
require transmitting raw sensor data to LLM servers, suf-
fering similar issues as Penetrative AI. 2) Fine-tuning-based
methods retrain LLMs with labeled datasets containing sen-
sor data inference examples. LLM4TS [ 21] fine-tunes LLMs
using sensor data with labels for time-series data prediction.
However, these works demand high compute and memory
resources. In contrast, AutoIOT explores a new approach to
automatically synthesizing programs for AIoT applications
without extra overheads.Table 2: AutoIOT vs SOTA code LLMs
Type NameMod.
Gen.RAG Auto.Auto Debug
& Impro.
Coder [34]✓ ✗ ✗ ✗ ✗
CodeLlama [62] ✓ ✗ ✗ ✗ ✗
WizardCoder [54] ✓ ✗ ✗ ✗ ✗
Auto.AutoGPT [12] ✓ ✓ ✗ ✓ ✗
MetaGPT [37] ✓ ✓ ✗ ✓ ✗
AutoIOT ✓ ✓ ✓ ✓ ✓
Code LLMs . Recent advances in code LLMs [ 34,62] have
demonstrated the potential to revolutionize software devel-
opment. They can produce functionally correct code across
various programming languages and input compiler output
into LLMs for debugging and program refining [ 29,88]. How-
ever, they require high computing resources with carefully
selected datasets (StarCoder [ 49] needs an 815GB dataset
and 512 A100 80GB GPUs). Worse still, they are unaware of
the latest advances in highly specialized domains and can-
not generate comprehensive solutions for complex IoT tasks.
AutoIOT draws strength from these works and addresses spe-
cific technical challenges in IoT program synthesis, which re-
quire ever-evolving domain knowledge that the above LLMs
have not yet assimilated. Table 2 compares AutoIOT with
SOTA code LLMs. LLM-based code generation synthesizes
functionally correct programs for a well-defined module,
whereas LLM-based task automation generates a complete
automation chain from requirements to solutions. AutoIOT
lies at the intersection of these two approaches, with the ob-
jective of code generation and task automation. Additionally,
RAG and automated code improvement jointly improve the
code generation process in AutoIOT .
We propose AutoIOT , an LLM-driven automated natural lan-
guage programming system for AIoT applications. Our sys-
tem features three novel technical modules: background
knowledge retrieval, automated program synthesis, and code
improvement, transforming natural language descriptions
into executable programs. Our experiments demonstrate the
competitive performance of AutoIOT in synthesizing pro-
grams for a variety of AIoT applications, with comparable
performance in challenging AIoT tasks and sometimes out-
performing some representative baselines. This showcases
the strong potential of exploiting the embedded common
knowledge of LLMs to evolve AIoT application development.
We sincerely thank our shepherd – Mi Zhang, and anony-
mous reviewers for their constructive comments and invalu-
able suggestions that helped improve this paper. This work is
supported by Hong Kong GRF Grant No. 15211924, 15206123,
and 16204224. Yuanqing Zheng and Mo Li are the correspond-
ing authors.
Page 14:
MobiCom ’25, Nov 4–8, 2025, Hong Kong, China Leming Shen, Qiang Yang, Yuanqing Zheng, Mo Li
[1] 2001.
[2] 2009.
[3] 2018.
[4] 2018.
[6] 2020.
[7] 2022.
[9] 2023.
[10] 2023.
[11] 2023.
[12] 2023.
[13] 2024.
[14] 2024.
[15] 2025.
[16] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge
Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt,
Sam Altman, Shyamal Anadkat, et al .2023. Gpt-4 technical report.
arXiv preprint arXiv:2303.08774 (2023).
[17] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk
Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry,
Quoc Le, et al .2021. Program synthesis with large language models.
arXiv preprint arXiv:2108.07732 (2021).
[18] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D
Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish
Sastry, Amanda Askell, et al .2020. Language models are few-shot
learners. NeurIPS 33 (2020), 1877–1901.
[19] Dongqi Cai, Yaozong Wu, Shangguang Wang, Felix Xiaozhu Lin, and
Mengwei Xu. 2023. Efficient federated learning for modern nlp. In
ACM MobiCom . 1–16.
[20] Jiani Cao, Chengdong Lin, Yang Liu, and Zhenjiang Li. 2022. Gaze
tracking on any surface with your phone. In ACM SenSys .
[21] Ching Chang, Wei-Yao Wang, Wen-Chih Peng, and Tien-Fu Chen.
2023. Llm4ts: Aligning pre-trained llms as data-efficient time-series
forecasters. arXiv preprint arXiv:2308.08469 (2023).
[22] Juo-Tung Chen and Chien-Ming Huang. 2023. Forgetful large language
models: Lessons learned from using LLMS in robot programming. In
Proceedings of the AAAI Symposium Series .
[23] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou.
2023. Teaching large language models to self-debug. arXiv preprint
arXiv:2304.05128 (2023).
[24] Ivaylo I Christov. 2004. Real time electrocardiogram QRS detection
using combined adaptive threshold. Biomedical engineering online 3, 1
(2004), 1–9.
[25] Kaiyan Cui, Leming Shen, Yuanqing Zheng, Fu Xiao, and Jinsong Han.
2024. Talk2Radar: Talking to mmWave Radars via Smartphone Speaker.
[26] Kaiyan Cui, Qiang Yang, Yuanqing Zheng, and Jinsong Han. 2023.
mmRipple: Communicating with mmWave radars through smartphone
vibration. In ACM IPSN .
[27] Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson,
Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hos-
seini, and Hervé Jégou. 2024. The faiss library. arXiv preprint
arXiv:2401.08281 (2024).
[28] Mingzhe Du, Anh Tuan Luu, Bin Ji, and See-Kiong Ng. 2024. Mercury:
An Efficiency Benchmark for LLM Code Synthesis. arXiv preprint
arXiv:2402.07844 (2024).
[29] Shukai Duan, Nikos Kanakaris, Xiongye Xiao, Heng Ping, Chenyu
Zhou, Nesreen K Ahmed, Guixiang Ma, Mihai Capota, Theodore LWillke, Shahin Nazarian, et al .2023. Leveraging Reinforcement Learn-
ing and Large Language Models for Code Optimization. arXiv preprint
arXiv:2312.05657 (2023).
[30] Alexander Dunn, John Dagdelen, Nicholas Walker, Sanghoon Lee,
Andrew S Rosen, Gerbrand Ceder, Kristin Persson, and Anubhav Jain.
2022. Structured information extraction from complex scientific text
with fine-tuned large language models. arXiv preprint arXiv:2212.05238
[31] Willem AH Engelse and Cees Zeelenberg. 1979. A single scan algorithm
for QRS-detection and feature extraction. Computers in cardiology 6,
1979 (1979), 37–42.
[32] Luciano Floridi and Massimo Chiriatti. 2020. GPT-3: Its nature, scope,
limits, and consequences. Minds and Machines 30 (2020), 681–694.
[33] Ming Gao, Lingfeng Zhang, Leming Shen, Xiang Zou, Jinsong Han,
Feng Lin, and Kui Ren. 2023. Exploring practical acoustic transduction
attacks on inertial sensors in MDOF systems. IEEE TMC (2023).
[34] Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao
Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al .2024. DeepSeek-
Coder: When the Large Language Model Meets Programming–The
Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024).
[35] Pat Hamilton. 2002. Open source ECG analysis. In Computers in cardi-
ology . IEEE, 101–104.
[36] Yuze He, Chen Bian, Jingfei Xia, Shuyao Shi, Zhenyu Yan, Qun Song,
and Guoliang Xing. 2023. VI-Map: Infrastructure-Assisted Real-Time
HD Mapping for Autonomous Driving. In ACM MobiCom . 1–15.
[37] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang,
Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou,
et al.2023. Metagpt: Meta programming for multi-agent collaborative
framework. arXiv preprint arXiv:2308.00352 (2023).
[38] Ningning Hou, Yifeng Wang, Xianjin Xia, Shiming Yu, Yuanqing Zheng,
and Tao Gu. 2025. MoLoRa: Intelligent Mobile Antenna System for
Enhanced LoRa Reception in Urban Environments. In ACM SenSys .
[39] Ningning Hou, Xianjin Xia, Yifeng Wang, and Yuanqing Zheng. 2024.
One shot for all: Quick and accurate data aggregation for LPWANs.
ACM ToSN 32, 3 (2024), 2285–2298.
[40] Ningning Hou, Xianjin Xia, and Yuanqing Zheng. 2023. Jamming of
LoRa PHY and countermeasure. ACM ToSN 19, 4 (2023), 1–27.
[41] Kai Huang and Wei Gao. 2022. Real-time neural network inference on
extremely weak devices: agile offloading with explainable AI. In ACM
MobiCom . 200–213.
[42] Ritu Jain and Ugrasen Suman. 2015. A systematic literature review
on global software development life cycle. ACM SIGSOFT Software
Engineering Notes 40, 2 (2015), 1–14.
[43] Sijie Ji, Xinzhe Zheng, and Chenshu Wu. 2024. HARGPT: Are
LLMs Zero-Shot Human Activity Recognizers? arXiv preprint
arXiv:2403.02727 (2024).
[44] Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. SelfEvolve: A Code
Evolution Framework via Large Language Models. arXiv preprint
arXiv:2306.02907 (2023).
[45] Vignesh Kalidas and Lakshman Tamil. 2017. Real-time QRS detector
using stationary wavelet transform for automated ECG analysis. In
IEEE BIBE . 457–461.
[46] Jennifer R Kwapisz, Gary M Weiss, and Samuel A Moore. 2011. Ac-
tivity recognition using cell phone accelerometers. ACM SigKDD
Explorations Newsletter 12, 2 (2011), 74–82.
[47] Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and
Shafiq Joty. 2023. CodeChain: Towards Modular Code Generation
Through Chain of Self-revisions with Representative Sub-modules.
arXiv:2310.08992 [cs.AI]
[48] Sunjae Lee, Junyoung Choi, Jungjae Lee, Munim Hasan Wasi, Hojun
Choi, Steve Ko, Sangeun Oh, and Insik Shin. 2024. Mobilegpt: Aug-
menting llm with human-like app memory for mobile task automation.
Page 15:
AutoIOT: LLM-Driven Automated Natural Language Programming for AIoT Applications MobiCom ’25, Nov 4–8, 2025, Hong Kong, China
InACM MobiCom .
[49] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis
Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li,
Jenny Chim, et al .2023. Starcoder: may the source be with you! arXiv
preprint arXiv:2305.06161 (2023).
[50] Hui Liu, Jiahao Jin, Zhifeng Xu, Yanzhen Zou, Yifan Bu, and Lu Zhang.
2019. Deep learning based code smell detection. IEEE TSE (2019).
[51] Jianwei Liu, Wenfan Song, Leming Shen, Jinsong Han, Xian Xu, and
Kui Ren. 2021. Mandipass: Secure and usable user authentication via
earphone imu. In IEEE ICDCS .
[52] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele
Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle:
How language models use long contexts. TACL 12 (2024), 157–173.
[53] Xin Liu, Daniel McDuff, Geza Kovacs, Isaac Galatzer-Levy, Jacob Sun-
shine, Jiening Zhan, Ming-Zher Poh, Shun Liao, Paolo Di Achille, and
Shwetak Patel. 2023. Large Language Models are Few-Shot Health
Learners. arXiv preprint arXiv:2305.15525 (2023).
[54] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang
Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023.
WizardCoder: Empowering Code Large Language Models with Evol-
Instruct. arXiv preprint arXiv:2306.08568 (2023).
[55] Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu
Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye,
Yiming Yang, et al .2024. Self-refine: Iterative refinement with self-
feedback. NeurIPS 36 (2024).
[56] Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael
Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim,
Chris Hallacy, et al .2022. Text and code embeddings by contrastive
pre-training. arXiv preprint arXiv:2201.10005 (2022).
[57] Xiaomin Ouyang, Xian Shuai, Yang Li, Li Pan, Xifan Zhang, Heming
Fu, Sitong Cheng, Xinyan Wang, Shihua Cao, Jiang Xin, et al .2024.
ADMarker: A Multi-Modal Federated Learning System for Monitoring
Digital Biomarkers of Alzheimer’s Disease. In ACM MobiCom .
[58] Xiaomin Ouyang and Mani Srivastava. 2024. LLMSense: Harnessing
LLMs for High-level Reasoning Over Spatiotemporal Sensor Traces.
arXiv preprint arXiv:2403.19857 (2024).
[59] Xiaomin Ouyang, Zhiyuan Xie, Heming Fu, Sitong Cheng, Li Pan,
Neiwen Ling, Guoliang Xing, Jiayu Zhou, and Jianwei Huang. 2023.
Harmony: Heterogeneous Multi-Modal Federated Learning through
Disentangled Model Training. In ACM MobiSys . 530–543.
[60] Jiapu Pan and Willis J Tompkins. 1985. A real-time QRS detection
algorithm. IEEE transactions on biomedical engineering (1985).
[61] Pengrui Quan, Xiaomin Ouyang, Jeya Vikranth Jeyakumar, Ziqi Wang,
Yang Xing, and Mani Srivastava. 2025. SensorBench: Benchmarking
LLMs in Coding-Based Sensor Processing. In ACM HOTMOBILE .
[62] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat,
Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin,
et al.2023. Code llama: Open foundation models for code. arXiv
preprint arXiv:2308.12950 (2023).
[63] Leming Shen, Qiang Yang, Kaiyan Cui, Yuanqing Zheng, Xiao-Yong
Wei, Jianwei Liu, and Jinsong Han. 2024. Fedconv: A learning-on-model
paradigm for heterogeneous federated clients. In ACM MobiSys .
[64] Leming Shen, Qiang Yang, Xinyu Huang, Zijing Ma, and Yuanqing
Zheng. 2025. GPIoT: Tailoring Small Language Models for IoT Program
Synthesis and Development. In ACM SenSys .
[65] Leming Shen and Yuanqing Zheng. 2023. FedDM: data and model
heterogeneity-aware federated learning via dynamic weight sharing.
[66] Leming Shen and Yuanqing Zheng. 2024. IoTCoder: A Copilot for IoT
Application Development. In ACM MobiCom .
[67] Kaize Shi, Xueyao Sun, Qing Li, and Guandong Xu. 2024. Compressing
Long Context for Enhancing RAG with AMR-based Concept Distilla-
tion. arXiv preprint arXiv:2405.03085 (2024).[68] Shichao Sun, Ruifeng Yuan, Ziqiang Cao, Wenjie Li, and Pengfei Liu.
2024. Prompt Chaining or Stepwise Prompt? Refinement in Text
Summarization. arXiv preprint arXiv:2406.00507 (2024).
[69] Alex Tamkin, Miles Brundage, Jack Clark, and Deep Ganguli. 2021.
Understanding the capabilities, limitations, and societal impact of large
language models. arXiv preprint arXiv:2102.02503 (2021).
[70] Gemma Team et al .2024. Gemma: Open models based on gemini
research and technology. arXiv preprint arXiv:2403.08295 (2024).
[71] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-
Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M
Dai, Anja Hauth, et al .2023. Gemini: a family of highly capable
multimodal models. arXiv preprint arXiv:2312.11805 (2023).
[72] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Alma-
hairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal
Bhargava, Shruti Bhosale, et al .2023. Llama 2: Open foundation and
fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
[73] Fei Wang, Yizhe Lv, Mengdie Zhu, Han Ding, and Jinsong Han. 2024.
XRF55: A Radio Frequency Dataset for Human Indoor Action Analysis.
ACM IMWUT (2024).
[74] Kun Wang, Zimu Zhou, and Zhenjiang Li. 2024. LATTE: Layer
Algorithm-aware Training Time Estimation for Heterogeneous Feder-
ated Learning. In ACM MobiCom .
[75] Qingyue Wang, Liang Ding, Yanan Cao, Zhiliang Tian, Shi Wang,
Dacheng Tao, and Li Guo. 2023. Recursively summarizing enables
long-term dialogue memory in large language models. arXiv preprint
arXiv:2308.15022 (2023).
[76] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia,
Ed Chi, Quoc V Le, Denny Zhou, et al .2022. Chain-of-thought prompt-
ing elicits reasoning in large language models. NeurIPS (2022).
[77] Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby
Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu.
2024. Autodroid: Llm-powered task automation in android. In ACM
MobiCom .
[78] Lu Wu, Xiaoyun Xie, and Yinglong Wang. 2021. ECG enhancement and
r-peak detection based on window variability. In Healthcare . MDPI.
[79] Huatao Xu, Liying Han, Qirui Yang, Mo Li, and Mani Srivastava. 2024.
Penetrative ai: Making llms comprehend the physical world. In ACL.
[80] Huatao Xu, Pengfei Zhou, Rui Tan, Mo Li, and Guobin Shen. 2021.
Limu-bert: Unleashing the potential of unlabeled data for imu sensing
applications. In ACM SenSys .
[81] Hongfei Xue, Qiming Cao, Yan Ju, Haochen Hu, Haoyu Wang, Aidong
Zhang, and Lu Su. 2022. M4esh: mmwave-based 3d human mesh
construction for multiple subjects. In ACM SenSys . 391–406.
[82] Qiang Yang, Kaiyan Cui, and Yuanqing Zheng. 2023. VoShield: Voice
liveness detection with sound field dynamics. In IEEE INFOCOM .
[83] Qiang Yang and Yuanqing Zheng. 2022. Deepear: Sound localization
with binaural microphones. IEEE TMC 23, 1 (2022), 359–375.
[84] Qiang Yang and Yuanqing Zheng. 2023. Aquahelper: Underwater sos
transmission and detection in swimming pools. In ACM SenSys .
[85] Yuqing Yang, Lei Jiao, and Yuedong Xu. 2024. A queueing theoretic
perspective on low-latency llm inference with variable token length.
InIEEE WiOpt .
[86] Shiming Yu, Xianjin Xia, Ningning Hou, Yuanqing Zheng, and Tao Gu.
2024. Revolutionizing lora gateway with xgate: Scalable concurrent
transmission across massive logical channels. In ACM MobiCom .
[87] Shiming Yu, Xianjin Xia, Ziyue Zhang, Ningning Hou, and Yuanqing
Zheng. 2024. FDLoRa: Tackling Downlink-Uplink Asymmetry with
Full-duplex LoRa Gateways. In ACM SenSys .
[88] Li Zhong, Zilong Wang, and Jingbo Shang. 2024. LDB: A Large Lan-
guage Model Debugger via Verifying Runtime Execution Step-by-step.
arXiv preprint arXiv:2402.16906 (2024).