Authors: Shengqiong Wu, Weicai Ye, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Shuicheng Yan, Hao Fei, Tat-Seng Chua
Page 1:
Any2Caption
: Interpreting Any Condition to Caption
for Controllable Video Generation
Shengqiong Wu1,2*Weicai Ye1,BJiahao Wang1Quande Liu1Xintao Wang1
Pengfei Wan1Di Zhang1Kun Gai1Shuicheng Yan2Hao Fei2,BTat-Seng Chua2
1Kuaishou Technology2National University of Singapore
https://sqwu.top/Any2Cap/
Any2Caption
Any/Diverse Input Conditions Output Videos Generation
Camera Pose
Multiple Identities First Frame
Depth Sequence
Human Pose Sequence(Text + Depth + Camera) → Video
(Text + First Frame ) → Video
(Text + Multiple Identities ) → Video
(Text + Human Pose ) → Video
…
Sketch
Interpreting Condition →Caption
Segmentation Normal Bae Style
…
(Text + Style ) → Video
(Text + Key Frames ) → Video
Key Video Frames
A stylish wo -
man walks th-
rough a bust -
ling …
Text
CTRL- Adapter
VideoComposer
CameraCtrlControlVideo
ConceptMaster
MotionCtrl
HunYuan
CogVideoXStage -1: Condition
Interpretation
(Frozen )
(Updated )
Dense caption
Main object
caption
Background
captionCamera caption
Style caption
Action captionOff-the-shelf
SoTA Video
Generators
Stage -2: Video
Generation
Structured
Dense Captions
of Any CondDisentangled !
A scene of a woman
in the kitchen …
A woman is standing
behind a desk …A woman with gold
hair is wearing blue …
There are a painting
hanging up on the …
The fixed camera is
shooting the upper …
Realistic with warm,
bright light …
Figure 1. We propose Any2Caption , an efficient and versatile framework for interpreting diverse conditions to structured captions,
which then can be fed into any video generator to generate highly controllable videos.
Abstract
To address the bottleneck of accurate user intent interpre-
tation within the current video generation community, we
present Any2Caption , a novel framework for control-
lable video generation under any condition. The key idea
is to decouple various condition interpretation steps from
the video synthesis step. By leveraging modern multimodal
large language models (MLLMs), Any2Caption inter-
prets diverse inputs—text, images, videos, and specialized
cues such as region, motion, and camera poses—into dense,
structured captions that offer backbone video generators
∗Work done during internship at Kuaishou Technology.
BCorresponding Author.with better guidance. We also introduce Any2CapIns , a
large-scale dataset with 337K instances and 407K condi-
tions for any-condition-to-caption instruction tuning. Com-
prehensive evaluations demonstrate significant improve-
ments of our system in controllability and video quality
across various aspects of existing video generation models.
1. Introduction
Video serves as a fundamental medium for capturing real-
world dynamics, making diverse and controllable video
generation a key capability for modern artificial intelligence
(AI) systems. Recently, video generation has gained signif-
1arXiv:2503.24379v1 [cs.CV] 31 Mar 2025
Page 2:
icant attention, driven by advancements in Diffusion Trans-
formers (DiT) [2, 29, 44, 52, 76], which have demonstrated
the ability to generate realistic, long-duration videos from
text prompts. These advancements have even led to indus-
trial applications, such as filmmaking. However, we ob-
serve that a major bottleneck in the current video generation
community lies in accurately interpreting user intention ,
so as to produce high-quality, controllable videos.
In text-to-video (T2V) generation, studies [24, 30, 67]
have suggested that detailed prompts, specifying objects,
actions, attributes, poses, camera movements, and style,
significantly enhance both controllability and video qual-
ity. Thus, a series of works have explored video recaption
techniques (e.g., ShareGPT4Video [10], MiraData [30],
and InstanceCap [15]) to build dense structured captions
for optimizing generative models. While dense captions
are used during training, in real-world inference scenarios,
users most likely provide concise or straightforward input
prompts [15]. Such a gap inevitably weakens instruction
following and leads to suboptimal generation due to an in-
complete understanding of user intent. To combat this, there
are two possible solutions, manual prompt refinement or au-
tomatic prompt enrichment [15, 67] using large language
models (LLMs). Yet, these approaches either require sub-
stantial human effort or risk introducing noise from incor-
rect prompt interpretations. As a result, this limitation in
precisely interpreting user intent hinders the adoption of
controllable video generation for demanding applications
such as anime creation and filmmaking.
In addition, to achieve more fine-grained controllable
video generation, one effective strategy is to provide addi-
tional visual conditions besides text prompts—such as ref-
erence images [17, 62], identity [22, 46, 69], style [42, 68],
human pose [33, 45], or camera [21, 75]—or even combina-
tions of multiple conditions together [41, 58, 74]. This mul-
timodal conditioning approach aligns well with real-world
scenarios, as users quite prefer interactive ways to artic-
ulate their creative intent. Several studies have examined
video generation under various conditions, such as Video-
Composer [58], Ctrl-Adapter [41], and ControlVideo [74].
Unfortunately, these methods tend to rely on the internal en-
coders of Diffusion/DiT to parse rich heterogeneous input
conditions with intricate requirements (e.g., multiple object
IDs, and complex camera movements). Before generation,
the model must accurately interpret the semantics of varied
visual conditions in tandem with textual prompts. Yet even
state-of-the-art (SoTA) DiT backbones have limited capac-
ity for reasoning across different input modalities, resulting
in suboptimal generation quality.
This work is dedicated to addressing these bottlenecks
of any-conditioned video generation. Our core idea is to
decouple the first job of interpreting various conditions
from the second job of video generation , motivated bytwo important observations:
a) SoTA video generation models (e.g., DiT) already ex-
cel at producing high-quality videos when presented
with sufficiently rich text captions;
a) Current MLLMs have demonstrated robust vision-
language comprehension.
Based on these, we propose Any2Caption , an MLLM-
based universal condition interpreter designed not only to
handle text, image, and video inputs but also equipped
with specialized modules for motion and camera pose in-
puts. As illustrated in Fig. 1, Any2Caption takes as in-
puts any/diverse condition (or combination), and produces
a densely structured caption, which is then passed on to any
backbone video generators for controllable, high-quality
video production. As Any2Caption disentangles the role
of complex interpretation of multimodal inputs from the
backbone generator, it advances in seamlessly integrating
into a wide range of well-trained video generators without
the extra cost of fine-tuning.
To facilitate the any-to-caption instruction tuning for
Any2Caption , we construct Any2CapIns , a large-scale
dataset that converts a concise user prompt and diverse
non-text conditions into detailed, structured captions. Con-
cretely, the dataset encompasses four main categories of
conditions: depth maps, multiple identities, human poses,
and camera poses. Through extensive manual labeling com-
bined with automated annotation by GPT-4V [1], followed
by rigorous human verification, we curate a total of 337K
high-quality instances, with 407K condition annotations,
with the short prompts and structured captions averaging
55 and 231 words, respectively. In addition, we devise a
comprehensive evaluation strategy to thoroughly measure
the model’s capacity for interpreting user intent under these
various conditions.
Experimentally, we first validate Any2Caption on our
Any2CapIns , where results demonstrate that it achieves
an impressive captioning quality that can faithfully re-
flect the original input conditions. We then experiment
with integrating Any2Caption with multiple SoTA video
generators, finding that (a) the long-form semantically
rich prompts produced by Any2Caption are pivotal for
generating high-quality videos under arbitrary conditions,
and (b) Any2Caption consistently enhances performance
across different backbone models, yielding noticeably im-
proved outputs. Furthermore, Any2Caption shows a pro-
nounced advantage when handling multiple combined con-
ditions, effectively interpreting and synthesizing intricate
user constraints into captions that align closely with user
expectations. Our contributions are threefold:
• We for the first time pioneer a novel any-condition-to-
caption paradigm of video generation, which bridges
the gap between user-provided multimodal conditions
and structured video generation instructions, leading to
2
Page 3:
SAMStep-2: St ructured Video Caption Generation Step-3: User -centric Short Prompt Generation
Manual Checker
Structured Video
Caption System
Target VideoManual Checker
1. Dense caption : The video
begins in a cozy ...
2. Main object caption : The
main objects are a ...
3. Background caption : The
background consists ....
4. Camera caption : The
camera first pans to the …
5. Style caption : Clean,
bright, and inviting ...
6. Action caption : A woman
is standing besides.. .
Target Video
Conditions
Target VideoTarget Videos Structured Caption Short Prompt
Structured Caption
A stylish woman
in a dark jacket
walks through a
bustling urban
area. Capture the
tall buildings and
busy street behind
her. Step-1: Data Collection
Tools
Depth
Anything
Any2CapIns
Video
Source
337K
Constraints
GPT-4VFigure 2. The pipeline for constructing the Any2CapIns dataset involves three key steps: 1) data collection, 2) structured video caption
generation, and 3) user-centric short prompt generation.
highly-controllable video generation.
• We propose Any2Caption to effectively integrate
and comprehend diverse multimodal conditions, pro-
ducing semantically enriched, long-form, structured
captions, which consistently improve both condition
flexibility and video quality. Any2Caption can also
be widely integrated as a plug-in module for any exist-
ing video generation framework.
• We introduce Any2CapIns , a large-scale, high-
quality benchmark dataset for the any-condition-to-
caption task, and also establish a suite of evaluation
metrics to rigorously assess the quality and fidelity of
condition-based caption generation.
2. Related Work
Controllable video generation [11, 16, 21, 53] has long been
a central topic in AI. Recent advanced DiT methods, such as
OpenAI’s Sora [3] and HunyuanVideo [34], yield photore-
alistic videos over extended durations. Early work focused
on text-controlled video generation [24, 52], the prevalent
approach. Yet, text prompts alone may insufficiently cap-
ture user intent, spurring exploration of additional inputs
including static images [17, 62], sketches [58, 74], human
poses [33, 45, 77], camera views [21, 75], and even extra
videos [14, 32, 70]. Thus, unifying these diverse conditions
into an “any-condition” framework is highly valuable.
Recent works such as VideoComposer [58], Ctrl-
Adapter [41], and ControlVideo [74] have explored any-
condition video generation. However, they face challenges
in controlling multiple modalities due to the limited inter-
pretability of text encoders in Diffusion or DiT. Motivated
by existing MLLMs’ multimodal reasoning [39, 43, 56], we
propose leveraging an MLLM to consolidate all possible
conditions into structured dense captions for better control-
lable generation. SoTA DiT models already exhibit the ca-
pacity to interpret dense textual descriptions as long as the
input captions are sufficiently detailed in depicting both the
scene and the intended generation goals. Thus, our MLLM-
based encoder alleviates the comprehension bottleneck, en-
abling higher-quality video generation. To our knowledge,
this is the first attempt in the field of any-condition video
generation. Moreover, as the captioning stage is decoupledfrom backbone DiT, Any2Caption can integrate with ex-
isting video generation solutions without additional retrain-
ing.
Our approach also relates to video recaptioning, as our
system produces dense captions from given conditions. In
text-to-video settings, prior work [15, 27, 47, 67] shows that
recaptioning yields detailed annotations that improve DiT
training. For instance, ShareGPT4Video [10] uses GPT-4V
[1] to reinterpret video content, while MiraData [30] and In-
stanceCap [15] focus on structured and instance-consistent
recaptioning. Unlike these methods, we avoid retraining
powerful DiT models with dense captions by training an
MLLM as an any-condition encoder on pairs of short, dense
captions that are easier to obtain. Moreover, recaptioning
entire videos can introduce noise or hallucinations that un-
dermine DiT training, whereas our framework sidesteps this
risk. Finally, while previous studies rely on dense-caption-
trained DiT models, the real-world user concise prompts
might create a mismatch that degrades generation quality.
3. Any2CapIns Dataset Construction
While relevant studies recaption target videos for dense
captions for enhanced T2V generation [10, 15, 30], these
datasets exhibit two key limitations: (1) the absence of
non-text conditions, and (2) short prompts that do not ac-
count for interactions among non-text conditions, poten-
tially leading to discrepancies in real-world applications.
To address these limitations, we introduce a new dataset,
Any2CapIns , specifically designed to incorporate diverse
multimodal conditions for generating structured video cap-
tions. The dataset is constructed through a three-step pro-
cess (cf. Fig. 2), including data collection, structured cap-
tion construction, and user-centric short prompt generation.
Step-1: Data Collection. We begin by systematically cat-
egorizing conditions into four primary types: 1) Spatial-
wise conditions , which focus on the structural and spa-
tial properties of the video, e.g., depth maps ,sketches , and
video frames . 2)Action-wise conditions , which emphasize
motion and human dynamics in the target video, e.g., hu-
man pose ,motion . 3)Composition-wise conditions , which
focus on scene composition, particularly in terms of ob-
3
Page 4:
0k20k40k60k80k
0 26 34 42 51 59 67 76 84 92 Number of V
ideos
170 186 202 218 235 251 267 284 300 316 Short Caption Structured Caption
LengthFigure 3. Distribution of the short/structured caption length
(in words) in Any2CapIns .
ject interactions and multiple identities in the target video.
4)Camera-wise conditions , which control video genera-
tion from a cinematographic perspective, e.g., camera an-
gles,movement trajectories . Since it is infeasible to encom-
pass all possible conditions in dataset collection, we curate
representative datasets under each category, specifically in-
cluding depth maps, human pose, multiple identities, and
camera motion. During the data collection process, we
leverage SoTA tools to construct conditions. For instance,
Depth Anything [65] is used to generate depth maps, DW-
Pose [66] provides human pose annotations, and SAM2 [50]
is utilized for segmentation construction. In total, we collect
337K video instances and 407K conditions, with detailed
statistics of the dataset presented in Tab. 1.
Step-2: Structured Video Caption Generation. The
granularity of a caption, specifically the detailed selection
of elements they encompass, plays a critical role in guid-
ing the model to produce videos that closely align with the
users’ intentions while preserving coherence and realism.
Drawing inspiration from [30], we inherit its structured cap-
tion format consisting of (1) Dense caption , (2)Main
object caption , (3) Background caption , (4)
Camera caption , and (5) Style caption . Further-
more, the action descriptions of the subjects significantly
influence the motion smoothness of the videos [57], we ex-
plicitly incorporate the (6) Action caption of subjects
to form the final structured caption. Technically, following
[57], we independently generate each caption component
and subsequently integrate them to construct the final struc-
tured caption.
Step-3: User-centric Short Prompt Generation. In this
step, we construct short prompts from a user-centric per-
spective, considering how users naturally express their in-
tentions. Firstly, our analysis highlights three key character-
istics of user-generated prompts: 1) Conciseness and Sim-
plicity , where users favor brief and straightforward word-
ing; 2) Condition-Dependent Omission , whereby users
often omit explicit descriptions of certain attributes (e.g.,
camera movement) when such conditions are already spec-
ified; and 3) Implicit instruction of Target Video : where
users convey their intent indirectly (e.g., specifying multi-
ple identities without explicitly detailing their interactions).Category #Inst. #Condition #Avg. Len. #Total Len.
Depth 182,945 182,945 9.87s 501.44h
Human Pose 44,644 44,644 8.38s 108.22h
Multi-Identities 68,255 138,089 13.01s 246.69h
Camera Movement 41,112 41,112 6.89s 78.86h
Table 1. Statistics of the collected dataset across four types of con-
ditions. #Inst. means the number of instances, and #Condition
denotes the number of unique conditions. #Avg. / #Total Len.
indicate the average and total video durations, respectively.
Guided by these observations, we employ GPT-4V to infer
potential user prompts under condition-specific constraints.
Given a structured video caption, target video, and associ-
ated conditions, we apply tailored constraints to preserve
condition invariance when relevant. We also explicitly con-
trol the length of the generated prompts to ensure concise-
ness. Finally, we conduct manual verification and filtering
to further refine the dataset. Fig. 3 presents the length dis-
tribution of the resulting short and structured prompts.
4. Any2Caption Model
In this section, we introduce Any2Caption , a novel
MLLM explicitly designed to comprehensively model and
interpret arbitrary multimodal conditions for controllable
video caption generation, as illustrated in Fig. 4(a). For-
mally, the user provides a short text prompt Talong with
non-text conditions C= [c1,···, cn], where the non-text
conditions can be either a single condition ( n= 1) or a
combination of multiple conditions. The objective of this
task is to generate a detailed and structured caption that
serves as a control signal for video generation.
Architecture. Similar to existing MLLMs [39, 43, 63],
Any2Caption incorporates an image encoder FI, a video
encoder FV, a motion encoder FMand a camera encoder
FC}to process non-text conditions. These encoders are
then integrated into an LLM backbone FLLM (i.e., Qwen2-
LLM) to facilitate structured video captioning. Specifi-
cally, we leverage a ViT-based visual encoder from Qwen2-
VL as FIandFVfor the unified modeling of images and
videos, achieving effective interpretation of input condi-
tions represented in image or video formats, such as depth
maps and multiple identities. To enable human pose un-
derstanding, we represent the extracted human pose trajec-
tories as H={(xk
n, yk
n)|k= 1,···, K, n = 1,···, N},
where Ndenotes the number of video frames and Kis
the number of keypoints. These trajectories are then vi-
sualized within video frames to enable further process-
ing by the motion encoder, which shares the same archi-
tectural structure and initialization as the vision encoder.
For camera motion understanding, inspired by [21], we
introduce a camera encoder that processes a pl ¨ucker em-
bedding sequence P∈RN×6×H×W, where H,Ware
the height and width of the video. This embedding ac-
curately captures camera pose information, enabling pre-
4
Page 5:
Text Encoder
r
Image EncoderVideo EncoderMotion EncoderCamera Encoder
Structured CaptionQwen2-LLM
CTRL-AdapterVideoComposerCameraCtrlControlVideoConceptMasterMotionCtrl
(a) Overall architecture of Any2Caption
(b) Progressive Mixed Training
Additional Vision/Text Ins.
Multi-IDsConditionsShort Caption
HumanPose
Camera
Depth
0.40.60.8
JointTraining
dropoutdropoutJointTraining Any2CapInsTraining Ratio
Figure 4. Architecture illustration of Any2Caption (a), where
Qwen2-LLM serves as the backbone and is paired with text, im-
age, video, motion, and camera encoders to produce structured
captions. After alignment learning, we perform a progressive
mixed training strategy (b), where additional vision/text instruc-
tion datasets are progressively added for joint training, and mean-
while, for input short caption, we adopt a random-dropout mecha-
nism at the sentence level to enhance robustness.
cise modeling of camera trajectories. Finally, in line with
Qwen2-VL, we employ special tokens to distinguish non-
text conditions from texts. Besides the existing tokens,
we introduce <|motion start|> ,<|motion end|> ,
<|camera start|> ,<|camera end|> , to demarcate
the start and end of human and camera pose features.
Training Recipes. To accurately interpret user generation
intent under arbitrary conditions and yield structured tar-
get video captions, large-scale pretraining and instruction
tuning are required. To this end, we adopt a two-stage
training procedure: Stage-I: Alignment Learning. In this
stage, as image and video encoders have been well-trained
in Qwen2-VL, we only focus on aligning human pose fea-
tures from the motion encoder and camera movement fea-
tures with the word embeddings of the LLM backbone. To
achieve this, we freeze the LLM and vision encoder while
keeping the motion encoder trainable and optimizing it on a
human pose description task. Similarly, for camera move-
ment alignment, we unfreeze the camera encoder and train
it on a camera movement description task, ensuring that
camera-related conditions are embedded into the model’s
latent space. This alignment phase establishes a strong
foundation for effective representation learning for these
conditions. Stage II: Condition-Interpreting Learning
Building upon the aligned encoders and pretrained Qwen2-VL weights, we fine-tune the model on the Any2CapIns
dataset for multimodal condition interpretation. However,
direct fine-tuning leads to catastrophic forgetting due to the
fixed output structure. To address this, we propose a pro-
gressive mixed training strategy. Specifically, the model
is first trained on a single condition to establish a strong
condition-specific understanding. As new conditions are in-
troduced, we gradually incorporate vision-language instruc-
tions such as LLaV A-instruction [43] and Alpaca-52K [54].
This stepwise strategy ensures robust multimodal condition
interpretation while preventing knowledge degradation.
5. Evaluation Suite
In this section, we introduce the evaluation suite for com-
prehensively assessing the capability of Any2Caption in
interpreting user intent and generating structured captions.
Lexical Matching Score. To assess caption generation
quality from a lexical standpoint, we employ standard eval-
uation metrics commonly used in image/video captioning
tasks, including BLEU [48],ROUGE [40], and METEOR [4].
We also introduce a Structural Integrity score to
verify whether the generated captions adhere to the required
six-component format, thereby ensuring completeness.
Semantic Matching Score. To evaluate the semantic
alignment of generated captions, we employ BERTS CORE
[71], which computes similarity by summing the cosine
similarities between token embeddings, effectively captur-
ing both lexical and compositional meaning preservation.
Additionally, we utilize CLIP Score [23] to assess the se-
mantic consistency between the input visual condition and
the generated videos.
Intent Reasoning Score. Evaluating structured captions
typically focuses on their linguistic quality, yet it rarely
assesses whether the model truly understands user intent
and accurately reflects it across aspects such as style, emo-
tion, and cinematic language. To bridge this gap and draw
inspiration from [8], we introduce the Intent Reasoning
Score (IRS CORE ), a novel quantitative metric that lever-
ages LLMs to assess whether generated captions accurately
capture user intentions. IRS CORE identifies user-specified
control factors, then decomposes the generated caption into
QA pairs aligned with these factors. The framework has
four steps: (1) User Intention Extraction: Analyze pro-
vided conditions to categorize user intent into six aspects:
subject, background, movement, camera, interaction, and
style. (2) Ground-Truth QA Pair Construction: For-
mulate aspect-specific QA pairs with defined requirements
(e.g., for subject-related attributes, emphasize object count,
appearance). (3) Predicted Answer Generation: Prompt
GPT-4V to parse the predicted caption and generate answers
based solely on it. (4) Answer Evaluation: Following [36],
GPT-4V outputs two scores (correctness and quality) for
5
Page 6:
Category Structural IntegrityLexical Matching Semantic Matching Intent Reasoning
B-2 R-L METER BERTS CORE Accuracy Quality
Entire Structured Caption 91.25 54.99 48.63 52.47 91.95 68.15 3.43
Dense Caption - 44.24 42.89 49.51 92.42 78.47 3.47
Main Object Caption - 38.54 47.46 52.48 92.02 56.28 2.74
Background Caption - 44.65 46.73 48.87 92.90 69.37 2.69
Camera Caption - 60.21 96.10 94.32 99.31 66.31 3.75
Style Caption - 41.71 47.70 55.9 93.48 63.75 3.05
Action Caption - 31.91 39.83 45.25 91.44 57.98 2.13
Table 2. Quantitative results of structured caption generation quality under four aspects: structural Integrity ,lexical matching ,semantic
matching , and intent reasoning . We demonstrate the overall caption generation capability and the individual component generation perfor-
mance within the structure. “B-2” and “R-L” denotes BLEU-2 and ROUGE-L, respectively.
Caption
EnrichText Video Generation
CLIP-T ↑Smoothness ↑Aesthetic ↑Integrity ↑
Short Cap. 18.31 93.46 5.32 55.39
Short Cap.19.19 93.41 5.41 54.91w/ Condition Cap.
Structured Cap. 19.87 94.38 5.46 57.47
Table 3. Quantitative results comparing short caption, short cap-
tion combined with condition caption, and structured caption for
multi-identity video generation.
Training
StrategyCaption Vieo Generation
B-2↑Accuracy ↑Smoothness ↑Dynamics ↑Aesthetic ↑
Any2Caption 47.69 67.35 94.60 17.67 5.53
w/o Two-Stage 33.70 51.79 93.31 16.36 5.50
w/o Dropout 49.24 69.51 94.16 14.54 5.51
Table 4. Ablation study on training strategy. “w/o stage” means
alignment learning is not applied during training, while “w/o
Dropout” denotes that short captions are not randomly dropped.
each answer, which are then averaged across all QA pairs.
More details are in Appendix §E.
Video Generation Quality Score. The primary objective
of generating structured captions is to enhance the video
generation quality and controllability. Therefore, we adopt
a series of metrics to assess the quality of videos generated
based on structured captions. Following [26, 30], we eval-
uate video generation quality across four key dimensions:
motion smoothness, dynamic degree, aesthetic quality, and
image integrity. To further verify adherence to specific
non-text conditions, we use specialized metrics: RotErr,
TransErr, and CamMC [21] for camera motion accuracy;
Mean Absolute Error (MAE) for depth consistency [18];
DINO-I [51], CLIP-I [51] Score to evaluate identity preser-
vation under multiple identities, and Pose Accuracy (Pose
Acc.) [45] to access the alignment in the generated videos.
6. Experiments
6.1. Setups
Dataset. We manually construct 200 test cases for each
type of condition (i.e., depth ,human pose ,multiple identi-ties,camera , and compositional conditions ) to evaluate the
model’s performance. Additionally, we assess the model
on publicly available benchmarks (e.g., [41, 68, 78]). For
further details, please refer to the Appendix §F.
Implementation Details. We leverage Qwen2VL-7B
[56] as the backbone of our model, which supports both im-
age and video understanding. The human pose in the input
conditions is encoded and processed in the video format.
The camera encoder adopts the vision encoder architecture
with the following settings: the input channel of 96, the
patch size of 16, the depth of 8, and 8 attention heads. Dur-
ing training, to simulate the brevity and randomness of user
inputs, we randomly drop sentences from the short caption
with a dropout rate of 0.6; a similar dropout strategy is ap-
plied to non-textual conditions. We conducted the training
on 8×A800 GPUs. For further details on the training pa-
rameters for each stage, please refer to the Appendix §F.
6.2. Experimental Results and Analyses
In this section, we present the experimental results and pro-
vide in-depth analyses to reveal how the system advances.
Following this, we try to ground the answers for the follow-
ing six key research questions.
RQ-1: How well is the structured caption generation
quality? We first evaluate whether our proposed model
could accurately interpret user intent and generate high-
quality structured captions. From a caption-generation per-
spective, we compare the predicted captions with gold-
standard captions across various metrics (refer to Tab. 2).
We observe that our model successfully produces the de-
sired structured content, achieving 91.25% in structural in-
tegrity. Moreover, it effectively captures the key elements of
the gold captions, attaining a ROUGE-L score of 48.63 and
a BERTS CORE of 91.95. Notably, the model demonstrates
the strongest performance in interpreting camera-related de-
tails compared to other aspects. Finally, regarding user in-
tent analysis, we found that the model reliably incorporated
user preferences into its structured outputs.
6
Page 7:
Diffusion Models Diffusion Models
Example -2: A serene winter backyard with snow -covered ground and
bare trees, revealing a blue shed with a white garage door and a doghouseExample -1: A man gestures while the woman listens. They sit in a sunny
park. The camera captures close- up shots of their heads and shoulders.
Short Cap. + IDs
Example -4: A woman walks in a minimalist, modern room. She is holding
two mugs and looks slightly displeased. The room has natural light.Example -3: A young man carrying a messenger bag runs down a narrow,
cobblestone street filled with sandbags and crates, suggesting a wartime.
Structured Cap. Video Short Cap. + ID Structured Cap. Video
Short Cap. + IDs + Depth Structured Cap. Video
Short Cap. + Camera + Depth Structured Cap. Video
Any2Caption
Any2Caption
Camera
Any2CaptionAny2Caption
IDsDepth DepthID
CogVideoX -2B
CogVideoX -2BHunyuanVideo
HunyuanVideo
IDs
Figure 5. Illustrations of generated videos where only the structured captions yielded by Any2Caption are fed into the CogVideoX-2B
(Left), and HunyuanVideo (Right). We can observe that some key features of the input identity images, such as the background and main
object, can be accurately visualized in the generated videos.
ModelText Camera Identities Depth Human Pose Overall Quality
CLIP-T ↑RotErr↓TransErr ↓CamMC ↓DINO-I ↑CLIP-I↑MAE↓Pose Acc. ↑Smoothness ↑Dynamic ↑Aesthetic ↑Integrity ↑
•Camera to Video
MotionCtrl [60] 19.67 1.54 4.49 4.80 - - - - 96.13 9.75 5.40 73.69
+ Structured Cap. 20.16 1.45 4.37 4.78 - - - - 96.16 11.43 5.71 74.63
CameraCtrl [21] 18.89 1.37 3.51 4.78 - - - - 94.11 12.59 4.26 71.84
+ Structured Cap. 21.70 0.94 2.97 4.37 - - - - 95.16 13.72 4.66 72.47
•Depth to Video
Ctrl-Adapter [41] 20.37 - - - - - 25.63 - 94.53 20.73 4.63 46.98
+ Structured Cap. 23.30 - - - - - 21.87 - 95.54 15.14 5.31 54.20
ControlVideo [74] 22.17 - - - - - 30.11 - 92.88 5.94 5.29 63.85
+ Structured Cap. 24.18 - - - - - 23.92 - 94.47 18.27 9.77 66.28
•Identities to Video
ConceptMaster [25] 16.04 - - - 36.37 65.31 - - 94.71 8.18 5.21 43.68
+ Structured Cap. 17.15 - - - 39.42 66.74 - - 95.05 10.14 5.68 49.73
•Human Pose to Video
FollowYourPose [45] 21.11 - - - - - - 30.47 91.71 14.29 4.95 58.84
+ Structured Cap. 21.39 - - - - - - 31.59 92.87 16.47 5.88 56.30
Table 5. Performance comparison on four types of conditions (e.g., camera ,depth ,identities , and human pose ) between directly using
short captions and integrating structured captions under various video quality evaluation metrics. Better results are marked in bold .
To further showcase the model’s capacity to under-
stand and leverage input conditions, we directly feed the
structured captions—derived from our model’s interpre-
tation—into downstream text-to-video generation systems
(e.g., CogvideoX-2B [67] and Hunyuan [34]), as illustrated
in Fig. 5. Even without explicit visual conditions (e.g.,
identities, depth, or camera movement), the resulting videos
clearly align with the input prompts, such as the hat’s color
and style, or the woman’s clothing texture in Example 1,
indicating that our structured captions successfully capture
intricate visual details. In particular, the model is able to
accurately grasp dense conditions, such as depth sequences
or compositional requirements in Example 4, ultimately en-
abling controllable video generation. While certain fine-
grained elements may not be exhaustively described in text,
resulting in occasional discrepancies with the actual visual
content, the overall controllability remains robust.RQ-2: Is the video generation quality enhanced with
structured caption? Here, we investigate whether inte-
grating structured captions consistently improves control-
lable video generation in multiple methods. We explore the
impact of adding camera, depth, identities, and human pose
conditions to various controllable video generation meth-
ods. As shown in Tab. 5, all tested models exhibit consistent
gains in overall video quality, such as smoothness and frame
fidelity, after incorporating structured captions, without re-
quiring any changes to the model architectures or additional
training. Moreover, these models show enhanced adher-
ence to the specified conditions, suggesting that our gener-
ated captions precisely capture user requirements and lead
to more accurate, visually coherent video outputs. More ex-
amples can be found in the Appendix §G.2.
RQ-3: Is the structured caption necessary? We exam-
ine whether a structured caption design is essential. We
7
Page 8:
Compositional ConditionText Camera Identities Depth Overall Quality
CLIP-T ↑RotErr↓TransErr ↓CamMC ↓DINO-I ↑CLIP-I↑MAE↓Smoothness ↑Dynamic ↑Aesthetic ↑Integrity ↑
Camera+Identities 14.81 1.37 4.04 4.24 25.63 64.14 - 94.43 28.87 4.99 59.81
+ Structured Cap. 19.03 1.30 4.36 4.03 26.75 68.45 - 94.38 34.99 5.25 63.02
Camera+Depth 20.80 1.57 3.88 4.77 - - 32.15 95.36 30.12 4.82 63.90
+ Structured Cap. 21.19 1.49 4.41 4.84 - - 25.37 95.40 30.10 4.96 65.05
Depth+Identities 20.01 - - - 35.24 57.82 23.00 93.15 32.21 4.96 61.21
+ Structured Cap. 20.76 - - - 36.25 63.48 24.78 92.50 36.43 5.18 60.81
Camera+Identities+Depth 18.49 2.05 7.74 8.47 35.86 64.25 18.37 92.02 30.09 3.91 60.62
+ Structured Cap. 19.52 1.57 7.74 8.20 38.74 64.37 17.41 93.03 32.81 4.99 61.22
Table 6. Quantitative comparison of structured captions when handling compositional conditions. Better results are marked in bold .
CLIP-T
Motion
Smoothness
Aesthetic QualityImage
QualityShort Cap. Structured Cap.
(a) SegmentationCLIP-T
Aesthetic Quality
CLIP-T
Aesthetic QualityMotion
SmoothnessImage
Quality
(b) Style
Motion
SmoothnessImage
QualityCLIP-T
Aesthetic QualityMotion
SmoothnessImage
Quality
(c) Masked Image (d) Sketch
Figure 6. Quantitative results on unseen conditions (i.e., segmen-
tation [58], style [68], masked image [58], and sketch [58]) when
using short and structured captions, respectively.
compare our structured caption approach with a simpler
method, where we first caption the input condition (e.g.,
multiple identity images) and then concatenate that caption
with the original short prompt, as shown in Tab. 3. Our
results indicate that merely appending the condition’s cap-
tion to the short prompt can reduce video smoothness and
image quality. One likely reason is that the identity images
may contain extraneous details beyond the target subject,
potentially conflicting with the original prompt and causing
inconsistencies. Consequently, controllability in the final
output is compromised. In contrast, our structured caption
method accurately identifies the target subject and augments
the prompt with relevant information, yielding more con-
trollable video generation (cf. §G.2).
RQ-4: How effective is the training strategy? Next,
we investigate the contribution of the training mechanism,
and the results are shown in Tab. 4. During training, we
employ a two-stage training approach, consisting of align-
ment learning followed by instruction-tuning. When align-
ment learning is omitted, and the model proceeds directlyto instruction tuning, both captioning and video genera-
tion performance degrade significantly. A possible expla-
nation is that bypassing alignment learning disrupts the en-
coder’s adaptation process, which has been aligned to the
LLM backbone, leading to suboptimal results in subsequent
stages. Additionally, we compare the performance of the
model without the dropout mechanism. Although removing
dropout yields a marked improvement in captioning qual-
ity, the benefit to video generation is marginal. This sug-
gests that without dropout, the model may rely on shortcuts
from the input captions rather than fully understanding the
underlying intent, thereby increasing the risk of overfitting.
RQ-5: How well does the model perform on composi-
tional conditions? We examine the impact of structured
captions under compositional conditions. As shown in Tab.
6, we compare the combined camera, identities, and depth
on our customized model and observe that structured cap-
tions consistently enhance its performance. Moreover, from
the Example 2 and 4 in Fig. 5, our model demonstrates
a thorough understanding of the interactions among various
conditions, for instance, capturing a woman’s hair color and
the position of a mug, accurately guiding the production of
videos that align with the specified requirements. This find-
ing further highlights that our approach can automatically
equip existing T2V models with the ability to handle com-
positional conditions without requiring additional training.
RQ-6: How well is the generalization capability of
Any2Caption? Finally, we investigate the model’s gen-
eralization ability by evaluating its performance on “un-
seen” conditions, including style,segmentation ,sketch , and
masked images . As demonstrated in Fig. 6, the structured
captions generated by our model consistently enhance ex-
isting T2V frameworks, offering benefits such as increased
motion smoothness, aesthetic quality, and more accurate
generation control. We attribute these advantages to two
primary factors: the strong reasoning capabilities of our
MLLM backbone and our training strategy, i.e., progressive
mixed training, which leverages existing vision and text in-
structions for fine-tuning while mitigating knowledge for-
getting, thereby ensuring robust generalization.
8
Page 9:
7. Conclusion
In this work, we focus on addressing the challenge of
more accurately interpreting user generation intention from
any condition for controllable video generation. We intro-
duce Any2Caption , a framework that decouples multi-
modal condition interpretation from video synthesis. Built
based on an MLLM, Any2Caption converts diverse in-
puts into dense captions that drive high-quality video gen-
eration. We further present Any2CapIns , a large-scale
dataset for effective instruction tuning. Experiments show
that our method improves controllability and video quality
across various backbones.
References
[1] Gpt-4v(ision) system card, 2023. 2, 3, 14
[2] Kuaishou. kling, 2024. 2
[3] Sora, 2024. 3, 13
[4] Satanjeev Banerjee and Alon Lavie. METEOR: An auto-
matic metric for MT evaluation with improved correlation
with human judgments. In ACL, pages 65–72, 2005. 5
[5] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel
Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi,
Zion English, Vikram V oleti, Adam Letts, Varun Jam-
pani, and Robin Rombach. Stable video diffusion: Scal-
ing latent video diffusion models to large datasets. CoRR ,
abs/2311.15127, 2023. 13
[6] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dock-
horn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis.
Align your latents: High-resolution video synthesis with la-
tent diffusion models. In CVPR , pages 22563–22575, 2023.
13
[7] Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun
Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei A.
Efros, and Tero Karras. Generating long videos of dynamic
scenes. In NeurIPS , 2022. 13
[8] Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng,
Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Sain-
ing Xie, and Christopher D. Manning. Auroracap: Efficient,
performant video detailed captioning and a new benchmark.
CoRR , abs/2410.03051, 2024. 5
[9] Ling-Hao Chen, Shunlin Lu, Ailing Zeng, Hao Zhang,
Benyou Wang, Ruimao Zhang, and Lei Zhang. Motionllm:
Understanding human behaviors from human motions and
videos. CoRR , abs/2405.20340, 2024. 14
[10] Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang,
Yuhang Zang, Zehui Chen, Haodong Duan, Lin Bin, Zhenyu
Tang, Li Yuan, Yu Qiao, Dahua Lin, Feng Zhao, and Jiaqi
Wang. Sharegpt4video: Improving video understanding and
generation with better captions. In NeurIPS , 2024. 2, 3, 14
[11] Yiran Chen, Anyi Rao, Xuekun Jiang, Shishi Xiao, Ruiqing
Ma, Zeyu Wang, Hui Xiong, and Bo Dai. Cinepregen: Cam-
era controllable video previsualization via engine-powered
diffusion. CoRR , abs/2408.17424, 2024. 3, 13
[12] Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taix ´e,
and Nils Thuerey. Learning temporal coherence via self-supervision for gan-based video generation. ACM Trans.
Graph. , 39(4):75, 2020. 13
[13] Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhi-
fang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang
Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical
report. CoRR , abs/2407.10759, 2024. 14
[14] Yufan Deng, Ruida Wang, Yuhao Zhang, Yu-Wing Tai, and
Chi-Keung Tang. Dragvideo: Interactive drag-style video
editing. In ECCV , pages 183–199, 2024. 3, 13
[15] Tiehan Fan, Kepan Nan, Rui Xie, Penghao Zhou, Zhenheng
Yang, Chaoyou Fu, Xiang Li, Jian Yang, and Ying Tai. In-
stancecap: Improving text-to-video generation via instance-
aware structured caption. CoRR , abs/2412.09283, 2024. 2,
3, 13, 14
[16] Haopeng Fang, Di Qiu, Binjie Mao, Pengfei Yan, and He
Tang. Motioncharacter: Identity-preserving and motion con-
trollable human video generation. CoRR , abs/2411.18281,
2024. 3, 13
[17] Xun Guo, Mingwu Zheng, Liang Hou, Yuan Gao, Yufan
Deng, Pengfei Wan, Di Zhang, Yufan Liu, Weiming Hu,
Zhengjun Zha, Haibin Huang, and Chongyang Ma. I2v-
adapter: A general image-to-video adapter for diffusion
models. In ACM SIGGRAPH , page 112, 2024. 2, 3, 13
[18] Yuwei Guo, Ceyuan Yang, Anyi Rao, Maneesh Agrawala,
Dahua Lin, and Bo Dai. Sparsectrl: Adding sparse controls
to text-to-video diffusion models. In ECCV , pages 330–348,
2024. 6
[19] Shir Gur, Sagie Benaim, and Lior Wolf. Hierarchical patch
V AE-GAN: generating diverse videos from a single sample.
InNeurIPS , 2020. 13
[20] Niv Haim, Ben Feinstein, Niv Granot, Assaf Shocher, Shai
Bagon, Tali Dekel, and Michal Irani. Diverse generation
from a single video made possible. In ECCV , pages 491–
509, 2022. 13
[21] Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo
Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: En-
abling camera control for text-to-video generation. CoRR ,
abs/2404.02101, 2024. 2, 3, 4, 6, 7, 13
[22] Xuanhua He, Quande Liu, Shengju Qian, Xin Wang, Tao
Hu, Ke Cao, Keyu Yan, Man Zhou, and Jie Zhang. Id-
animator: Zero-shot identity-preserving human video gen-
eration. CoRR , abs/2404.15275, 2024. 2
[23] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras,
and Yejin Choi. Clipscore: A reference-free evaluation met-
ric for image captioning. In EMNLP , pages 7514–7528,
2021. 5
[24] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and
Jie Tang. Cogvideo: Large-scale pretraining for text-to-video
generation via transformers. In ICLR , 2023. 2, 3, 13
[25] Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang,
Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and
Kun Gai. Conceptmaster: Multi-concept video customiza-
tion on diffusion transformer models without test-time tun-
ing. CoRR , abs/2501.04698, 2025. 7
[26] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si,
Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin,
Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin
9
Page 10:
Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Vbench: Com-
prehensive benchmark suite for video generative models. In
CVPR , pages 21807–21818, 2024. 6
[27] Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Na-
garajan, Lorenzo Torresani, and Gedas Bertasius. Video re-
cap: Recursive captioning of hour-long videos. In CVPR ,
pages 18198–18208, 2024. 3, 13
[28] Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao
Chen. Motiongpt: Human motion as a foreign language. In
NeurIPS , 2023. 14
[29] Junpeng Jiang, Gangyi Hong, Lijun Zhou, Enhui Ma, Heng-
tong Hu, Xia Zhou, Jie Xiang, Fan Liu, Kaicheng Yu,
Haiyang Sun, Kun Zhan, Peng Jia, and Miao Zhang. Dive:
Dit-based video generation with enhanced control. CoRR ,
abs/2409.01595, 2024. 2
[30] Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xin-
tao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, and Ying Shan.
Miradata: A large-scale video dataset with long durations
and structured captions. In NeurIPS , 2024. 2, 3, 4, 6, 14
[31] Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao
Wang, Pengfei Wan, Di Zhang, Kun Gai, and Qiang Xu.
Fulldit: Multi-task video generative foundation model with
full attention. arXiv preprint arXiv:2503.19907 , 2025. 18
[32] Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe, James M.
Rehg, and Pinar Yanardag. RA VE: randomized noise shuf-
fling for fast and consistent video editing with diffusion mod-
els. In CVPR , pages 6507–6516, 2024. 3, 13
[33] Johanna Karras, Aleksander Holynski, Ting-Chun Wang,
and Ira Kemelmacher-Shlizerman. Dreampose: Fashion
image-to-video synthesis via stable diffusion. In ICCV ,
pages 22623–22633, 2023. 2, 3, 13
[34] Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai,
Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang,
Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin
Wang, Andong Wang, Changlin Li, Duojun Huang, Fang
Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai,
Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang
Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu,
Xinchi Deng, Yang Li, Yi Chen, Yutao Cui, Yuanbo Peng,
Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan
Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Daquan Zhou,
Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang,
and Caesar Zhong. Hunyuanvideo: A systematic framework
for large video generative models. CoRR , abs/2412.03603,
2024. 3, 7, 13
[35] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H.
Hoi. BLIP-2: bootstrapping language-image pre-training
with frozen image encoders and large language models. In
ICM, pages 19730–19742, 2023. 14
[36] Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang,
Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat:
Chat-centric video understanding. CoRR , abs/2305.06355,
2023. 5
[37] Yanwei Li, Chengyao Wang, and Jiaya Jia. Llama-vid: An
image is worth 2 tokens in large language models. In ECCV ,
pages 323–340, 2024. 14
[38] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng
Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and JiayaJia. Mini-gemini: Mining the potential of multi-modality
vision language models. CoRR , abs/2403.18814, 2024. 14
[39] Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng
Jin, and Li Yuan. Video-llava: Learning united visual repre-
sentation by alignment before projection. In EMNLP , pages
5971–5984, 2024. 3, 4, 13, 14
[40] Chin-Yew Lin. ROUGE: A package for automatic evaluation
of summaries. In Text Summarization Branches Out , pages
74–81, 2004. 5
[41] Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal.
Ctrl-adapter: An efficient and versatile framework for
adapting diverse controls to any diffusion model. CoRR ,
abs/2404.09967, 2024. 2, 3, 6, 7, 13
[42] Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen,
Jinbo Xing, Yibo Wang, Xintao Wang, Ying Shan, and Yu-
jiu Yang. Stylecrafter: Taming artistic video diffusion with
reference-augmented adapter learning. ACM Trans. Graph. ,
43(6):251:1–251:10, 2024. 2
[43] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee.
Improved baselines with visual instruction tuning. In CVPR ,
pages 26286–26296, 2024. 3, 4, 5, 13, 14
[44] Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Zi-
wei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte:
Latent diffusion transformer for video generation. CoRR ,
abs/2401.03048, 2024. 2
[45] Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran
Chen, Xiu Li, and Qifeng Chen. Follow your pose: Pose-
guided text-to-video generation using pose-free videos. In
AAAI , pages 4117–4125, 2024. 2, 3, 6, 7, 13
[46] Ze Ma, Daquan Zhou, Chun-Hsiao Yeh, Xue-She Wang, Xi-
uyu Li, Huanrui Yang, Zhen Dong, Kurt Keutzer, and Jiashi
Feng. Magic-me: Identity-specific video customized diffu-
sion. CoRR , abs/2402.09368, 2024. 2
[47] Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhen-
heng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai.
Openvid-1m: A large-scale high-quality dataset for text-to-
video generation. CoRR , abs/2407.02371, 2024. 3, 13
[48] Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing
Zhu. Bleu: a method for automatic evaluation of machine
translation. pages 311–318, 2002. 5
[49] Zhiwu Qing, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yujie
Wei, Yingya Zhang, Changxin Gao, and Nong Sang. Hierar-
chical spatio-temporal decoupling for text-to- video genera-
tion. In CVPR , pages 6635–6645, 2024. 13
[50] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang
Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman
R¨adle, Chlo ´e Rolland, Laura Gustafson, Eric Mintun, Junt-
ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-
Yuan Wu, Ross B. Girshick, Piotr Doll ´ar, and Christoph
Feichtenhofer. SAM 2: Segment anything in images and
videos. CoRR , abs/2408.00714, 2024. 4
[51] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch,
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine
tuning text-to-image diffusion models for subject-driven
generation. In CVPR , pages 22500–22510, 2023. 6
[52] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An,
Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual,
10
Page 11:
Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman.
Make-a-video: Text-to-video generation without text-video
data. In ICLR , 2023. 2, 3, 13
[53] Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi
Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any
3d and 4d scenes from a single image with controllable video
diffusion. CoRR , abs/2411.04928, 2024. 3, 13
[54] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois,
Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B.
Hashimoto. Stanford alpaca: An instruction-following
llama model. https://github.com/tatsu-lab/
stanford_alpaca , 2023. 5
[55] Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo,
Adithya Iyer, Sai Charitha Akula, Shusheng Yang, Jihan
Yang, Manoj Middepogu, Ziteng Wang, Xichen Pan, Rob
Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A
fully open, vision-centric exploration of multimodal llms. In
NeurIPS , 2024. 14
[56] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao
Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang,
Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng
Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou,
and Junyang Lin. Qwen2-vl: Enhancing vision-language
model’s perception of the world at any resolution. CoRR ,
abs/2409.12191, 2024. 3, 6, 13
[57] Qiuheng Wang, Yukai Shi, Jiarong Ou, Rui Chen, Ke Lin,
Jiahao Wang, Boyuan Jiang, Haotian Yang, Mingwu Zheng,
Xin Tao, Fei Yang, Pengfei Wan, and Di Zhang. Koala-
36m: A large-scale video dataset improving consistency be-
tween fine-grained conditions and video content. CoRR ,
abs/2410.08260, 2024. 4
[58] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Ji-
uniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jin-
gren Zhou. Videocomposer: Compositional video synthesis
with motion controllability. In NeurIPS , 2023. 2, 3, 8, 13
[59] Yaohui Wang, Piotr Bilinski, Franc ¸ois Br ´emond, and Antitza
Dantcheva. Imaginator: Conditional spatio-temporal GAN
for video generation. In WACV , pages 1149–1158, 2020. 13
[60] Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li,
Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan.
Motionctrl: A unified and flexible motion controller for
video generation. In ACM SIGGRAPH , page 114, 2024. 7
[61] Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai,
Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei
Lu, Tong Lu, Ping Luo, Yu Qiao, and Jifeng Dai. Vision-
llm v2: An end-to-end generalist multimodal large language
model for hundreds of vision-language tasks. In NeurIPS ,
2024. 14
[62] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian
Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu
Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning
of image diffusion models for text-to-video generation. In
ICCV , pages 7589–7599, 2023. 2, 3, 13
[63] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng
Chua. Next-gpt: Any-to-any multimodal LLM. In ICML .
OpenReview.net, 2024. 4, 14
[64] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen
Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, DayihengLiu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin,
Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jian-
wei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou,
Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming
Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na
Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao,
Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu,
Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiao-
huan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xu-
ancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang
Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru
Zhang, Zhifang Guo, and Zhihao Fan. Qwen2 technical re-
port. CoRR , abs/2407.10671, 2024. 14
[65] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao-
gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any-
thing V2. In NeurIPS , 2024. 4
[66] Zhendong Yang, Ailing Zeng, Chun Yuan, and Yu Li. Effec-
tive whole-body pose estimation with two-stages distillation.
InICCV , pages 4212–4222, 2023. 4
[67] Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu
Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan
Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang,
Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong,
and Jie Tang. Cogvideox: Text-to-video diffusion models
with an expert transformer, 2024. 2, 3, 7, 13
[68] Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan,
Di Zhang, and Wenhan Luo. Stylemaster: Stylize your
video with artistic generation and translation. CoRR ,
abs/2412.07744, 2024. 2, 6, 8
[69] Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yu-
jun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity-
preserving text-to-video generation by frequency decompo-
sition. CoRR , abs/2411.17440, 2024. 2
[70] Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao,
Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jin-
gren Zhou. I2vgen-xl: High-quality image-to-video synthe-
sis via cascaded diffusion models. CoRR , abs/2311.04145,
2023. 3, 13
[71] Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Wein-
berger, and Yoav Artzi. Bertscore: Evaluating text genera-
tion with BERT. In ICLR , 2020. 5
[72] Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong
Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan.
Omg-llava: Bridging image-level, object-level, pixel-level
reasoning and understanding. In NeurIPS , 2024. 14
[73] Yiming Zhang, Zhening Xing, Yanhong Zeng, Youqing
Fang, and Kai Chen. PIA: your personalized image anima-
tor via plug-and-play modules in text-to-image models. In
CVPR , pages 7747–7756, 2024. 13
[74] Min Zhao, Rongzhen Wang, Fan Bao, Chongxuan Li, and
Jun Zhu. Controlvideo: Adding conditional control for one
shot text-to-video editing. CoRR , abs/2305.17098, 2023. 2,
3, 7, 13
[75] Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu,
and Xi Li. Cami2v: Camera-controlled image-to-video dif-
fusion model. CoRR , abs/2410.15957, 2024. 2, 3, 13
[76] Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen,
Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang
11
Page 12:
You. Open-sora: Democratizing efficient video production
for all. CoRR , abs/2412.20404, 2024. 2
[77] Yong Zhong, Min Zhao, Zebin You, Xiaofeng Yu, Chang-
wang Zhang, and Chongxuan Li. Posecrafter: One-shot per-
sonalized video synthesis following flexible pose control. In
ECCV , pages 243–260, 2024. 3, 13
[78] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe,
and Noah Snavely. Stereo magnification: learning view syn-
thesis using multiplane images. ACM Trans. Graph. , 37(4):
65, 2018. 6
12
Page 13:
Appendix Overview
The appendix presents more details and additional results
not included in the main paper due to page limitations. The
list of items included is:
• Limitation in Section §A.
• Ethic Statement results in Section §B.
• Extended Related Work in Section §C.
• Extended Dataset Construction Details in Section §D.
• More Statistics Information of IRS CORE in Section
§E.
• Detailed Setups in Section §F.
• Extended Experiment Results and Analyses in Section
§G.
A. Limitation
Despite the advancement of our proposed framework, sev-
eral limitations may remain:
Firstly, the diversity of annotated data is constrained
by the capabilities of the current annotation tools, which
may limit the variety of generated content. Moreover,
the scarcity of real-world data introduces potential domain
gaps, reducing the model’s generalizability in practical sce-
narios.
Secondly, due to inherent model limitations, hallucina-
tions may occur, resulting in inaccurate structured captions
and consequently degrading the quality of generated videos.
A possible direction to mitigate this issue is to develop an
end-to-end approach that jointly interprets complex condi-
tions and handles video generation.
Lastly, the additional condition-understanding modules
inevitably increase inference time. However, our empiri-
cal results suggest that the performance gains from these
modules are substantial, and future work may explore more
efficient architectures or optimization techniques to balance
speed and accuracy.
B. Ethic Statement
This work relies on publicly available datasets and manu-
ally constructed datasets, ensuring that all data collection
and usage adhere to established privacy standards. We rec-
ognize that automatic annotation processes may introduce
biases, and we have taken measures to evaluate and mit-
igate these biases. Nonetheless, we remain committed to
ongoing improvements in this area.
By enhancing video generation capabilities,
Any2Caption could inadvertently facilitate negative
societal impacts, such as the production of deepfakes and
misinformation, breaches of privacy, or the creation of
harmful content. We, therefore, emphasize the importance
of strict ethical guidelines, robust privacy safeguards,
and careful dataset curation to minimize these risks and
promote responsible research practices.C. Extended Related Work
C.1. Text-to-Video Generation
The development of video generation models has pro-
gressed from early GAN- and V AE-based approaches [7,
12, 19, 20, 59] to the increasingly popular diffusion-
based methods [5, 6, 49, 73]. Among these, diffusion-
in-transformer (DiT) architectures, such as OpenAI’s Sora
[3] and HunyuanVideo [34], have demonstrated remark-
able performance, producing photorealistic videos over ex-
tended durations. Controllable video generation [11, 16, 21,
53] has become an essential aspect of this field. Initially, re-
search efforts centered predominantly on text-to-video gen-
eration [24, 52], which remains the most common approach.
However, relying solely on text prompts can be insuffi-
cient for accurately capturing user intent, prompting explo-
ration into other conditioning inputs such as static images
[17, 62], user sketches [58, 74], human poses [33, 45, 77],
camera perspectives [21, 75], and even additional videos
[14, 32, 70]. Given this diversity of potential conditions,
unifying them into a single “any-condition” video genera-
tion framework is highly valuable.
C.2. Controllable Video Generation
Recent methods like VideoComposer [58], Ctrl-Adapter
[41], and ControlVideo [74] have investigated any-
condition video generation. Nevertheless, they still strug-
gle with comprehensive controllability due to the complex-
ity of multiple modalities and the limited capacity of stan-
dard diffusion or DiT encoders to interpret them. Inspired
by the strong multimodal reasoning capabilities of modern
MLLMs [39, 43, 56], we propose leveraging an MLLM to
consolidate all possible conditions into a unified reasoning
process, producing structured dense captions as inputs to
a backbone Video DiT. SoTA DiT models already exhibit
the capacity to interpret dense textual descriptions, as long
as the input captions are sufficiently detailed in depicting
both the scene and the intended generation goals. Build-
ing on this, our MLLM-based condition encoder directly
addresses the comprehension bottleneck, theoretically en-
abling higher-quality video generation. To our knowledge,
this work is the first to develop an MLLM specifically
tailored for any-condition video generation. Because the
caption-generation mechanism is decoupled from DiT, our
proposed Any2Caption can be integrated into existing
DiT-based methods without additional retraining.
C.3. Video Captioning
Our approach is closely related to video recaptioning re-
search, as our MLLM must produce dense captions based
on the given conditions. In text-to-video settings, prior
work [15, 27, 47, 67] has demonstrated the benefits of re-
captioning videos to obtain more detailed textual annota-
13
Page 14:
tions, thereby improving the training of longer and higher-
quality video generation via DiT. ShareGPT4Video [10],
for example, employs GPT-4V [1] to reinterpret video con-
tent and produce richer captions. MiraData [30] introduces
structured dense recaptioning, while InstanceCap [15] fo-
cuses on instance-consistent dense recaptioning. Although
we also pursue structured dense captions to enhance gen-
eration quality, our method diverges fundamentally from
these previous approaches. First, because DiT models are
already sufficiently powerful, we directly adopt an off-the-
shelf Video DiT without incurring the substantial cost of
training it with dense captions. Instead, we train an MLLM
as an any-condition encoder at a comparatively lower cost;
in the text-to-video scenario, for instance, we only need to
train on pairs of short and dense captions, which are far
easier and more abundant to obtain. Second, prior methods
that recapturing the entire video risk introducing noise or
even hallucinated content due to the current limitations of
MLLMs in video understanding, potentially undermining
DiT training quality, whereas our framework avoids this is-
sue. Most importantly, while these approaches may rely on
dense-caption-trained DiT models, real-world inference of-
ten involves very concise user prompts, creating a mismatch
that can diminish final generation quality.
C.4. Multimodal Large Language Models
Recent advances in Large Language Models (LLMs) [64]
have catalyzed a surge of interest in extending their capa-
bilities to multimodal domains [37–39, 61, 72]. A num-
ber of works integrate a vision encoder (e.g., CLIP [43],
DINOv2[55], OpenCLIP[38]) with an LLM, often through
a lightweight “connector” module (e.g., MLP [43], Q-
former [35]), enabling the model to process image and
video inputs with minimal additional training data and pa-
rameters [39, 43, 63]. These approaches have demonstrated
promising performance on tasks such as image caption-
ing, visual question answering, and video understanding.
Beyond purely visual data, some researchers have inves-
tigated broader modalities, such as 3D motion [9, 28] or
audio [13, 63], thereby expanding the application range of
multimodal LLMs. Despite these advances, most existing
MLLMs focus on a limited set of visual modalities and do
not readily accommodate more specialized inputs like hu-
man pose or camera motion. This gap restricts their ability
to handle more diverse and complex conditions in the con-
trollable video generation field. In contrast, our work targets
a broader spectrum of modalities, aiming to build a single
model capable of interpreting and unifying image, video,
human pose, and camera conditions. Specifically, we aug-
ment an existing MLLM with dedicated encoders for mo-
tion and camera features, thereby equipping the model to
process arbitrary multimodal conditions and facilitate con-
trollable video generation.D. Extended Dataset Details
Visualization of Short and Structured Caption. Here,
we visualize the short and structured caption in Fig. 7, 8.
Notably, the structured caption captures the video content
in greater detail and across multiple aspects. In contrast,
the short caption deliberately omits any information al-
ready supplied by a non-text condition—for example, cam-
era movement is excluded in Fig. 7’s short caption but in-
cluded in Fig. 8 because it is not specified by the provided
multiple-identity images. Moreover, we visualize the word
distribution of the structured captions in Any2CapIns in
Fig. 9.
Prompt Visualization for Short Caption Construction.
In Tab. 7 and 8, we show the system prompts used by GPT-
4V to generate short captions. The prompt explicitly in-
structs GPT-4V to consider the given conditions compre-
hensively and produce short prompts that focus on informa-
tion not covered by the non-textual inputs. For instance,
when multiple identities are specified, the short prompt
should avoid repeating their appearance attributes and in-
stead highlight interactions among the identities. Con-
versely, when depth is the input condition, the short prompt
should include more detailed appearance-related informa-
tion.
E. More Statistics Information of IRS CORE
We generate a total of 15,378 question-answer (QA) pairs,
averaging 19.2 pairs per structured caption. Fig. 10 presents
the distribution of constructed questions across different as-
pects in the structured caption, and Tab. 9 shows representa-
tive QA pairs for each aspect. Notably, questions under the
main object category emphasize fine-grained details such
as clothing color or hairstyles, while action questions focus
on object interactions and movements. This level of speci-
ficity allows us to rigorously assess whether the generated
captions are both complete and precise.
F. Detailed Setups
F.1. Detailed Testing Dataset
Here, we present the statistics of the test in Tab. 10, which
covers four types of single conditions (e.g., Depth ,Camera ,
Identities , and Human pose ), and four types of composi-
tional conditions(e.g., Camera+Identities ,Camera+Depth ,
Identities+Depth and Camera+Identities+Depth ). Each
category contains 200 instances.
F.2. Detailed Training Procedure
We employ a two-stage training process to enhance the
alignment and interpretability of multimodal conditions in
Any2Caption .
14
Page 15:
Diffusion Models Diffusion Models
Short Caption : A cozy and well-lit home . Start by showing a dining table with chairs and a chandelier, then capture the living
room with a sofa and fireplace . Move towards the large windows to reveal the deck outside and the grassy area. Emphasize
the warm and inviting atmosphere .
Structured Caption :
1.Dense caption : An elegantly composed interior scene unfolds, where the dining and living areas are interwoven through warm
tones and architectural cohesion . The narrative begins with a glimpse of a chandelier -lit dining table, then flows organically
toward a plush seating arrangement and a glowing fireplace . A slow and smooth camera shift unveils tall glass windows that
eventually open up to a serene deck, gracefully bridging interior comfort with exterior peace.
2.Main object caption : Key interior elements include a traditional dining ensemble centered under a sparkling chandelier, and a
living room anchored by a soft sofa and an old-fashioned fireplace— framed by panoramic glass windows that serve as portals
to nature .
3.Background caption : The setting is enriched by a timeless ambiance : gentle upholstery, a vintage -style wall clock, and natural
daylight pouring through large windows . Outside, a sunlit wooden patio bordered by greenery extends the space beyond the
glass .
4.Camera caption : Captured with a steady cam at eye-level. Begins with a lateral sweep from the dining corner before
proceeding with a slow dolly movement toward the windowed wall, culminating in a reveal of the outdoor patio .
5.Style caption : Cozy and classic, with a warm color palette, soft lighting gradients, and a slightly rustic design aesthetic . The
stylistic cohesion evokes a calm and welcoming atmosphere .
6.Action caption : Executing a deliberate leftward pan, the camera surveys the interior before advancing forward, transitioning
the viewer’s perspective from indoor warmth to the outdoor environment .
Short Caption + Camera VideoFigure 7. Illustrations of constructed short and structured captions under the camera-to-video generation.
Diffusion Models Diffusion Models
Short Caption : A person stands in a vast field under a stormy sky. They raise their hands to their face, then above their head,
transforming their headpiece into a horn-like structure . The camera moves backward, capturing the gloomy atmosphere and
their confident stance .
Structured Caption :
1.Dense caption : Amid a thunderous, cloud- laden sky, a lone figure clad in futuristic, obsidian -colored armor stands motionless
at the center of an expansive, empty field. The moment intensifies as the individual lifts their arms toward their face in a
calculated motion . Gradually, their helmet morphs into an imposing, horned structure . As this metamorphosis completes, the
figure returns to a commanding stance, now equipped with two elegant, blade- like weapons —one held firmly in each hand.
2.Main object caption : The armored character —fully suited in sleek, dark tech-gear—occupies the center of the frame . Their
headgear shifts shape, evolving into horn-like extensions as they raise their hands in a slow, deliberate movement .
3.Background caption : Sweeping grasslands stretch beneath a foreboding sky, with rolling hills far in the distance . The
atmosphere is heavy with tension, and the setting remains still, emphasizing the transformation's magnitude and isolating the
character in a vast, ominous world .
4.Camera caption : Camera begins with a mid-range, centered composition . As the sequence unfolds, it subtly dollies backward
to reveal more of the character’s form while maintaining a consistent eye-level perspective .
5.Style caption : Dramatic and futuristic, with a moody color palette dominated by greys and blacks. The visual tone draws from
dystopian cinema, focusing on solitude, metamorphosis, and subtle power .
6.Action caption : A solitary warrior lifts her arms toward her helmet, triggering its transformation . Once reformed into a
horned shape, she lowers her arms, retrieves dual blades, and assumes a poised, forward- facing stance .
Short Caption + Multi -IDs Video
Figure 8. Illustrations of constructed short and structured captions under the multiIDs-to-video generation.
15
Page 16:
Diffusion ModelsDiffusion Models
(1) Dense Caption
(2) Main Object Caption
(3) Background Caption
(4) Action Caption
(6) Camera Caption
(5) Style CaptionFigure 9. Word cloud of different structured captions in Any2CapIns dataset, showing the diversity.
Multi-IDs Here is the scenario: We have an MLLM model that supports a text image-conditioned intrinsic-video-caption generation task. The system
input consists of:
1. A reference image composed of 2-3 horizontally stitched images (provided by the user), each stitched image containing one or several
target objects for reference); and
2. A concise textual prompt (referred to as text B, the user’s instruction).
The model’s output is a detailed descriptive caption (**text A**) that thoroughly describes the video corresponding to the user’s input prompt
(**text B**) in great detail. Your task is to perform a reverse engineering. Based on the given reference image (the target objects) and the
detailed target video caption (text A), you need to generate a **reasonable and concise user prompt (text B)** through your understanding,
analysis, and imagination. You must adhere to the following rules:
1. Text A is a dense caption of a video, including all the key objects, their attributes, relationships, background, camera movements, style,
and more. Carefully analyze this caption for all relevant details.
2. Analyze the provided reference images in detail to identify the differences or missing details compared to the target video description.
These may include environment details, the interaction between objects, the progression of actions, camera movements, style, or any elements
not covered by the reference image. Based on these analyses, generate the user’s instructions.
3. The user’s prompt must include the following aspects: first, an overall description of where the target objects are and what they are doing,
along with the temporal progression of their actions. Then, it should describe the background, style, and camera movements.
4. If the target video introduces new objects not present in the reference images, the user’s prompt should describe the attributes of the new
target objects and their interactions with the other target objects.
5. If the video’s style differs from the reference, briefly describe the style in a few words.
6. When the background needs to be described, include details about people, settings, and styles present in the background.
7. Avoid repeating information that can be inferred from the reference images, and eliminate redundant descriptions in the user prompt.
8. The user prompt (text B) must be written in simple wording, maintaining a concise style with short sentences.
9. The user’s instructions should vary in expression; For example, prompts do not always need to start with the main subject. They can begin
with environmental details, camera movements, or other contextual aspects.
Here are three examples representing the desired pattern:
====================================================================================================
[In-context Examples]
====================================================================================================
[Input]
Table 7. Demonstration of the prompt used for GPT-4V to generate the short prompt when the input condition is the multi-IDs.
Stage-1: Alignment learning. This stage focuses on
aligning features extracted by the camera encoder with the
LLM feature space. To achieve this, we first extract camera
movement descriptions (e.g., fixed ,backward ,pan to the
right ) from the camera captions in Any2CapIns to con-
struct a camera movement description dataset. We then in-
troduce two specialized tokens, <|camera start|> and
<|camera end|> , at the beginning and end of the cam-
era feature embeddings. During training, only the cam-
era encoder is optimized, while all other parameters in
Any2Caption remain frozen. Similarly, for motion align-ment, we construct a motion description dataset by extract-
ing action descriptions (e.g., walking ,dancing ,holding )
from the action captions in Any2CapIns . We then freeze
all model parameters except those in the motion encoder
to ensure the LLM effectively understands motion-related
conditions.
Stage-2: Condition-Interpreting Learning. After align-
ment learning, we initialize Any2Caption with the pre-
trained Qwen2-VL, motion encoder, and camera encoder
16
Page 17:
Depth Here is the scenario: We have an MLLM model that supports a text & image-conditioned intrinsic-video-caption generation task. The system
input consists of:
1. A reference image composed of 3-5 horizontally stitched depth maps in temporal sequence (provided by the user, each map containing
depth information for reference); and
2. A concise textual prompt (referred to as text B, the user’s instruction).
The model’s output is a detailed descriptive caption (text A) that thoroughly describes the video corresponding to the user’s input prompt (text
B) in great detail. Now, I need you to perform a reverse engineering task. Based on the given reference image (the depths) and the detailed
target video caption (text A), you must generate a reasonable and concise user prompt (text B) through your understanding, analysis, and
imagination. To ensure accurate and effective outputs, follow these rules strictly:
1. Text A is a dense caption of a video, including all the key objects, their attributes, relationships, background, camera movements, style,
and more. Carefully analyze this caption to extract the necessary details.
2. Since the depth information already provides the necessary geometric outlines and layout details. Do not repeat this information in the
user prompt. Instead, focus on the aspects not covered by the depth maps.
3. The user’s instruction should highlight details not included in the depth map, such as environmental details, the appearance of the
subjects, interactions between subjects, the progression of actions, relationships between the subjects and the environment, camera movements,
and overall style.
4. For dense depth maps (more than 5 maps), assume the maps provide the camera movements and actions between objects, focusing on
describing the appearance of the subjects and environment, the atmosphere, and subtle interactions between subjects and their environment.
5. For sparse depth maps (5 maps or fewer), assume the maps only provide scene outlines. Emphasize details about the subjects’ appearance,
environment, interactions between subjects, relationships between subjects and the environment, and camera movements.
6. The user prompt (text B) must be written in simple wording, maintaining a concise style with short sentences, with a total word count not
exceeding 100.
7. Your output should be a continuous series of sentences, not a list or bullet points.
8. The user’s instructions should vary in expression; they don’t always need to begin with a description of the main subject. They could also
start with environmental details or camera movements.
Here are three examples representing the desired pattern:
====================================================================================================
[In-context Examples]
====================================================================================================
[Input]
Table 8. Demonstration of the prompt used for GPT-4V to generate the short prompt when the input condition is the depth.
Main Object
Caption: 36.84%
Background Caption: 17.54%Action Caption:
19.3%Camera Caption:
10.53%Style Caption: 15.79%
Figure 10. QA pairs proportion in structured captions.
weights. We then employ a progressive mixed training strat-
egy, updating only the lmhead while keeping the multi-
modal encoders frozen. The training follows a sequential
order based on condition complexity: identities ⇒human
pose⇒camera ⇒depth. Correspondingly, the integration
ratio of additional vision/text instruction datasets is progres-
sively increased, set at 0.0, 0.4, 0.6, and 0.8, ensuring a
balanced learning process between condition-specific spe-
cialization and generalization.
F.3. Detailed Implementations
In Tab. 11, we list the detailed hyperparameter settings in
two stages. All the training is conducted on 8 ×A800 (80G)GPUs.
G. Extended Experiment Results and Analyses
G.1. The Capability for Understanding Complex
Instruction.
We further examine Any2Caption ’s ability to handle
complex user instructions, particularly regarding whether
it accurately captures the user’s intended generation tar-
gets. From Fig. 11, we observe that the model focuses pre-
cisely on the user-specified main objects, such as a “woman
warrior” or a background “filled with chaos and destruc-
tion”—when producing structured captions. In contrast, a
short caption combined with condition captions often in-
cludes extraneous objects or background details present in
the identity images, which distract from the user’s intended
targets in the final video generation.
Additionally, we assess the model’s performance on in-
structions containing implicit objects or actions, as shown
in Fig. 12 and 13. In these examples, the model correctly
interprets phrases like “the most right person” as “a young
Black woman with long, curly brown hair, wearing a black
and white outfit” and similarly associates implicitly speci-
fied objects with the provided conditions, generating struc-
tured captions that align with the user’s goals.
Lastly, Fig. 16 compares videos generated using differ-
ent captions. The results indicate that structured captions
17
Page 18:
Aspect QA Pairs
Main Object What is the young woman adjusting as she walks down the corridor? Her wide-brimmed hat.
What color is the young woman’s T-shirt? Light blue.
How does the young woman feel as she walks down the corridor? Happy and carefree.
What is the young woman wearing? Light blue t-shirt with pink lettering, blue jeans, and a wide-brimmed hat.
What is the young woman’s hair length? Long.
What is the position of the young woman in the frame? In the center of the frame.
What is the main object in the video? A large shark.
What is the color of the underwater scene? Blue.
What are the two scientists wearing? White lab coats and gloves.
What is the first scientist using? A microscope.
Background Where is the young woman walking? Down a corridor.
What time of day does the scene appear to be set? Daytime.
What can be seen in the background of the corridor? Beige walls and large windows.
What is the weather like in the video? Clear.
Where is the shark located? On the ocean floor.
What surrounds the shark in the video? Smaller fish.
Where is the laboratory setting? In a brightly lit environment with shelves filled with bottles.
What detail does the background highlight? The scientific setting with static emphasis.
Camera How does the camera follow the young woman? Moving backward
What is the camera’s height relative to the person? Roughly the same height as the person.
What shot type does the camera maintain? Medium close-up shot of the upper body.
How does the camera position itself to capture the subject? At a higher angle, shooting downward.
How does the camera capture the environment? From a medium distance.
How is the camera positioned? At approximately the same eye level as the subjects, maintaining a close-up shot.
How does the camera move in the video? It pans to the right.
Style What is the style of the video? Casual and candid.
What kind of design does the corridor have? Modern and clean design.
What style does the video portray? Naturalistic style with clear, vivid visuals.
What does the video style emphasize? Clinical, high-tech, and scientific precision.
What is the color theme of the lighting? Bright and cool.
What kind of atmosphere does the laboratory have? Professional and scientific.
Action What does the young woman do with both hands occasionally? Adjusts her hat.
What is the young woman doing as she moves? Walking forward with her hands on her hat.
What is the main action of the shark in the video? Lying motionless.
What is the movement of the fish like? Calm and occasionally darting.
What is the movement of the first scientist at the beginning? Examines a microscope.
What task is the second scientist engaged in? Handling a pipette and a beaker filled with green liquid.
How does the second scientist transfer the liquid? Carefully using a pipette into the beaker.
Are there any noticeable movements in the background? Occasional small particles floating.
Table 9. Demonstration of generated question-answer pairs utilized in IRS CORE calculation.
Type #Inst. #Condi.Short Cap. #Structured Cap.
#Avg. Len. #Avg. Len.
Identities 200 350 65.28 284.97
Camera 200 200 50.25 208.01
Depth 200 200 54.22 225.09
Human Pose 200 200 58.38 259.03
Camera+Identities 200 622 53.41 209.17
Camera+Depth 200 400 51.43 208.81
Identities+Depth 200 555 53.14 286.83
Camera+Identities+Depth 200 756 58.35 289.21
Table 10. Statistics of the constructed test datasets. #Inst. denotes
the number of instances, and #Condi. indicates the number of
unique conditions. Short Cap. #Avg. Len represents the average
caption length of short captions, and Structured Cap. #Avg. Len.
represents the average caption length of structured captions.
significantly improve both the smoothness of motion and
the overall consistency of the generated videos.G.2. More Video Visualization
Here, we compare the results of different customized mod-
els [31] after integrating structured captions. Fig. 14 shows
camera-controlled video generation results, and Fig. 15 il-
lustrates depth-controlled outcomes. We observe that struc-
tured captions improve image quality and motion smooth-
ness by providing richer scene details.
For multi-identity-controlled video generation, as de-
picted in Fig. 16, 17, and 18, structured captions lead to
more expressive and realistic representations of the refer-
enced identities, with more accurate colors and smoother
actions.
18
Page 19:
ConfigurationStage-1: Alignment Learning Stage-2: Condition-Interpreting Learning
Camera Motion Identities Human pose Camera Depth
Optimizer AdamW AdamW AdamW AdamW AdamW AdamW
Precision bfloat16 bfloat16 bfloat16 bfloat16 bfloat16 bfloat16
Learning Rate 5e5 5e5 5e5 5e5 5e5 1e5
Weight Decay 0.01 0.01 0.01 0.01 0.01 0.01
Joint Train Ratio 0.0 0.0 0.0 0.4 0.6 0.8
Betas (0.9, 0.99) (0.9, 0.99) (0.9, 0.99) (0.9, 0.99) (0.9, 0.99)
Dropout Possibility 0.0 0.0 0.4 0.6 0.6 0.6
Dropout (Short Cap.) 0.0 0.0 0.6 0.6 0.6 0.6
Batch Size Per GPU 4 4 4 4 4 4
Training DataCamera Movement Description
Dataset (Manual)Action Description
Dataset (Manual)MultiIDsHuman Pose
LLaV A-150KCamera
LLaV A-150KDepth
LLaV A-150K
Alpaca-50K
Table 11. Training recipes for Any2Caption .
Diffusion Models Diffusion Models
Input Short Caption : A woman warrior in armor wields a shield and sword, fighting a soldier in a chaotic, battle -torn room .
Input Multiple Identities Gold Video
Short Caption w/ Condition Caption:
A woman warrior in armor wields a shield and sword, fighting a soldier in a chaotic, battle -torn room .
ID Image 1: A behind- the-scenes moment from a superhero film set, featuring Wonder Woman, Batman, and Superman in costume, with a
film crew and green screen in the background.
ID Image 2: A round shield with intricate designs and a central starburst pattern.Introduce noise informationPredicted Structured Caption:
1.Dense caption : Set in a war-torn interior littered with fractured machinery and scattered debris, a fierce confrontation unfolds between
a battle -hardened woman and an armored soldier . Standing amid the wreckage, she brandishes a polished sword and a finely adorned
shield, moving with the precision of a trained combatant . The surrounding space, filled with remnants of violence, amplifies the urgency
and focus of the encounter . Despite the destruction, both fighters hold their ground, locked in a tense standoff .
2.Main object caption : A female warrior clad in segmented armor combining dark leather and polished metal stands prominently . She wields
a sword in her right hand and a shield in her left—its ornate design featuring an embossed emblem at its center . Her posture is alert and
deliberate, facing off against a cape- wearing soldier lurking just behind her in full tactical gear.
3.Background caption: The room is a battlefield— walls partially collapsed, wires dangling, shattered glass on the floor. Military- grade
vehicles and fragmented gear lie overturned, and thick shadows stretch across the space, lit only by scattered sparks and faint ambient
lighting from damaged fixtures.
4.Camera caption: The scene is captured from a static low -angle position, slightly tilted upward to magnify the warrior's authority. Framing
is tight enough to highlight detail in the characters’ armor but wide enough to expose the chaotic environment surrounding th em.
5.Style caption: Gritty and immersive, the visual aesthetic leans toward post -apocalyptic realism, employing low -key lighting and subdued
tones to emphasize both the destruction and the emotional weight of the scene .
6.Action c aption: With sharp, calculated motions, the woman shifts between offense and defense— raising her shield to deflect and
stepping forward to strike with her blade. Her movements convey strength, control, and combat experience.
Figure 11. Illustrations of input short caption, predicted structured caption, and short caption combined with condition captions. The
condition captions may introduce noisy information due to the focus distraction of the background of input identities, for example, the
Batman ,Superman ,film crew .
19
Page 20:
Diffusion Models Diffusion Models
Input Short Caption : The most right person in the
picture 1 is dancing on the desk in picture 2.
Input Multiple Identities Implicit objects for generation
Predicted Structured Caption :
1.Dense caption : Inside a sunlit modern kitchen, a young woman confidently dances atop a wooden dining table, her movements
vibrant and self-assured. Clad in a monochrome outfit that blends fashion with flair, she glides, twirls, and gestures to an internal
rhythm . Light pours in from generous windows, casting soft highlights on the table’s surface —where everyday elements like fruit
and a meal serve as subtle reminders of domestic life. The warm tones of the wood and the greenery visible outside enhance the
feeling of life, spontaneity, and celebration .
2.Main object caption : A confident Black woman with voluminous, curly brown hair performs expressive dance movements on a dining
table . She wears a black- and-white color -block jacket, slim black pants, and ankle boots . Every motion of her body radiates joy,
from the sway of her arms to the rhythm in her steps, all anchored at the center of the scene .
3.Background caption The room is bright and contemporary —featuring polished wooden cabinets, clean surfaces, and a dining table
set casually with fruit and a dish. Beyond the expansive windows lies a view of lush foliage, infusing the interior with calm, natural
energy. Nothing in the setting feels staged —it’s a real kitchen full of warmth and life.
4.Camera caption : The scene is observed from a slightly elevated camera angle, keeping the subject fully in frame while subtly
emphasizing the table’s height and the dancer’s grounded energy.
5.Style caption : Effortless and candid. The style emphasizes natural light, domestic textures, and a relaxed color palette, evoking
authenticity and comfort with a hint of artistic spontaneity .
6.Action caption : She steps, spins, and gestures with increasing enthusiasm, transforming the tabletop into her stage . Her
expressive movements seem improvised yet intentional, turning an everyday moment into a spontaneous performance .Figure 12. Illustrations of predicted structured captions based on the input multiple identities and the short instruction that expresses the
implicit objects for the video generation.
Diffusion Models Diffusion Models
Input Multiple Identities
Input Human Pose
Input Short Caption: The right person in picture 1 is following the human pose movement in the video while the other person is
watching beside them.
Implicit objects and action for generation
Predicted Structured Caption :
1.Dense caption: In a sun -drenched home interior, a young woman executes a series of delicate and measured dance -like motions.
With arms raised and gestures flowing seamlessly from one to the next, her performance is composed and deliberate,
harmonizing with the serene domestic surroundings. Her movement unfolds within a modest space, where light filters softly through sheer curtains, accentuating the gentle rhythm of the moment.
2.Main object caption: The focus is a light -skinned woman in her early twenties, wearing a pastel blue shirt and fitted dark
jeans. Her long brown hair sways gently as she smiles and moves with rhythmic grace. Just off to the side, a man dressed in a
black coat stands quietly, watching with a neutral expression and relaxed posture, offering a quiet counterpoint to her energ y.
3.Background caption: Set in a cozy, well -maintained living room with soft white walls and a large window draped in translucent
curtains. Minimal yet thoughtful decor— like a clean -lined sofa, a floor plant, and a small lamp -lit side table— evoke warmth and
simplicity. Bright daylight streams in, creating a welcoming and peaceful domestic space.
4.Camera caption: Stationary low -angle setup with the lens tilted slightly upward, capturing the dancer’s full body in frame. The
perspective subtly magnifies the performer’s presence and motion while anchoring her firmly in the surrounding domestic setting.
5.Style caption: Simple and unfiltered, with a naturalistic tone. The visual style mimics an everyday candid recording,
emphasizing authenticity over polish, with even lighting and muted colors contributing to the understated charm.
6.Action caption: The woman moves with gentle enthusiasm —stretching, pivoting, and sweeping her arms in smooth arcs. Her
expressive dance contrasts with the stillness of the man nearby, who watches calmly, contributing to the quiet intimacy of th e
scene.
Figure 13. Illustrations of predicted structured captions based on the input multiple identities and the short instruction that expresses the
implicit objects and the action for the target video generation.
20
Page 21:
Diffusion ModelsDiffusion Models
ØShortCaption:Aserenevideoofalargehousewitharedroofandaspaciousporch,surroundedbylushgreenery.Apeacefulcountrysidesettingwithvibrantcolorsandatranquilatmosphere.ShortCaption+Camera→Video
StructuredCaption+Camera→Video
ØShort Caption: A well-lit dining and living room with elegant and classic decor. The dining table is surrounded by chairs and has a chandelier above it. There's a wooden cabinet against the wall. The background features a hallway with a staircase and another dining area visible. The decor includes wooden furniture and framed pictures on the walls.
ShortCaption+Camera→Video
StructuredCaption+Camera→VideoExample-1
Example-2ØShort Caption: The scene is bathed in bright sunlight, emphasizing the warm and inviting atmosphere. A modern house with large windows and a balcony is showcased. Potted plants accent the architectural details. Distant mountains frame the view. Lush greenery surrounds the scene. The sky is a clear blue, dotted with scattered clouds. A dreamy lens flare effect adds to the serene quality. The overallambiance is tranquil and picturesque.
StructuredCaption+Camera→VideoShortCaption+Camera→VideoExample-3Figure 14. Illustrations of predicted structured captions based on the input multiple identities and the short instruction that expresses the
implicit objects and the action for the target video generation.
21
Page 22:
Diffusion ModelsDiffusion ModelsØShort Caption: A dark, static background enhances the brightly colored, rotating spiral of small blocks. The camera remains fixed, capturing the mesmerizing effect of the colorsshifting subtly. The dynamic movement creates a hypnotic, abstract visual style.
ØShort Caption: The park is sunny and lush with green trees and a small pond. A young couple in their late twenties embraces and shares a kiss. The woman, in a white sleeveless wedding dress and holding a bouquet, playfully touches the man's face. He is dressed in a black suit. The scene is romantic and intimate, with soft, natural lighting. The camera pans gently, capturing their affectionate interaction and the serene environment.ShortCaption+Depth→Video
StructuredCaption+Depth→Video
StructuredCaption+Depth→VideoShortCaption+Depth→Video
Example-4Example-5Figure 15. Illustrations of predicted structured captions based on the input multiple identities and the short instruction that expresses the
implicit objects and the action for the target video generation.
22
Page 23:
Diffusion ModelsDiffusion Models
Example-6ShortCaption+Identities→Video
StructuredCaption+Identities→VideoShortCaptionw/ConditionCaption+Identities→VideoØShort Caption: A woman in a military uniform talks on the phone while holding a document, standing beside a man in a blue uniform against a wall. The setting is formal and professional, suggesting an official procedure in an institutional environment with light-coloredwalls and framed documents. The camera captures their upper bodies, moving backward and tilting upward, transitioning from a close-up to a mediumclose-up shot.
ØShort Caption: A young woman in a traditional colorfuloutfit rides a galloping black horse through a lush green landscape. The camera follows her movements, capturing the dynamic and vibrant scene, with her hair flowing in the wind. The background is blurred to emphasizethe speed and joy of the rider. The overall feel is natural and bright.ShortCaption+Identities→Video
StructuredCaption+Identities→VideoShortCaptionw/ConditionCaption+Identities→VideoExample-7Figure 16. Illustrations of predicted structured captions based on the input multiple identities and the short instruction that expresses the
implicit objects and the action for the target video generation.
23
Page 24:
Diffusion ModelsDiffusion Models
ØShort Caption: A martial arts dojo scene where an instructor in black demonstrates techniques, throwing a student in white to the ground. Students sit in a circle on the green mat floor, observing attentively. In the background, banners and signs indicate martialarts training, with a seated audience and standing spectators. The camera moves from a medium to a long shot, capturing the full scene with respectand focus.ShortCaption+Identities→Video
StructuredCaption+Identities→VideoShortCaptionw/ConditionCaption+Identities→VideoExample-8
ØShort Caption: A young girl wearing a school uniform and a young man in casual clothes are walking side by side along a dimly lit concrete wallat night. The girl walks on the left while the boy rides a bicycle on the right. The background is urban and gritty, with warm, moody lighting. The camera follows them closely, capturing a medium close-up shot of their upper bodies from different angles as they move. The scene has a nostalgic and contemplative atmosphere.ShortCaption+Identities→Video
StructuredCaption+Identities→VideoShortCaptionw/ConditionCaption+Identities→VideoExample-9Figure 17. Illustrations of predicted structured captions based on the input multiple identities and the short instruction that expresses the
implicit objects and the action for the target video generation.
24
Page 25:
Diffusion ModelsDiffusion ModelsØShort Caption: A woman wearing fashionable clothes stands in the room, smiling and showing the goods in her hand. Then the camera zooms in and focuses on the details of the goods in the person's hand.
ØShort Caption: Two cartoon characters are smiling at the camera together.
ShortCaption+Identities→Video
StructuredCaption+Identities→VideoShortCaptionw/ConditionCaption+Identities→VideoExample-10
ShortCaption+Identities→Video
StructuredCaption+Identities→VideoShortCaptionw/ConditionCaption+Identities→VideoExample-11Figure 18. Illustrations of predicted structured captions based on the input multiple identities and the short instruction that expresses the
implicit objects and the action for the target video generation.
25