Page 1:
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Xingyu Chen1Yue Chen1Yuliang Xiu1,2Andreas Geiger3Anpei Chen1,3
1Westlake University2Max Planck Institute for Intelligent Systems
3University of T ¨ubingen, T ¨ubingen AI Center
easi3r.github.io
Abstract
Recent advances in DUSt3R have enabled robust estima-
tion of dense point clouds and camera parameters of static
scenes, leveraging Transformer network architectures and
direct supervision on large-scale 3D datasets. In contrast,
the limited scale and diversity of available 4D datasets
present a major bottleneck for training a highly general-
izable 4D model. This constraint has driven conventional
4D methods to fine-tune 3D models on scalable dynamic
video data with additional geometric priors such as opti-
cal flow and depths. In this work, we take an opposite path
and introduce Easi3R, a simple yet efficient training-free
method for 4D reconstruction. Our approach applies at-
tention adaptation during inference, eliminating the need
for from-scratch pre-training or network fine-tuning. We
find that the attention layers in DUSt3R inherently encode
rich information about camera and object motion. By care-
fully disentangling these attention maps, we achieve accu-
rate dynamic region segmentation, camera pose estimation,
and 4D dense point map reconstruction. Extensive exper-
iments on real-world dynamic videos demonstrate that our
lightweight attention adaptation significantly outperforms
previous state-of-the-art methods that are trained or fine-
tuned on extensive dynamic datasets.
1. Introduction
Recovering geometry and motions from dynamic im-
age collections is still a fundamental challenge in com-
puter vision, with broad downstream applications in novel
view synthesis, AR/VR, autonomous navigation, and
robotics. The literature commonly identifies this problem as
Structure-from-Motion (SfM) and has been the core focus
in 3D vision over decades, yielding mature algorithms that
perform well under stationary conditions and wide base-
lines. However, these algorithms often fail when applied
to dynamic video input.
The main reason for the accuracy and robustness gap be-
tween static and dynamic SfM is object dynamics, a com-
··· ···Easi3R
4D Reconstruction
Object Motion
VideoStatic Scene
··· ···Camera
Motion
Frame
Frame
Frame
FrameFrameFigure 1. We present Easi3R, a training-free, plug-and-play ap-
proach that efficiently disentangles object and camera motion, en-
abling the adaptation of DUSt3R for 4D reconstruction.
mon component in real-world videos. Moving objects vi-
olate fundamental assumptions of homography and epipo-
lar consistency in traditional SfM methods [37, 48]. In ad-
dition, in dynamic videos, where camera and object mo-
tions are often entangled, these methods struggle to dis-
entangle the two motions, often causing the motion with
rich texture to mainly contribute to camera pose estima-
tion erroneously. Recent efforts, such as MonST3R [73]
and CUT3R [63], have made strides to address these chal-
lenges. However, their success is based on extensive train-
ing data [19, 25, 63, 68, 73] or task-specific prior mod-
els [22, 25, 73, 74], such as the depth, optical flow, and
object mask estimators. These limitations motivate us to
innovate further to minimize the gap between static and dy-
namic reconstruction.
We ask ourselves if there are lessons from human per-
ception that can be used as design principles for dynamic
4D reconstruction: Human beings are capable of perceiv-
ing body motion and the structure of the scene, identifying
1arXiv:2503.24391v1 [cs.CV] 31 Mar 2025
Page 2:
dynamic objects, and disentangling ego-motion from object
motion through the inherent attention mechanisms of the
brain [58]. Yet, the learning process rarely relies on explicit
dynamic labels.
We observe that DUSt3R implicitly learned a similar
mechanism, and based on this, we introduce Easi3R, a
training-free method to achieve dynamic object segmenta-
tion, dense point map reconstruction, and robust camera
pose estimation from dynamic videos, as shown in Figure 1.
DUSt3R uses attention layers at its core, taking two image
features as input and producing pixel-aligned point maps
as output. These attention layers are trained to directly
predict pointmaps in the reference view coordinate space,
implicitly matching the image features between the input
views [4] and estimating the rigid view transformation in
the feature space. In practice, performance drops signifi-
cantly when processing pairs with object dynamics [73], as
shown in Figure 2. By analyzing the attention maps in the
transformer layers, we find that regions with less texture,
under-observed, and dynamic objects can yield low atten-
tion values. Therefore, we propose a simple yet effective
decomposition strategy to isolate the above components,
which enables long-horizon dynamic object detection and
segmentation. With this segmentation, we perform a second
inference pass by applying a re-weighting [17] in the cross-
attention layers, enabling robust dynamic 4D reconstruction
and camera motion recovery without fine-tuning on a dy-
namic dataset, all at minimal additional cost to DUSt3R.
Despite its simplicity, we demonstrate that our inference-
time scaling approach for 4D reconstruction is remarkably
robust and accurate on in-the-wild casual dynamic videos.
We evaluate our Easi3R adaptation on the DUSt3R and
MonST3R backbones in three task categories: camera pose
estimation, dynamic object segmentation, and pointcloud
reconstruction in dynamic scenes. Easi3R performs sur-
prisingly well across a wide range of datasets, even surpass-
ing concurrent methods (e.g., CUT3R [63], MonST3R [73],
and DAS3R [68]) that are trained on dynamic datasets.
2. Related Work
SfM and SLAM. Structure-from-Motion (SfM) [2, 41, 42,
48, 51, 52] and Simultaneous Localization and Mapping
(SLAM) [9, 13, 32, 34] have long been the foundation for
3D structure and camera pose estimation. These methods
are done by associating 2D correspondences [5, 10, 28, 32,
47] or minimizing photometric errors [12, 13], followed
by bundle adjustment (BA) [3, 6, 55, 57, 59, 62] to re-
fine structure and motion estimates. Although highly effec-
tive with dense input, these approaches often struggle with
limited camera parallax or ill-posed conditions, leading to
performance degeneracy. To overcome these limitations,
DUSt3R [64] introduced a learning-based approach that di-rectly predicts two pointmaps from an image pair in the co-
ordinate space of the first view. This approach inherently
matches image features and rigid body view transforma-
tion. By leveraging a Transformer-based architecture [11]
and direct point supervision on large-scale 3D datasets,
DUSt3R establishes a robust Multi-View Stereo (MVS)
foundation model. However, DUSt3R and the follow-up
methods [27, 33, 56, 61] assume primarily static scenes,
which can lead to significant performance degradation when
dealing with videos with dynamic objects.
Pose-free Dynamic Scene Reconstruction. Modifications
to SLAM for dynamic scenes involve robust pose estima-
tion to mitigate moving object interference, dynamic map
management for updating changing environments, includ-
ing techniques like semantic segmentation [72], optical
flows [75], enhance SLAM’s resilience in dynamic scenar-
ios. Another line of work focuses on estimating stable video
depth by incorporating geometric constraints [29] and gen-
erative priors [18, 49]. These methods enhance monocu-
lar depth accuracy but lack global point cloud lifting due
to missing camera intrinsics and poses. For joint pose and
depth estimation, optimization-based methods such as Ca-
sualSAM [74] fine-tune a depth network [45] at test time
using pre-computed optical flow [66]. Robust-CVD [22] re-
fines pre-computed depth [45] and camera pose by leverag-
ing masked optical flow [16, 66] to improve stability in oc-
cluded and moving regions. Concurrently, MegaSaM [25]
further enhances pose and depth accuracy by integrating
DROID-SLAM [57], optical flow [66], and depth initializa-
tions from [40, 71], achieving state-of-the-art results. Alter-
natively, point-map-based approaches like MonST3R [73]
extend DUSt3R to dynamic scenes by fine-tuning with dy-
namic datasets and incorporating optical flow [66] to infer
dynamic object segmentation. DAS3R trains a DPT [44] on
top of MonST3R, enabling feedforward segmentation esti-
mation. CUT3R [63] fine-tunes MASt3R [24] on both static
and dynamic datasets, achieving feedforward reconstruc-
tion but without predicting dynamic object segmentation,
thereby entangling the static scene with dynamic objects.
Although effective, these methods require costly training on
diverse motion patterns to generalize well.
In contrast, we take an opposite path, exploring a
training-free and plug-in-play adaptation that enhances the
generalization of DUSt3R variants for dynamic scene re-
construction. Our method requires no fine-tuning and
comes at almost no additional cost, offering a scalable and
efficient alternative for handling real-world dynamic videos.
Motion Segmentation. Motion segmentation aims to pre-
dict dynamic object masks from video inputs. Classical
approaches generally rely on optical flow estimation [26,
31, 67, 70] and point tracking [7, 21, 35, 50, 69] to distin-
guish moving objects from the background. Being trained
solely on 2D data, they often struggle with occlusions and
2
Page 3:
distinguishing between object and camera motion. To im-
prove robustness, RoMo [15] incorporates epipolar geome-
try [30] and SAM2 [46] to better disambiguate object mo-
tion from camera motion. While RoMo successfully com-
bines COLMAP [48] for accurate camera calibration, it fo-
cuses primarily on removing dynamic objects and recon-
structing static scene elements only. For the complete 4D
reconstruction, MonST3R [73] integrates optical flow [66]
with estimated pose and depth to predict dynamic object
segmentation. DAS3R builds on MonST3R and trains a
DPT [44] for segmentation inference. Using this segmenta-
tion, they align static components globally while preserving
dynamic point clouds from each frame, enabling temporally
consistent reconstruction of moving objects.
In this work, we discover that dynamic segmentation can
be extracted from pre-trained 3D reconstruction models like
DUSt3R. We propose a simple, yet robust strategy to isolate
this information from the attention layers, without the need
for optical flow or pre-training on segmentation datasets.
3. Method
Given a casually captured video sequence {It∈
RW×H×3}T
t=1, our goal is to estimate object and camera
movements, as well as the canonical point clouds present
in the input video. Object motion is represented as the seg-
mentation sequence Mt, camera motion as the extrinsic and
intrinsic pose sequences Pt,Kt, and point clouds X∗. First,
we formulate how the DUSt3R model handles videos (Sec-
tion 3.1). Next, we explore the mechanisms of attention ag-
gregation in spatial and temporal dimensions (Section 3.2).
Finally, we introduce how aggregated cross-attention maps
can be leveraged to decompose dynamic object segmenta-
tion (Section 3.3), which in turn helps re-weight attention
values for robust point cloud and camera pose reconstruc-
tion (Section 3.4).
3.1. DUSt3R with Dynamic Video
DUSt3R is designed for pose-free reconstruction, taking
two RGB images - Ia, Ib, where a, b∈[1, . . . , T ]- as input
and output two pointmaps in the reference view coordinate
space, Xa→a, Xb→a∈RW×H×3:
Xa→a, Xb→a=DUSt3R (Ia, Ib) (1)
Here, Xb→adenotes the pointmap of input Ibpredicted in
the view acoordinate space. In particular, both pointmaps
are expressed in the reference view coordinate, i.e., view a
in this example.
Given multi-view images, DUSt3R processes them in
pairs and globally aligns the pairwise predictions into a
joint coordinate space using a connectivity graph across all
views. However, this approach introduces computational
redundancy for video sequences, as the view connectiv-
ity is largely known. Instead, we process videos using a
t t+1 t+2 t-1 t-2··· ···
Sliding Window
DUSt3R reconstruction Video fram eFigure 2. DUSt3R with Dynamic Video. We process videos us-
ing a sliding window and infer the DUSt3R network pairwise. Re-
construction degrades with misalignment when dynamic objects
occupy a considerable portion of the frames.
sliding temporal window and infer the network for pair set
εt={(a, b)|a, b∈[t−n−1
2, . . . , t +n−1
2], a̸=b}within
the symmetric temporal window of size ncentered at time
t, as illustrated in the top row of Figure 2.
With pairwise predictions, it recovers globally aligned
pointmaps {Xt∈RW×H×3}T
t=1by optimizing the trans-
formation Pt
i∈R3×4from the coordinate space of each
pair to the world coordinate, and a scale factor st
ifor the
i-th pair within the set of pairs εt:
X∗= arg min
X,P,sX
t∈TX
i∈εt∥Xa−st
iPt
iXa→a∥1
+∥Xb−st
iPt
iXb→a∥1(2)
Note that the above optimization process assumes a re-
liable pairwise reconstruction and that global content can
be registered by minimizing the linear equations in Eq. 2.
However, since DUSt3R are learned from RGB-D images
of static scenes, dynamic objects disrupt the learned epipo-
lar matching policy. As a result, registration fails when dy-
namic content occupies a considerable portion of pixels, as
shown in Figure 2.
3.2. Secrets Behind DUSt3R
We now examine the network architecture to identify the
components that cause failures for dynamic video input.
As shown in Figure 3, DUSt3R consists of two branches:
the top one for the reference image Iaand the bottom one
for the source image Ib. The two input images are first
processed by a weight-sharing ViT encoder [11], produc-
ing token representations Fa
0=Encoder (Ia)andFb
0=
Encoder (Ib). Next, two decoders, composed of a sequence
of decoder blocks, exchange information both within and
between views. In each block, self-attention is applied to
the token outputs from the previous block, while cross-
attention is performed using the corresponding block out-
puts from the other branch,
3
Page 4:
Figure 3. DUSt3R and our Easi3R adaptation. DUSt3R encodes two images Ia, Ibinto feature tokens Fa
0,Fb
0, which are then decoded
into point maps in the reference view coordinate space using two decoders. Our Easi3R aggregates the cross-attention maps from the
decoders, producing four semantically meaningful maps: Ab=src
µ,Ab=src
σ,Aa=ref
µ,Ab=ref
σ. These maps are then used for a second inference
pass to enhance reconstruction quality.
Dynamics
Texture -less
Under -observed
Camera Motion
✓
✓
-
-
✓
-
✓
-
✓
-
✓
-
-
✓
-
✓
✓
-
-
-
Figure 4. Visualization for Cross-Attention Maps. We color
thenormalized values of attention maps, ranging from onetozero.
We highlight the patterns captured by each type of attention map
using relatively high values. For a more detailed demonstration,
we invite reviewers to visit our webpage under easi3r.github.io.
Fa
l=DecoderBlockref
l
Fa
l−1,Fb
l−1
Fb
l=DecoderBlocksrc
l
Fb
l−1,Fa
l−1 (3)
where l= 1, . . . , L is the block index. Using the fea-
ture tokens, two regression heads produce pointmap pre-
dictions: Xa→a=Headref(Fa
0, . . . ,Fa
L)andXb→a=
Headsrc
Fb
0, . . . ,Fb
L
, respectively. The blocks are trained
by minimizing the Euclidean distance between the pre-
dicted and ground-truth pointmaps.
Observation. Our key insight is that DUSt3R implicitly
learns rigid view transformations through its cross-attention
layers, assigning low attention values to tokens that violate
epipolar geometry constraints, such as texture-less, under-
observed, and dynamic regions. By aggregating cross-
attention outputs across spatial and temporal dimensions,we extract motions from the attention layers.
Spatial attention maps. As illustrated in the Figure 3 (left),
the image features Fare projected into a query matrix for
their respective branch with Q=ℓQ(F)∈R(h×w)×c,
while also serving as a key and value matrix for the other
matrices, K=ℓK(F)∈R(h×w)×c, where cis the fea-
ture dimenson. The projectios are obtained using trainable
linear functions ℓQ(·), ℓK(·). As illumistracted in the right
side of Figure 3, this results in the cross-attention map:
Aa←b
l=Qa
lKb
lT/√c,Ab←a
l=Qb
lKa
lT/√c (4)
in which the cross-attention map Aa←b
l,Ab←a
l ∈
R(h×w)×h×ware used to guide the warping of the value
matrix V=ℓV(F)∈R(h×w)×c, and the cross-
attention output in the reference view branch is given by
softmax( Aa←b
l)Vb. Intuitively, the attention map Aa←b
l
determines how the information is aggregated from the view
bto the view ain the l-th decoder block.
To evaluate the spatial contribution of each token in view
bto all tokens in view a, we average the attention values
between different tokens along the query and layer dimen-
sions. This is given by,
Ab=src=X
lX
xAa←b
l(x, y, z )/(L×h×w)
Aa=ref=X
lX
xAb←a
l(x, y, z )/(L×h×w)(5)
where Ab=src,Aa=ref∈Rh×w, representing the averaged
attention maps, capturing the overall influence of tokens
from one view to another across all decoder layers. i.e.,
Ab=srcdenotes the overall contribution of view bto the ref-
erence view when it serve as source view.
Temporal attention maps. In the following, we extend the
above single-pair formulation to multiple pairs to explore
their temporal attention correlations. For a specific frame
It, it pairs with multiple frames, resulting in 2(n−1)atten-
tion maps per frame. As shown in the upper row of Figure 2,
4
Page 5:
window size of 3 corresponds to 4 pairs. To aggregate the
pairwise cross-attention maps temporally, we compute the
mean and variance over pairs that the view serves as source
and reference:
Ab=src
µ =Mean (Ab=src
i),Ab=src
σ =Std(Ab=src
i) (6)
where i∈εb=srcand
εb=src={(a, b)|src=b, a∈[t−n, . . . , t +n], a̸=b}
(7)
Similarly, we compute Ab=ref
µ andAb=ref
σ for the set of
pairs where view aacts as the source view:
εb=ref={(b, a)|ref=b, a∈[t−n, . . . , t +n], a̸=b}
(8)
Secrets. We visualize the aggregated temporal cross-
attention maps in Figure 4. Recall that DUSt3R infers
pointmaps from two images in the reference view coordi-
nate frame, implicitly aligning points from the source view
to the reference view.
(i) The reference view serves as the registration standard
and is assumed to be static. As a result, the average at-
tention map Aa=ref
µ tends to be smooth, with texture-less
regions (e.g., ground, sky, swing supports, boxing fences)
and under-observed areas (e.g., image boundary) naturally
exhibiting low attention values, since DUSt3R believes that
they are less useful for registration. These regions can be
highlighted and extracted using (1−Aa=ref
µ), as shown in
the “
” column of Figure 4.
(ii) By calculating the standard deviation of (1−Aa=ref
µ)
between neighboring frames, we have Aa=ref
σ, e.g., the col-
umn “
”, representing the changes of the token contribu-
tion in the image coordinate space. Pixels perpendicular
to the direction of motion generally share similar pixel flow
speeds, resulting in consistent deviations that allow us to in-
fer camera motion from the attention pattern. For example,
in the“Walking Man” case in the fourth row of the Figure 4,
with the camera motion from left to right, we can observe
pixels along a column sharing similar attention values.
(iii) Similar to the reference view, we also compute the
average invert attention map in the source view, 1−Aa=src
µ .
As shown in the “
” column, the result not only in-
dicates areas with less texture and underobserved areas but
also highlights dynamic objects because they violate the
rigid body transformation prior which DUSt3R has learned
from the 3D dataset, resulting in low Aa=src
µ values.
(iv) The column “
” shows the standard deviation of
the source view attention map, Aa=src
σ . It highlights both
camera and object motion, as the attention of these areas
continuously changes over time, leading to high deviation
in image space.3.3. Dynamic Object Segmentation
By observing the compositional properties of the derived
cross-attention maps, we propose extracting dynamic ob-
ject segmentation for free, which provides a key for bridg-
ing static and dynamic scene reconstruction. To this end,
we identify attention activations attributed to object mo-
tion. We infer the dynamic attention map for frame a
by computing the joint attention of the first two atten-
tion columns in Figure 4 using the element-wise product:
(1−Aa=src
µ)·Aa=src
σ . To further mitigate the effects of
texture-less regions, under-observed areas, and camera mo-
tion (as shown in the third and fourth columns of Figure 4),
we incorporate the outputs with their inverse attention, re-
sulting in the final formula:
Aa=dyn= (1−Aa=src
µ)·Aa=src
σ·Aa=ref
µ·(1−Aa=ref
σ)(9)
we then obtain per-frame dynamic object segmentation
Mt=
At=dyn> α
using Eq. 9 and Mt∈Rh×w,α
denotes a pre-defined attention threshold and the [·]is the
Iverson bracket. Note that the segmentation is processed
frame by frame. To enhance temporal consistency, we apply
a feature clustering method that fuses information across all
frames; see the supplementary materials for more details.
3.4. 4D Reconstruction
With dynamic object segmentation, the most intuitive way
to adapt static models to dynamic scenes is by masking
out dynamic objects during inference at both the image
and token levels. This can be done by replacing dynamic
regions with black pixels in the image and substituting
the corresponding tokens with mask tokens. In practice,
this approach significantly degrades reconstruction perfor-
mance [73], mainly because black pixels and mask tokens
lead to out-of-distribution input. This motivates us to ap-
ply masking directly within the attention layers instead of
modifying the input images.
Attention re-weighting. We propose to modify the cross-
attention maps by weakening the attention values associated
with dynamic regions. To achieve this, we perform a sec-
ond inference pass through the network, masking the atten-
tion map for assigned dynamic regions. This results in zero
attention for those regions while keeping the rest of the at-
tention maps unchanged:
softmax( ˜Aa←b
l) =(
0 ifMa←b
softmax( Aa←b
l)otherwise(10)
here, Ma←b= (1 −Ma)⊗MbT, where Ma←b∈
R(h×w)×(h×w)and⊗denotes outer product. This results
in tokens from dynamic regions in view bthat do not con-
tribute to static regions in view a. It is important to note
that re-weighting is applied only to the reference view de-
coder, as source view requires a static reference (i.e., the
5
Page 6:
reference view), as described in the secret (i). To achieve
this, the source view decoder must perform cross-attention
with all tokens from the reference view. Re-weighting dy-
namic attention on both branches could result in the loss of
static standard, leading to noisy outputs. We conducted an
ablation study on this insight.
Global alignment. We align the predicted pointmaps
from the sliding windows with the global world coordi-
nate using Eq. 2. Moreover, thanks to dynamic region seg-
mentation, our method also supports segmentation-aware
global alignment with optical flow. In particular, we in-
corporate a reprojection loss to ensure that the projected
point flow remains consistent with the optical flow estima-
tion [66]. Specifically, given an image pair (a, b), we com-
pute the camera motion from frame ato frame b, denoted by
ˆFa→b, by projecting the global point map Xbfrom camera
(Pa,Ka)to camera (Pb,Kb). We then enforce the consis-
tency between the computed flow and the estimated optical
flowFa→bin static regions (1−Mt):
Lflow=X
t∈TX
i∈εt(1−Ma)· ∥ˆFa→b
i− Fa→b
i∥1
+ (1−Mb)· ∥ˆFb→a
i− Fb→a
i∥1(11)
Where ·indicates element-wise product. By incorporating
flow constraint into the optimization process in Eq. 2, we
achieve a more robust output in terms of global pointmaps
X∗and pose sequences Pt,Kt. Note that this term is used
optionally to ensure a fair comparison with the baseline that
does not incorporate the flow-estimation model.
4. Experiments
We evaluate our method in a variety of tasks, including dy-
namic object segmentation (Section 4.1), camera pose es-
timation (Section 4.2) and 4D reconstruction (Section 4.3).
We performed ablation studies in supplementary.
Baselines. We compare Easi3R with state-of-the-art pose
free 4D reconstruction method, including DUSt3R [64],
MonST3R [73], DAS3R [68], and CUT3R [63]. Among
these works, the latter three are concurrent works that also
aim to extend DUSt3R to handle dynamic videos, but take
a different approach by fine-tuning on dynamic datasets,
such as [20, 54, 65], and optimization under the supervi-
sion of optical flow [66]. Unlike previous work, our method
performs a second inference pass on top of the pre-trained
DUSt3R or MonST3R model without requiring fine-tuning
or optimization on additional data.
4.1. Dynamic Object Motion
We represent object motion as a segmentation sequence
and evaluate performance on the video object segmenta-
Video Frame MonST3R [73] DAS3R [68] Ours GT
Figure 5. Qualitative Results of Dynamic Object Segmentation.
“Ours” refers to the Easi3R monst3r setting. Here, we present the
enhanced setting, where outputs from different methods serve as
prompts and are used with SAM2 [46] for mask inference.
Table 1. Dynamic Object Segmentation on the DA VIS dataset.
The best and second best results are bold and underlined , respec-
tively. Easi3R dust3r/monst3r denotes the Easi3R experiment with the
backbones of MonST3R/DUSt3R.
DA VIS-16 DA VIS-17 DA VIS-all
w/o SAM2 w/ SAM2 w/o SAM2 w/ SAM2 w/o SAM2 w/ SAM2
Method Flow JM↑JR↑JM↑JR↑JM↑JR↑JM↑JR↑JM↑JR↑JM↑JR↑
DUSt3R [64] ✓ 42.1 45.7 58.5 63.4 35.2 35.3 48.7 50.2 35.9 34.0 47.6 48.7
MonST3R [73] ✓ 40.9 42.2 64.3 73.3 38.6 38.2 56.4 59.6 36.7 34.3 51.9 54.1
DAS3R [68] ✗ 41.6 39.0 54.2 55.8 43.5 42.1 57.4 61.3 43.4 38.7 53.9 54.8
Easi3R dust3r ✗ 53.1 60.4 67.9 71.4 49.0 56.4 60.1 65.3 44.5 49.6 54.7 60.6
Easi3R monst3r ✗ 57.7 71.6 70.7 79.9 56.5 68.6 67.9 76.1 53.0 63.4 63.1 72.6
tion benchmark DA VIS-16 [39], more challenging DA VIS-
17 [43], and DA VIS-all. We present two experiment set-
tings: direct evaluation of network outputs and an enhanced
setting where outputs serve as prompts for SAM2 [46], im-
proving results. These settings are denoted as w/ and w/o
SAM2 in Table 1. Following DA VIS [39], we evaluate per-
formance using IoU mean (JM) and IoU recall (JR) met-
rics. Since DUSt3R originally does not support dynamic
object segmentation, we extend it as a baseline by incorpo-
rating the flow-guided segmentation as MonST3R. By ap-
plying our attention-guided decomposition, both DUSt3R
and MonST3R show improved segmentation, without the
need for flow, even surpassing DAS3R, which is explicitly
trained on dynamic mask labels.
Qualitative Results. Figure 5 presents the qualitative com-
parison between our method and existing approaches. Since
MonST3R relies on optical flow estimation, it struggles in
textureless regions, failing to disentangle dynamic objects
from the background (e.g., koala, rhino, sheep). On the
other hand, DAS3R learns a mask head for dynamic seg-
mentation but tends to over-segment in most cases. Our
6
Page 7:
Table 2. Benefits of Easi3R on Camera Pose Estimation on
the DyCheck, ADT and TUM-dynamics datasets. The best and
second best results are bold and underlined , respectively. Easi3R
dust3r/monst3r denotes the Easi3R experiment with the backbones of
MonST3R/DUSt3R.
DyCheck ADT TUM-dynamics
Method Flow ATE↓RTE↓RRE↓ATE↓RTE↓RRE↓ATE↓RTE↓RRE↓
DUSt3R [64] ✗ 0.035 0.030 2.323 0.042 0.025 1.212 0.100 0.087 2.692
Easi3R dust3r ✗ 0.029 0.025 1.774 0.040 0.021 0.880 0.093 0.076 2.366
DUSt3R [64] ✓ 0.029 0.021 1.875 0.076 0.030 0.974 0.071 0.067 3.711
Easi3R dust3r ✓ 0.021 0.014 1.092 0.042 0.015 0.655 0.070 0.061 2.361
MonST3R [73] ✗ 0.040 0.034 1.820 0.045 0.024 0.759 0.183 0.148 6.985
Easi3R monst3r ✗ 0.038 0.032 1.736 0.045 0.024 0.715 0.184 0.149 6.311
MonST3R [73] ✓ 0.033 0.024 1.501 0.055 0.025 0.776 0.170 0.155 6.455
Easi3R monst3r ✓ 0.030 0.021 1.390 0.039 0.016 0.640 0.168 0.150 5.925
Table 3. Quantitative Comparisons of Camera Pose Estimation
on the DyCheck, ADT and TUM-dynamics datasets. The best and
second best results are bold and underlined , respectively.
DyCheck ADT TUM-dynamics
Method Flow ATE↓RTE↓RRE↓ATE↓RTE↓RRE↓ATE↓RTE↓RRE↓
DUSt3R [64] ✗ 0.035 0.030 2.323 0.042 0.025 1.212 0.100 0.087 2.692
CUT3R [63] ✗ 0.029 0.020 1.383 0.084 0.025 0.490 0.079 0.088 10.41
MonST3R [73] ✓ 0.033 0.024 1.501 0.055 0.025 0.776 0.170 0.155 6.455
DAS3R [68] ✓ 0.033 0.022 1.467 0.040 0.017 0.685 0.173 0.157 8.341
Easi3R monst3r ✓ 0.030 0.021 1.390 0.039 0.016 0.640 0.168 0.150 5.925
Easi3R dust3r ✓ 0.021 0.014 1.092 0.042 0.015 0.655 0.070 0.061 2.361
method, built on DUSt3R and enhanced with our Easi3R
attention-guided decomposition, accurately segments dy-
namic objects while maintaining robustness in handling tex-
tureless regions (e.g., trunks, rocks, walls), small dynamic
objects (e.g., goose), and casual motions (e.g., girls, pedes-
trian). The results provide a surprising insight that 3D mod-
els, such as DUSt3R in our case, may inherently possess a
strong understanding of the scene and can generalize well
to standard 2D tasks.
4.2. Camera Motion
We evaluate camera motion by using the estimated extrin-
sic sequence on three dynamic benchmarks: DyCheck [14],
TUM-dynamics [53], and ADT [23, 38] datasets. Specifi-
cally, the ADT dataset features egocentric videos, which are
out-of-distribution for DUSt3R’s training set. The DyCheck
dataset includes diverse, in-the-wild dynamic videos cap-
tured from handheld cameras. The TUM-dynamics dataset
contains major dynamic objects in relatively simple indoor
scenarios. Instead of evaluating video clips as in previous
methods, we adopt a more challenging setting by process-
ingentire sequences . Specifically, we downsample frames
at different rates: every 5 frames for ADT, 10 for DyCheck,
and 30 for TUM-dynamics, resulting in approximately 40
frames. We use standard error metrics: Absolute Trans-
lation Error (ATE), Relative Translation Error (RTE), and
Relative Rotation Error (RRE), after applying the Sim(3)
alignment [60] on the estimated camera trajectory to the GT.Table 4. Benefits of Easi3R on Point Cloud Reconstruction
on the DyCheck dataset. The best and second best results are
bold and underlined , respectively. Easi3R dust3r/monst3r denotes the
Easi3R experiment with the backbones of MonST3R/DUSt3R.
Accuracy ↓ Completeness ↓ Distance ↓
Method Flow Mean Median Mean Median Mean Median
DUSt3R [64] ✗ 0.802 0.595 1.950 0.815 0.353 0.233
Easi3R dust3r ✗ 0.772 0.596 1.813 0.757 0.336 0.219
DUSt3R [64] ✓ 0.738 0.599 1.669 0.678 0.313 0.196
Easi3R dust3r ✓ 0.703 0.589 1.474 0.586 0.301 0.186
MonST3R [73] ✗ 0.855 0.693 1.916 1.035 0.398 0.295
Easi3R monst3r ✗ 0.846 0.660 1.840 0.983 0.390 0.290
MonST3R [73] ✓ 0.851 0.689 1.734 0.958 0.353 0.254
Easi3R monst3r ✓ 0.834 0.643 1.661 0.916 0.350 0.255
Table 5. Quantitative Comparisons of Point Cloud Reconstruc-
tion on the DyCheck dataset. The best and second best results are
bold and underlined , respectively.
Accuracy ↓ Completeness ↓ Distance ↓
Method Flow Mean Median Mean Median Mean Median
DUSt3R [64] ✗ 0.802 0.595 1.950 0.815 0.353 0.233
CUT3R [63] ✗ 0.458 0.342 1.633 0.792 0.326 0.229
MonST3R [73] ✓ 0.851 0.689 1.734 0.958 0.353 0.254
DAS3R [68] ✓ 1.772 1.438 2.503 1.548 0.475 0.352
Easi3R monst3r ✓ 0.834 0.643 1.661 0.916 0.350 0.255
Easi3R dust3r ✓ 0.703 0.589 1.474 0.586 0.301 0.186
Benefits from Easi3R. We use DUSt3R and MonST3R
without optical flow as the backbone settings, indepen-
dently analyzing the benefits that Easi3R offers for each.
We show qualitative comparisons of the estimation of the
camera trajectory (Figure 7) and quantitative pose accuracy
in Table 2, including w/ and w/o flow settings. Easi3R
demonstrates more accurate and robust camera pose and tra-
jectory estimation over both backbones and settings. Our
Easi3R effectively leverages the inherent knowledge of
DUSt3R with just a few lines of code, even achieving an
improvement compared to models with optical flow prior.
Comparison. In Table 3, we compare Easi3R with state-
of-the-art variants of DUSt3R. Unlike the plug-and-play
comparison in Table 2, where we optionally disable the
optical flow prior for a fair evaluation. Here, we report
baseline performance using their original experimental set-
tings, i.e., whether the flow model is used is specified
in the second column. For clarity, we denote the set-
ting “MonST3R +Easi3R” as Easi3R monst3r and “DUSt3R
+Easi3R” as Easi3R dust3r. Notably, our approach achieves
significant improvements and delivers the best overall per-
formance among all methods, without ANY fine-tuning on
additional dynamic datasets or mask labels.
4.3. 4D Reconstruction
We evaluate 4D reconstruction on DyCheck [14] by measur-
ing distances to ground-truth point clouds. Following prior
work [1, 8, 61], we use accuracy, completeness, and dis-
7
Page 8:
Video CUT3R [63] MonST3R [73] DAS3R [68] Ours
86 frames
75 frames
52 frames
80 frames···
···
···
···
···
···
···
···
···
Figure 6. Qualitative Comparison. We visualize cross-frame globally aligned static scenes with dynamic point clouds at a selected
timestamp. Notably, instead of using ground truth dynamic masks in previous work, we apply the estimated per-frame dynamic masks to
filter out dynamic points at other timestamps for comparison. Our method (top two and bottom two rows as Easi3R dust3r/monst3r , respectively)
achieves temporally consistent reconstruction of both static scenes and moving objects, whereas baselines suffer from static structure
misalignment and unstable camera pose estimation, and ghosting artifacts due to inaccuracy estimation of dynamic segmentation.
0.10
0.05
0.00 0.05 0.10
X (m)0.06
0.04
0.02
0.000.020.040.06Y (m)Ground Truth
DUSt3R
Easi3Rdust3r
0.8
0.6
0.4
0.2
0.0 0.2 0.4 0.6
X (m)0.2
0.1
0.00.10.20.3Y (m)Ground Truth
MonST3R
Easi3Rmonst3r
Figure 7. Visualization of estimated camera trajectories. Our
robust estimated camera trajectory (orange) deviates less from the
ground truth (gray) compared to the original backbones (blue).
tance metrics. Accuracy is the nearest Euclidean distance
from a reconstructed point to ground truth, completeness is
the reverse, and distance is the Euclidean distance based on
ground-truth point matching.
Quantitative Results. We observe benefits from Easi3R in
Table 4 and Table 5, Easi3R demonstrates more accurate
reconstruction and outperforms most baselines, even com-
parable to concurrent CUT3R [63], which are trained with
many extensive datasets.
Qualitative Results. We also compare the reconstruction
quality of our method with CUT3R [63], MonST3R [73]
and DAS3R [68] in Figure 6. All baselines struggle with
misalignment and entanglement of dynamic and static re-constructions, resulting in broken geometry, distortions, and
ghosting artifacts. The key to our success lies in: (1)
attention-guided segmentation for robust motion disentan-
glement, (2) attention re-weighting for improved pairwise
reconstruction, and (3) segmentation-aware global align-
ment for enhanced overall quality.
5. Conclusion
We presented Easi3R, an adaptation to DUSt3R, which in-
troduces the spatial and temporal attention mechanism be-
hind DUSt3R, to achieve training-free and robust 4D re-
construction. We found the compositional complexity in
attention maps, and propose a simple yet effective decom-
position strategy to isolate the textureless, under-observed,
and dynamic objects components and allowing for robust
dynamic object segmentation. With the segmentation, we
perform a second inference pass by applying attention re-
weighting, enabling robust dynamic 4D reconstruction and
camera motion recovery, and at almost no additional cost
on top of DUSt3R. Surprisingly, our experimental results
demonstrate that Easi3R outperforms state-of-the-art meth-
ods in most cases. We hope that our findings on attention
map disentanglement can inspire other tasks.
8
Page 9:
Acknowledgments. We thank the members of Inception3D
andEndless AI Labs for their help and discussions. Xingyu
Chen andYue Chen are funded by the Westlake Education
Foundation. Xingyu Chen is also supported by the Natu-
ral Science Foundation of Zhejiang province, China (No.
QKWL25F0301). Yuliang Xiu received funding from the
Max Planck Institute for Intelligent Systems. Anpei Chen
and Andreas Geiger are supported by the ERC Starting
Grant LEGO-3D (850533) and DFG EXC number 2064/1 -
project number 390727645.
References
[1] Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis,
Engin Tola, and Anders Bjorholm Dahl. Large-scale data for
multiple-view stereopsis. Proc. of the IEEE International
Conf. on Computer Vision (ICCV) , 2016. 7
[2] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si-
mon, Brian Curless, Steven M Seitz, and Richard Szeliski.
Building rome in a day. ACM Communications , 2011. 2
[3] Sameer Agarwal, Noah Snavely, Steven M Seitz, and
Richard Szeliski. Bundle adjustment in the large. In Proc. of
the European Conf. on Computer Vision (ECCV) , 2010. 2
[4] Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung,
Jisang Han, Sunghwan Hong, and Seungryong Kim. Cross-
view completion models are zero-shot correspondence esti-
mators. arXiv , 2412.09072, 2024. 2
[5] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc
Van Gool. Speeded-up robust features (surf). Computer vi-
sion and image understanding , 2008. 2
[6] Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav-
allari, ´Aron Monszpart, Daniyar Turmukhambetov, and Vic-
tor Adrian Prisacariu. Scene coordinate reconstruction: Pos-
ing of image collections via incremental learning of a relo-
calizer. In Proc. of the European Conf. on Computer Vision
(ECCV) , 2024. 2
[7] Thomas Brox and Jitendra Malik. Object segmentation by
long term analysis of point trajectories. In Proc. of the Euro-
pean Conf. on Computer Vision (ECCV) , 2010. 2
[8] Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll,
and Yuliang Xiu. Feat2gs: Probing visual foundation models
with gaussian splatting. arXiv , 2412.09606, 2024. 7
[9] Andrew J Davison, Ian D Reid, Nicholas D Molton, and
Olivier Stasse. Monoslam: Real-time single camera slam.
IEEE Trans. on Pattern Analysis and Machine Intelligence
(PAMI) , 2007. 2
[10] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi-
novich. Superpoint: Self-supervised interest point detection
and description. In CVPRW , 2018. 2
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
vain Gelly, et al. An image is worth 16x16 words: Trans-
formers for image recognition at scale. arXiv preprint
arXiv:2010.11929 , 2020. 2, 3
[12] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct
sparse odometry. IEEE Trans. on Pattern Analysis and Ma-
chine Intelligence (PAMI) , 2017. 2
[13] Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. In Proc. of the
European Conf. on Computer Vision (ECCV) , 2014. 2
[14] Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell,
and Angjoo Kanazawa. Monocular dynamic view synthesis:
A reality check. Advances in Neural Information Processing
Systems , 2022. 7
[15] Lily Goli, Sara Sabour, Mark Matthews, Marcus Brubaker,
Dmitry Lagun, Alec Jacobson, David J Fleet, Saurabh Sax-
ena, and Andrea Tagliasacchi. Romo: Robust motion seg-
mentation improves structure from motion. arXiv preprint
arXiv:2411.18650 , 2024. 3
[16] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-
shick. Mask R-CNN. In IEEE International Conference on
Computer Vision (ICCV) , 2017. 2
[17] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman,
Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image
editing with cross-attention control. In International Confer-
ence on Learning Representations (ICLR) , 2023. 2
[18] Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong
Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter:
Generating consistent long depth sequences for open-world
videos. arXiv preprint arXiv:2409.02095 , 2024. 2
[19] Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah
Snavely, and Aleksander Holynski. Stereo4d: Learning
how things move in 3d from internet stereo videos. arXiv ,
2412.09621, 2024. 1
[20] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia
Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy-
namicstereo: Consistent dynamic depth from stereo videos.
InProc. IEEE Conf. on Computer Vision and Pattern Recog-
nition (CVPR) , 2023. 6
[21] Laurynas Karazija, Iro Laina, Christian Rupprecht, and An-
drea Vedaldi. Learning segmentation from point trajectories.
arXiv preprint arXiv:2501.12392 , 2025. 2
[22] Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust
consistent video depth estimation. In Proc. IEEE Conf. on
Computer Vision and Pattern Recognition (CVPR) , 2021. 1,
2
[23] Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward,
Joao Carreira, Andrew Zisserman, Gabriel Brostow, and Carl
Doersch. Tapvid-3d: A benchmark for tracking any point in
3d, 2024. arXiv preprint arXiv:2407.05921 . 7
[24] Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground-
ing image matching in 3d with MASt3R. In European Con-
ference on Computer Vision (ECCV) , 2024. 2
[25] Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang,
Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn-
ski, and Noah Snavely. MegaSaM: accurate, fast, and robust
structure and motion from casual dynamic videos. arXiv ,
2412.04463, 2024. 1, 2
[26] Long Lian, Zhirong Wu, and Stella X Yu. Bootstrapping
objectness from videos by relaxed common fate and visual
grouping. In Proc. IEEE Conf. on Computer Vision and Pat-
tern Recognition (CVPR) , 2023. 2
[27] Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yanchao Yang,
Qingnan Fan, and Baoquan Chen. SLAM3R: real-time dense
scene reconstruction from monocular RGB videos. arXiv ,
2412.09401, 2024. 2
[28] David G Lowe. Distinctive image features from scale-
9
Page 10:
invariant keypoints. International journal of computer vi-
sion, 2004. 2
[29] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen,
and Johannes Kopf. Consistent video depth estimation. ACM
Trans. on Graphics , 2020. 2
[30] Quan-Tuan Luong and Olivier D Faugeras. The fundamental
matrix: Theory, algorithms, and stability analysis. Proc. of
the IEEE International Conf. on Computer Vision (ICCV) ,
1996. 3
[31] Etienne Meunier, Ana ¨ıs Badoual, and Patrick Bouthemy.
Em-driven unsupervised learning for efficient motion seg-
mentation. IEEE Trans. on Pattern Analysis and Machine
Intelligence (PAMI) , 2022. 2
[32] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D
Tardos. Orb-slam: a versatile and accurate monocular slam
system. IEEE transactions on robotics , 2015. 2
[33] Riku Murai, Eric Dexheimer, and Andrew J. Davison.
MASt3R-SLAM: real-time dense SLAM with 3d reconstruc-
tion priors. arXiv , 2412.12392, 2024. 2
[34] Richard A Newcombe, Steven J Lovegrove, and Andrew J
Davison. Dtam: Dense tracking and mapping in real-time.
InProc. of the IEEE International Conf. on Computer Vision
(ICCV) , 2011. 2
[35] Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation
of moving objects by long term video analysis. IEEE Trans.
on Pattern Analysis and Machine Intelligence (PAMI) , 2013.
2
[36] Nobuyuki Otsu et al. A threshold selection method from
gray-level histograms. Automatica , 11(285-296):23–27,
1975. 1
[37] Linfei Pan, D ´aniel Bar ´ath, Marc Pollefeys, and Jo-
hannes Lutz Sch ¨onberger. Global structure-from-motion
revisited. In European Conference on Computer Vision
(ECCV) , 2024. 1
[38] Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Pe-
ters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard
Newcombe, and Yuheng Carl Ren. Aria digital twin: A new
benchmark dataset for egocentric 3d machine perception. In
Proc. of the IEEE International Conf. on Computer Vision
(ICCV) , 2023. 7
[39] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc
Van Gool, Markus Gross, and Alexander Sorkine-Hornung.
A benchmark dataset and evaluation methodology for video
object segmentation. In Proc. IEEE Conf. on Computer Vi-
sion and Pattern Recognition (CVPR) , 2016. 6
[40] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia
Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth:
universal monocular metric depth estimation. In Computer
Vision and Pattern Recognition (CVPR) , 2024. 2
[41] Marc Pollefeys, Reinhard Koch, and Luc Van Gool. Self-
calibration and metric reconstruction inspite of varying and
unknown intrinsic camera parameters. International Journal
of Computer Vision , 1999. 2
[42] Marc Pollefeys, Luc Van Gool, Maarten Vergauwen, Frank
Verbiest, Kurt Cornelis, Jan Tops, and Reinhard Koch. Visual
modeling with a hand-held camera. International Journal of
Computer Vision , 2004. 2
[43] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar-
bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017davis challenge on video object segmentation. arXiv preprint
arXiv:1704.00675 , 2017. 6
[44] Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi-
sion transformers for dense prediction. In Proc. of the IEEE
International Conf. on Computer Vision (ICCV) , 2021. 2, 3
[45] Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad
Schindler, and Vladlen Koltun. Towards robust monocular
depth estimation: Mixing datasets for zero-shot cross-dataset
transfer. IEEE Trans. on Pattern Analysis and Machine In-
telligence (PAMI) , 2020. 2
[46] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang
Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman
R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2:
Segment anything in images and videos. arXiv preprint
arXiv:2408.00714 , 2024. 3, 6
[47] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz,
and Andrew Rabinovich. Superglue: Learning feature
matching with graph neural networks. In Proc. IEEE Conf.
on Computer Vision and Pattern Recognition (CVPR) , 2020.
2
[48] Johannes Lutz Sch ¨onberger and Jan-Michael Frahm.
Structure-from-motion revisited. In CVPR , 2016. 1, 2, 3
[49] Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang,
Yujun Shen, Vitor Guizilini, Yue Wang, Matteo Poggi, and
Yiyi Liao. Learning temporally consistent video depth from
video diffusion priors. arXiv preprint arXiv:2406.01493 ,
2024. 2
[50] Yaser Sheikh, Omar Javed, and Takeo Kanade. Background
subtraction for freely moving cameras. In Proc. of the IEEE
International Conf. on Computer Vision (ICCV) , 2009. 2
[51] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo
tourism: exploring photo collections in 3d. In SIGGRAPH .
2006. 2
[52] Noah Snavely, Steven M Seitz, and Richard Szeliski. Model-
ing the world from internet photo collections. International
journal of computer vision , 2008. 2
[53] J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram
Burgard, and Daniel Cremers. A benchmark for the evalua-
tion of rgb-d slam systems. In 2012 IEEE/RSJ international
conference on intelligent robots and systems , 2012. 7
[54] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou,
Yuning Chai, Benjamin Caine, et al. Scalability in perception
for autonomous driving: Waymo open dataset. In Proc. IEEE
Conf. on Computer Vision and Pattern Recognition (CVPR) ,
2020. 6
[55] Chengzhou Tang and Ping Tan. Ba-net: Dense bundle ad-
justment network. arXiv preprint arXiv:1806.04807 , 2018.
2
[56] Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu,
Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan.
MV-DUSt3R+: single-stage scene reconstruction from
sparse views in 2 seconds. arXiv , 2412.06974, 2024. 2
[57] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam
for monocular, stereo, and rgb-d cameras. Advances in Neu-
ral Information Processing Systems (NIPS) , 2021. 2
[58] Stefan Treue and John HR Maunsell. Attentional modula-
tion of visual motion processing in cortical areas mt and mst.
Nature , 1996. 2
10
Page 11:
[59] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An-
drew W Fitzgibbon. Bundle adjustment—a modern synthe-
sis. In Vision Algorithms: Theory and Practice: Interna-
tional Workshop on Vision Algorithms Corfu, Greece , 2000.
2
[60] Shinji Umeyama. Least-squares estimation of transforma-
tion parameters between two point patterns. IEEE Trans. on
Pattern Analysis and Machine Intelligence (PAMI) , 1991. 7
[61] Hengyi Wang and Lourdes Agapito. 3d reconstruction with
spatial memory. arXiv , 2408.16061, 2024. 2, 7
[62] Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and
David Novotny. Vggsfm: Visual geometry grounded deep
structure from motion. In Proc. IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR) , 2024. 2
[63] Qianqian Wang, Yifei Zhang, Aleksander Holynski,
Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d
perception model with persistent state. arXiv , 2501.12387,
2025. 1, 2, 6, 7, 8, 3
[64] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris
Chidlovskii, and Jerome Revaud. DUSt3R: geometric 3d vi-
sion made easy. In Computer Vision and Pattern Recognition
(CVPR) , 2024. 2, 6, 7
[65] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu,
Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se-
bastian Scherer. Tartanair: A dataset to push the limits of
visual slam. In Proc. IEEE International Conf. on Intelligent
Robots and Systems (IROS) , 2020. 6
[66] Yihan Wang, Lahav Lipson, and Jia Deng. Sea-raft: Sim-
ple, efficient, accurate raft for optical flow. In Proc. of the
European Conf. on Computer Vision (ECCV) , 2024. 2, 3, 6
[67] Junyu Xie, Weidi Xie, and Andrew Zisserman. Segmenting
moving objects via an object-centric layered representation.
Advances in neural information processing systems , 2022. 2
[68] Kai Xu, Tze Ho Elden Tse, Jizong Peng, and Angela Yao.
DAS3R: dynamics-aware gaussian splatting for static scene
reconstruction. arXiv , 2412.19584, 2024. 1, 2, 6, 7, 8, 3
[69] Jingyu Yan and Marc Pollefeys. A general framework for
motion segmentation: Independent, articulated, rigid, non-
rigid, degenerate and non-degenerate. In Proc. of the IEEE
International Conf. on Computer Vision (ICCV) , 2006. 2
[70] Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman,
and Weidi Xie. Self-supervised video object segmentation by
motion grouping. In Proc. of the IEEE International Conf.
on Computer Vision (ICCV) , 2021. 2
[71] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi
Feng, and Hengshuang Zhao. Depth anything: Unleashing
the power of large-scale unlabeled data. In Computer Vision
and Pattern Recognition (CVPR) , 2024. 2
[72] Chao Yu, Zuxin Liu, Xin-Jun Liu, Fugui Xie, Yi Yang, Qi
Wei, and Fei Qiao. DS-SLAM: A semantic visual SLAM
towards dynamic environments. In Proc. IEEE International
Conf. on Intelligent Robots and Systems (IROS) , 2018. 2
[73] Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam-
pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-
Hsuan Yang. MonST3R: a simple approach for estimating
geometry in the presence of motion. arXiv , 2410.03825,
2024. 1, 2, 3, 5, 6, 7, 8
[74] Zhoutong Zhang, Forrester Cole, Zhengqi Li, Noah Snavely,
Michael Rubinstein, and William T. Freeman. Structure andmotion from casual videos. In Proc. of the European Conf.
on Computer Vision (ECCV) , 2022. 1, 2
[75] Wang Zhao, Shaohui Liu, Hengkai Guo, Wenping Wang, and
Yong-Jin Liu. Particlesfm: Exploiting dense point trajecto-
ries for localizing moving cameras in the wild. In Proc. of
the European Conf. on Computer Vision (ECCV) , 2022. 2
11
Page 12:
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Supplementary Material
In this supplementary document , we first present ad-
ditional method details on temporal consistency dynamic
object segmentation in Appendix A. Next, we conduct ab-
lation studies of Easi3R in Appendix B and analysis limita-
tions in Appendix C. Lastly, we report additional qualitative
results in Appendix D. We invite readers to easi3r.github.io
for better visualization.
A. Dynamic Object Segmentation
We have presented dynamic object segmentation for a sin-
gle frame in Section 3.3, now we introduce how to ensure
consistency along the temporal axis. Given image feature
tokens Ft
0for frames at t, output from the image encoder,
we concatenate them along the temporal dimension,
¯F= [F1
0;F2
0;. . .;FT
0]∈R(T×h×w)×c(12)
where cis the feature dimension of the tokens. This al-
lows us to apply k-means clustering to group similar fea-
tures across frames, producing cluster assignments,
C=KMeans (¯F, k), Ct(x, y)∈ {1, . . . , k },∀t, x, y (13)
where kis the number of clusters, we use k= 64 for all
experiments.
For each cluster c∈ {1, . . . , k }, we compute a dynamic
score scby averaging the base dynamic attention values of
all tokens within that cluster:
sc=P
tP
i,j1[Ct(x, y) =c]·At=dyn(x, y)P
tP
x,y1[Ct(x, y) =c](14)
where 1[·]denotes the indicator function. We then use these
scores to generate a cluster-fused dynamic attention map,
mapping each pixel’s cluster assignment back to its corre-
sponding dynamic score,
At=dyn
fuse(x, y) =sCt(x,y) (15)
The refined dynamic attention map At=dyn
fuse∈Rh×wis
used to infer the dynamic object segmentation by,
Mt(x, y) = 1[At=dyn
fuse(x, y)> α] (16)
where αis an automatic image thresholding using Otsu’s
method [36]. The resulting dynamic object segmentation
is further utilized in the second inference pass and global
optimization.
Video Video
Figure 8. Benefits of Cross-frame Feature Clustering. We visu-
alize the dynamic attention map At=dyn, cluster assignments Ct,
and cluster-fused dynamic attention map At=dyn
fuse . Features from
the DUSt3R encoder exhibit temporal consistency, as cluster as-
signments ( Ct) remain unchanged across frames, thereby enhanc-
ing temporal consistency in dynamic segmentation ( At=dynfuse)
through clustering-guided temporal fusing. For better visual intu-
ition, we invite readers to easi3r.github.io.
Table 6. Ablation of Dynamic Object Segmentation on DA VIS.
DA VIS-16 DA VIS-17 DA VIS-all
Backbone Ablation JM↑JR↑ JM↑JR↑ JM↑JR↑
DUSt3Rw/oAa=src
µ 45.1 45.2 42.8 39.9 42.2 38.5
w/oAa=src
σ 42.3 50.0 35.0 37.0 30.9 28.3
w/oAa=ref
µ 33.3 28.4 31.5 27.9 32.5 29.7
w/oAa=ref
σ 47.7 54.1 46.2 54.3 43.7 48.6
w/o Clustering 40.0 38.5 38.3 38.3 34.3 30.5
Full 53.1 60.4 49.0 56.4 44.5 49.6
MonST3Rw/oAa=src
µ 47.2 51.5 44.4 46.7 40.9 41.5
w/oAa=src
σ 49.7 60.1 48.7 57.8 44.9 49.6
w/oAa=ref
µ 46.4 54.0 47.4 55.9 45.3 50.7
w/oAa=ref
σ 50.7 62.6 51.0 60.2 50.3 56.8
w/o Clustering 45.5 46.7 45.3 48.1 42.1 43.5
Full 57.7 71.6 56.5 68.6 53.0 63.4
B. Ablation Study
Our ablation lies in two folds: dynamic object segmen-
tation and 4D reconstruction. For dynamic object seg-
mentation, as shown in Table 6 we ablate the contri-
bution of four aggregated temporal cross-attention maps,
Aa=src
µ,Aa=src
σ,Aa=ref
µ,Aa=ref
σ, and feature clustering. The
ablation results show that (1) Disabling any temporal cross-
attention map leads to a performance drop, indicating that
all attention maps contribute to improved dynamic object
segmentation; and (2) Features from the DUSt3R encoder
exhibit temporal consistency and enhance dynamic segmen-
tation through cross-frame clustering.
Table 7 presents ablation studies on 4D reconstruction,
evaluating two key design choices: (1) the impact of two-
branch re-weighting (applying attention re-weighting to
both reference and source decoders) and (2) global align-
ment using optical flow with and without segmentation. The
1
Page 13:
Table 7. Ablation Study of Camera Pose Estimation and Point Cloud Reconstruction on the DyCheck dataset.
Pose Estimation Reconstruction
ATE↓ RTE↓ RRE↓Accuracy ↓ Completeness ↓ Distance ↓
Backbone Re-weighting Flow-GA Mean Median Mean Median Mean Median
DUSt3RRef + Src ✗ 0.030 0.026 1.777 0.775 0.596 1.848 0.778 0.342 0.224
Ref ✗ 0.029 0.025 1.774 0.772 0.596 1.813 0.757 0.336 0.219
Ref w/o Mask 0.026 0.017 1.472 0.940 0.831 1.654 0.685 0.336 0.220
Ref w/ Mask 0.021 0.014 1.092 0.703 0.589 1.474 0.586 0.301 0.186
MonST3RRef + Src ✗ 0.040 0.032 1.751 0.848 0.744 1.850 1.003 0.398 0.292
Ref ✗ 0.038 0.032 1.736 0.846 0.660 1.840 0.983 0.390 0.290
Ref w/o Mask 0.033 0.023 1.495 0.969 0.796 1.752 0.998 0.368 0.273
Ref w/ Mask 0.030 0.021 1.390 0.834 0.643 1.661 0.916 0.350 0.255
DUSt3R [64] Ours
Figure 9. Limitation. We visualize static reconstructions from
two different viewpoints in the top and bottom rows. Easi3R
improves camera pose estimation and point cloud reconstruc-
tion (top row), enhancing alignment in structures like swing
supports through attention re-weighting and segmentation-aware
global alignment. However, from another viewpoint (bottom row),
Easi3R still produces floaters near object boundaries.
ablation results show that (1) Re-weighting only the refer-
ence view decoder outperforms re-weighting both branches.
Since the reference and source decoders serve different
roles, and the reference view acts as the static standard, this
aligns with our design intuition (i); and (2) Incorporating
segmentation in global alignment consistently improves 4D
reconstruction quality.
C. Limitations
Despite strong performance on various in-the-wild videos,
Easi3R can fail when the DUSt3R/MonST3R backbones
produce inaccurate depth predictions. While Easi3R ef-
fectively improves camera pose estimation and point cloud
reconstruction, as shown in Table 5 of the main paper, it
provides clear improvements in completeness and distance
metrics, which are measured on the global point cloud.
However, a noticeable gap remains in depth accuracy, whichis evaluated on per-view outputs. This is because our
method focuses mainly on improving dynamic regions and
global alignment rather than correcting depth predictions in
static parts, as illustrated in Figure 9. We leave per-view
depth correction for future work.
D. Addtional Results
We report additional qualitative results of disentangled 4D
reconstruction in Figure 10, Figure 11 and Figure 12.
We find that MonST3R tends to predict under-segmented
dynamic masks, while DAS3R tends to predict over-
segmented dynamic masks. CUT3R, although it produces
more accurate depth estimation, is prone to being affected
by dynamic objects, leading to misaligned static structures,
unstable camera pose estimation, and ghosting artifacts due
to the lack of dynamic segmentation prediction. In contrast,
Easi3R achieves more accurate segmentation, camera pose
estimation, and 4D reconstruction, resulting in renderings
with better visual quality.
2
Page 14:
Video CUT3R [63] MonST3R [73] DAS3R [68] Ours
68 frames···
···
80 frames······
Figure 10. Qualitative Comparison. We visualize cross-frame globally aligned static scenes with dynamic point clouds at a selected
timestamp. Notably, instead of using ground truth dynamic masks in previous work, we apply the estimated per-frame dynamic masks to
filter out dynamic points at other timestamps for comparison. Top and bottom rows are Easi3R dust3r/monst3r , respectively.
4D Reconstruction Static Scene Dynamic Object Video
Frame 4
······
Frame 55Frame 78
Frame 4Frame 55Frame 78
······
······
Easi
3
R
MonST3R
80frames
80frames
Figure 11. Disentanglement vs. MonST3R [73]. We visualize the disentangled 4D reconstruction, static scene and dynamic objects at
different frames. MonST3R tends to predict under-segmented dynamic masks.
4D Reconstruction Static Scene Dynamic Object Video
······
Easi
3
R
DAS3R
······
Frame 0Frame 44Frame 57Frame 0Frame 44Frame 57
79 frames79 frames
Figure 12. Disentanglement vs. DAS3R [68]. We visualize the disentangled 4D reconstruction, static scene and dynamic objects at
different frames. DAS3R tends to predict over-segmented dynamic masks.
3