Open AGI Codes | Your Codes Reflect!

Updates

Generating audio...

Extracting PDF content...

arxiv

Paper 2503.24391

Easi3R: Estimating Disentangled Motion from DUSt3R Without Training

Authors: Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen

Published: 2025-03-31

Abstract:

Recent advances in DUSt3R have enabled robust estimation of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly generalizable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as optical flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies attention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By carefully disentangling these attention maps, we achieve accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive experiments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or finetuned on extensive dynamic datasets. Our code is publicly available for research purpose at https://easi3r.github.io/

Paper Content: on Alphaxiv

PDF Extraction Method:

Page 1: Easi3R: Estimating Disentangled Motion from DUSt3R Without Training Xingyu Chen1Yue Chen1Yuliang Xiu1,2Andreas Geiger3Anpei Chen1,3 1Westlake University2Max Planck Institute for Intelligent Systems 3University of T ¨ubingen, T ¨ubingen AI Center easi3r.github.io Abstract Recent advances in DUSt3R have enabled robust estima- tion of dense point clouds and camera parameters of static scenes, leveraging Transformer network architectures and direct supervision on large-scale 3D datasets. In contrast, the limited scale and diversity of available 4D datasets present a major bottleneck for training a highly general- izable 4D model. This constraint has driven conventional 4D methods to fine-tune 3D models on scalable dynamic video data with additional geometric priors such as opti- cal flow and depths. In this work, we take an opposite path and introduce Easi3R, a simple yet efficient training-free method for 4D reconstruction. Our approach applies at- tention adaptation during inference, eliminating the need for from-scratch pre-training or network fine-tuning. We find that the attention layers in DUSt3R inherently encode rich information about camera and object motion. By care- fully disentangling these attention maps, we achieve accu- rate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction. Extensive exper- iments on real-world dynamic videos demonstrate that our lightweight attention adaptation significantly outperforms previous state-of-the-art methods that are trained or fine- tuned on extensive dynamic datasets. 1. Introduction Recovering geometry and motions from dynamic im- age collections is still a fundamental challenge in com- puter vision, with broad downstream applications in novel view synthesis, AR/VR, autonomous navigation, and robotics. The literature commonly identifies this problem as Structure-from-Motion (SfM) and has been the core focus in 3D vision over decades, yielding mature algorithms that perform well under stationary conditions and wide base- lines. However, these algorithms often fail when applied to dynamic video input. The main reason for the accuracy and robustness gap be- tween static and dynamic SfM is object dynamics, a com- ··· ···Easi3R 4D Reconstruction Object Motion VideoStatic Scene ··· ···Camera Motion Frame Frame Frame FrameFrameFigure 1. We present Easi3R, a training-free, plug-and-play ap- proach that efficiently disentangles object and camera motion, en- abling the adaptation of DUSt3R for 4D reconstruction. mon component in real-world videos. Moving objects vi- olate fundamental assumptions of homography and epipo- lar consistency in traditional SfM methods [37, 48]. In ad- dition, in dynamic videos, where camera and object mo- tions are often entangled, these methods struggle to dis- entangle the two motions, often causing the motion with rich texture to mainly contribute to camera pose estima- tion erroneously. Recent efforts, such as MonST3R [73] and CUT3R [63], have made strides to address these chal- lenges. However, their success is based on extensive train- ing data [19, 25, 63, 68, 73] or task-specific prior mod- els [22, 25, 73, 74], such as the depth, optical flow, and object mask estimators. These limitations motivate us to innovate further to minimize the gap between static and dy- namic reconstruction. We ask ourselves if there are lessons from human per- ception that can be used as design principles for dynamic 4D reconstruction: Human beings are capable of perceiv- ing body motion and the structure of the scene, identifying 1arXiv:2503.24391v1 [cs.CV] 31 Mar 2025 Page 2: dynamic objects, and disentangling ego-motion from object motion through the inherent attention mechanisms of the brain [58]. Yet, the learning process rarely relies on explicit dynamic labels. We observe that DUSt3R implicitly learned a similar mechanism, and based on this, we introduce Easi3R, a training-free method to achieve dynamic object segmenta- tion, dense point map reconstruction, and robust camera pose estimation from dynamic videos, as shown in Figure 1. DUSt3R uses attention layers at its core, taking two image features as input and producing pixel-aligned point maps as output. These attention layers are trained to directly predict pointmaps in the reference view coordinate space, implicitly matching the image features between the input views [4] and estimating the rigid view transformation in the feature space. In practice, performance drops signifi- cantly when processing pairs with object dynamics [73], as shown in Figure 2. By analyzing the attention maps in the transformer layers, we find that regions with less texture, under-observed, and dynamic objects can yield low atten- tion values. Therefore, we propose a simple yet effective decomposition strategy to isolate the above components, which enables long-horizon dynamic object detection and segmentation. With this segmentation, we perform a second inference pass by applying a re-weighting [17] in the cross- attention layers, enabling robust dynamic 4D reconstruction and camera motion recovery without fine-tuning on a dy- namic dataset, all at minimal additional cost to DUSt3R. Despite its simplicity, we demonstrate that our inference- time scaling approach for 4D reconstruction is remarkably robust and accurate on in-the-wild casual dynamic videos. We evaluate our Easi3R adaptation on the DUSt3R and MonST3R backbones in three task categories: camera pose estimation, dynamic object segmentation, and pointcloud reconstruction in dynamic scenes. Easi3R performs sur- prisingly well across a wide range of datasets, even surpass- ing concurrent methods (e.g., CUT3R [63], MonST3R [73], and DAS3R [68]) that are trained on dynamic datasets. 2. Related Work SfM and SLAM. Structure-from-Motion (SfM) [2, 41, 42, 48, 51, 52] and Simultaneous Localization and Mapping (SLAM) [9, 13, 32, 34] have long been the foundation for 3D structure and camera pose estimation. These methods are done by associating 2D correspondences [5, 10, 28, 32, 47] or minimizing photometric errors [12, 13], followed by bundle adjustment (BA) [3, 6, 55, 57, 59, 62] to re- fine structure and motion estimates. Although highly effec- tive with dense input, these approaches often struggle with limited camera parallax or ill-posed conditions, leading to performance degeneracy. To overcome these limitations, DUSt3R [64] introduced a learning-based approach that di-rectly predicts two pointmaps from an image pair in the co- ordinate space of the first view. This approach inherently matches image features and rigid body view transforma- tion. By leveraging a Transformer-based architecture [11] and direct point supervision on large-scale 3D datasets, DUSt3R establishes a robust Multi-View Stereo (MVS) foundation model. However, DUSt3R and the follow-up methods [27, 33, 56, 61] assume primarily static scenes, which can lead to significant performance degradation when dealing with videos with dynamic objects. Pose-free Dynamic Scene Reconstruction. Modifications to SLAM for dynamic scenes involve robust pose estima- tion to mitigate moving object interference, dynamic map management for updating changing environments, includ- ing techniques like semantic segmentation [72], optical flows [75], enhance SLAM’s resilience in dynamic scenar- ios. Another line of work focuses on estimating stable video depth by incorporating geometric constraints [29] and gen- erative priors [18, 49]. These methods enhance monocu- lar depth accuracy but lack global point cloud lifting due to missing camera intrinsics and poses. For joint pose and depth estimation, optimization-based methods such as Ca- sualSAM [74] fine-tune a depth network [45] at test time using pre-computed optical flow [66]. Robust-CVD [22] re- fines pre-computed depth [45] and camera pose by leverag- ing masked optical flow [16, 66] to improve stability in oc- cluded and moving regions. Concurrently, MegaSaM [25] further enhances pose and depth accuracy by integrating DROID-SLAM [57], optical flow [66], and depth initializa- tions from [40, 71], achieving state-of-the-art results. Alter- natively, point-map-based approaches like MonST3R [73] extend DUSt3R to dynamic scenes by fine-tuning with dy- namic datasets and incorporating optical flow [66] to infer dynamic object segmentation. DAS3R trains a DPT [44] on top of MonST3R, enabling feedforward segmentation esti- mation. CUT3R [63] fine-tunes MASt3R [24] on both static and dynamic datasets, achieving feedforward reconstruc- tion but without predicting dynamic object segmentation, thereby entangling the static scene with dynamic objects. Although effective, these methods require costly training on diverse motion patterns to generalize well. In contrast, we take an opposite path, exploring a training-free and plug-in-play adaptation that enhances the generalization of DUSt3R variants for dynamic scene re- construction. Our method requires no fine-tuning and comes at almost no additional cost, offering a scalable and efficient alternative for handling real-world dynamic videos. Motion Segmentation. Motion segmentation aims to pre- dict dynamic object masks from video inputs. Classical approaches generally rely on optical flow estimation [26, 31, 67, 70] and point tracking [7, 21, 35, 50, 69] to distin- guish moving objects from the background. Being trained solely on 2D data, they often struggle with occlusions and 2 Page 3: distinguishing between object and camera motion. To im- prove robustness, RoMo [15] incorporates epipolar geome- try [30] and SAM2 [46] to better disambiguate object mo- tion from camera motion. While RoMo successfully com- bines COLMAP [48] for accurate camera calibration, it fo- cuses primarily on removing dynamic objects and recon- structing static scene elements only. For the complete 4D reconstruction, MonST3R [73] integrates optical flow [66] with estimated pose and depth to predict dynamic object segmentation. DAS3R builds on MonST3R and trains a DPT [44] for segmentation inference. Using this segmenta- tion, they align static components globally while preserving dynamic point clouds from each frame, enabling temporally consistent reconstruction of moving objects. In this work, we discover that dynamic segmentation can be extracted from pre-trained 3D reconstruction models like DUSt3R. We propose a simple, yet robust strategy to isolate this information from the attention layers, without the need for optical flow or pre-training on segmentation datasets. 3. Method Given a casually captured video sequence {It∈ RW×H×3}T t=1, our goal is to estimate object and camera movements, as well as the canonical point clouds present in the input video. Object motion is represented as the seg- mentation sequence Mt, camera motion as the extrinsic and intrinsic pose sequences Pt,Kt, and point clouds X∗. First, we formulate how the DUSt3R model handles videos (Sec- tion 3.1). Next, we explore the mechanisms of attention ag- gregation in spatial and temporal dimensions (Section 3.2). Finally, we introduce how aggregated cross-attention maps can be leveraged to decompose dynamic object segmenta- tion (Section 3.3), which in turn helps re-weight attention values for robust point cloud and camera pose reconstruc- tion (Section 3.4). 3.1. DUSt3R with Dynamic Video DUSt3R is designed for pose-free reconstruction, taking two RGB images - Ia, Ib, where a, b∈[1, . . . , T ]- as input and output two pointmaps in the reference view coordinate space, Xa→a, Xb→a∈RW×H×3: Xa→a, Xb→a=DUSt3R (Ia, Ib) (1) Here, Xb→adenotes the pointmap of input Ibpredicted in the view acoordinate space. In particular, both pointmaps are expressed in the reference view coordinate, i.e., view a in this example. Given multi-view images, DUSt3R processes them in pairs and globally aligns the pairwise predictions into a joint coordinate space using a connectivity graph across all views. However, this approach introduces computational redundancy for video sequences, as the view connectiv- ity is largely known. Instead, we process videos using a t t+1 t+2 t-1 t-2··· ··· Sliding Window DUSt3R reconstruction Video fram eFigure 2. DUSt3R with Dynamic Video. We process videos us- ing a sliding window and infer the DUSt3R network pairwise. Re- construction degrades with misalignment when dynamic objects occupy a considerable portion of the frames. sliding temporal window and infer the network for pair set εt={(a, b)|a, b∈[t−n−1 2, . . . , t +n−1 2], a̸=b}within the symmetric temporal window of size ncentered at time t, as illustrated in the top row of Figure 2. With pairwise predictions, it recovers globally aligned pointmaps {Xt∈RW×H×3}T t=1by optimizing the trans- formation Pt i∈R3×4from the coordinate space of each pair to the world coordinate, and a scale factor st ifor the i-th pair within the set of pairs εt: X∗= arg min X,P,sX t∈TX i∈εt∥Xa−st iPt iXa→a∥1 +∥Xb−st iPt iXb→a∥1(2) Note that the above optimization process assumes a re- liable pairwise reconstruction and that global content can be registered by minimizing the linear equations in Eq. 2. However, since DUSt3R are learned from RGB-D images of static scenes, dynamic objects disrupt the learned epipo- lar matching policy. As a result, registration fails when dy- namic content occupies a considerable portion of pixels, as shown in Figure 2. 3.2. Secrets Behind DUSt3R We now examine the network architecture to identify the components that cause failures for dynamic video input. As shown in Figure 3, DUSt3R consists of two branches: the top one for the reference image Iaand the bottom one for the source image Ib. The two input images are first processed by a weight-sharing ViT encoder [11], produc- ing token representations Fa 0=Encoder (Ia)andFb 0= Encoder (Ib). Next, two decoders, composed of a sequence of decoder blocks, exchange information both within and between views. In each block, self-attention is applied to the token outputs from the previous block, while cross- attention is performed using the corresponding block out- puts from the other branch, 3 Page 4: Figure 3. DUSt3R and our Easi3R adaptation. DUSt3R encodes two images Ia, Ibinto feature tokens Fa 0,Fb 0, which are then decoded into point maps in the reference view coordinate space using two decoders. Our Easi3R aggregates the cross-attention maps from the decoders, producing four semantically meaningful maps: Ab=src µ,Ab=src σ,Aa=ref µ,Ab=ref σ. These maps are then used for a second inference pass to enhance reconstruction quality. Dynamics Texture -less Under -observed Camera Motion ✓ ✓ - - ✓ - ✓ - ✓ - ✓ - - ✓ - ✓ ✓ - - - Figure 4. Visualization for Cross-Attention Maps. We color thenormalized values of attention maps, ranging from onetozero. We highlight the patterns captured by each type of attention map using relatively high values. For a more detailed demonstration, we invite reviewers to visit our webpage under easi3r.github.io. Fa l=DecoderBlockref l Fa l−1,Fb l−1 Fb l=DecoderBlocksrc l Fb l−1,Fa l−1 (3) where l= 1, . . . , L is the block index. Using the fea- ture tokens, two regression heads produce pointmap pre- dictions: Xa→a=Headref(Fa 0, . . . ,Fa L)andXb→a= Headsrc Fb 0, . . . ,Fb L , respectively. The blocks are trained by minimizing the Euclidean distance between the pre- dicted and ground-truth pointmaps. Observation. Our key insight is that DUSt3R implicitly learns rigid view transformations through its cross-attention layers, assigning low attention values to tokens that violate epipolar geometry constraints, such as texture-less, under- observed, and dynamic regions. By aggregating cross- attention outputs across spatial and temporal dimensions,we extract motions from the attention layers. Spatial attention maps. As illustrated in the Figure 3 (left), the image features Fare projected into a query matrix for their respective branch with Q=ℓQ(F)∈R(h×w)×c, while also serving as a key and value matrix for the other matrices, K=ℓK(F)∈R(h×w)×c, where cis the fea- ture dimenson. The projectios are obtained using trainable linear functions ℓQ(·), ℓK(·). As illumistracted in the right side of Figure 3, this results in the cross-attention map: Aa←b l=Qa lKb lT/√c,Ab←a l=Qb lKa lT/√c (4) in which the cross-attention map Aa←b l,Ab←a l ∈ R(h×w)×h×ware used to guide the warping of the value matrix V=ℓV(F)∈R(h×w)×c, and the cross- attention output in the reference view branch is given by softmax( Aa←b l)Vb. Intuitively, the attention map Aa←b l determines how the information is aggregated from the view bto the view ain the l-th decoder block. To evaluate the spatial contribution of each token in view bto all tokens in view a, we average the attention values between different tokens along the query and layer dimen- sions. This is given by, Ab=src=X lX xAa←b l(x, y, z )/(L×h×w) Aa=ref=X lX xAb←a l(x, y, z )/(L×h×w)(5) where Ab=src,Aa=ref∈Rh×w, representing the averaged attention maps, capturing the overall influence of tokens from one view to another across all decoder layers. i.e., Ab=srcdenotes the overall contribution of view bto the ref- erence view when it serve as source view. Temporal attention maps. In the following, we extend the above single-pair formulation to multiple pairs to explore their temporal attention correlations. For a specific frame It, it pairs with multiple frames, resulting in 2(n−1)atten- tion maps per frame. As shown in the upper row of Figure 2, 4 Page 5: window size of 3 corresponds to 4 pairs. To aggregate the pairwise cross-attention maps temporally, we compute the mean and variance over pairs that the view serves as source and reference: Ab=src µ =Mean (Ab=src i),Ab=src σ =Std(Ab=src i) (6) where i∈εb=srcand εb=src={(a, b)|src=b, a∈[t−n, . . . , t +n], a̸=b} (7) Similarly, we compute Ab=ref µ andAb=ref σ for the set of pairs where view aacts as the source view: εb=ref={(b, a)|ref=b, a∈[t−n, . . . , t +n], a̸=b} (8) Secrets. We visualize the aggregated temporal cross- attention maps in Figure 4. Recall that DUSt3R infers pointmaps from two images in the reference view coordi- nate frame, implicitly aligning points from the source view to the reference view. (i) The reference view serves as the registration standard and is assumed to be static. As a result, the average at- tention map Aa=ref µ tends to be smooth, with texture-less regions (e.g., ground, sky, swing supports, boxing fences) and under-observed areas (e.g., image boundary) naturally exhibiting low attention values, since DUSt3R believes that they are less useful for registration. These regions can be highlighted and extracted using (1−Aa=ref µ), as shown in the “ ” column of Figure 4. (ii) By calculating the standard deviation of (1−Aa=ref µ) between neighboring frames, we have Aa=ref σ, e.g., the col- umn “ ”, representing the changes of the token contribu- tion in the image coordinate space. Pixels perpendicular to the direction of motion generally share similar pixel flow speeds, resulting in consistent deviations that allow us to in- fer camera motion from the attention pattern. For example, in the“Walking Man” case in the fourth row of the Figure 4, with the camera motion from left to right, we can observe pixels along a column sharing similar attention values. (iii) Similar to the reference view, we also compute the average invert attention map in the source view, 1−Aa=src µ . As shown in the “ ” column, the result not only in- dicates areas with less texture and underobserved areas but also highlights dynamic objects because they violate the rigid body transformation prior which DUSt3R has learned from the 3D dataset, resulting in low Aa=src µ values. (iv) The column “ ” shows the standard deviation of the source view attention map, Aa=src σ . It highlights both camera and object motion, as the attention of these areas continuously changes over time, leading to high deviation in image space.3.3. Dynamic Object Segmentation By observing the compositional properties of the derived cross-attention maps, we propose extracting dynamic ob- ject segmentation for free, which provides a key for bridg- ing static and dynamic scene reconstruction. To this end, we identify attention activations attributed to object mo- tion. We infer the dynamic attention map for frame a by computing the joint attention of the first two atten- tion columns in Figure 4 using the element-wise product: (1−Aa=src µ)·Aa=src σ . To further mitigate the effects of texture-less regions, under-observed areas, and camera mo- tion (as shown in the third and fourth columns of Figure 4), we incorporate the outputs with their inverse attention, re- sulting in the final formula: Aa=dyn= (1−Aa=src µ)·Aa=src σ·Aa=ref µ·(1−Aa=ref σ)(9) we then obtain per-frame dynamic object segmentation Mt= At=dyn> α using Eq. 9 and Mt∈Rh×w,α denotes a pre-defined attention threshold and the [·]is the Iverson bracket. Note that the segmentation is processed frame by frame. To enhance temporal consistency, we apply a feature clustering method that fuses information across all frames; see the supplementary materials for more details. 3.4. 4D Reconstruction With dynamic object segmentation, the most intuitive way to adapt static models to dynamic scenes is by masking out dynamic objects during inference at both the image and token levels. This can be done by replacing dynamic regions with black pixels in the image and substituting the corresponding tokens with mask tokens. In practice, this approach significantly degrades reconstruction perfor- mance [73], mainly because black pixels and mask tokens lead to out-of-distribution input. This motivates us to ap- ply masking directly within the attention layers instead of modifying the input images. Attention re-weighting. We propose to modify the cross- attention maps by weakening the attention values associated with dynamic regions. To achieve this, we perform a sec- ond inference pass through the network, masking the atten- tion map for assigned dynamic regions. This results in zero attention for those regions while keeping the rest of the at- tention maps unchanged: softmax( ˜Aa←b l) =( 0 ifMa←b softmax( Aa←b l)otherwise(10) here, Ma←b= (1 −Ma)⊗MbT, where Ma←b∈ R(h×w)×(h×w)and⊗denotes outer product. This results in tokens from dynamic regions in view bthat do not con- tribute to static regions in view a. It is important to note that re-weighting is applied only to the reference view de- coder, as source view requires a static reference (i.e., the 5 Page 6: reference view), as described in the secret (i). To achieve this, the source view decoder must perform cross-attention with all tokens from the reference view. Re-weighting dy- namic attention on both branches could result in the loss of static standard, leading to noisy outputs. We conducted an ablation study on this insight. Global alignment. We align the predicted pointmaps from the sliding windows with the global world coordi- nate using Eq. 2. Moreover, thanks to dynamic region seg- mentation, our method also supports segmentation-aware global alignment with optical flow. In particular, we in- corporate a reprojection loss to ensure that the projected point flow remains consistent with the optical flow estima- tion [66]. Specifically, given an image pair (a, b), we com- pute the camera motion from frame ato frame b, denoted by ˆFa→b, by projecting the global point map Xbfrom camera (Pa,Ka)to camera (Pb,Kb). We then enforce the consis- tency between the computed flow and the estimated optical flowFa→bin static regions (1−Mt): Lflow=X t∈TX i∈εt(1−Ma)· ∥ˆFa→b i− Fa→b i∥1 + (1−Mb)· ∥ˆFb→a i− Fb→a i∥1(11) Where ·indicates element-wise product. By incorporating flow constraint into the optimization process in Eq. 2, we achieve a more robust output in terms of global pointmaps X∗and pose sequences Pt,Kt. Note that this term is used optionally to ensure a fair comparison with the baseline that does not incorporate the flow-estimation model. 4. Experiments We evaluate our method in a variety of tasks, including dy- namic object segmentation (Section 4.1), camera pose es- timation (Section 4.2) and 4D reconstruction (Section 4.3). We performed ablation studies in supplementary. Baselines. We compare Easi3R with state-of-the-art pose free 4D reconstruction method, including DUSt3R [64], MonST3R [73], DAS3R [68], and CUT3R [63]. Among these works, the latter three are concurrent works that also aim to extend DUSt3R to handle dynamic videos, but take a different approach by fine-tuning on dynamic datasets, such as [20, 54, 65], and optimization under the supervi- sion of optical flow [66]. Unlike previous work, our method performs a second inference pass on top of the pre-trained DUSt3R or MonST3R model without requiring fine-tuning or optimization on additional data. 4.1. Dynamic Object Motion We represent object motion as a segmentation sequence and evaluate performance on the video object segmenta- Video Frame MonST3R [73] DAS3R [68] Ours GT Figure 5. Qualitative Results of Dynamic Object Segmentation. “Ours” refers to the Easi3R monst3r setting. Here, we present the enhanced setting, where outputs from different methods serve as prompts and are used with SAM2 [46] for mask inference. Table 1. Dynamic Object Segmentation on the DA VIS dataset. The best and second best results are bold and underlined , respec- tively. Easi3R dust3r/monst3r denotes the Easi3R experiment with the backbones of MonST3R/DUSt3R. DA VIS-16 DA VIS-17 DA VIS-all w/o SAM2 w/ SAM2 w/o SAM2 w/ SAM2 w/o SAM2 w/ SAM2 Method Flow JM↑JR↑JM↑JR↑JM↑JR↑JM↑JR↑JM↑JR↑JM↑JR↑ DUSt3R [64] ✓ 42.1 45.7 58.5 63.4 35.2 35.3 48.7 50.2 35.9 34.0 47.6 48.7 MonST3R [73] ✓ 40.9 42.2 64.3 73.3 38.6 38.2 56.4 59.6 36.7 34.3 51.9 54.1 DAS3R [68] ✗ 41.6 39.0 54.2 55.8 43.5 42.1 57.4 61.3 43.4 38.7 53.9 54.8 Easi3R dust3r ✗ 53.1 60.4 67.9 71.4 49.0 56.4 60.1 65.3 44.5 49.6 54.7 60.6 Easi3R monst3r ✗ 57.7 71.6 70.7 79.9 56.5 68.6 67.9 76.1 53.0 63.4 63.1 72.6 tion benchmark DA VIS-16 [39], more challenging DA VIS- 17 [43], and DA VIS-all. We present two experiment set- tings: direct evaluation of network outputs and an enhanced setting where outputs serve as prompts for SAM2 [46], im- proving results. These settings are denoted as w/ and w/o SAM2 in Table 1. Following DA VIS [39], we evaluate per- formance using IoU mean (JM) and IoU recall (JR) met- rics. Since DUSt3R originally does not support dynamic object segmentation, we extend it as a baseline by incorpo- rating the flow-guided segmentation as MonST3R. By ap- plying our attention-guided decomposition, both DUSt3R and MonST3R show improved segmentation, without the need for flow, even surpassing DAS3R, which is explicitly trained on dynamic mask labels. Qualitative Results. Figure 5 presents the qualitative com- parison between our method and existing approaches. Since MonST3R relies on optical flow estimation, it struggles in textureless regions, failing to disentangle dynamic objects from the background (e.g., koala, rhino, sheep). On the other hand, DAS3R learns a mask head for dynamic seg- mentation but tends to over-segment in most cases. Our 6 Page 7: Table 2. Benefits of Easi3R on Camera Pose Estimation on the DyCheck, ADT and TUM-dynamics datasets. The best and second best results are bold and underlined , respectively. Easi3R dust3r/monst3r denotes the Easi3R experiment with the backbones of MonST3R/DUSt3R. DyCheck ADT TUM-dynamics Method Flow ATE↓RTE↓RRE↓ATE↓RTE↓RRE↓ATE↓RTE↓RRE↓ DUSt3R [64] ✗ 0.035 0.030 2.323 0.042 0.025 1.212 0.100 0.087 2.692 Easi3R dust3r ✗ 0.029 0.025 1.774 0.040 0.021 0.880 0.093 0.076 2.366 DUSt3R [64] ✓ 0.029 0.021 1.875 0.076 0.030 0.974 0.071 0.067 3.711 Easi3R dust3r ✓ 0.021 0.014 1.092 0.042 0.015 0.655 0.070 0.061 2.361 MonST3R [73] ✗ 0.040 0.034 1.820 0.045 0.024 0.759 0.183 0.148 6.985 Easi3R monst3r ✗ 0.038 0.032 1.736 0.045 0.024 0.715 0.184 0.149 6.311 MonST3R [73] ✓ 0.033 0.024 1.501 0.055 0.025 0.776 0.170 0.155 6.455 Easi3R monst3r ✓ 0.030 0.021 1.390 0.039 0.016 0.640 0.168 0.150 5.925 Table 3. Quantitative Comparisons of Camera Pose Estimation on the DyCheck, ADT and TUM-dynamics datasets. The best and second best results are bold and underlined , respectively. DyCheck ADT TUM-dynamics Method Flow ATE↓RTE↓RRE↓ATE↓RTE↓RRE↓ATE↓RTE↓RRE↓ DUSt3R [64] ✗ 0.035 0.030 2.323 0.042 0.025 1.212 0.100 0.087 2.692 CUT3R [63] ✗ 0.029 0.020 1.383 0.084 0.025 0.490 0.079 0.088 10.41 MonST3R [73] ✓ 0.033 0.024 1.501 0.055 0.025 0.776 0.170 0.155 6.455 DAS3R [68] ✓ 0.033 0.022 1.467 0.040 0.017 0.685 0.173 0.157 8.341 Easi3R monst3r ✓ 0.030 0.021 1.390 0.039 0.016 0.640 0.168 0.150 5.925 Easi3R dust3r ✓ 0.021 0.014 1.092 0.042 0.015 0.655 0.070 0.061 2.361 method, built on DUSt3R and enhanced with our Easi3R attention-guided decomposition, accurately segments dy- namic objects while maintaining robustness in handling tex- tureless regions (e.g., trunks, rocks, walls), small dynamic objects (e.g., goose), and casual motions (e.g., girls, pedes- trian). The results provide a surprising insight that 3D mod- els, such as DUSt3R in our case, may inherently possess a strong understanding of the scene and can generalize well to standard 2D tasks. 4.2. Camera Motion We evaluate camera motion by using the estimated extrin- sic sequence on three dynamic benchmarks: DyCheck [14], TUM-dynamics [53], and ADT [23, 38] datasets. Specifi- cally, the ADT dataset features egocentric videos, which are out-of-distribution for DUSt3R’s training set. The DyCheck dataset includes diverse, in-the-wild dynamic videos cap- tured from handheld cameras. The TUM-dynamics dataset contains major dynamic objects in relatively simple indoor scenarios. Instead of evaluating video clips as in previous methods, we adopt a more challenging setting by process- ingentire sequences . Specifically, we downsample frames at different rates: every 5 frames for ADT, 10 for DyCheck, and 30 for TUM-dynamics, resulting in approximately 40 frames. We use standard error metrics: Absolute Trans- lation Error (ATE), Relative Translation Error (RTE), and Relative Rotation Error (RRE), after applying the Sim(3) alignment [60] on the estimated camera trajectory to the GT.Table 4. Benefits of Easi3R on Point Cloud Reconstruction on the DyCheck dataset. The best and second best results are bold and underlined , respectively. Easi3R dust3r/monst3r denotes the Easi3R experiment with the backbones of MonST3R/DUSt3R. Accuracy ↓ Completeness ↓ Distance ↓ Method Flow Mean Median Mean Median Mean Median DUSt3R [64] ✗ 0.802 0.595 1.950 0.815 0.353 0.233 Easi3R dust3r ✗ 0.772 0.596 1.813 0.757 0.336 0.219 DUSt3R [64] ✓ 0.738 0.599 1.669 0.678 0.313 0.196 Easi3R dust3r ✓ 0.703 0.589 1.474 0.586 0.301 0.186 MonST3R [73] ✗ 0.855 0.693 1.916 1.035 0.398 0.295 Easi3R monst3r ✗ 0.846 0.660 1.840 0.983 0.390 0.290 MonST3R [73] ✓ 0.851 0.689 1.734 0.958 0.353 0.254 Easi3R monst3r ✓ 0.834 0.643 1.661 0.916 0.350 0.255 Table 5. Quantitative Comparisons of Point Cloud Reconstruc- tion on the DyCheck dataset. The best and second best results are bold and underlined , respectively. Accuracy ↓ Completeness ↓ Distance ↓ Method Flow Mean Median Mean Median Mean Median DUSt3R [64] ✗ 0.802 0.595 1.950 0.815 0.353 0.233 CUT3R [63] ✗ 0.458 0.342 1.633 0.792 0.326 0.229 MonST3R [73] ✓ 0.851 0.689 1.734 0.958 0.353 0.254 DAS3R [68] ✓ 1.772 1.438 2.503 1.548 0.475 0.352 Easi3R monst3r ✓ 0.834 0.643 1.661 0.916 0.350 0.255 Easi3R dust3r ✓ 0.703 0.589 1.474 0.586 0.301 0.186 Benefits from Easi3R. We use DUSt3R and MonST3R without optical flow as the backbone settings, indepen- dently analyzing the benefits that Easi3R offers for each. We show qualitative comparisons of the estimation of the camera trajectory (Figure 7) and quantitative pose accuracy in Table 2, including w/ and w/o flow settings. Easi3R demonstrates more accurate and robust camera pose and tra- jectory estimation over both backbones and settings. Our Easi3R effectively leverages the inherent knowledge of DUSt3R with just a few lines of code, even achieving an improvement compared to models with optical flow prior. Comparison. In Table 3, we compare Easi3R with state- of-the-art variants of DUSt3R. Unlike the plug-and-play comparison in Table 2, where we optionally disable the optical flow prior for a fair evaluation. Here, we report baseline performance using their original experimental set- tings, i.e., whether the flow model is used is specified in the second column. For clarity, we denote the set- ting “MonST3R +Easi3R” as Easi3R monst3r and “DUSt3R +Easi3R” as Easi3R dust3r. Notably, our approach achieves significant improvements and delivers the best overall per- formance among all methods, without ANY fine-tuning on additional dynamic datasets or mask labels. 4.3. 4D Reconstruction We evaluate 4D reconstruction on DyCheck [14] by measur- ing distances to ground-truth point clouds. Following prior work [1, 8, 61], we use accuracy, completeness, and dis- 7 Page 8: Video CUT3R [63] MonST3R [73] DAS3R [68] Ours 86 frames 75 frames 52 frames 80 frames··· ··· ··· ··· ··· ··· ··· ··· ··· Figure 6. Qualitative Comparison. We visualize cross-frame globally aligned static scenes with dynamic point clouds at a selected timestamp. Notably, instead of using ground truth dynamic masks in previous work, we apply the estimated per-frame dynamic masks to filter out dynamic points at other timestamps for comparison. Our method (top two and bottom two rows as Easi3R dust3r/monst3r , respectively) achieves temporally consistent reconstruction of both static scenes and moving objects, whereas baselines suffer from static structure misalignment and unstable camera pose estimation, and ghosting artifacts due to inaccuracy estimation of dynamic segmentation. 0.10 0.05 0.00 0.05 0.10 X (m)0.06 0.04 0.02 0.000.020.040.06Y (m)Ground Truth DUSt3R Easi3Rdust3r 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 X (m)0.2 0.1 0.00.10.20.3Y (m)Ground Truth MonST3R Easi3Rmonst3r Figure 7. Visualization of estimated camera trajectories. Our robust estimated camera trajectory (orange) deviates less from the ground truth (gray) compared to the original backbones (blue). tance metrics. Accuracy is the nearest Euclidean distance from a reconstructed point to ground truth, completeness is the reverse, and distance is the Euclidean distance based on ground-truth point matching. Quantitative Results. We observe benefits from Easi3R in Table 4 and Table 5, Easi3R demonstrates more accurate reconstruction and outperforms most baselines, even com- parable to concurrent CUT3R [63], which are trained with many extensive datasets. Qualitative Results. We also compare the reconstruction quality of our method with CUT3R [63], MonST3R [73] and DAS3R [68] in Figure 6. All baselines struggle with misalignment and entanglement of dynamic and static re-constructions, resulting in broken geometry, distortions, and ghosting artifacts. The key to our success lies in: (1) attention-guided segmentation for robust motion disentan- glement, (2) attention re-weighting for improved pairwise reconstruction, and (3) segmentation-aware global align- ment for enhanced overall quality. 5. Conclusion We presented Easi3R, an adaptation to DUSt3R, which in- troduces the spatial and temporal attention mechanism be- hind DUSt3R, to achieve training-free and robust 4D re- construction. We found the compositional complexity in attention maps, and propose a simple yet effective decom- position strategy to isolate the textureless, under-observed, and dynamic objects components and allowing for robust dynamic object segmentation. With the segmentation, we perform a second inference pass by applying attention re- weighting, enabling robust dynamic 4D reconstruction and camera motion recovery, and at almost no additional cost on top of DUSt3R. Surprisingly, our experimental results demonstrate that Easi3R outperforms state-of-the-art meth- ods in most cases. We hope that our findings on attention map disentanglement can inspire other tasks. 8 Page 9: Acknowledgments. We thank the members of Inception3D andEndless AI Labs for their help and discussions. Xingyu Chen andYue Chen are funded by the Westlake Education Foundation. Xingyu Chen is also supported by the Natu- ral Science Foundation of Zhejiang province, China (No. QKWL25F0301). Yuliang Xiu received funding from the Max Planck Institute for Intelligent Systems. Anpei Chen and Andreas Geiger are supported by the ERC Starting Grant LEGO-3D (850533) and DFG EXC number 2064/1 - project number 390727645. References [1] Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis. Proc. of the IEEE International Conf. on Computer Vision (ICCV) , 2016. 7 [2] Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day. ACM Communications , 2011. 2 [3] Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. Bundle adjustment in the large. In Proc. of the European Conf. on Computer Vision (ECCV) , 2010. 2 [4] Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han, Sunghwan Hong, and Seungryong Kim. Cross- view completion models are zero-shot correspondence esti- mators. arXiv , 2412.09072, 2024. 2 [5] Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf). Computer vi- sion and image understanding , 2008. 2 [6] Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cav- allari, ´Aron Monszpart, Daniyar Turmukhambetov, and Vic- tor Adrian Prisacariu. Scene coordinate reconstruction: Pos- ing of image collections via incremental learning of a relo- calizer. In Proc. of the European Conf. on Computer Vision (ECCV) , 2024. 2 [7] Thomas Brox and Jitendra Malik. Object segmentation by long term analysis of point trajectories. In Proc. of the Euro- pean Conf. on Computer Vision (ECCV) , 2010. 2 [8] Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, and Yuliang Xiu. Feat2gs: Probing visual foundation models with gaussian splatting. arXiv , 2412.09606, 2024. 7 [9] Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) , 2007. 2 [10] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In CVPRW , 2018. 2 [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929 , 2020. 2, 3 [12] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry. IEEE Trans. on Pattern Analysis and Ma- chine Intelligence (PAMI) , 2017. 2 [13] Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. In Proc. of the European Conf. on Computer Vision (ECCV) , 2014. 2 [14] Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check. Advances in Neural Information Processing Systems , 2022. 7 [15] Lily Goli, Sara Sabour, Mark Matthews, Marcus Brubaker, Dmitry Lagun, Alec Jacobson, David J Fleet, Saurabh Sax- ena, and Andrea Tagliasacchi. Romo: Robust motion seg- mentation improves structure from motion. arXiv preprint arXiv:2411.18650 , 2024. 3 [16] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir- shick. Mask R-CNN. In IEEE International Conference on Computer Vision (ICCV) , 2017. 2 [17] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. In International Confer- ence on Learning Representations (ICLR) , 2023. 2 [18] Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. arXiv preprint arXiv:2409.02095 , 2024. 2 [19] Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos. arXiv , 2412.09621, 2024. 1 [20] Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. InProc. IEEE Conf. on Computer Vision and Pattern Recog- nition (CVPR) , 2023. 6 [21] Laurynas Karazija, Iro Laina, Christian Rupprecht, and An- drea Vedaldi. Learning segmentation from point trajectories. arXiv preprint arXiv:2501.12392 , 2025. 2 [22] Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2021. 1, 2 [23] Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, Joao Carreira, Andrew Zisserman, Gabriel Brostow, and Carl Doersch. Tapvid-3d: A benchmark for tracking any point in 3d, 2024. arXiv preprint arXiv:2407.05921 . 7 [24] Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with MASt3R. In European Con- ference on Computer Vision (ECCV) , 2024. 2 [25] Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holyn- ski, and Noah Snavely. MegaSaM: accurate, fast, and robust structure and motion from casual dynamic videos. arXiv , 2412.04463, 2024. 1, 2 [26] Long Lian, Zhirong Wu, and Stella X Yu. Bootstrapping objectness from videos by relaxed common fate and visual grouping. In Proc. IEEE Conf. on Computer Vision and Pat- tern Recognition (CVPR) , 2023. 2 [27] Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yanchao Yang, Qingnan Fan, and Baoquan Chen. SLAM3R: real-time dense scene reconstruction from monocular RGB videos. arXiv , 2412.09401, 2024. 2 [28] David G Lowe. Distinctive image features from scale- 9 Page 10: invariant keypoints. International journal of computer vi- sion, 2004. 2 [29] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation. ACM Trans. on Graphics , 2020. 2 [30] Quan-Tuan Luong and Olivier D Faugeras. The fundamental matrix: Theory, algorithms, and stability analysis. Proc. of the IEEE International Conf. on Computer Vision (ICCV) , 1996. 3 [31] Etienne Meunier, Ana ¨ıs Badoual, and Patrick Bouthemy. Em-driven unsupervised learning for efficient motion seg- mentation. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) , 2022. 2 [32] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics , 2015. 2 [33] Riku Murai, Eric Dexheimer, and Andrew J. Davison. MASt3R-SLAM: real-time dense SLAM with 3d reconstruc- tion priors. arXiv , 2412.12392, 2024. 2 [34] Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. InProc. of the IEEE International Conf. on Computer Vision (ICCV) , 2011. 2 [35] Peter Ochs, Jitendra Malik, and Thomas Brox. Segmentation of moving objects by long term video analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) , 2013. 2 [36] Nobuyuki Otsu et al. A threshold selection method from gray-level histograms. Automatica , 11(285-296):23–27, 1975. 1 [37] Linfei Pan, D ´aniel Bar ´ath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global structure-from-motion revisited. In European Conference on Computer Vision (ECCV) , 2024. 1 [38] Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Pe- ters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In Proc. of the IEEE International Conf. on Computer Vision (ICCV) , 2023. 7 [39] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proc. IEEE Conf. on Computer Vi- sion and Pattern Recognition (CVPR) , 2016. 6 [40] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: universal monocular metric depth estimation. In Computer Vision and Pattern Recognition (CVPR) , 2024. 2 [41] Marc Pollefeys, Reinhard Koch, and Luc Van Gool. Self- calibration and metric reconstruction inspite of varying and unknown intrinsic camera parameters. International Journal of Computer Vision , 1999. 2 [42] Marc Pollefeys, Luc Van Gool, Maarten Vergauwen, Frank Verbiest, Kurt Cornelis, Jan Tops, and Reinhard Koch. Visual modeling with a hand-held camera. International Journal of Computer Vision , 2004. 2 [43] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar- bel´aez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675 , 2017. 6 [44] Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In Proc. of the IEEE International Conf. on Computer Vision (ICCV) , 2021. 2, 3 [45] Ren ´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. on Pattern Analysis and Machine In- telligence (PAMI) , 2020. 2 [46] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 , 2024. 3, 6 [47] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2020. 2 [48] Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR , 2016. 1, 2, 3 [49] Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Vitor Guizilini, Yue Wang, Matteo Poggi, and Yiyi Liao. Learning temporally consistent video depth from video diffusion priors. arXiv preprint arXiv:2406.01493 , 2024. 2 [50] Yaser Sheikh, Omar Javed, and Takeo Kanade. Background subtraction for freely moving cameras. In Proc. of the IEEE International Conf. on Computer Vision (ICCV) , 2009. 2 [51] Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. In SIGGRAPH . 2006. 2 [52] Noah Snavely, Steven M Seitz, and Richard Szeliski. Model- ing the world from internet photo collections. International journal of computer vision , 2008. 2 [53] J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of rgb-d slam systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems , 2012. 7 [54] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2020. 6 [55] Chengzhou Tang and Ping Tan. Ba-net: Dense bundle ad- justment network. arXiv preprint arXiv:1806.04807 , 2018. 2 [56] Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. MV-DUSt3R+: single-stage scene reconstruction from sparse views in 2 seconds. arXiv , 2412.06974, 2024. 2 [57] Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in Neu- ral Information Processing Systems (NIPS) , 2021. 2 [58] Stefan Treue and John HR Maunsell. Attentional modula- tion of visual motion processing in cortical areas mt and mst. Nature , 1996. 2 10 Page 11: [59] Bill Triggs, Philip F McLauchlan, Richard I Hartley, and An- drew W Fitzgibbon. Bundle adjustment—a modern synthe- sis. In Vision Algorithms: Theory and Practice: Interna- tional Workshop on Vision Algorithms Corfu, Greece , 2000. 2 [60] Shinji Umeyama. Least-squares estimation of transforma- tion parameters between two point patterns. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI) , 1991. 7 [61] Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. arXiv , 2408.16061, 2024. 2, 7 [62] Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) , 2024. 2 [63] Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. arXiv , 2501.12387, 2025. 1, 2, 6, 7, 8, 3 [64] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: geometric 3d vi- sion made easy. In Computer Vision and Pattern Recognition (CVPR) , 2024. 2, 6, 7 [65] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. In Proc. IEEE International Conf. on Intelligent Robots and Systems (IROS) , 2020. 6 [66] Yihan Wang, Lahav Lipson, and Jia Deng. Sea-raft: Sim- ple, efficient, accurate raft for optical flow. In Proc. of the European Conf. on Computer Vision (ECCV) , 2024. 2, 3, 6 [67] Junyu Xie, Weidi Xie, and Andrew Zisserman. Segmenting moving objects via an object-centric layered representation. Advances in neural information processing systems , 2022. 2 [68] Kai Xu, Tze Ho Elden Tse, Jizong Peng, and Angela Yao. DAS3R: dynamics-aware gaussian splatting for static scene reconstruction. arXiv , 2412.19584, 2024. 1, 2, 6, 7, 8, 3 [69] Jingyu Yan and Marc Pollefeys. A general framework for motion segmentation: Independent, articulated, rigid, non- rigid, degenerate and non-degenerate. In Proc. of the IEEE International Conf. on Computer Vision (ICCV) , 2006. 2 [70] Charig Yang, Hala Lamdouar, Erika Lu, Andrew Zisserman, and Weidi Xie. Self-supervised video object segmentation by motion grouping. In Proc. of the IEEE International Conf. on Computer Vision (ICCV) , 2021. 2 [71] Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In Computer Vision and Pattern Recognition (CVPR) , 2024. 2 [72] Chao Yu, Zuxin Liu, Xin-Jun Liu, Fugui Xie, Yi Yang, Qi Wei, and Fei Qiao. DS-SLAM: A semantic visual SLAM towards dynamic environments. In Proc. IEEE International Conf. on Intelligent Robots and Systems (IROS) , 2018. 2 [73] Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. MonST3R: a simple approach for estimating geometry in the presence of motion. arXiv , 2410.03825, 2024. 1, 2, 3, 5, 6, 7, 8 [74] Zhoutong Zhang, Forrester Cole, Zhengqi Li, Noah Snavely, Michael Rubinstein, and William T. Freeman. Structure andmotion from casual videos. In Proc. of the European Conf. on Computer Vision (ECCV) , 2022. 1, 2 [75] Wang Zhao, Shaohui Liu, Hengkai Guo, Wenping Wang, and Yong-Jin Liu. Particlesfm: Exploiting dense point trajecto- ries for localizing moving cameras in the wild. In Proc. of the European Conf. on Computer Vision (ECCV) , 2022. 2 11 Page 12: Easi3R: Estimating Disentangled Motion from DUSt3R Without Training Supplementary Material In this supplementary document , we first present ad- ditional method details on temporal consistency dynamic object segmentation in Appendix A. Next, we conduct ab- lation studies of Easi3R in Appendix B and analysis limita- tions in Appendix C. Lastly, we report additional qualitative results in Appendix D. We invite readers to easi3r.github.io for better visualization. A. Dynamic Object Segmentation We have presented dynamic object segmentation for a sin- gle frame in Section 3.3, now we introduce how to ensure consistency along the temporal axis. Given image feature tokens Ft 0for frames at t, output from the image encoder, we concatenate them along the temporal dimension, ¯F= [F1 0;F2 0;. . .;FT 0]∈R(T×h×w)×c(12) where cis the feature dimension of the tokens. This al- lows us to apply k-means clustering to group similar fea- tures across frames, producing cluster assignments, C=KMeans (¯F, k), Ct(x, y)∈ {1, . . . , k },∀t, x, y (13) where kis the number of clusters, we use k= 64 for all experiments. For each cluster c∈ {1, . . . , k }, we compute a dynamic score scby averaging the base dynamic attention values of all tokens within that cluster: sc=P tP i,j1[Ct(x, y) =c]·At=dyn(x, y)P tP x,y1[Ct(x, y) =c](14) where 1[·]denotes the indicator function. We then use these scores to generate a cluster-fused dynamic attention map, mapping each pixel’s cluster assignment back to its corre- sponding dynamic score, At=dyn fuse(x, y) =sCt(x,y) (15) The refined dynamic attention map At=dyn fuse∈Rh×wis used to infer the dynamic object segmentation by, Mt(x, y) = 1[At=dyn fuse(x, y)> α] (16) where αis an automatic image thresholding using Otsu’s method [36]. The resulting dynamic object segmentation is further utilized in the second inference pass and global optimization. Video Video Figure 8. Benefits of Cross-frame Feature Clustering. We visu- alize the dynamic attention map At=dyn, cluster assignments Ct, and cluster-fused dynamic attention map At=dyn fuse . Features from the DUSt3R encoder exhibit temporal consistency, as cluster as- signments ( Ct) remain unchanged across frames, thereby enhanc- ing temporal consistency in dynamic segmentation ( At=dynfuse) through clustering-guided temporal fusing. For better visual intu- ition, we invite readers to easi3r.github.io. Table 6. Ablation of Dynamic Object Segmentation on DA VIS. DA VIS-16 DA VIS-17 DA VIS-all Backbone Ablation JM↑JR↑ JM↑JR↑ JM↑JR↑ DUSt3Rw/oAa=src µ 45.1 45.2 42.8 39.9 42.2 38.5 w/oAa=src σ 42.3 50.0 35.0 37.0 30.9 28.3 w/oAa=ref µ 33.3 28.4 31.5 27.9 32.5 29.7 w/oAa=ref σ 47.7 54.1 46.2 54.3 43.7 48.6 w/o Clustering 40.0 38.5 38.3 38.3 34.3 30.5 Full 53.1 60.4 49.0 56.4 44.5 49.6 MonST3Rw/oAa=src µ 47.2 51.5 44.4 46.7 40.9 41.5 w/oAa=src σ 49.7 60.1 48.7 57.8 44.9 49.6 w/oAa=ref µ 46.4 54.0 47.4 55.9 45.3 50.7 w/oAa=ref σ 50.7 62.6 51.0 60.2 50.3 56.8 w/o Clustering 45.5 46.7 45.3 48.1 42.1 43.5 Full 57.7 71.6 56.5 68.6 53.0 63.4 B. Ablation Study Our ablation lies in two folds: dynamic object segmen- tation and 4D reconstruction. For dynamic object seg- mentation, as shown in Table 6 we ablate the contri- bution of four aggregated temporal cross-attention maps, Aa=src µ,Aa=src σ,Aa=ref µ,Aa=ref σ, and feature clustering. The ablation results show that (1) Disabling any temporal cross- attention map leads to a performance drop, indicating that all attention maps contribute to improved dynamic object segmentation; and (2) Features from the DUSt3R encoder exhibit temporal consistency and enhance dynamic segmen- tation through cross-frame clustering. Table 7 presents ablation studies on 4D reconstruction, evaluating two key design choices: (1) the impact of two- branch re-weighting (applying attention re-weighting to both reference and source decoders) and (2) global align- ment using optical flow with and without segmentation. The 1 Page 13: Table 7. Ablation Study of Camera Pose Estimation and Point Cloud Reconstruction on the DyCheck dataset. Pose Estimation Reconstruction ATE↓ RTE↓ RRE↓Accuracy ↓ Completeness ↓ Distance ↓ Backbone Re-weighting Flow-GA Mean Median Mean Median Mean Median DUSt3RRef + Src ✗ 0.030 0.026 1.777 0.775 0.596 1.848 0.778 0.342 0.224 Ref ✗ 0.029 0.025 1.774 0.772 0.596 1.813 0.757 0.336 0.219 Ref w/o Mask 0.026 0.017 1.472 0.940 0.831 1.654 0.685 0.336 0.220 Ref w/ Mask 0.021 0.014 1.092 0.703 0.589 1.474 0.586 0.301 0.186 MonST3RRef + Src ✗ 0.040 0.032 1.751 0.848 0.744 1.850 1.003 0.398 0.292 Ref ✗ 0.038 0.032 1.736 0.846 0.660 1.840 0.983 0.390 0.290 Ref w/o Mask 0.033 0.023 1.495 0.969 0.796 1.752 0.998 0.368 0.273 Ref w/ Mask 0.030 0.021 1.390 0.834 0.643 1.661 0.916 0.350 0.255 DUSt3R [64] Ours Figure 9. Limitation. We visualize static reconstructions from two different viewpoints in the top and bottom rows. Easi3R improves camera pose estimation and point cloud reconstruc- tion (top row), enhancing alignment in structures like swing supports through attention re-weighting and segmentation-aware global alignment. However, from another viewpoint (bottom row), Easi3R still produces floaters near object boundaries. ablation results show that (1) Re-weighting only the refer- ence view decoder outperforms re-weighting both branches. Since the reference and source decoders serve different roles, and the reference view acts as the static standard, this aligns with our design intuition (i); and (2) Incorporating segmentation in global alignment consistently improves 4D reconstruction quality. C. Limitations Despite strong performance on various in-the-wild videos, Easi3R can fail when the DUSt3R/MonST3R backbones produce inaccurate depth predictions. While Easi3R ef- fectively improves camera pose estimation and point cloud reconstruction, as shown in Table 5 of the main paper, it provides clear improvements in completeness and distance metrics, which are measured on the global point cloud. However, a noticeable gap remains in depth accuracy, whichis evaluated on per-view outputs. This is because our method focuses mainly on improving dynamic regions and global alignment rather than correcting depth predictions in static parts, as illustrated in Figure 9. We leave per-view depth correction for future work. D. Addtional Results We report additional qualitative results of disentangled 4D reconstruction in Figure 10, Figure 11 and Figure 12. We find that MonST3R tends to predict under-segmented dynamic masks, while DAS3R tends to predict over- segmented dynamic masks. CUT3R, although it produces more accurate depth estimation, is prone to being affected by dynamic objects, leading to misaligned static structures, unstable camera pose estimation, and ghosting artifacts due to the lack of dynamic segmentation prediction. In contrast, Easi3R achieves more accurate segmentation, camera pose estimation, and 4D reconstruction, resulting in renderings with better visual quality. 2 Page 14: Video CUT3R [63] MonST3R [73] DAS3R [68] Ours 68 frames··· ··· 80 frames······ Figure 10. Qualitative Comparison. We visualize cross-frame globally aligned static scenes with dynamic point clouds at a selected timestamp. Notably, instead of using ground truth dynamic masks in previous work, we apply the estimated per-frame dynamic masks to filter out dynamic points at other timestamps for comparison. Top and bottom rows are Easi3R dust3r/monst3r , respectively. 4D Reconstruction Static Scene Dynamic Object Video Frame 4 ······ Frame 55Frame 78 Frame 4Frame 55Frame 78 ······ ······ Easi 3 R MonST3R 80frames 80frames Figure 11. Disentanglement vs. MonST3R [73]. We visualize the disentangled 4D reconstruction, static scene and dynamic objects at different frames. MonST3R tends to predict under-segmented dynamic masks. 4D Reconstruction Static Scene Dynamic Object Video ······ Easi 3 R DAS3R ······ Frame 0Frame 44Frame 57Frame 0Frame 44Frame 57 79 frames79 frames Figure 12. Disentanglement vs. DAS3R [68]. We visualize the disentangled 4D reconstruction, static scene and dynamic objects at different frames. DAS3R tends to predict over-segmented dynamic masks. 3