loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2503.24382

Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views

Authors: Chong Bao, Xiyu Zhang, Zehao Yu, Jiale Shi, Guofeng Zhang, Songyou Peng, Zhaopeng Cui

Published: 2025-03-31

Abstract:

Neural rendering has demonstrated remarkable success in high-quality 3D neural reconstruction and novel view synthesis with dense input views and accurate poses. However, applying it to extremely sparse, unposed views in unbounded 360{\deg} scenes remains a challenging problem. In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360{\deg} scenes. To resolve the spatial ambiguity inherent in unbounded scenes with sparse input views, we propose a layered Gaussian-based representation to effectively model the scene with distinct spatial layers. By employing a dense stereo reconstruction model to recover coarse geometry, we introduce a layer-specific bootstrap optimization to refine the noise and fill occluded regions in the reconstruction. Furthermore, we propose an iterative fusion of reconstruction and generation alongside an uncertainty-aware training approach to facilitate mutual conditioning and enhancement between these two processes. Comprehensive experiments show that our approach outperforms existing state-of-the-art methods in terms of rendering quality and surface reconstruction accuracy. Project page: https://zju3dv.github.io/free360/

Paper Content: on Alphaxiv
Page 1: Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis from Extremely Sparse and Unposed Views Chong Bao1§Xiyu Zhang1Zehao Yu3Jiale Shi1Guofeng Zhang1 Songyou Peng2Zhaopeng Cui1† 1State Key Lab of CAD&CG, Zhejiang University2ETH Z ¨urich 3University of T ¨ubingen, T ¨ubingen AI Center Layered Reconstruction VideoGeneration Input: Extremely (3-4) Sparse ViewsIteratively Layered 360 Reconstruction & Generation Ground TruthInstantsplat [10] Ours RenderingOur Surface Reconstruction(a) Overview of Our FrameworkFSGS* [57] ViewCrafter [51](b) Novel View Synthesis Comparisons and Our Surface Reconstruction Figure 1. Free360. (a) We propose a novel Gaussian-based framework, which can reconstruct unbounded 360◦scenes from extremely (3-4) sparse views through an iterative fusion of layered reconstruction and generation. (b) Our method outperforms other state-of-the-art methods in rendering quality and supports complete surface reconstruction. Abstract Neural rendering has demonstrated remarkable success in high-quality 3D neural reconstruction and novel view syn- thesis with dense input views and accurate poses. However, applying it to extremely sparse, unposed views in unbounded 360◦scenes remains a challenging problem. In this paper, we propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360◦scenes. To resolve the spatial ambigu- ity inherent in unbounded scenes with sparse input views, we propose a layered Gaussian-based representation to ef- fectively model the scene with distinct spatial layers. By employing a dense stereo reconstruction model to recover coarse geometry, we introduce a layer-specific bootstrap optimization to refine the noise and fill occluded regions in the reconstruction. Furthermore, we propose an itera- tive fusion of reconstruction and generation alongside an uncertainty-aware training approach to facilitate mutual conditioning and enhancement between these two processes. †Corresponding author. §The work was partially done when visiting ETHZ.Comprehensive experiments show that our approach outper- forms existing state-of-the-art methods in terms of rendering quality and surface reconstruction accuracy. Project page: https://zju3dv.github.io/free360/. 1. Introduction 3D Gaussian Splatting (3DGS) [ 18] has shown great suc- cess in achieving high efficiency and quality in 3D neu- ral reconstruction and novel view synthesis with dense in- put views and accurate poses. With the development of large reconstruction models [ 14,51,58] and generative mod- els [25,26,37,53], it even becomes feasible to recover 3D objects or bounded scenes from sparse views. Most sparse-view 3DGS approaches [ 4,23,60] rely on precise camera poses and relatively sparse sets of views (typ- ically more than 10). To handle pose-free scenarios, some methods [ 10] tend to learn a 3DGS based on the point cloud and camera poses obtained from dense stereo reconstruction models [ 22,41]; however, these stereo models often yield noisy geometry in unbounded scenes, leading to more ar- tifacts and corrupted structures at novel viewpoints. Other methods [ 11,42,52,53] tempt to fine-tune video diffusionarXiv:2503.24382v1 [cs.CV] 31 Mar 2025 Page 2: models for direct novel view generation, while such methods are limited to narrow baseline or highly-overlapped scenarios due to the inherent absence of 3D information in the video generation model. In the unbounded 360◦scenes with low- overlap sparse views, these generated views exhibit severe multi-view inconsistencies, further contaminating the recon- struction. Therefore, performing neural reconstruction and unbounded 360◦view synthesis from extremely sparse and unposed views remains an underexplored and challenging problem. In this paper, we present a novel neural rendering frame- work to realize unbounded 360◦view synthesis and 3D re- construction from extremely sparse views (e.g., 3-4 views) by analyzing and addressing two key challenges in unposed and unbounded scenarios. At first, the dense stereo recon- struction model [ 22,41] is employed to recover the coarse geometry and camera poses. However, the large depth span of the unbounded scene and insufficient multi-view corre- spondences from 360◦extremely sparse views lead to two critical issues: 1) unreliable depth estimation, which hinders the clear differentiation between near-camera and distant structures, and 2) different visibility characteristics, which pose challenges for unified optimization, as near-camera content appears in multiple views, whereas distant struc- tures are often partially occluded and visible from only a single viewpoint. To tackle this issue, we propose a layered Gaussian-based representation enabling layer-specific boot- strap optimizations. Our approach explicitly constructs the scene’s layered structure and use a photometric-guided opti- mization to mitigate noise and correct detail distortions for near-camera reconstruction, alongside a prior-guided inpaint- ing to complete missing regions in distant reconstruction. Secondly, given the extremely sparse views, intensive ob- servations are missing for 360◦reconstruction, requiring the use of generative models to generate additional observations. Furthermore, the latest video diffusion models [ 53] have demonstrated the capability of generating novel views from a good condition, e.g., accurate per-frame point cloud ren- derings, which however are hard to obtain given extremely sparse views. To tackle this challenge, we propose an itera- tive fusion strategy that seamlessly integrates reconstruction and generation. The generation provides supplementary observations for reconstruction to alleviate the spare-view ambiguity. In turn, the reconstruction resolves inconsistency within generated views and produces consistent renderings to condition the generation process. Moreover, to prevent er- ror propagation during the fusion process and ensure a robust reconstruction upon generation, we develop an uncertainty- aware training that filters out inconsistent generated contents. In summary, the contributions of our paper are as fol- lows. (1) We propose a novel neural rendering framework to accomplish the unposed and extremely sparse-view 3D reconstruction in unbounded 360◦scenes. (2) We propose alayered Gaussian-based representation to address the spatial ambiguity of unbounded scenes, along with a layer-specific bootstrap optimization upon the layered representation to refine the noisy reconstruction. (3) We design an iterative fu- sion strategy of reconstruction and novel view generation, fa- cilitating mutual conditioning and enhancement between the two processes. Besides, we incorporate uncertainty-aware learning to mitigate error propagation and ensure robust reconstruction. (4) We conduct extensive experiments on various large-scale unbounded 360◦scenes using only 3-9 views. The results demonstrate that our method reaches the best rendering quality and surface reconstruction accuracy. 2. Related Works Sparse-view Novel View Synthesis. Novel view synthesis (NVS) aims to generate new images from viewpoints out- side the input views. Since the introduction of NeRF [ 29], the NVS field has evolved rapidly, with 3D Gaussian Splat- ting [ 18,55] emerging as the new standard. While both NeRF and 3DGS produce impressive photometric NVS re- sults from dense input views, their performance degrades significantly with sparse input views due to limited opti- mization constraints [ 5,56]. Subsequent work has aimed to enhance sparse-view NVS by incorporating semantic regu- larization [ 2,3,16], smoothness priors [ 30,48], or geometric priors [ 9,23,35,46,54,60]. For instance, FSGS [ 60] and DNGaussian [ 23] improve sparse-view NVS by regularizing depth maps rendered from 3D Gaussians with monocular depth priors [ 32,33]. However, these methods either focus on forward-facing scenes [ 1,28] or rely on relatively dense input views (e.g., 24 views in the Mip-NeRF 360 dataset [ 5]). In contrast, we solve an extremely challenging scenario in- volving only 3 or 4 views for 360◦inward-facing scenes without assuming ground-truth poses. Pose-free Novel View Synthesis. Optimizing a NeRF or 3DGS model typically requires known camera poses before- hand. In practice, these poses are often recovered using established structure-from-motion (SfM) methods, such as COLMAP [ 38]. However, traditional SfM techniques face significant challenges in accurately estimating poses when provided with sparse input views, making it impractical to assume known poses in such scenarios. While pose-free methods [ 6,8,12,24] have emerged to tackle this issue by jointly optimizing camera poses and scene representation, these approaches often struggle under sparse input condi- tions. Recently, DUSt3R [41], an unstructured dense stereo reconstruction model, has demonstrated robust performance in estimating dense 3D points and camera poses from sparse views. Building on DUSt3R’s output, InstantSplat [ 10] per- forms joint optimization on poses and Gaussian attributes, achieving improved NVS renderings. Following this line of research, we leverage DUSt3R to estimate camera poses and dense 3D points from sparse input views. However, we Page 3: (a) Sparse Input Images(b) Bootstrap Optimization(c) Multi-layer Reconstruction Front Layer Coarse Recon.Front Layer GSFront Layer Image & Mono. Normal Supervision Dense Stereo Reconstruction ModelBack Layer Dense ReconstructionOptimized Front Layer Reconstruction (d) Multi-layer Gaussian Splatting Video Diffusion Model Select 𝑃′ images(e) Iterative Fusion of Reconstruction and GenerationGS Rendering Cond.GS Rendering Cond.Points Rendering Cond. Generation SupervisonInput / Generated viewsUnknown views i-th fusion(i+1)-th fusion InitializeDownsampleFigure 2. Pipeline. (a-d) Given unposed extremely sparse views, we employ the dense stereo reconstruction model [ 22,41] to recover camera poses and initial point cloud of the scene. A layered Gaussian-based representation is built upon the initial point cloud to enable layer-specific bootstrap optimization. (e) We design the iterative fusion of reconstruction and generation with diffusion model [ 53]. Unknown views are iteratively generated under conditions of consistent GS rendering of known views. In turn, generated views are used to enhance the GS training. observe needle-like artifacts when using the noisy points from DUSt3R. To address this, we build a layered Gaussian Splatting with layer-specific bootstrap optimization. Generative Sparse Novel View Synthesis. Generative mod- els have achieved remarkable progress in recent years, de- livering impressive results in image [ 36], video [ 7], and 3D generation [ 31]. This success has led to the explo- ration of generative models for sparse novel view synthesis (NVS) [ 13,26,27,37,44,45,47]. For instance, ReconFu- sion [ 43] trains a diffusion model to refine the noisy render- ing of a NeRF. With the diffusion model, ReconFusion opti- mizes a NeRF with a sampling loss from the denoising model for novel views and a reconstruction loss between sparse in- put views. More recent approaches [ 11,42,52,53] fine-tune video diffusion models for novel view synthesis from sparse inputs. Although these methods perform well with input images with significant overlap, they often face identity shift and multi-view inconsistency issues when provided with low-overlapped sparse views. In our work, we build upon video diffusion models by introducing a progressive fusion step to enhance reconstruction by selecting reliable images generated by the video diffusion model. Furthermore, we compute uncertainty maps to capture inconsistencies in the generated images, guiding the reconstruction process. 3. Method As shown in Fig. 2, given unposed extremely sparse views in an unbounded 360◦scene, we propose a novel neural ren- dering framework to faithfully reconstruct the structure of the scene and render high-quality novel views. By utilizing the dense stereo reconstruction model [ 22,41] to obtain the camera poses and an initial point cloud from sparse views, we partition the scene into a layered structure and build a layered Gaussian-based representation upon it (see Sec. 3.1). Then, we introduce a layer-specific bootstrap optimization technique to refine reconstruction error (see Sec. 3.2). Due to the inadequate constraints provided by sparse views, we in- corporate the video diffusion model [ 53] as prior and proposean iterative fusion approach that combines reconstruction and generation, enabling mutual conditioning and enhance- ment between the two processes, i.e., generation provides novel observations for reconstruction while reconstruction resolves inconsistencies within the generated outputs (see Sec. 3.3). Furthermore, we propose an uncertainty-aware training technique to achieve robust reconstruction with the awareness of inconsistent generated content (see Sec. 3.4). 3.1. Layered Gaussian Splatting In the spirit of partitioning the near-camera and far-away scene content into different layers for layer-specific optimiza- tion, existing unbounded NeRF methods [ 5,57] intuitively treat the scene inside the smallest sphere encompassing all camera positions as foreground, with all other areas as back- ground. Following this spirit, we use a more compact bound- ing volume to partition an unbounded scene into a front layer and a back layer. Specifically, the front layer is the scene inside the smallest bounding sphere that encapsulates the intersection area of all camera frustums, while the back layer constitutes the residual scene. Alternatively, a more precise partition is to annotate the front and back layers on the monocular depth [49] of each sparse-input view. Given unposed sparse images, we use the dense stereo reconstruction model [ 22,41] to recover the scene’s cam- era poses and initial point cloud. Then, the point cloud is partitioned into the front-layer and back-layer point clouds as described above. We build our layered-based represen- tation upon the 2D Gaussian Splatting [ 15]. Each layer i is independently modeled as a group of Gaussian primi- tivesGi={Gi,k|k= 1, ..., K}, which is initialized using partitioned point clouds. The Gaussian primitives are param- eterized by the center position µi,k, a scaling vector Si,k, and a rotation matrix Ri,k. The color ci,kof Gaussian primitives is characterized by the view-dependent spherical harmonics (SH). To render the entire scene, we merge all Gaussian primitives within each layer and execute Gaussian rasteri- zation once for efficiency and anti-aliasing. The pixel color Page 4: Input sparse imagesStereo Reconstruction ResultsBootstrap-optimized Reconstruction Results(Projected from rendered depth)Figure 3. Visualization of Point Clouds. We compare results from the stereo reconstruction model and our bootstrap optimization. Input ImagesGenerated Novel ViewUncertainty Map Figure 4. Uncertainty Map. We show the uncertainty map esti- mated from generated novel views. The flower has severe multi- view inconsistency with high uncertainty. is composited by the point-based α-blending of rasterized Gaussian that overlaps the pixel. 3.2. Bootstrap Optimization The point cloud produced by reconstructing an entire scene using a stereo model is always error-prone due to the insuffi- cient visual constraints from sparse views and the model’s inability to perceive the fine structures and occlusion. With the layered GS, we target layer-specific bootstrap optimiza- tion to refine these errors. For the front layer, a noisy point cloud and missing ge- ometry of the fine structure always occur, such as the noisy floating points of Bonsai and missing leg of Horse in Fig. 3. This can further diminish the quality of Gaussian render- ing and generation consistency. Inspired by RAIN-GS [17], which demonstrates that photometric-based GS optimization can recover clean geometry with fine details when initial- ized with as few as 10 points, we perform an initial GS optimization on the front layer using a downsampled point cloud. This downsampling operation avoids initializing con- siderable error points of the point cloud as Gaussians by employing a voxel-based uniform downsampling strategy. The range of voxel size is 1% to 3% of the bounding box of the point cloud. Given the sparsity of input views, we further integrate monocular normal constraints [ 50] to en- hance geometric accuracy. During optimization, we enable densification to prune noisy Gaussians while expanding new Gaussians to capture missing fine structures, guided by the images and monocular normals. Specifically, we use pho- tometric loss Lcbetween the rendered images and input images to capture details of geometry and binary entropy lossLmbetween the rendered alpha map and layer mask to remove the dislocated outlier, a cosine similarity loss Lnbe- tween rendered normal map and monocular normal to fix the bumpy surface and use L1 loss Ldbetween the rendered andstereo-reconstructed [ 41] depths to regularize the geometry Ld: Lbs=Lc+λmLm+λnLn+λdLd, (1) where λm= 1.0,λn= 0.25,λd= 0.1. As shown in Fig. 3, our optimization successfully gets a cleaner point cloud in Bonsai and fills the missing geometry of leg in Horse. For the back layer, we inpaint the depth of regions in the back layer that are not visible from the input viewpoints. To achieve this, we sample several novel viewpoints and render their visibility masks based on the back layer’s point cloud. Then, we obtain the bounding sphere of the back layer’s point cloud, and a ray-box intersection is performed to estimate the depth of invisible pixels. Based on the inpainted depth, new points are subsequently added to back layer’s point cloud. 3.3. Iterative Fusion of Reconstruction and Gener- ation Due to underdetermined constraints in the extremely sparse view, we leverage video diffusion model [ 53] to generate new observations for 360◦reconstruction. Here we use ViewCrafter [ 53] as generative prior. ViewCrafter [ 53] takes Pimages of a sequential motion as conditions, which con- sists of point cloud renderings and real images, and generates Pnovel views following the conditioned motions. A na ¨ıve way to use generative prior is to generate all required novel views at once and train Gaussian primitives directly on them. However, ViewCrafter [ 53] exhibits pronounced multi-view inconsistency when generating novel views between low- overlapped sparse views, as it is trained on videos character- ized by minimal camera motion and substantial per-frame overlap. These inconsistencies degrade the GS optimization process, leading to blurry and distorted renderings. To mitigate the inconsistency in the generation, we pro- pose an iterative fusion strategy of reconstruction and gen- eration where the diffusion model is iteratively conditioned on rendered images and generates consistent novel views to enhance the reconstruction in turn. Specifically, a known image set Iknown and pose set Vknown are defined to denote the cameras with known image supervision (real or gen- erated). We initialize the known image set and pose set with input sparse views, Iknown =Iinput,Vknown =Vinput. Starting from sampling a start pose among input poses, we interpolate Pposes sequentially Vgenthat consist of known cameras Vknown ,gen(generated at the iteration before) and unknown cameras Vnovel,Vgen=Vknown ,gen∪Vnovel. For the known cameras, we use real images or GS-rendered im- ages as generative conditions. For unknown cameras, we use renderings of whole-scene point cloud using Pytorch3D [ 34] as the generative conditions. Next, we utilize the video diffu- sion model to generate images Igenbased on these conditions, Igen=Iknown ,gen∪Inovel. The video diffusion model [ 53] generates Psequential frames at a single forward, but the frame quality is not constant. The inconsistency is accu- Page 5: Method 3 views 6 views 9 views PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ FSGS* [60] 14.25 0.286 0.545 15.65 0.318 0.509 16.37 0.347 0.489 InstantSplat [10] 14.10 0.296 0.529 15.83 0.351 0.480 17.16 0.402 0.443 ZeroNVS* [37] 15.16 0.327 0.632 15.46 0.313 0.614 15.81 0.328 0.607 ViewCrafter [53] 15.63 0.358 0.515 16.32 0.366 0.555 16.68 0.382 0.551 Ours 16.20 0.378 0.499 16.83 0.397 0.458 17.35 0.423 0.435 Table 1. Comparison on Mip-NeRF 360◦[5] Dataset. We compare the rendering quality with baselines given 3, 6 and 9 views. Method 4 views 6 views 9 views PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ FSGS* [60] 13.14 0.357 0.509 14.29 0.425 0.451 15.13 0.457 0.423 InstantSplat [10] 13.03 0.390 0.496 14.22 0.457 0.439 15.58 0.51 0.391 ZeroNVS* [37] 12.62 0.363 0.619 12.96 0.349 0.597 13.18 0.364 0.592 ViewCrafter [53] 13.56 0.412 0.504 14.12 0.424 0.483 14.70 0.443 0.467 Ours 14.69 0.476 0.409 15.67 0.523 0.368 16.73 0.564 0.328 Table 2. Comparison on Tanks and Temples [20] Dataset. We compare the rendering quality with baselines given 4, 6 and 9 views. mulated with distance, i.e. the consistency of novel views deteriorates as the camera deviates from the known cam- eras. Therefore, we define a reliable frame selection strategy. Only P′frames I′ noveland poses V′ novelare selected within novel frames Inoveland poses Vnovelaccording to minimum distance to the known poses: I′ novel,V′ novel=argmin I′ novel,V′ novelP′X v(p) novel∈Vnovelmin v∈Vknown, gen∥v(p) novel−v∥. (2) Then, we append the selected novel poses and frames into the known set, Iknown =Iknown∪I′novel,Vknown =Vknown∪ V′novel. At the end of an iteration, we train the layered GS on these known poses and images to learn the consistent content from the video prior progressively. This layered GS optimization is introduced in Sec. 3.4. We empirically define 300 to 400 unknown camera poses in total and repeat this process multiple times until all cameras become known. Please refer to supp. material for camera definition. 3.4. Uncertainty-aware Training To avoid error drifting during mutual conditioning and achieve a robust reconstruction, an uncertainty measure- ment is required to distinguish the reliable parts of generated novel views. In our task, uncertainty arises due to limited constraints from sparse-view input, which exhibits excessive degrees of freedom in the content of generated novel views, as shown in Fig. 4, i.e., diffusion hallucinates incorrect ge- ometry and appearance in high-freedom regions. Inspired by NeRF Ensembles [ 40], we exploit epistemic uncertainty [ 21] to model this hallucination. Specifically, we generate mul- tiple images from the same viewpoint, each conditioned on either unperturbed or perturbed point cloud renderings. As depicted in Fig. 4, we measure the variance across these generated images as epistemic uncertainty, which reflects the model’s lack of knowledge in under-constrained regions. The perturbation process is based on the L1 difference be-tween the generated views and their corresponding unper- turbed conditions. Perturbations are then selectively applied to areas of conditions where this difference exceeds a prede- fined threshold β. Moreover, the uncertainty map is combined with some loss terms to supervise the training. We employ the photo- metric loss Lcand perceptual loss Llpipsbetween rendered images and generated images, and a cosine loss Lnbetween rendered normal and monocular normal at input viewpoints. Additionally, we use binary entropy loss Lmbetween alpha maps of the front GS and front-layer masks to retain the shape. We also exploit some regularization terms Lreg, such as distortion loss and normal consistency. L=u∗[λc(Lc+Llpips)+λmLm+λnLn+λregLreg],(3) where λm= 1.0,λn= 0.25.λregis 100 for distortion loss and 0.25 for normal consistency loss. As the incon- sistency will accumulate as the camera deviates from the input view mentioned in Sec. 3.3, we relate the λcwith the distance xto the input sparse views, λc=f(x) = max(0 .5∗e(−20∗x),0.05). During training, the densifica- tion process is deactivated to prevent densifying Gaussians based on inaccurate positional gradients arising from in- consistencies of generated views. Please refer to the supp. material for training details. 4. Experiments 4.1. Dataset & Baselines Dataset. We evaluate our method on two unbounded 360◦ datasets: Mip-NeRF 360 [ 5] and Tanks and Temples [ 20]. For Mip-NeRF 360 [ 5], we choose 8 out of 9 sequences containing indoor and outdoor scenes, excluding the Flower scene as DUSt3R [ 41] fails using sparse views. We follow 3DGS’s [ 18] train-test split and then select 3, 6, 9 views from its training split as training images, and evaluate novel views on its test split. For Tanks and Temples [ 20], we use 6 Page 6: FSGS*InstantsplatZeroNVS*ViewCrafterOursBicycle Reference GroundTruth StumpGardenTreehillKitchenFigure 5. Comparison on Mip-NeRF 360◦[5] Dataset on the 3-View Setting. We qualitatively compare rendering quality with FSGS* [ 60], InstantSplat [10], ZeroNVS* [37], ViewCrafter [53] given 3 input views. unbounded outdoor scenes, each captured in 360◦. Similarly, we uniformly split the train and test set by sampling every 8th image as the test set, and the remaining images are training images. As 3 views are insufficient to capture the 360-degree scenes, we select 4, 6, and 9 views from the training images as sparse input and evaluate novel views on the test split. We align estimated poses from DUSt3R [41] with Colmap [39] poses recovered using all images for evaluation. Please refer to the supp. material for more explanations. Baselines. We compare with the following state-of-the- art methods[ 10,37,53,60]. FSGS [ 60] is a 3DGS-based method that regularizes the Gaussian optimization with monocular depth prior. ZeroNVS [ 37] trains a diffusion model for the single-view reconstruction in unbounded scenes. We follow ReconFusion [ 44] to adapt it to multi- view input. For fair comparisons, we use the same inputpoint cloud and camera poses from DUSt3R for FSGS* and ZeroNVS*. InstantSplat [ 10] initializes Gaussian primitives from DUSt3R’s point cloud and optimizes 3D Gaussians and camera poses jointly. ViewCrafter [ 53] fine-tunes a video diffusion model conditioned on point cloud rendering from DUSt3R and trains a final 3DGS [ 18] based on generated views and sparse input. Please refer to the supp. material for more details on baseline training. 4.2. Novel View Synthesis We now evaluate our method on the Mip-NeRF 360 dataset [ 5]. As shown in Tab. 1, our method outperforms all baselines across all metrics. As shown in Fig. 5, this dataset is particularly challenging as it contains many fine- grained structures, like the delicate flowers in the Garden. FSGS [ 60] and InstantSplat [ 10] show strong artifacts, such Page 7: FSGS*InstantsplatZeroNVS*ViewCrafterOursGarden FamilyTruckHorseIgnatius Reference GroundTruthFigure 6. Comparison on Tanks and Temples [ 20] Dataset on the 4-View Setting. We qualitatively compare rendering quality with FSGS* [60], InstantSplat [10], ZeroNVS* [37], ViewCrafter [53] given 4 views. (a) 3-view Reconstruction on Mip-NeRF 360 Ground Truth2DGS*Ours Ground Truth2DGS*Ours (b) 4-view Reconstruction on Tanks and Temples Figure 7. Comparison on 3D Surface Reconstruction. We qualitatively compare surface reconstruction with 2DGS [ 15] on (a) Bicycle (top) and Garden (down) from Mip-NeRF 360◦[5] dataset and (b) Barn (top) and Ignatius (down) from Tanks and Temples dataset [20]. as ”foggy” geometry and needle-like distorted Gaussians in the background. Their reconstruction quality is limited by the noisy point cloud from DUSt3R, especially for fine structures. In contrast, our Bootstrap Optimization corrects these errors, restoring details like the flowers in the Garden scene and the leaves in the Stump scene. ZeroNVS [ 37] fails to generate consistent novel views for 3DGS training, whileViewCrafter [ 53] has similar issues. Inconsistent novel views from the video diffusion model also impair quality, resulting in blurred renderings and view-dependent artifacts, as seen in the flowers of Garden scene and split wheels of Lego in Kitchen scene. Our Iterative Fusion effectively mitigates inconsistencies, improving reconstruction. We further evaluate our method on the Tanks and Tem- Page 8: Methods 4 views 6 views 9 views 2DGS [15] 0.284 0.367 0.418 Ours 0.423 0.430 0.439 Table 3. Comparison on 3D Surface Reconstruction. We com- pare with 2DGS [ 15] on four scenes of Tanks and Temples [ 20]. The evaluation metric is F1-score and higher is better. Settings PSNR ↑ SSIM↑ LPIPS ↓ Baseline 16.16 0.289 0.532 + Multi-layer (Sec. 3.1) 16.56 0.308 0.501 + Bootstrap Opt. (Sec. 3.2) 16.65 0.314 0.491 + Iterative Fusion (Sec. 3.3) 16.81 0.317 0.484 + Uncertainty (Ours, Sec. 3.4) 16.95 0.321 0.480 Table 4. Ablation on Design Choices. We perform ablation studies on the baseline with different designs of our framework. ples dataset [ 20], which presents larger scenes and signifi- cant depth variations, challenging reconstruction from sparse views. Similar to our observation above, our method outper- forms all baselines, as shown in Tab. 2. We observe similar artifacts for baseline methods in Fig. 6: FSGS [ 60] and InstantSplat [ 10] produce needle-like Gaussian for the fore- ground object due to noisy point cloud initialization. Further, they failed to complete the background regions as they did not employ generative priors. ZeroNVS[ 37] produces overly blurred images, while ViewCrafter [ 53] produces distorted geometry due to noise in the point cloud initialization and in- consistency in the generative models. For instance, the leg of thehorse is missing in ViewCrafter’s rendering, and the sky sticks with the foreground objects. By contrast, our multi- layer representation mitigates depth ambiguity, preserving foreground integrity. Our bootstrap optimization improves the scene’s geometry, resulting in better NVS results. 4.3. Geometry Evaluation We also evaluate surface reconstruction accuracy against 2DGS [ 15]. We use camera poses and point clouds from DUSt3R [ 22,41] as input for 2DGS, denoted as 2DGS*. For quantitative comparisons, we conduct the experiment on four scenes with ground-truth mesh in Tanks and Tem- ples [ 20], including Barn, Ignatius, Caterpillar, and Truck. As sparse-view reconstruction is error-prone, we enlarge the error threshold 10 times for all methods to compute the F1- score. As is shown in Tab. 3, our method achieves superior results to 2DGS [ 15]. We show qualitative comparisons on Mip-NeRF 360 [ 5] and Tanks and Temples [ 20] in Fig. 7. 2DGS [ 15] can not reconstruct the geometry of unseen re- gions; as a consequence, there are large holes on the ground and the background as shown in Fig. 7(a). Moreover, noisy point clouds from dense stereo models [ 22,41] lead to error- prone densification, resulting in artifacts in the geometry, particularly in the sky region, where 2DGS [ 15] grows Gaus- sian points on the foreground to minimize appearance loss, seeBarn scene in Fig. 7(b). By contrast, through Bootstrap Optimization and Iterative Fusion, we obtain a clean and detailed point cloud for initialization, allowing smoother Generated View w/ Iterative FusionGenerated View w/o Iterative FusionV1V0 Input ImagesViewpointFigure 8. Ablation on Iterative Fusion. We compare the rendering quality of different viewpoints when ablating the iterative fusion (Sec. 3.3) on the Counter of Mip-NeRF 360◦[5] dataset. distribution of Gaussian points with less noise at the seen regions and complete geometry at the unseen regions. 4.4. Ablation Study We conduct ablations using Bicycle andGarden from Mip- NeRF 360 [ 5] with 3 input views. Given stereo-reconstructed point cloud and poses, “Baseline” firstly generates images under the same unknown poses as ours, then optimizes a final 2DGS [ 15] on generated and input images. Adding multi- ple layers into GS optimization boosts the rendering quality. Bootstrap Optimization provides a cleaner and more detailed geometry for Gaussian Splatting to capture detailed geome- try and appearance. Iterative Fusion and Uncertainty-aware Training generate more consistent novel views, leading to better results at inconsistent regions, which further improves the rendering quality. As shown in Tab. 4, each component can effectively enhance the quality of reconstruction. We also compare generated novel views from the video diffusion model on 3 views of Counter from Mip-NeRF 360 [ 5]. As shown in Fig. 8, the generated results without Iteration Fu- sion suffer inconsistency across multiple viewpoints, while with Iteration Fusion, video diffusion model can generate consistent and high-quality novel views. 5. Conclusion We have presented a novel framework for unposed and ex- tremely sparse-view 3D reconstruction in unbounded 360 scenes. We build a layered GS upon the reconstruction of dense stereo reconstruction model [ 22,41] and use boot- strap optimization to generate a clean and detailed geometry from sparse views. Besides, we propose an iterative fu- sion of reconstruction and generation [53] to enable mutual conditioning and enhancement, and an uncertainty-aware training for robust reconstruction. It is worth noting that Free360 is a general framework that is both complementary and orthogonal to video diffusion models. A future direc- tion is incorporating an enhanced video diffusion model into our framework. As a limitation, we cannot solve scenes with extensive repetitive texture where stereo reconstruction model [22, 41] fails for faithful global reconstructions. Acknowledgment: This work was partially supported by the NSFC (No. 62441222), Information Technology Center and State Key Lab of CAD&CG, Zhejiang University. Page 9: References [1]Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis. IJCV , 2016. 2 [2]Chong Bao, Yinda Zhang, Bangbang Yang, Tianxing Fan, Zesong Yang, Hujun Bao, Guofeng Zhang, and Zhaopeng Cui. Sine: Semantic-driven image-based nerf editing with prior-guided editing field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 20919–20929, 2023. 2 [3]Chong Bao, Yinda Zhang, Yuan Li, Xiyu Zhang, Bang- bang Yang, Hujun Bao, Marc Pollefeys, Guofeng Zhang, and Zhaopeng Cui. Geneavatar: Generic expression-aware volumetric head avatar editing from a single image. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 8952–8963, 2024. 2 [4]Zhenyu Bao, Guibiao Liao, Kaichen Zhou, Kanglin Liu, Qing Li, and Guoping Qiu. Loopsparsegs: Loop based sparse-view friendly gaussian splatting. arXiv preprint arXiv:2408.00254 , 2024. 1 [5]Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. CVPR , 2022. 2, 3, 5, 6, 7, 8, 12, 13 [6]Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Vic- tor Adrian Prisacariu. Nope-nerf: Optimising neural radiance field with no pose prior. In CVPR , 2023. 2 [7]Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. 3 [8]Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, and Ameesh Makadia. Lu-nerf: Scene and pose estimation by synchronizing local unposed nerfs. In ICCV , 2023. 2 [9]Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan. Depth-supervised nerf: Fewer views and faster training for free. In CVPR , 2022. 2 [10] Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang. Instantsplat: Unbounded sparse-view pose-free gaussian splat- ting in 40 seconds, 2024. 1, 2, 5, 6, 7, 8, 12, 13 [11] Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Vic- toria Abrevaya, Michael J Black, and Xuaner Zhang. Explo- rative inbetweening of time and space. In European Confer- ence on Computer Vision , pages 378–395. Springer, 2025. 1, 3 [12] Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A. Efros, and Xiaolong Wang. Colmap-free 3d gaussian splatting. InCVPR , 2024. 2 [13] Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T Barron, and Ben Poole. Cat3d: Create anything in 3d with multi-view diffusion models. arXiv preprint arXiv:2405.10314 , 2024. 3[14] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d, 2024. 1 [15] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accu- rate radiance fields. In ACM SIGGRAPH 2024 Conference Papers , pages 1–11, 2024. 3, 7, 8, 12, 13 [16] Ajay Jain, Matthew Tancik, and Pieter Abbeel. Putting nerf on a diet: Semantically consistent few-shot view synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vision , pages 5885–5894, 2021. 2 [17] Jaewoo Jung, Jisang Han, Honggyu An, Jiwon Kang, Seonghoon Park, and Seungryong Kim. Relaxing accurate ini- tialization constraint for 3d gaussian splatting. arXiv preprint arXiv:2403.09413 , 2024. 4 [18] Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. , 42(4):139–1, 2023. 1, 2, 5, 6, 12 [19] Diederik P Kingma. Adam: A method for stochastic opti- mization. arXiv preprint arXiv:1412.6980 , 2014. 12 [20] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics , 36(4), 2017. 5, 7, 8, 12, 13, 14 [21] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estima- tion using deep ensembles. Advances in neural information processing systems , 30, 2017. 5 [22] Vincent Leroy, Yohann Cabon, and J ´erˆome Revaud. Ground- ing image matching in 3d with mast3r. arXiv preprint arXiv:2406.09756 , 2024. 1, 2, 3, 8, 12 [23] Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 20775–20785, 2024. 1, 2 [24] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon Lucey. Barf: Bundle-adjusting neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 5741–5751, 2021. 2 [25] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimiza- tion. Advances in Neural Information Processing Systems , 36, 2024. 1 [26] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV , 2023. 1, 3 [27] Xinhang Liu, Jiaben Chen, Shiu-hong Kao, Yu-Wing Tai, and Chi-Keung Tang. Deceptive-nerf/3dgs: Diffusion-generated pseudo-observations for high-quality sparse-view reconstruc- tion. arXiv preprint arXiv:2305.15171 , 2023. 3 [28] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Page 10: Abhishek Kar. Local light field fusion: Practical view synthe- sis with prescriptive sampling guidelines. ACM TOG , 2019. 2 [29] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthe- sis.Communications of the ACM , 2021. 2 [30] Michael Niemeyer, Jonathan T Barron, Ben Mildenhall, Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Reg- nerf: Regularizing neural radiance fields for view synthesis from sparse inputs. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition , pages 5480–5490, 2022. 2 [31] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion, 2022. 3 [32] Ren´e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) , 2020. 2 [33] Ren´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. ArXiv preprint , 2021. 2 [34] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Tay- lor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia Gkioxari. Accelerating 3d deep learning with pytorch3d. arXiv:2007.08501 , 2020. 4 [35] Barbara Roessle, Jonathan T Barron, Ben Mildenhall, Pratul P Srinivasan, and Matthias Nießner. Dense depth priors for neural radiance fields from sparse input views. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 12892–12901, 2022. 2 [36] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models, 2021. 3 [37] Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann, Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La- gun, Li Fei-Fei, Deqing Sun, et al. Zeronvs: Zero-shot 360- degree view synthesis from a single image. In CVPR , 2024. 1, 3, 5, 6, 7, 8, 12, 13 [38] Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. In Conference on Computer Vision and Pattern Recognition (CVPR) , 2016. 2 [39] Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. In Proceedings of the IEEE confer- ence on computer vision and pattern recognition , pages 4104– 4113, 2016. 6 [40] Niko S ¨underhauf, Jad Abou-Chakra, and Dimity Miller. Density-aware nerf ensembles: Quantifying predictive un- certainty in neural radiance fields. In 2023 IEEE Interna- tional Conference on Robotics and Automation (ICRA) , pages 9370–9376. IEEE, 2023. 5 [41] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR , 2024. 1, 2, 3, 4, 5, 6, 8, 12 [42] Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tian- shui Chen, Menghan Xia, Ping Luo, and Ying Shan. Mo- tionctrl: A unified and flexible motion controller for videogeneration. In ACM SIGGRAPH 2024 Conference Papers , pages 1–11, 2024. 1, 3 [43] Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, and Aleksander Holynski. Reconfusion: 3d reconstruction with diffusion priors. arXiv , 2023. 3 [44] Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin, Jonathan T Barron, Ben Poole, et al. Reconfusion: 3d recon- struction with diffusion priors. In CVPR , 2024. 3, 6 [45] Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay, Pradyumna Chari, and Achuta Kadambi. Sparsegs: Real-time 360°sparse view synthesis using gaussian splatting. Arxiv , 2023. 3 [46] Bangbang Yang, Chong Bao, Junyi Zeng, Hujun Bao, Yinda Zhang, Zhaopeng Cui, and Guofeng Zhang. Neumesh: Learn- ing disentangled neural mesh-based implicit field for geome- try and texture editing. In European Conference on Computer Vision , pages 597–614. Springer, 2022. 2 [47] Chen Yang, Sikuang Li, Jiemin Fang, Ruofan Liang, Lingxi Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Gaussianobject: High-quality 3d object reconstruction from four views with gaussian splatting. ACM Transactions on Graphics , 2024. 3 [48] Jiawei Yang, Marco Pavone, and Yue Wang. Freenerf: Im- proving few-shot neural rendering with free frequency regu- larization. In CVPR , 2023. 2 [49] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2.arXiv:2406.09414 , 2024. 3, 12 [50] Chongjie Ye, Lingteng Qiu, Xiaodong Gu, Qi Zuo, Yushuang Wu, Zilong Dong, Liefeng Bo, Yuliang Xiu, and Xiaoguang Han. Stablenormal: Reducing diffusion variance for stable and sharp normal. ACM Transactions on Graphics , 2024. 4, 12 [51] Xu Yinghao, Shi Zifan, Yifan Wang, Chen Hansheng, Yang Ceyuan, Peng Sida, Shen Yujun, and Wetzstein Gordon. Grm: Large gaussian reconstruction model for efficient 3d recon- struction and generation, 2024. 1 [52] Meng You, Zhiyu Zhu, Hui Liu, and Junhui Hou. Nvs-solver: Video diffusion model as zero-shot novel view synthesizer. arXiv preprint arXiv:2405.15364 , 2024. 1, 3 [53] Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 , 2024. 1, 2, 3, 4, 5, 6, 7, 8, 12, 13 [54] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler, and Andreas Geiger. Monosdf: Exploring monocular geomet- ric cues for neural implicit surface reconstruction. Advances in Neural Information Processing Systems (NeurIPS) , 2022. 2 [55] Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. Conference on Computer Vision and Pattern Recognition (CVPR) , 2024. 2 Page 11: [56] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv:2010.07492 , 2020. 2 [57] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen Koltun. Nerf++: Analyzing and improving neural radiance fields. arXiv preprint arXiv:2010.07492 , 2020. 3 [58] Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large recon- struction model for 3d gaussian splatting. In ECCV , 2024. 1 [59] Zheng Zhang, Wenbo Hu, Yixing Lao, Tong He, and Hengshuang Zhao. Pixel-gs: Density control with pixel- aware gradient for 3d gaussian splatting. arXiv preprint arXiv:2403.15530 , 2024. 12 [60] Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang. Fsgs: Real-time few-shot view synthesis using gaussian splat- ting. In ECCV , 2024. 1, 2, 5, 6, 7, 8, 13 Page 12: Supplementary Material In this supplementary material, we first present detailed implementation aspects in Section A. More experimental details are shown in Section B. We show more comparisons in the Sec. C. Additionally, we include a short video summa- rizing the method with video results, and an offline webpage for interactive visualization of our whole results and compar- isons. A. Implementation Details For the fine-grained front and back layer masks, we an- notate the maximum depth of the front layer in monocular depth [ 49] of each view. The pixels are selected into the front layer if their depth is smaller than the annotated maximum depth. We build Free360 upon the 2DGS [ 15] framework. We follow the version implemented in the StableNormal [ 50]. We use default settings in dense stereo reconstruction mod- els [22,41] and use the filtered point cloud by predicted confidence map. We transform the world origin to the center of the scene, which is determined by the center depth of the first image. Besides, we rescale the cameras to fit within a sphere of radius 2. In reconstruction bootstrap optimization, we downsam- ple the point cloud of the front layer before initializing its Gaussian primitives. We initialize the front layer’s Gaussian primitives using its point cloud and train for 10,000 iterations based on the loss defined in Eq. (1). We enable densification from the 166-th iteration to 5000-th iteration. In the iterative fusion of reconstruction and generation, we define the unknown cameras in two ways. First, we interpolate the poses between input sparse views in the cubic spline interpolator. Second way is to define a target camera pose by jittering the position of an input camera pose while orienting its rotation to face the world origin, and interpolate the poses between the target pose and closet input pose. We empirically define 300 to 400 unknown camera poses in total from these two ways. We use ViewCrafter [ 53] to generate the 25 frames each time with the resolution 1024 ×576. In uncertainty-aware training, we set the maximum L1 difference βbetween conditions and generations as 0.2. We train the Gaussian primitives of the front layer and back layer using Eq. (3) in 10000 iterations without densification. All experiments are conducted on an NVIDIA RTX 6000 GPU.B. Experimental Details B.1. Dataset We double the official downsample factor used by 3DGS [ 18] in Mip-NerF 360 [ 5]. For Tanks and Temples [ 20], we use the processed data from PixelGS [ 59] and downsample the images by a factor of 2. To evaluate the metrics between rendered images and ground-truth images, we follow the InstantSplat [ 10] to align the estimated poses from stereo reconstruction model [ 22,41] to the ground-truth poses. Ini- tially, a coarse alignment is obtained through rigid point registration between the estimated and ground-truth cam- era positions at the training viewpoints. Subsequently, for each rendered image, we fix the Gaussian primitives, and a test-time optimization is performed on the camera pose by minimizing the L1 difference between the rendered and ground-truth images. This optimization is executed for 500 iterations using the Adam optimizer [ 19], with a learning rate of 0.0003 for position and 0.0001 for rotation. B.2. Baselines All compared methods use the same camera poses and point clouds from dense stereo reconstruction [ 22,41]. Since Ze- roNVS [37] and ViewCrafter [53] are agnostic to the recon- struction backbone, we adopt the same 2DGS backbone [ 15, 50] for both methods to ensure a fair and rigorous com- parison. In geometry evaluation, the same 2DGS [ 15,50] backbone is used as the baseline. For low-overlapped sparse views of the unbounded scene, ViewCrafter [ 53] needs to iteratively generate novel views within a small region, uti- lize Dust3R on generated views to recover the scene’s point cloud, and subsequently repeat to generate the next portion of the scene conditioned previously generated image and Dust3R [ 41] point cloud. However, Dust3R’s feed-forward nature struggles with inconsistencies in generated images, leading to inaccurate depths that degrade subsequent genera- tions. In contrast, our iterative fusion framework integrates an uncertainty-aware GS optimization after each iteration to refine the generative error promptly. The optimized 3D- consistent GS rendering is used to condition subsequent generations for consistent multi-view generation guiding the next GS optimization. ViewCrafter [ 53] and ZeroNVS [ 37] use the same group of unknown cameras to generate novel views as our method. Page 13: FSGS*InstantSplatZeroNVS*ViewCrafterOursReference GroundTruthCounter Room BonsaiFigure I. Comparison on Mip-NeRF 360◦[5] Dataset on the 3-View Setting. We qualitatively compare rendering quality with FSGS* [ 60], InstantSplat [10], ZeroNVS* [37], ViewCrafter [53] given 3 input views. FSGS*InstantSplatZeroNVS*ViewCrafterOursReference GroundTruthBarn Caterpillar Figure J. Comparison on Tanks and Temples [ 20] Dataset on the 4-View Setting. We qualitatively compare rendering quality with FSGS* [60], InstantSplat [10], ZeroNVS* [37], ViewCrafter [53] given 4 views. (a) 3-view Reconstruction on Mip-NeRF 360 Ground Truth2DGS*Ours Ground Truth2DGS*Ours (b) 4-view Reconstruction on Tanks and Temples Figure K. Comparison on 3D Surface Reconstruction. We qualitatively compare surface reconstruction with 2DGS [ 15] on (a) Counter (top) and Treehill (down) from Mip-NeRF 360◦[5] dataset and (b) Caterpillar (top) and Horse (down) from Tanks and Temples dataset [ 20]. C. More Experiments C.1. Novel View Synthesis We show more rendering comparisons on the Mip- NeRF360 [ 5] and Tanks and Temples [ 20], as illustrated in Fig. I and Fig. J respectively. FSGS [ 60] and In- stantSplat [ 10] exhibit severe distortion and needle-like Gaus- sian artifacts in the rendering results. ZeroNVS [ 37] fails in synthesizing clear novel views due to limited consistency from generative prior. ViewCrafter [ 53] cannot present adetailed and consistent rendering of the scene. Instead, our method shows the crisp rendering and complete structure of the scene. C.2. Geometry Evaluation We show more geometry evaluation on the Mip-NeRF360 [ 5] and Tanks and Temples [ 20], as illustrated in Fig. K. Given the sparse views, the geometry from 2DGS has many missing areas and distorted surfaces, such as the holes in Treehill, the missing legs in Horse, and the floater at the top of Caterpillar. Page 14: Settings PSNR ↑ SSIM↑ LPIPS ↓ w/oLd&Ln 16.723 0.310 0.505 w/oLd 16.799 0.314 0.499 w/oLn 16.825 0.313 0.498 Ours 16.95 0.321 0.480 Table E. We perform ablation studies on the geometric priors in our method. In contrast, our method not only produces a complete and smooth geometry of the scene but also detailed structures. We show the F1-score precision and recall curves of 2DGS and our method in Fig. L. Since the extremely sparse-view surface reconstruction is ambiguous and error-prone, few points lie within the official error threshold in Tanks and Temples [ 20]. To facilitate a clearer comparison, we increase the error threshold by a factor of 10 (represented by the black dotted line in Fig. L). C.3. Ablation on geometry prior As shown in Tab. E, we show ablation of geometry priors Ld andLnin Eq. 1, 2 on Bicycle andGarden (3 views). The geometry prior improves the rendering quality but is not the key to sparse-view reconstruction. Page 15: 2DGS* Ours 0.0 0.1 0.2 0.3 0.4 0.5 Meters020406080100# of points (%)Precision and Recall: Barn, 18.17 f-score precision recall 0.0 0.1 0.2 0.3 0.4 0.5 Meters020406080100# of points (%)Precision and Recall: Barn, 36.10 f-score precision recall 0.00 0.05 0.10 0.15 0.20 0.25 Meters020406080100# of points (%)Precision and Recall: Caterpillar, 13.24 f-score precision recall 0.00 0.05 0.10 0.15 0.20 0.25 Meters020406080100# of points (%)Precision and Recall: Caterpillar, 11.51 f-score precision recall 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Meters020406080100# of points (%)Precision and Recall: Ignatius, 31.42 f-score precision recall 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Meters020406080100# of points (%)Precision and Recall: Ignatius, 77.84 f-score precision recall 0.00 0.05 0.10 0.15 0.20 0.25 Meters020406080100# of points (%)Precision and Recall: Truck, 50.71 f-score precision recall 0.00 0.05 0.10 0.15 0.20 0.25 Meters020406080100# of points (%)Precision and Recall: Truck, 43.57 f-score precision recall Figure L. The Precision and Recall Curves of F1-score Comparisons . We show detailed precision and recall curves of F1-score comparisons between 2DGS and our method on Barn, Caterpillar, Ignatius, and Truck, given 4 views. The black dotted line in each subfigure denotes the error threshold.

---