Authors: Chong Bao, Xiyu Zhang, Zehao Yu, Jiale Shi, Guofeng Zhang, Songyou Peng, Zhaopeng Cui
Page 1:
Free360: Layered Gaussian Splatting for Unbounded 360-Degree View Synthesis
from Extremely Sparse and Unposed Views
Chong Bao1§Xiyu Zhang1Zehao Yu3Jiale Shi1Guofeng Zhang1
Songyou Peng2Zhaopeng Cui1†
1State Key Lab of CAD&CG, Zhejiang University2ETH Z ¨urich
3University of T ¨ubingen, T ¨ubingen AI Center
Layered Reconstruction
VideoGeneration
Input: Extremely (3-4) Sparse ViewsIteratively Layered 360 Reconstruction & Generation
Ground TruthInstantsplat [10]
Ours RenderingOur Surface Reconstruction(a) Overview of Our FrameworkFSGS* [57]
ViewCrafter [51](b) Novel View Synthesis Comparisons and Our Surface Reconstruction
Figure 1. Free360. (a) We propose a novel Gaussian-based framework, which can reconstruct unbounded 360◦scenes from extremely (3-4)
sparse views through an iterative fusion of layered reconstruction and generation. (b) Our method outperforms other state-of-the-art methods
in rendering quality and supports complete surface reconstruction.
Abstract
Neural rendering has demonstrated remarkable success in
high-quality 3D neural reconstruction and novel view syn-
thesis with dense input views and accurate poses. However,
applying it to extremely sparse, unposed views in unbounded
360◦scenes remains a challenging problem. In this paper, we
propose a novel neural rendering framework to accomplish
the unposed and extremely sparse-view 3D reconstruction
in unbounded 360◦scenes. To resolve the spatial ambigu-
ity inherent in unbounded scenes with sparse input views,
we propose a layered Gaussian-based representation to ef-
fectively model the scene with distinct spatial layers. By
employing a dense stereo reconstruction model to recover
coarse geometry, we introduce a layer-specific bootstrap
optimization to refine the noise and fill occluded regions
in the reconstruction. Furthermore, we propose an itera-
tive fusion of reconstruction and generation alongside an
uncertainty-aware training approach to facilitate mutual
conditioning and enhancement between these two processes.
†Corresponding author.
§The work was partially done when visiting ETHZ.Comprehensive experiments show that our approach outper-
forms existing state-of-the-art methods in terms of rendering
quality and surface reconstruction accuracy. Project page:
https://zju3dv.github.io/free360/.
1. Introduction
3D Gaussian Splatting (3DGS) [ 18] has shown great suc-
cess in achieving high efficiency and quality in 3D neu-
ral reconstruction and novel view synthesis with dense in-
put views and accurate poses. With the development of
large reconstruction models [ 14,51,58] and generative mod-
els [25,26,37,53], it even becomes feasible to recover 3D
objects or bounded scenes from sparse views.
Most sparse-view 3DGS approaches [ 4,23,60] rely on
precise camera poses and relatively sparse sets of views (typ-
ically more than 10). To handle pose-free scenarios, some
methods [ 10] tend to learn a 3DGS based on the point cloud
and camera poses obtained from dense stereo reconstruction
models [ 22,41]; however, these stereo models often yield
noisy geometry in unbounded scenes, leading to more ar-
tifacts and corrupted structures at novel viewpoints. Other
methods [ 11,42,52,53] tempt to fine-tune video diffusionarXiv:2503.24382v1 [cs.CV] 31 Mar 2025
Page 2:
models for direct novel view generation, while such methods
are limited to narrow baseline or highly-overlapped scenarios
due to the inherent absence of 3D information in the video
generation model. In the unbounded 360◦scenes with low-
overlap sparse views, these generated views exhibit severe
multi-view inconsistencies, further contaminating the recon-
struction. Therefore, performing neural reconstruction and
unbounded 360◦view synthesis from extremely sparse and
unposed views remains an underexplored and challenging
problem.
In this paper, we present a novel neural rendering frame-
work to realize unbounded 360◦view synthesis and 3D re-
construction from extremely sparse views (e.g., 3-4 views)
by analyzing and addressing two key challenges in unposed
and unbounded scenarios. At first, the dense stereo recon-
struction model [ 22,41] is employed to recover the coarse
geometry and camera poses. However, the large depth span
of the unbounded scene and insufficient multi-view corre-
spondences from 360◦extremely sparse views lead to two
critical issues: 1) unreliable depth estimation, which hinders
the clear differentiation between near-camera and distant
structures, and 2) different visibility characteristics, which
pose challenges for unified optimization, as near-camera
content appears in multiple views, whereas distant struc-
tures are often partially occluded and visible from only a
single viewpoint. To tackle this issue, we propose a layered
Gaussian-based representation enabling layer-specific boot-
strap optimizations. Our approach explicitly constructs the
scene’s layered structure and use a photometric-guided opti-
mization to mitigate noise and correct detail distortions for
near-camera reconstruction, alongside a prior-guided inpaint-
ing to complete missing regions in distant reconstruction.
Secondly, given the extremely sparse views, intensive ob-
servations are missing for 360◦reconstruction, requiring the
use of generative models to generate additional observations.
Furthermore, the latest video diffusion models [ 53] have
demonstrated the capability of generating novel views from
a good condition, e.g., accurate per-frame point cloud ren-
derings, which however are hard to obtain given extremely
sparse views. To tackle this challenge, we propose an itera-
tive fusion strategy that seamlessly integrates reconstruction
and generation. The generation provides supplementary
observations for reconstruction to alleviate the spare-view
ambiguity. In turn, the reconstruction resolves inconsistency
within generated views and produces consistent renderings
to condition the generation process. Moreover, to prevent er-
ror propagation during the fusion process and ensure a robust
reconstruction upon generation, we develop an uncertainty-
aware training that filters out inconsistent generated contents.
In summary, the contributions of our paper are as fol-
lows. (1) We propose a novel neural rendering framework
to accomplish the unposed and extremely sparse-view 3D
reconstruction in unbounded 360◦scenes. (2) We propose alayered Gaussian-based representation to address the spatial
ambiguity of unbounded scenes, along with a layer-specific
bootstrap optimization upon the layered representation to
refine the noisy reconstruction. (3) We design an iterative fu-
sion strategy of reconstruction and novel view generation, fa-
cilitating mutual conditioning and enhancement between the
two processes. Besides, we incorporate uncertainty-aware
learning to mitigate error propagation and ensure robust
reconstruction. (4) We conduct extensive experiments on
various large-scale unbounded 360◦scenes using only 3-9
views. The results demonstrate that our method reaches the
best rendering quality and surface reconstruction accuracy.
2. Related Works
Sparse-view Novel View Synthesis. Novel view synthesis
(NVS) aims to generate new images from viewpoints out-
side the input views. Since the introduction of NeRF [ 29],
the NVS field has evolved rapidly, with 3D Gaussian Splat-
ting [ 18,55] emerging as the new standard. While both
NeRF and 3DGS produce impressive photometric NVS re-
sults from dense input views, their performance degrades
significantly with sparse input views due to limited opti-
mization constraints [ 5,56]. Subsequent work has aimed to
enhance sparse-view NVS by incorporating semantic regu-
larization [ 2,3,16], smoothness priors [ 30,48], or geometric
priors [ 9,23,35,46,54,60]. For instance, FSGS [ 60] and
DNGaussian [ 23] improve sparse-view NVS by regularizing
depth maps rendered from 3D Gaussians with monocular
depth priors [ 32,33]. However, these methods either focus
on forward-facing scenes [ 1,28] or rely on relatively dense
input views (e.g., 24 views in the Mip-NeRF 360 dataset [ 5]).
In contrast, we solve an extremely challenging scenario in-
volving only 3 or 4 views for 360◦inward-facing scenes
without assuming ground-truth poses.
Pose-free Novel View Synthesis. Optimizing a NeRF or
3DGS model typically requires known camera poses before-
hand. In practice, these poses are often recovered using
established structure-from-motion (SfM) methods, such as
COLMAP [ 38]. However, traditional SfM techniques face
significant challenges in accurately estimating poses when
provided with sparse input views, making it impractical to
assume known poses in such scenarios. While pose-free
methods [ 6,8,12,24] have emerged to tackle this issue by
jointly optimizing camera poses and scene representation,
these approaches often struggle under sparse input condi-
tions. Recently, DUSt3R [41], an unstructured dense stereo
reconstruction model, has demonstrated robust performance
in estimating dense 3D points and camera poses from sparse
views. Building on DUSt3R’s output, InstantSplat [ 10] per-
forms joint optimization on poses and Gaussian attributes,
achieving improved NVS renderings. Following this line
of research, we leverage DUSt3R to estimate camera poses
and dense 3D points from sparse input views. However, we
Page 3:
(a) Sparse Input Images(b) Bootstrap Optimization(c) Multi-layer Reconstruction
Front Layer Coarse Recon.Front Layer GSFront Layer Image & Mono. Normal Supervision
Dense Stereo Reconstruction ModelBack Layer Dense ReconstructionOptimized Front Layer Reconstruction
(d) Multi-layer Gaussian Splatting
Video Diffusion Model
Select 𝑃′ images(e) Iterative Fusion of Reconstruction and GenerationGS Rendering Cond.GS Rendering Cond.Points Rendering Cond.
Generation SupervisonInput / Generated viewsUnknown views
i-th fusion(i+1)-th fusion
InitializeDownsampleFigure 2. Pipeline. (a-d) Given unposed extremely sparse views, we employ the dense stereo reconstruction model [ 22,41] to recover
camera poses and initial point cloud of the scene. A layered Gaussian-based representation is built upon the initial point cloud to enable
layer-specific bootstrap optimization. (e) We design the iterative fusion of reconstruction and generation with diffusion model [ 53]. Unknown
views are iteratively generated under conditions of consistent GS rendering of known views. In turn, generated views are used to enhance the
GS training.
observe needle-like artifacts when using the noisy points
from DUSt3R. To address this, we build a layered Gaussian
Splatting with layer-specific bootstrap optimization.
Generative Sparse Novel View Synthesis. Generative mod-
els have achieved remarkable progress in recent years, de-
livering impressive results in image [ 36], video [ 7], and
3D generation [ 31]. This success has led to the explo-
ration of generative models for sparse novel view synthesis
(NVS) [ 13,26,27,37,44,45,47]. For instance, ReconFu-
sion [ 43] trains a diffusion model to refine the noisy render-
ing of a NeRF. With the diffusion model, ReconFusion opti-
mizes a NeRF with a sampling loss from the denoising model
for novel views and a reconstruction loss between sparse in-
put views. More recent approaches [ 11,42,52,53] fine-tune
video diffusion models for novel view synthesis from sparse
inputs. Although these methods perform well with input
images with significant overlap, they often face identity shift
and multi-view inconsistency issues when provided with
low-overlapped sparse views. In our work, we build upon
video diffusion models by introducing a progressive fusion
step to enhance reconstruction by selecting reliable images
generated by the video diffusion model. Furthermore, we
compute uncertainty maps to capture inconsistencies in the
generated images, guiding the reconstruction process.
3. Method
As shown in Fig. 2, given unposed extremely sparse views
in an unbounded 360◦scene, we propose a novel neural ren-
dering framework to faithfully reconstruct the structure of
the scene and render high-quality novel views. By utilizing
the dense stereo reconstruction model [ 22,41] to obtain the
camera poses and an initial point cloud from sparse views,
we partition the scene into a layered structure and build a
layered Gaussian-based representation upon it (see Sec. 3.1).
Then, we introduce a layer-specific bootstrap optimization
technique to refine reconstruction error (see Sec. 3.2). Due to
the inadequate constraints provided by sparse views, we in-
corporate the video diffusion model [ 53] as prior and proposean iterative fusion approach that combines reconstruction
and generation, enabling mutual conditioning and enhance-
ment between the two processes, i.e., generation provides
novel observations for reconstruction while reconstruction
resolves inconsistencies within the generated outputs (see
Sec. 3.3). Furthermore, we propose an uncertainty-aware
training technique to achieve robust reconstruction with the
awareness of inconsistent generated content (see Sec. 3.4).
3.1. Layered Gaussian Splatting
In the spirit of partitioning the near-camera and far-away
scene content into different layers for layer-specific optimiza-
tion, existing unbounded NeRF methods [ 5,57] intuitively
treat the scene inside the smallest sphere encompassing all
camera positions as foreground, with all other areas as back-
ground. Following this spirit, we use a more compact bound-
ing volume to partition an unbounded scene into a front
layer and a back layer. Specifically, the front layer is the
scene inside the smallest bounding sphere that encapsulates
the intersection area of all camera frustums, while the back
layer constitutes the residual scene. Alternatively, a more
precise partition is to annotate the front and back layers on
the monocular depth [49] of each sparse-input view.
Given unposed sparse images, we use the dense stereo
reconstruction model [ 22,41] to recover the scene’s cam-
era poses and initial point cloud. Then, the point cloud is
partitioned into the front-layer and back-layer point clouds
as described above. We build our layered-based represen-
tation upon the 2D Gaussian Splatting [ 15]. Each layer i
is independently modeled as a group of Gaussian primi-
tivesGi={Gi,k|k= 1, ..., K}, which is initialized using
partitioned point clouds. The Gaussian primitives are param-
eterized by the center position µi,k, a scaling vector Si,k, and
a rotation matrix Ri,k. The color ci,kof Gaussian primitives
is characterized by the view-dependent spherical harmonics
(SH). To render the entire scene, we merge all Gaussian
primitives within each layer and execute Gaussian rasteri-
zation once for efficiency and anti-aliasing. The pixel color
Page 4:
Input sparse imagesStereo Reconstruction ResultsBootstrap-optimized Reconstruction Results(Projected from rendered depth)Figure 3. Visualization of Point Clouds. We compare results from
the stereo reconstruction model and our bootstrap optimization.
Input ImagesGenerated Novel ViewUncertainty Map
Figure 4. Uncertainty Map. We show the uncertainty map esti-
mated from generated novel views. The flower has severe multi-
view inconsistency with high uncertainty.
is composited by the point-based α-blending of rasterized
Gaussian that overlaps the pixel.
3.2. Bootstrap Optimization
The point cloud produced by reconstructing an entire scene
using a stereo model is always error-prone due to the insuffi-
cient visual constraints from sparse views and the model’s
inability to perceive the fine structures and occlusion. With
the layered GS, we target layer-specific bootstrap optimiza-
tion to refine these errors.
For the front layer, a noisy point cloud and missing ge-
ometry of the fine structure always occur, such as the noisy
floating points of Bonsai and missing leg of Horse in Fig. 3.
This can further diminish the quality of Gaussian render-
ing and generation consistency. Inspired by RAIN-GS [17],
which demonstrates that photometric-based GS optimization
can recover clean geometry with fine details when initial-
ized with as few as 10 points, we perform an initial GS
optimization on the front layer using a downsampled point
cloud. This downsampling operation avoids initializing con-
siderable error points of the point cloud as Gaussians by
employing a voxel-based uniform downsampling strategy.
The range of voxel size is 1% to 3% of the bounding box
of the point cloud. Given the sparsity of input views, we
further integrate monocular normal constraints [ 50] to en-
hance geometric accuracy. During optimization, we enable
densification to prune noisy Gaussians while expanding new
Gaussians to capture missing fine structures, guided by the
images and monocular normals. Specifically, we use pho-
tometric loss Lcbetween the rendered images and input
images to capture details of geometry and binary entropy
lossLmbetween the rendered alpha map and layer mask to
remove the dislocated outlier, a cosine similarity loss Lnbe-
tween rendered normal map and monocular normal to fix the
bumpy surface and use L1 loss Ldbetween the rendered andstereo-reconstructed [ 41] depths to regularize the geometry
Ld:
Lbs=Lc+λmLm+λnLn+λdLd, (1)
where λm= 1.0,λn= 0.25,λd= 0.1. As shown in Fig. 3,
our optimization successfully gets a cleaner point cloud in
Bonsai and fills the missing geometry of leg in Horse.
For the back layer, we inpaint the depth of regions in the
back layer that are not visible from the input viewpoints. To
achieve this, we sample several novel viewpoints and render
their visibility masks based on the back layer’s point cloud.
Then, we obtain the bounding sphere of the back layer’s point
cloud, and a ray-box intersection is performed to estimate the
depth of invisible pixels. Based on the inpainted depth, new
points are subsequently added to back layer’s point cloud.
3.3. Iterative Fusion of Reconstruction and Gener-
ation
Due to underdetermined constraints in the extremely sparse
view, we leverage video diffusion model [ 53] to generate
new observations for 360◦reconstruction. Here we use
ViewCrafter [ 53] as generative prior. ViewCrafter [ 53] takes
Pimages of a sequential motion as conditions, which con-
sists of point cloud renderings and real images, and generates
Pnovel views following the conditioned motions. A na ¨ıve
way to use generative prior is to generate all required novel
views at once and train Gaussian primitives directly on them.
However, ViewCrafter [ 53] exhibits pronounced multi-view
inconsistency when generating novel views between low-
overlapped sparse views, as it is trained on videos character-
ized by minimal camera motion and substantial per-frame
overlap. These inconsistencies degrade the GS optimization
process, leading to blurry and distorted renderings.
To mitigate the inconsistency in the generation, we pro-
pose an iterative fusion strategy of reconstruction and gen-
eration where the diffusion model is iteratively conditioned
on rendered images and generates consistent novel views to
enhance the reconstruction in turn. Specifically, a known
image set Iknown and pose set Vknown are defined to denote
the cameras with known image supervision (real or gen-
erated). We initialize the known image set and pose set
with input sparse views, Iknown =Iinput,Vknown =Vinput.
Starting from sampling a start pose among input poses, we
interpolate Pposes sequentially Vgenthat consist of known
cameras Vknown ,gen(generated at the iteration before) and
unknown cameras Vnovel,Vgen=Vknown ,gen∪Vnovel. For
the known cameras, we use real images or GS-rendered im-
ages as generative conditions. For unknown cameras, we use
renderings of whole-scene point cloud using Pytorch3D [ 34]
as the generative conditions. Next, we utilize the video diffu-
sion model to generate images Igenbased on these conditions,
Igen=Iknown ,gen∪Inovel. The video diffusion model [ 53]
generates Psequential frames at a single forward, but the
frame quality is not constant. The inconsistency is accu-
Page 5:
Method 3 views 6 views 9 views
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
FSGS* [60] 14.25 0.286 0.545 15.65 0.318 0.509 16.37 0.347 0.489
InstantSplat [10] 14.10 0.296 0.529 15.83 0.351 0.480 17.16 0.402 0.443
ZeroNVS* [37] 15.16 0.327 0.632 15.46 0.313 0.614 15.81 0.328 0.607
ViewCrafter [53] 15.63 0.358 0.515 16.32 0.366 0.555 16.68 0.382 0.551
Ours 16.20 0.378 0.499 16.83 0.397 0.458 17.35 0.423 0.435
Table 1. Comparison on Mip-NeRF 360◦[5] Dataset. We compare the rendering quality with baselines given 3, 6 and 9 views.
Method 4 views 6 views 9 views
PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓
FSGS* [60] 13.14 0.357 0.509 14.29 0.425 0.451 15.13 0.457 0.423
InstantSplat [10] 13.03 0.390 0.496 14.22 0.457 0.439 15.58 0.51 0.391
ZeroNVS* [37] 12.62 0.363 0.619 12.96 0.349 0.597 13.18 0.364 0.592
ViewCrafter [53] 13.56 0.412 0.504 14.12 0.424 0.483 14.70 0.443 0.467
Ours 14.69 0.476 0.409 15.67 0.523 0.368 16.73 0.564 0.328
Table 2. Comparison on Tanks and Temples [20] Dataset. We compare the rendering quality with baselines given 4, 6 and 9 views.
mulated with distance, i.e. the consistency of novel views
deteriorates as the camera deviates from the known cam-
eras. Therefore, we define a reliable frame selection strategy.
Only P′frames I′
noveland poses V′
novelare selected within
novel frames Inoveland poses Vnovelaccording to minimum
distance to the known poses:
I′
novel,V′
novel=argmin
I′
novel,V′
novelP′X
v(p)
novel∈Vnovelmin
v∈Vknown, gen∥v(p)
novel−v∥.
(2)
Then, we append the selected novel poses and frames into
the known set, Iknown =Iknown∪I′novel,Vknown =Vknown∪
V′novel. At the end of an iteration, we train the layered GS
on these known poses and images to learn the consistent
content from the video prior progressively. This layered
GS optimization is introduced in Sec. 3.4. We empirically
define 300 to 400 unknown camera poses in total and repeat
this process multiple times until all cameras become known.
Please refer to supp. material for camera definition.
3.4. Uncertainty-aware Training
To avoid error drifting during mutual conditioning and
achieve a robust reconstruction, an uncertainty measure-
ment is required to distinguish the reliable parts of generated
novel views. In our task, uncertainty arises due to limited
constraints from sparse-view input, which exhibits excessive
degrees of freedom in the content of generated novel views,
as shown in Fig. 4, i.e., diffusion hallucinates incorrect ge-
ometry and appearance in high-freedom regions. Inspired by
NeRF Ensembles [ 40], we exploit epistemic uncertainty [ 21]
to model this hallucination. Specifically, we generate mul-
tiple images from the same viewpoint, each conditioned on
either unperturbed or perturbed point cloud renderings. As
depicted in Fig. 4, we measure the variance across these
generated images as epistemic uncertainty, which reflects
the model’s lack of knowledge in under-constrained regions.
The perturbation process is based on the L1 difference be-tween the generated views and their corresponding unper-
turbed conditions. Perturbations are then selectively applied
to areas of conditions where this difference exceeds a prede-
fined threshold β.
Moreover, the uncertainty map is combined with some
loss terms to supervise the training. We employ the photo-
metric loss Lcand perceptual loss Llpipsbetween rendered
images and generated images, and a cosine loss Lnbetween
rendered normal and monocular normal at input viewpoints.
Additionally, we use binary entropy loss Lmbetween alpha
maps of the front GS and front-layer masks to retain the
shape. We also exploit some regularization terms Lreg, such
as distortion loss and normal consistency.
L=u∗[λc(Lc+Llpips)+λmLm+λnLn+λregLreg],(3)
where λm= 1.0,λn= 0.25.λregis 100 for distortion
loss and 0.25 for normal consistency loss. As the incon-
sistency will accumulate as the camera deviates from the
input view mentioned in Sec. 3.3, we relate the λcwith
the distance xto the input sparse views, λc=f(x) =
max(0 .5∗e(−20∗x),0.05). During training, the densifica-
tion process is deactivated to prevent densifying Gaussians
based on inaccurate positional gradients arising from in-
consistencies of generated views. Please refer to the supp.
material for training details.
4. Experiments
4.1. Dataset & Baselines
Dataset. We evaluate our method on two unbounded 360◦
datasets: Mip-NeRF 360 [ 5] and Tanks and Temples [ 20].
For Mip-NeRF 360 [ 5], we choose 8 out of 9 sequences
containing indoor and outdoor scenes, excluding the Flower
scene as DUSt3R [ 41] fails using sparse views. We follow
3DGS’s [ 18] train-test split and then select 3, 6, 9 views
from its training split as training images, and evaluate novel
views on its test split. For Tanks and Temples [ 20], we use 6
Page 6:
FSGS*InstantsplatZeroNVS*ViewCrafterOursBicycle
Reference GroundTruth
StumpGardenTreehillKitchenFigure 5. Comparison on Mip-NeRF 360◦[5] Dataset on the 3-View Setting. We qualitatively compare rendering quality with FSGS* [ 60],
InstantSplat [10], ZeroNVS* [37], ViewCrafter [53] given 3 input views.
unbounded outdoor scenes, each captured in 360◦. Similarly,
we uniformly split the train and test set by sampling every 8th
image as the test set, and the remaining images are training
images. As 3 views are insufficient to capture the 360-degree
scenes, we select 4, 6, and 9 views from the training images
as sparse input and evaluate novel views on the test split. We
align estimated poses from DUSt3R [41] with Colmap [39]
poses recovered using all images for evaluation. Please refer
to the supp. material for more explanations.
Baselines. We compare with the following state-of-the-
art methods[ 10,37,53,60]. FSGS [ 60] is a 3DGS-based
method that regularizes the Gaussian optimization with
monocular depth prior. ZeroNVS [ 37] trains a diffusion
model for the single-view reconstruction in unbounded
scenes. We follow ReconFusion [ 44] to adapt it to multi-
view input. For fair comparisons, we use the same inputpoint cloud and camera poses from DUSt3R for FSGS* and
ZeroNVS*. InstantSplat [ 10] initializes Gaussian primitives
from DUSt3R’s point cloud and optimizes 3D Gaussians and
camera poses jointly. ViewCrafter [ 53] fine-tunes a video
diffusion model conditioned on point cloud rendering from
DUSt3R and trains a final 3DGS [ 18] based on generated
views and sparse input. Please refer to the supp. material for
more details on baseline training.
4.2. Novel View Synthesis
We now evaluate our method on the Mip-NeRF 360
dataset [ 5]. As shown in Tab. 1, our method outperforms
all baselines across all metrics. As shown in Fig. 5, this
dataset is particularly challenging as it contains many fine-
grained structures, like the delicate flowers in the Garden.
FSGS [ 60] and InstantSplat [ 10] show strong artifacts, such
Page 7:
FSGS*InstantsplatZeroNVS*ViewCrafterOursGarden
FamilyTruckHorseIgnatius
Reference GroundTruthFigure 6. Comparison on Tanks and Temples [ 20] Dataset on the 4-View Setting. We qualitatively compare rendering quality with
FSGS* [60], InstantSplat [10], ZeroNVS* [37], ViewCrafter [53] given 4 views.
(a) 3-view Reconstruction on Mip-NeRF 360 Ground Truth2DGS*Ours
Ground Truth2DGS*Ours
(b) 4-view Reconstruction on Tanks and Temples
Figure 7. Comparison on 3D Surface Reconstruction. We qualitatively compare surface reconstruction with 2DGS [ 15] on (a) Bicycle
(top) and Garden (down) from Mip-NeRF 360◦[5] dataset and (b) Barn (top) and Ignatius (down) from Tanks and Temples dataset [20].
as ”foggy” geometry and needle-like distorted Gaussians
in the background. Their reconstruction quality is limited
by the noisy point cloud from DUSt3R, especially for fine
structures. In contrast, our Bootstrap Optimization corrects
these errors, restoring details like the flowers in the Garden
scene and the leaves in the Stump scene. ZeroNVS [ 37] fails
to generate consistent novel views for 3DGS training, whileViewCrafter [ 53] has similar issues. Inconsistent novel views
from the video diffusion model also impair quality, resulting
in blurred renderings and view-dependent artifacts, as seen
in the flowers of Garden scene and split wheels of Lego in
Kitchen scene. Our Iterative Fusion effectively mitigates
inconsistencies, improving reconstruction.
We further evaluate our method on the Tanks and Tem-
Page 8:
Methods 4 views 6 views 9 views
2DGS [15] 0.284 0.367 0.418
Ours 0.423 0.430 0.439
Table 3. Comparison on 3D Surface Reconstruction. We com-
pare with 2DGS [ 15] on four scenes of Tanks and Temples [ 20].
The evaluation metric is F1-score and higher is better.
Settings PSNR ↑ SSIM↑ LPIPS ↓
Baseline 16.16 0.289 0.532
+ Multi-layer (Sec. 3.1) 16.56 0.308 0.501
+ Bootstrap Opt. (Sec. 3.2) 16.65 0.314 0.491
+ Iterative Fusion (Sec. 3.3) 16.81 0.317 0.484
+ Uncertainty (Ours, Sec. 3.4) 16.95 0.321 0.480
Table 4. Ablation on Design Choices. We perform ablation studies
on the baseline with different designs of our framework.
ples dataset [ 20], which presents larger scenes and signifi-
cant depth variations, challenging reconstruction from sparse
views. Similar to our observation above, our method outper-
forms all baselines, as shown in Tab. 2. We observe similar
artifacts for baseline methods in Fig. 6: FSGS [ 60] and
InstantSplat [ 10] produce needle-like Gaussian for the fore-
ground object due to noisy point cloud initialization. Further,
they failed to complete the background regions as they did
not employ generative priors. ZeroNVS[ 37] produces overly
blurred images, while ViewCrafter [ 53] produces distorted
geometry due to noise in the point cloud initialization and in-
consistency in the generative models. For instance, the leg of
thehorse is missing in ViewCrafter’s rendering, and the sky
sticks with the foreground objects. By contrast, our multi-
layer representation mitigates depth ambiguity, preserving
foreground integrity. Our bootstrap optimization improves
the scene’s geometry, resulting in better NVS results.
4.3. Geometry Evaluation
We also evaluate surface reconstruction accuracy against
2DGS [ 15]. We use camera poses and point clouds from
DUSt3R [ 22,41] as input for 2DGS, denoted as 2DGS*.
For quantitative comparisons, we conduct the experiment
on four scenes with ground-truth mesh in Tanks and Tem-
ples [ 20], including Barn, Ignatius, Caterpillar, and Truck.
As sparse-view reconstruction is error-prone, we enlarge the
error threshold 10 times for all methods to compute the F1-
score. As is shown in Tab. 3, our method achieves superior
results to 2DGS [ 15]. We show qualitative comparisons on
Mip-NeRF 360 [ 5] and Tanks and Temples [ 20] in Fig. 7.
2DGS [ 15] can not reconstruct the geometry of unseen re-
gions; as a consequence, there are large holes on the ground
and the background as shown in Fig. 7(a). Moreover, noisy
point clouds from dense stereo models [ 22,41] lead to error-
prone densification, resulting in artifacts in the geometry,
particularly in the sky region, where 2DGS [ 15] grows Gaus-
sian points on the foreground to minimize appearance loss,
seeBarn scene in Fig. 7(b). By contrast, through Bootstrap
Optimization and Iterative Fusion, we obtain a clean and
detailed point cloud for initialization, allowing smoother
Generated View w/ Iterative FusionGenerated View w/o Iterative FusionV1V0
Input ImagesViewpointFigure 8. Ablation on Iterative Fusion. We compare the rendering
quality of different viewpoints when ablating the iterative fusion
(Sec. 3.3) on the Counter of Mip-NeRF 360◦[5] dataset.
distribution of Gaussian points with less noise at the seen
regions and complete geometry at the unseen regions.
4.4. Ablation Study
We conduct ablations using Bicycle andGarden from Mip-
NeRF 360 [ 5] with 3 input views. Given stereo-reconstructed
point cloud and poses, “Baseline” firstly generates images
under the same unknown poses as ours, then optimizes a final
2DGS [ 15] on generated and input images. Adding multi-
ple layers into GS optimization boosts the rendering quality.
Bootstrap Optimization provides a cleaner and more detailed
geometry for Gaussian Splatting to capture detailed geome-
try and appearance. Iterative Fusion and Uncertainty-aware
Training generate more consistent novel views, leading to
better results at inconsistent regions, which further improves
the rendering quality. As shown in Tab. 4, each component
can effectively enhance the quality of reconstruction. We
also compare generated novel views from the video diffusion
model on 3 views of Counter from Mip-NeRF 360 [ 5]. As
shown in Fig. 8, the generated results without Iteration Fu-
sion suffer inconsistency across multiple viewpoints, while
with Iteration Fusion, video diffusion model can generate
consistent and high-quality novel views.
5. Conclusion
We have presented a novel framework for unposed and ex-
tremely sparse-view 3D reconstruction in unbounded 360
scenes. We build a layered GS upon the reconstruction of
dense stereo reconstruction model [ 22,41] and use boot-
strap optimization to generate a clean and detailed geometry
from sparse views. Besides, we propose an iterative fu-
sion of reconstruction and generation [53] to enable mutual
conditioning and enhancement, and an uncertainty-aware
training for robust reconstruction. It is worth noting that
Free360 is a general framework that is both complementary
and orthogonal to video diffusion models. A future direc-
tion is incorporating an enhanced video diffusion model into
our framework. As a limitation, we cannot solve scenes
with extensive repetitive texture where stereo reconstruction
model [22, 41] fails for faithful global reconstructions.
Acknowledgment: This work was partially supported by
the NSFC (No. 62441222), Information Technology Center
and State Key Lab of CAD&CG, Zhejiang University.
Page 9:
References
[1]Henrik Aanæs, Rasmus Ramsbøl Jensen, George V ogiatzis,
Engin Tola, and Anders Bjorholm Dahl. Large-scale data for
multiple-view stereopsis. IJCV , 2016. 2
[2]Chong Bao, Yinda Zhang, Bangbang Yang, Tianxing Fan,
Zesong Yang, Hujun Bao, Guofeng Zhang, and Zhaopeng
Cui. Sine: Semantic-driven image-based nerf editing with
prior-guided editing field. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition ,
pages 20919–20929, 2023. 2
[3]Chong Bao, Yinda Zhang, Yuan Li, Xiyu Zhang, Bang-
bang Yang, Hujun Bao, Marc Pollefeys, Guofeng Zhang,
and Zhaopeng Cui. Geneavatar: Generic expression-aware
volumetric head avatar editing from a single image. In Pro-
ceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition , pages 8952–8963, 2024. 2
[4]Zhenyu Bao, Guibiao Liao, Kaichen Zhou, Kanglin Liu, Qing
Li, and Guoping Qiu. Loopsparsegs: Loop based sparse-view
friendly gaussian splatting. arXiv preprint arXiv:2408.00254 ,
2024. 1
[5]Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P.
Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded
anti-aliased neural radiance fields. CVPR , 2022. 2, 3, 5, 6, 7,
8, 12, 13
[6]Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Vic-
tor Adrian Prisacariu. Nope-nerf: Optimising neural radiance
field with no pose prior. In CVPR , 2023. 2
[7]Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei
Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric
Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh.
Video generation models as world simulators. 2024. 3
[8]Zezhou Cheng, Carlos Esteves, Varun Jampani, Abhishek
Kar, Subhransu Maji, and Ameesh Makadia. Lu-nerf: Scene
and pose estimation by synchronizing local unposed nerfs. In
ICCV , 2023. 2
[9]Kangle Deng, Andrew Liu, Jun-Yan Zhu, and Deva Ramanan.
Depth-supervised nerf: Fewer views and faster training for
free. In CVPR , 2022. 2
[10] Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian
Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco
Pavone, Georgios Pavlakos, Zhangyang Wang, and Yue Wang.
Instantsplat: Unbounded sparse-view pose-free gaussian splat-
ting in 40 seconds, 2024. 1, 2, 5, 6, 7, 8, 12, 13
[11] Haiwen Feng, Zheng Ding, Zhihao Xia, Simon Niklaus, Vic-
toria Abrevaya, Michael J Black, and Xuaner Zhang. Explo-
rative inbetweening of time and space. In European Confer-
ence on Computer Vision , pages 378–395. Springer, 2025. 1,
3
[12] Yang Fu, Sifei Liu, Amey Kulkarni, Jan Kautz, Alexei A.
Efros, and Xiaolong Wang. Colmap-free 3d gaussian splatting.
InCVPR , 2024. 2
[13] Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur
Brussee, Ricardo Martin-Brualla, Pratul Srinivasan,
Jonathan T Barron, and Ben Poole. Cat3d: Create anything
in 3d with multi-view diffusion models. arXiv preprint
arXiv:2405.10314 , 2024. 3[14] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou,
Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao
Tan. Lrm: Large reconstruction model for single image to 3d,
2024. 1
[15] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and
Shenghua Gao. 2d gaussian splatting for geometrically accu-
rate radiance fields. In ACM SIGGRAPH 2024 Conference
Papers , pages 1–11, 2024. 3, 7, 8, 12, 13
[16] Ajay Jain, Matthew Tancik, and Pieter Abbeel. Putting nerf
on a diet: Semantically consistent few-shot view synthesis.
InProceedings of the IEEE/CVF International Conference
on Computer Vision , pages 5885–5894, 2021. 2
[17] Jaewoo Jung, Jisang Han, Honggyu An, Jiwon Kang,
Seonghoon Park, and Seungryong Kim. Relaxing accurate ini-
tialization constraint for 3d gaussian splatting. arXiv preprint
arXiv:2403.09413 , 2024. 4
[18] Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and
George Drettakis. 3d gaussian splatting for real-time radiance
field rendering. ACM Trans. Graph. , 42(4):139–1, 2023. 1, 2,
5, 6, 12
[19] Diederik P Kingma. Adam: A method for stochastic opti-
mization. arXiv preprint arXiv:1412.6980 , 2014. 12
[20] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen
Koltun. Tanks and temples: Benchmarking large-scale scene
reconstruction. ACM Transactions on Graphics , 36(4), 2017.
5, 7, 8, 12, 13, 14
[21] Balaji Lakshminarayanan, Alexander Pritzel, and Charles
Blundell. Simple and scalable predictive uncertainty estima-
tion using deep ensembles. Advances in neural information
processing systems , 30, 2017. 5
[22] Vincent Leroy, Yohann Cabon, and J ´erˆome Revaud. Ground-
ing image matching in 3d with mast3r. arXiv preprint
arXiv:2406.09756 , 2024. 1, 2, 3, 8, 12
[23] Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun
Zhou, and Lin Gu. Dngaussian: Optimizing sparse-view 3d
gaussian radiance fields with global-local depth normalization.
InProceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition , pages 20775–20785, 2024.
1, 2
[24] Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Simon
Lucey. Barf: Bundle-adjusting neural radiance fields. In
Proceedings of the IEEE/CVF International Conference on
Computer Vision , pages 5741–5751, 2021. 2
[25] Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund
Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any single
image to 3d mesh in 45 seconds without per-shape optimiza-
tion. Advances in Neural Information Processing Systems , 36,
2024. 1
[26] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov,
Sergey Zakharov, and Carl V ondrick. Zero-1-to-3: Zero-shot
one image to 3d object. In ICCV , 2023. 1, 3
[27] Xinhang Liu, Jiaben Chen, Shiu-hong Kao, Yu-Wing Tai, and
Chi-Keung Tang. Deceptive-nerf/3dgs: Diffusion-generated
pseudo-observations for high-quality sparse-view reconstruc-
tion. arXiv preprint arXiv:2305.15171 , 2023. 3
[28] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz-Cayon,
Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and
Page 10:
Abhishek Kar. Local light field fusion: Practical view synthe-
sis with prescriptive sampling guidelines. ACM TOG , 2019.
2
[29] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik,
Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf:
Representing scenes as neural radiance fields for view synthe-
sis.Communications of the ACM , 2021. 2
[30] Michael Niemeyer, Jonathan T Barron, Ben Mildenhall,
Mehdi SM Sajjadi, Andreas Geiger, and Noha Radwan. Reg-
nerf: Regularizing neural radiance fields for view synthesis
from sparse inputs. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition , pages
5480–5490, 2022. 2
[31] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall.
Dreamfusion: Text-to-3d using 2d diffusion, 2022. 3
[32] Ren´e Ranftl, Katrin Lasinger, David Hafner, Konrad
Schindler, and Vladlen Koltun. Towards robust monocular
depth estimation: Mixing datasets for zero-shot cross-dataset
transfer. IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI) , 2020. 2
[33] Ren´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi-
sion transformers for dense prediction. ArXiv preprint , 2021.
2
[34] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Tay-
lor Gordon, Wan-Yen Lo, Justin Johnson, and Georgia
Gkioxari. Accelerating 3d deep learning with pytorch3d.
arXiv:2007.08501 , 2020. 4
[35] Barbara Roessle, Jonathan T Barron, Ben Mildenhall, Pratul P
Srinivasan, and Matthias Nießner. Dense depth priors for
neural radiance fields from sparse input views. In Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition , pages 12892–12901, 2022. 2
[36] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Bj ¨orn Ommer. High-resolution image
synthesis with latent diffusion models, 2021. 3
[37] Kyle Sargent, Zizhang Li, Tanmay Shah, Charles Herrmann,
Hong-Xing Yu, Yunzhi Zhang, Eric Ryan Chan, Dmitry La-
gun, Li Fei-Fei, Deqing Sun, et al. Zeronvs: Zero-shot 360-
degree view synthesis from a single image. In CVPR , 2024.
1, 3, 5, 6, 7, 8, 12, 13
[38] Johannes Lutz Sch ¨onberger and Jan-Michael Frahm.
Structure-from-motion revisited. In Conference on Computer
Vision and Pattern Recognition (CVPR) , 2016. 2
[39] Johannes L Schonberger and Jan-Michael Frahm. Structure-
from-motion revisited. In Proceedings of the IEEE confer-
ence on computer vision and pattern recognition , pages 4104–
4113, 2016. 6
[40] Niko S ¨underhauf, Jad Abou-Chakra, and Dimity Miller.
Density-aware nerf ensembles: Quantifying predictive un-
certainty in neural radiance fields. In 2023 IEEE Interna-
tional Conference on Robotics and Automation (ICRA) , pages
9370–9376. IEEE, 2023. 5
[41] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris
Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d
vision made easy. In CVPR , 2024. 1, 2, 3, 4, 5, 6, 8, 12
[42] Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tian-
shui Chen, Menghan Xia, Ping Luo, and Ying Shan. Mo-
tionctrl: A unified and flexible motion controller for videogeneration. In ACM SIGGRAPH 2024 Conference Papers ,
pages 1–11, 2024. 1, 3
[43] Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park,
Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin,
Jonathan T. Barron, Ben Poole, and Aleksander Holynski.
Reconfusion: 3d reconstruction with diffusion priors. arXiv ,
2023. 3
[44] Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park,
Ruiqi Gao, Daniel Watson, Pratul P Srinivasan, Dor Verbin,
Jonathan T Barron, Ben Poole, et al. Reconfusion: 3d recon-
struction with diffusion priors. In CVPR , 2024. 3, 6
[45] Haolin Xiong, Sairisheek Muttukuru, Rishi Upadhyay,
Pradyumna Chari, and Achuta Kadambi. Sparsegs: Real-time
360°sparse view synthesis using gaussian splatting. Arxiv ,
2023. 3
[46] Bangbang Yang, Chong Bao, Junyi Zeng, Hujun Bao, Yinda
Zhang, Zhaopeng Cui, and Guofeng Zhang. Neumesh: Learn-
ing disentangled neural mesh-based implicit field for geome-
try and texture editing. In European Conference on Computer
Vision , pages 597–614. Springer, 2022. 2
[47] Chen Yang, Sikuang Li, Jiemin Fang, Ruofan Liang, Lingxi
Xie, Xiaopeng Zhang, Wei Shen, and Qi Tian. Gaussianobject:
High-quality 3d object reconstruction from four views with
gaussian splatting. ACM Transactions on Graphics , 2024. 3
[48] Jiawei Yang, Marco Pavone, and Yue Wang. Freenerf: Im-
proving few-shot neural rendering with free frequency regu-
larization. In CVPR , 2023. 2
[49] Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao-
gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything
v2.arXiv:2406.09414 , 2024. 3, 12
[50] Chongjie Ye, Lingteng Qiu, Xiaodong Gu, Qi Zuo, Yushuang
Wu, Zilong Dong, Liefeng Bo, Yuliang Xiu, and Xiaoguang
Han. Stablenormal: Reducing diffusion variance for stable
and sharp normal. ACM Transactions on Graphics , 2024. 4,
12
[51] Xu Yinghao, Shi Zifan, Yifan Wang, Chen Hansheng, Yang
Ceyuan, Peng Sida, Shen Yujun, and Wetzstein Gordon. Grm:
Large gaussian reconstruction model for efficient 3d recon-
struction and generation, 2024. 1
[52] Meng You, Zhiyu Zhu, Hui Liu, and Junhui Hou. Nvs-solver:
Video diffusion model as zero-shot novel view synthesizer.
arXiv preprint arXiv:2405.15364 , 2024. 1, 3
[53] Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li,
Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan,
and Yonghong Tian. Viewcrafter: Taming video diffusion
models for high-fidelity novel view synthesis. arXiv preprint
arXiv:2409.02048 , 2024. 1, 2, 3, 4, 5, 6, 7, 8, 12, 13
[54] Zehao Yu, Songyou Peng, Michael Niemeyer, Torsten Sattler,
and Andreas Geiger. Monosdf: Exploring monocular geomet-
ric cues for neural implicit surface reconstruction. Advances
in Neural Information Processing Systems (NeurIPS) , 2022.
2
[55] Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and
Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat-
ting. Conference on Computer Vision and Pattern Recognition
(CVPR) , 2024. 2
Page 11:
[56] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Koltun. Nerf++: Analyzing and improving neural radiance
fields. arXiv:2010.07492 , 2020. 2
[57] Kai Zhang, Gernot Riegler, Noah Snavely, and Vladlen
Koltun. Nerf++: Analyzing and improving neural radiance
fields. arXiv preprint arXiv:2010.07492 , 2020. 3
[58] Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao,
Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large recon-
struction model for 3d gaussian splatting. In ECCV , 2024.
1
[59] Zheng Zhang, Wenbo Hu, Yixing Lao, Tong He, and
Hengshuang Zhao. Pixel-gs: Density control with pixel-
aware gradient for 3d gaussian splatting. arXiv preprint
arXiv:2403.15530 , 2024. 12
[60] Zehao Zhu, Zhiwen Fan, Yifan Jiang, and Zhangyang Wang.
Fsgs: Real-time few-shot view synthesis using gaussian splat-
ting. In ECCV , 2024. 1, 2, 5, 6, 7, 8, 13
Page 12:
Supplementary Material
In this supplementary material, we first present detailed
implementation aspects in Section A. More experimental
details are shown in Section B. We show more comparisons
in the Sec. C. Additionally, we include a short video summa-
rizing the method with video results, and an offline webpage
for interactive visualization of our whole results and compar-
isons.
A. Implementation Details
For the fine-grained front and back layer masks, we an-
notate the maximum depth of the front layer in monocular
depth [ 49] of each view. The pixels are selected into the front
layer if their depth is smaller than the annotated maximum
depth. We build Free360 upon the 2DGS [ 15] framework.
We follow the version implemented in the StableNormal [ 50].
We use default settings in dense stereo reconstruction mod-
els [22,41] and use the filtered point cloud by predicted
confidence map. We transform the world origin to the center
of the scene, which is determined by the center depth of the
first image. Besides, we rescale the cameras to fit within a
sphere of radius 2.
In reconstruction bootstrap optimization, we downsam-
ple the point cloud of the front layer before initializing its
Gaussian primitives. We initialize the front layer’s Gaussian
primitives using its point cloud and train for 10,000 iterations
based on the loss defined in Eq. (1). We enable densification
from the 166-th iteration to 5000-th iteration.
In the iterative fusion of reconstruction and generation,
we define the unknown cameras in two ways. First, we
interpolate the poses between input sparse views in the cubic
spline interpolator. Second way is to define a target camera
pose by jittering the position of an input camera pose while
orienting its rotation to face the world origin, and interpolate
the poses between the target pose and closet input pose. We
empirically define 300 to 400 unknown camera poses in total
from these two ways. We use ViewCrafter [ 53] to generate
the 25 frames each time with the resolution 1024 ×576.
In uncertainty-aware training, we set the maximum L1
difference βbetween conditions and generations as 0.2. We
train the Gaussian primitives of the front layer and back layer
using Eq. (3) in 10000 iterations without densification. All
experiments are conducted on an NVIDIA RTX 6000 GPU.B. Experimental Details
B.1. Dataset
We double the official downsample factor used by 3DGS [ 18]
in Mip-NerF 360 [ 5]. For Tanks and Temples [ 20], we use
the processed data from PixelGS [ 59] and downsample the
images by a factor of 2. To evaluate the metrics between
rendered images and ground-truth images, we follow the
InstantSplat [ 10] to align the estimated poses from stereo
reconstruction model [ 22,41] to the ground-truth poses. Ini-
tially, a coarse alignment is obtained through rigid point
registration between the estimated and ground-truth cam-
era positions at the training viewpoints. Subsequently, for
each rendered image, we fix the Gaussian primitives, and
a test-time optimization is performed on the camera pose
by minimizing the L1 difference between the rendered and
ground-truth images. This optimization is executed for 500
iterations using the Adam optimizer [ 19], with a learning
rate of 0.0003 for position and 0.0001 for rotation.
B.2. Baselines
All compared methods use the same camera poses and point
clouds from dense stereo reconstruction [ 22,41]. Since Ze-
roNVS [37] and ViewCrafter [53] are agnostic to the recon-
struction backbone, we adopt the same 2DGS backbone [ 15,
50] for both methods to ensure a fair and rigorous com-
parison. In geometry evaluation, the same 2DGS [ 15,50]
backbone is used as the baseline. For low-overlapped sparse
views of the unbounded scene, ViewCrafter [ 53] needs to
iteratively generate novel views within a small region, uti-
lize Dust3R on generated views to recover the scene’s point
cloud, and subsequently repeat to generate the next portion
of the scene conditioned previously generated image and
Dust3R [ 41] point cloud. However, Dust3R’s feed-forward
nature struggles with inconsistencies in generated images,
leading to inaccurate depths that degrade subsequent genera-
tions. In contrast, our iterative fusion framework integrates
an uncertainty-aware GS optimization after each iteration
to refine the generative error promptly. The optimized 3D-
consistent GS rendering is used to condition subsequent
generations for consistent multi-view generation guiding the
next GS optimization. ViewCrafter [ 53] and ZeroNVS [ 37]
use the same group of unknown cameras to generate novel
views as our method.
Page 13:
FSGS*InstantSplatZeroNVS*ViewCrafterOursReference GroundTruthCounter
Room
BonsaiFigure I. Comparison on Mip-NeRF 360◦[5] Dataset on the 3-View Setting. We qualitatively compare rendering quality with FSGS* [ 60],
InstantSplat [10], ZeroNVS* [37], ViewCrafter [53] given 3 input views.
FSGS*InstantSplatZeroNVS*ViewCrafterOursReference GroundTruthBarn
Caterpillar
Figure J. Comparison on Tanks and Temples [ 20] Dataset on the 4-View Setting. We qualitatively compare rendering quality with
FSGS* [60], InstantSplat [10], ZeroNVS* [37], ViewCrafter [53] given 4 views.
(a) 3-view Reconstruction on Mip-NeRF 360 Ground Truth2DGS*Ours
Ground Truth2DGS*Ours
(b) 4-view Reconstruction on Tanks and Temples
Figure K. Comparison on 3D Surface Reconstruction. We qualitatively compare surface reconstruction with 2DGS [ 15] on (a) Counter
(top) and Treehill (down) from Mip-NeRF 360◦[5] dataset and (b) Caterpillar (top) and Horse (down) from Tanks and Temples dataset [ 20].
C. More Experiments
C.1. Novel View Synthesis
We show more rendering comparisons on the Mip-
NeRF360 [ 5] and Tanks and Temples [ 20], as illustrated
in Fig. I and Fig. J respectively. FSGS [ 60] and In-
stantSplat [ 10] exhibit severe distortion and needle-like Gaus-
sian artifacts in the rendering results. ZeroNVS [ 37] fails
in synthesizing clear novel views due to limited consistency
from generative prior. ViewCrafter [ 53] cannot present adetailed and consistent rendering of the scene. Instead, our
method shows the crisp rendering and complete structure of
the scene.
C.2. Geometry Evaluation
We show more geometry evaluation on the Mip-NeRF360 [ 5]
and Tanks and Temples [ 20], as illustrated in Fig. K. Given
the sparse views, the geometry from 2DGS has many missing
areas and distorted surfaces, such as the holes in Treehill, the
missing legs in Horse, and the floater at the top of Caterpillar.
Page 14:
Settings PSNR ↑ SSIM↑ LPIPS ↓
w/oLd&Ln 16.723 0.310 0.505
w/oLd 16.799 0.314 0.499
w/oLn 16.825 0.313 0.498
Ours 16.95 0.321 0.480
Table E. We perform ablation studies on the geometric priors in our
method.
In contrast, our method not only produces a complete and
smooth geometry of the scene but also detailed structures.
We show the F1-score precision and recall curves of 2DGS
and our method in Fig. L. Since the extremely sparse-view
surface reconstruction is ambiguous and error-prone, few
points lie within the official error threshold in Tanks and
Temples [ 20]. To facilitate a clearer comparison, we increase
the error threshold by a factor of 10 (represented by the black
dotted line in Fig. L).
C.3. Ablation on geometry prior
As shown in Tab. E, we show ablation of geometry priors Ld
andLnin Eq. 1, 2 on Bicycle andGarden (3 views). The
geometry prior improves the rendering quality but is not the
key to sparse-view reconstruction.
Page 15:
2DGS* Ours
0.0 0.1 0.2 0.3 0.4 0.5
Meters020406080100# of points (%)Precision and Recall: Barn, 18.17 f-score
precision
recall
0.0 0.1 0.2 0.3 0.4 0.5
Meters020406080100# of points (%)Precision and Recall: Barn, 36.10 f-score
precision
recall
0.00 0.05 0.10 0.15 0.20 0.25
Meters020406080100# of points (%)Precision and Recall: Caterpillar, 13.24 f-score
precision
recall
0.00 0.05 0.10 0.15 0.20 0.25
Meters020406080100# of points (%)Precision and Recall: Caterpillar, 11.51 f-score
precision
recall
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14
Meters020406080100# of points (%)Precision and Recall: Ignatius, 31.42 f-score
precision
recall
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14
Meters020406080100# of points (%)Precision and Recall: Ignatius, 77.84 f-score
precision
recall
0.00 0.05 0.10 0.15 0.20 0.25
Meters020406080100# of points (%)Precision and Recall: Truck, 50.71 f-score
precision
recall
0.00 0.05 0.10 0.15 0.20 0.25
Meters020406080100# of points (%)Precision and Recall: Truck, 43.57 f-score
precision
recall
Figure L. The Precision and Recall Curves of F1-score Comparisons . We show detailed precision and recall curves of F1-score
comparisons between 2DGS and our method on Barn, Caterpillar, Ignatius, and Truck, given 4 views. The black dotted line in each subfigure
denotes the error threshold.