Authors: Paul Engstler, Aleksandar Shtedritski, Iro Laina, Christian Rupprecht, Andrea Vedaldi
Page 1:
SynCity: Training-Free Generation of 3D Worlds
Paul Engstler∗Aleksandar Shtedritski∗Iro Laina Christian Rupprecht
Andrea Vedaldi
Visual Geometry Group, University of Oxford
{paule,suny,iro,chrisr,vedaldi}@robots.ox.ac.uk
Figure 1. We introduce SynCity , a novel method that can generate from a prompt complex and immersive 3D worlds that can be navigated
freely. Our method is training-free and leverages powerful language, 2D and 3D generators via novel prompt engineering strategies.
Abstract
We address the challenge of generating 3D worlds from
textual descriptions. We propose SynCity, a training- and
optimization-free approach, which leverages the geometric
precision of pre-trained 3D generative models and the artis-
tic versatility of 2D image generators to create large, high-
quality 3D spaces. While most 3D generative models are
object-centric and cannot generate large-scale worlds, we
show how 3D and 2D generators can be combined to gener-
ate ever-expanding scenes. Through a tile-based approach,
we allow fine-grained control over the layout and the ap-
pearance of scenes. The world is generated tile-by-tile, and
*Equal contributioneach new tile is generated within its world-context and then
fused with the scene. SynCity generates compelling and im-
mersive scenes that are rich in detail and diversity.
1. Introduction
We consider the problem of generating 3D worlds from tex-
tual descriptions. Generating 3D content, for example, for
video games, virtual reality, special effects and simulation,
is highly laborious and time-consuming. When it comes to
generating entire 3D scenes, much of the content is not of
particular artistic value, and its manual creation, which is
still necessary, may be seen as a waste of human resources,
talent and creativity. Generative models can help to reduce
1arXiv:2503.16420v1 [cs.CV] 20 Mar 2025
Page 2:
or even remove this burden by largely automating many of
these mundane tasks.
The advent of modern generative AI has significantly im-
pacted 3D content generation and promises to reduce the
cost of its production dramatically. DreamFusion [39] was
among the first to co-opt state-of-the-art diffusion-based 2D
image generators [43] to create 3D objects. The area has
since matured significantly by fine-tuning 2D image gener-
ators to produce multiple consistent views of an object [14,
32, 47, 52] and by learning few-view 3D reconstruction net-
works [24, 63]. More recently, the focus has shifted to
methods that learn a 3D latent space [10, 25, 26, 61, 66],
which can then be sampled to generate 3D objects. Because
the latent space directly encodes the 3D structure of the ob-
ject rather than its 2D appearance, these methods can gen-
erate much more accurate and regular geometry.
Despite their advantages, 3D generative methods have
been so far mostly limited to generating individual objects.
However, the most promising usage of 3D generative AI is
the construction of entire virtual worlds, as this is where
automation can make by far the most difference. There is
ample literature on generating 3D scenes from textual or
image prompts. Most such methods are image-based, and
progressively reconstruct larger scene regions by expanding
from an initial image [8, 12, 13, 17, 30, 37, 42, 64, 67], com-
bining depth prediction, image and depth outpainting, and
3D reconstruction using NeRF [35] or 3D Gaussian Splat-
ting [19]. The main advantage of these approaches is that
they can leverage powerful 2D image generator models to
create the first and subsequent views of the scene. These 2D
generators allow the overall system to understand complex
textual prompts and generate corresponding 3D scenes with
good artistic quality. However, it is difficult for these ap-
proaches to maintain a coherent 3D structure across a large
scene. For example, while the reconstructed scene may en-
velop the observer in a 360° manner, it is generally not pos-
sible to ‘walk’ into the scene for more than a few steps.
This is the case even for state-of-the-art implementations
like the one of World Labs [21], a company specialized in
SpatialAI.
A challenge with extending scenes beyond these ‘3D
bubbles’ is that it is difficult for image-based methods to
maintain consistency incrementally without drifting. We
argue that 3D generative models might be preferable as
they can regularize and constrain the reconstructed geom-
etry, including hallucinating shape and textures in regions
behind the visible sides of objects. This is clearly shown
in prior works like BlockDiffusion [59] and LT3SD [33],
where large coherent spaces can be generated. However,
these methods are limited in the quality and diversity of the
generated scenes as it is difficult to train 3D generative mod-
els directly for scene generation. In particular, unlike their
view-based peers, these methods do not build on an imagegenerator. They thus cannot benefit from the artistic quality
and ability to interpret complex textual prompts that come
from pre-training 2D generators on billions of images.
In this work, we thus seek to build on 3D generative
models while still building on the strengths of 2D image
generators to generate large, high-quality 3D spaces that can
be navigated freely (Fig. 1). First, we note that while 3D
models like TRELLIS [61] are trained for object-level re-
construction, they can reconstruct fairly complex local com-
positions of multiple objects. Borrowing ideas from video
game world building, we show in particular that TRELLIS
can effectively generate, if not an entire world, at least a tile
representing a local region of the world. We show, in par-
ticular, how to prompt the model with an ‘isometric’ view
of the tile and then generate the tile in 3D.
Given this basic capability, we then look at the problem
of generating a large scene by generating and stitching to-
gether multiple tiles. We build on a text-to-image generator
(Flux [20]) and propose a novel way of prompting that sta-
bilizes it to consistently produce tiles with a similar isomet-
ric framing. In this manner, we encourage the tiles’ framing
to be stable and compatible between different tiles, making
it easier to stitch them together.
To ensure tiles to fit together appropriately, we propose
two mechanisms. First, we encourage consistency in ap-
pearance by using previously generated tiles to draw con-
text for the image generator, where each new tile inpaints
a missing region in a 2D isometric view of the scene. Sec-
ondly, we enforce geometric consistency by blending the
3D representations of neighbouring tiles using the 3D gen-
erative model.
2. Related Work
Novel view synthesis for scenes. Expanding an image
beyond its boundaries has been a long-standing task in
computer vision. Early methods that sought to expand
object-centric scenes rely on layer-structured representa-
tions [23, 34, 48, 50, 54, 55], which disregard the scene’s
true geometry. SynSin [58] is a pivotal work, where im-
age features are projected and used as conditioning to gen-
erate novel views, achieving geometric and semantic con-
sistency. ZeroNVS [44] introduces high-quality results
with fine-grained control of the camera but remains object-
centric. GenWarp [45] integrates semantic information
through cross-attention when generating a novel view.
The major challenges for these methods remain seman-
tic drift and object permanence. To obtain an explicit
3D representation, the generated views need to be trans-
ferred into such a representation, e.g., NeRF [35] or Gaus-
sians [18, 19], where any geometric conflicts would need to
be resolved.
Page 3:
SynCity
2D prompting 3D prompting
3D Blending
ConditionRender
Add to
world2D
3D2D 2D
2D 2D3D3D3D
Flux
Inpainting
FG Extract
& RebaseTRELLIS
Image -to-3D
Text
prompt
Stitch &
RenderCrop &
Reorient
Flux
InpaintingStitch TRELLIS
Denoise & Blend
Figure 2. Overview of SynCity. 2D prompting: To generate a new tile, we first render a view of where that tile should be placed, including
context from neighbouring tiles. 3D prompting: We extract the new tile image and construct an image prompt for TRELLIS by adding
a wider base under the tile. 3D blending: The 3D model that TRELLIS outputs is usually not well blended with the rest of the scene.
To address that, we render a view of the new tile next to each neighbouring tile, and inpaint the region between the two with an image
inpainting model. Next, we condition using that well-blended view to refine the region between the two 3D tiles. Finally, the new, blended,
tile is added to the world.
Image projection-based scene generation. A different
line of work follows the paradigm of building the 3D repre-
sentation of a scene sequentially using 2D image generation
models [8, 13, 17, 22, 37, 42, 57, 60, 64, 67]. Most of them
employ an image generation model to outpaint the existing
scene using pre-defined camera poses. The results are then
fused in 3D with depth prediction models. Text2Room [17]
generates meshes of indoor scenes. As the scene is clearly
delimited by the bounds of the mesh, it can be freely ex-
plored. LucidDreamer [8] and Text2Immersion [37] go be-
yond indoor scenes, but their generated scenes reveal geo-
metric inconsistencies when stepping away from the camera
poses used to generate the scene. Invisible Stitch [12] ad-
dresses this issue by inpainting depth (rather than naively
aligning it) and RealmDreamer [49] proposes multiple op-
timization losses to refine the generated scene. Despite
these improvements, the resulting scenes still suffer from
geometric artifacts and remain rather small. WonderJour-
ney [64] introduces novel ideas for depth fusion, such as
grouping objects at similar disparity to planes and sky depth
refinement, enabling large ‘scene journeys’, where indepen-
dent representations are built between scene ‘keyframes’,
but these are not merged into one coherent scene. Won-
derWorld [65] leverages these improvements to build a sin-gle scene, allowing interactive updates, but the true extent
of the generated scenes remains limited. Other works use
panoramas [51, 56] or implicit representations [2, 44] but
the freedom of movement remains constricted.
Procedural scene generation. Further methods permit
long-range fly-overs over nature [5–7, 27, 30] or cities [29,
46, 62]. These usually generate procedural unbounded im-
ages (e.g., the terrain make-up or a city layout). While
those methods create realistic-looking images, they are of-
ten monotonous as the methods are domain-specific and
thus highly constrained in the variety they can generate.
3D scene generation. Instead of merely generating images
of a scene or outpainting it only in 2D, further methods gen-
erate the representation directly. Set-the-Scene [9] adds a
layer of control to the layout of NeRF scenes by defining
object proxies. BlockFusion [59] learns a network to auto-
regressively diffuse small blocks to extend a mesh. A 2D
layout conditioning is used to control the generation pro-
cess, allowing users to generate scenes of rooms, a village,
and a city. While the method allows building large-scale
scenes, the variety of the objects it generates is severely lim-
ited as it requires domain-specific 3D training data. Further-
more, it generates untextured meshes. LT3SD [33] learns a
Page 4:
diffusion model that generates 3D environments in a patch-
by-patch and coarse-to-fine fashion. However, this method
is only trained to produce indoor scenes.
At the same time, the synthesis of complex, high-fidelity
objects has been enabled by the rapid progress in the fields
of text-to-3D and image-to-3D generation [25, 26, 28, 31,
40, 41, 47, 53, 61, 68, 70]. Trained on large-scale curated
subsets of 3D datasets such as Objaverse-XL [11], these
models can generate a large variety of different objects.
However, to the best of our knowledge, no prior work
has leveraged object generators for scene generation.
3. Method
Our goal is to generate a 3D world Gfrom an initial tex-
tual prompt p0. Our main result is to show how prompt
engineering can be used in combination with off-the-shelf
language, 2D and 3D generators to create the entire world
automatically, with no need to retrain the models.
We structure the world as a W×HgridT=
{0, . . . , W −1} × { 0, . . . , H −1}of square tiles, each of
which can contain several complex 3D objects (e.g., a build-
ing, a bridge, trees, etc.) as well as the ground surface. We
generate the world progressively, tile by tile, as shown in
in Fig. 3. Hence, when tile (x, y)∈ T is generated, tiles
T(x, y) ={(x′, y′)∈ T :y′< y∨(y′=y∧x′< x)}
have already been generated.
xxy
00
W-1H-1
y
Figure 3. Left: Progressive generation of world tiles T.Right:
Isometric framing of a tile for image-based prompting.
An overview of our approach is shown in Fig. 2. The first
step of our method is to expand the world description p0into
tile-specific prompts (Sec. 3.1). The second step is to pass
these tile-specific prompts to a 2D image generator and in-
painter to create an isometric view of each tile, accounting
for the part of the world generated so far (Sec. 3.2). The
third step is to extract image prompts from these isomet-
ric views and use them as input to a image-to-3D gener-
ator to reconstruct each tile’s geometry and appearance in
3D (Sec. 3.3). The final step is to align and blend the 3D
reconstructions of the tiles to create a coherent 3D world
(Sec. 3.4).
3.1. Prompting the Language Model
The goal of language prompting is to take a high-level tex-
tual description of the world p0and expand it into a set p
of tile-specific textual prompts that can be used to generatethe 3D world. Specifically, pis a collection of sub-prompts
pxy∈Σ∗, one for each tile, and a world-level ‘style’ prompt
p⋆∈Σ∗, so that we can write p={pxy}(x,y)∈T∪ {p⋆},
where Σ∗is the set of all possible strings.
The prompt pcould be constructed manually (which al-
lows controlling the content of each tile) or generated by a
large language model (LLM) such as ChatGPT [36] from
a ‘seed’ prompt. For the latter, we prompt ChatGPT o3-
mini-high to generate a grid-like world with tile-specific
descriptions, providing it with an example JSON file (see
Appendix A.1 for details).
3.2. Prompting the 2D Generator
We use the language prompts pfrom Sec. 3.1 to prompt an
off-the-shelf 2D image generator Φ2Dto output a 2D image
I(x, y)of each tile to be generated, as shown for example in
Fig. 4. The image I(x, y)must satisfy several constraints:
(1) It must reflect the tile-specific instructions pxyof the
target tile as well as the world-level instructions p⋆. (2) It
must be suitable for prompting the image-to-3D generator
in the next step. (3) It must be consistent with the previously
generated tiles.
Our prompting strategy is designed to satisfy these con-
straints. The image is drawn as a sample I(x, y)∼
Φ2D(q, B, M )from the 2D image generator Φ2D, where
q=pxy·p⋆is a prompt that combines the tile-specific
and world-level descriptions. The generator Φ2Dis also pro-
vided with a base image Band an inpainting mask Mthat
constrain the output. We assume that Φ2Dis capable of in-
painting — a common feature of modern image generators.
Tile inpainting. To satisfy constraint (2), we need to en-
courage the image generator to generate regular tiles so that
the image-to-3D model can output tiles with regular geom-
etry that fit well together. We assume that tiles have a fixed
square basis of unit size and that they are imaged in an ‘iso-
metric’ manner. This framing of the tiles is conducive to the
generation of regular 3D tiles. Furthermore, it is a common
choice in video games and might have been observed by the
image generator during training as these are often trained
on game-like data.
Hence, our goal is to condition the image generator Φ2D
to produce images of this kind. While one possible ap-
proach is to fine-tune the generator on such images, we
demonstrate that it is possible to obtain this effect through
prompt engineering alone, avoiding any retraining. We
achieve this by carefully constructing the inputs BandM
as shown in Fig. 4. Specifically, we set Bto be the image
of the base of the tile, as a square, grey slab imaged from
a fixed isometric vantage point. The mask Mis a binary
mask covering a cube on top of the base.
Figure 4 shows the result of prompting the model in this
manner as well as what happens if signals BandMare
removed: the viewpoint and general frame of the tile is ran-
Page 5:
dom and not suitable for 3D generation.
𝐵
𝑀Φ2𝐷
𝑞Φ2𝐷
Prompt 𝑞
Figure 4. Left: Generation of the 2D image prompt for the first
world tile at x= 0 andy= 0. The image generator Φ2Dis
conditioned on q=p00·p⋆and tasked with inpainting the base
image Bin the masked region M.Right: If we do not ‘frame’ the
image by using BandM, the generator produces an image which
is not suitable for tiling.
Taking the context into account. Except for the first tile
(0,0), the tiles are generated in the context of the world al-
ready generated. In order to account for this context, for
tiles with x, y > 0, we modify the base image Band the
mask Mas shown in Fig. 5. For the base image B, we
render the part of the 3D world generated so far, provid-
ing context for the inpainting network. We also modify
the mask Mto avoid covering tiles already generated to
the left (‘west’; i.e., for a tile (i, j)∈ T these are tiles
{(x, y)∈ T:x < i∧y=j}).
Figure 5. Left: Base image Band inpainting mask M(white over-
lay) to prompt the image generator Φ2Dto generate an image for a
x >0,y >0world tile. Right: Result of inpainting.
Because we wish to ensure continuity of the ground, be-
fore rendering this contextual image, we trim any 3D geom-
etry that is sufficiently high to occlude the tile we wish to
generate, as shown in Fig. 6 (see the result in Fig. 5).
Figure 6. Trimming tall structures for 2D prompting.The appendix discusses a special case for tiles at the
boundaries of the world (see Appendix A.2).
3.3. Prompting the 3D Generator
Given the tile image I(x, y)obtained from the 2D image
generator in Sec. 3.2, the next goal is to generate a cor-
responding 3D reconstruction G(x, y)of the tile utilizing
an image-to-3D model Φ3D. We opt for using a robust 3D
generator and select TRELLIS [61] due to its good perfor-
mance, ability to generate both shape and texture, and latent
space structure, which will be easy to manipulate for blend-
ing as we show later in Sec. 3.4.
Hence, 3D reconstruction amounts to drawing a sample
G(x, y)∼Φ3D(J(x, y))from the image-to-3D generator
Φ3D. Rather than conditioning on the image I(x, y), we use
a pre-processed version J(x, y), as described next.
2D foreground extraction and rebasing. Recall that the
image I(x, y)output by the 2D generator of Sec. 3.2 is an
image of the tile and its context. However, the 3D gen-
erator Φ3Dexpects the input image to only show the ob-
ject that needs to be reconstructed, i.e., the new tile. The
first step is thus extracting only that part from I(x, y)that
corresponds to the new tile, which we do by applying the
inpainting mask and then running rembg [15] with alpha
matting [4] to remove the background, as shown in Fig. 7.
Foreground Extraction
Mask
rembgRebasing
Paste Onto
Image of Base
2D
Figure 7. Left: Isolating the image of the new tile from I(x, y).
Right: Placing a slightly larger base underneath.
The resulting image is narrowly cropped around the new
tile. Similar to Sec. 3.2, we found it beneficial to hallucinate
a base for the tile, an operation that we call ‘rebasing’, as
shown in Fig. 7. We simply compose the image of the tile
with a slightly larger gray slab (in 2D) to obtain J(x, y),
which in effect provides a ‘frame’ for the 3D generator to
work with. The base is reconstructed as part of the tile’s
geometry, which can be used for validation and as a simple-
to-detect handle for further 3D processing.
The ‘rebased’ image J(x, y)is fed to the 3D genera-
torΦ3Dto obtain the 3D reconstruction G(x, y)of the tile,
which are 3D Gaussian Splats (3DGS). The effect of rebas-
ing on the 3D result is shown in Fig. 8.
3D geometric validation. Because the generators are im-
perfect, we verify the 3D reconstruction G(x, y)to ensure
Page 6:
TRELLIS Input (2D)TRELLIS Output (3D)Front ViewBack ViewFigure 8. Top: 3D reconstruction using a tight base. Bottom: the
same, but with a slightly larger base, which helps to contain the
tile’s geometry above ground (see the back of the reconstruction),
and creates an easy-to-detect 3D base.
that it is of sufficient quality. If not, we discard it and gen-
erate the tile again using a different random seed. To verify
the tile, we use a few heuristics that check that the tile’s ge-
ometry occupies a square region of sufficient size and that
the base of the tile has been reconstructed faithfully. Please
see Appendix A.3.
3D post-processing. At this point, we have verified the 3D
reconstruction G(x, y)for the tile as a mixture of 3D Gaus-
sians. However, the actual 3D footprint, orientation, and
size of the tile are controlled by the 3D generator and are in-
consistent. The post-processing step applies simple heuris-
tics to refine the 3D Gaussian representation by first crop-
ping out the added base, then rescaling the tile to a unit size,
and finally reorienting it to match the 2D image prompt. We
explain this in more detail in Appendix A.4.
3.4. 3D Blending
At this point of the pipeline, we have reconstructed all 3D
tilesG(x, y),(x, y)∈ T . As a result of the prompting
and post-processing steps in Secs. 3.2 and 3.3, the tiles are
already approximately aligned and correctly oriented, with
their ground level at roughly the same height in 3D space.
Because the 3D reconstructions are 3D Gaussian Splats, it
is easy to simply take their union as the reconstruction of
the whole world.
Even so, the boundaries of the tiles may not match per-
fectly. This is largely due to the fact that TRELLIS does
not reconstruct the input images exactly and to the fact that
only a single view of each tile is provided to it, which only
indirectly controls the reconstruction of the back of the tile.
In this section, we thus propose a method to improve the
blending of tiles, ensuring that the 3D world is coherent and
continuous.
In particular, we regenerate the boundary region of two
adjacent tiles in the latent space of TRELLIS, in essence
Original tile Naïve upsampling Proposed upsampling
Figure 9. Upsampling sparse latents. We need to resize or up-
sample sparse latents in order to stitch them. Due to the sparsity
of the latents and the behaviour of the latent decoder, naively re-
sampling in latent space leads to artifacts. Our proposed resizing
of the sparse latents better preserves textures and fine structures.
blending it, and then decode it into 3DGS. Next, we discuss
the specifics of this process.
Blending in 2D. To blend the latents of two neighbouring
tiles, we first predict the appearance of the boundary be-
tween the two tiles. To that end, we place the two 3D tiles
next to each other, render a frontal view, and inpaint the
middle region of the rendering (Fig. 2) with Φ2D. This leads
to a well-blended image, which we use to condition for Φ3D.
Blending in 3D. Next, we use Φ3Dto blend the latents. We
take the latents of the two tiles γ1andγ2where γ1, γ2∈
RD×R×R×RareD-dimensional features in the R-sized 3D
grid that TRELLIS denoises. We put them together in a new
volume γ, where the side where they meet is in the middle:
γ:,x,y,z =(
γ1
:,x+R/2,y,z,ifx < R/ 2
γ2
:,x−R/2,y,z,ifx≥R/2.
We apply the denoising function Ω, which is the latent
denoiser of Φ3D, to the volume γ, but only within the middle
region where we have applied the stitch, i.e., forx∈[R/2−
r, R/ 2 +r]for some r < R/ 2, while keeping the rest fixed.
Formally, we initialise ˜γ∼ N(0, I)and at each denoising
stept, we update ˜γas:
˜γt+1,:,x,y,z =(
Ω(˜γt,:,x,y,z),if|x−R/2| ≤r
γt+1,:,x,y,z,otherwise ,
where γtis obtained by adding noise to the original γat the
corresponding noise level for step t. In practice, we only de-
noise the second stage of TRELLIS, keeping the occupancy
latents of the first stage fixed. The reason for that is that
the first stage is at a very low spatial resolution ( R= 16 ,
compared to R= 64 of the second stage), which gives little
flexibility for the size of the denoised region r.
Upsampling the latents. Remember that due to the rebas-
ing,G(x, y)contains a 3D base. While we have removed
the base in 3DGS space, we have yet to do the same in the
latent space. We use the same cuts we applied in 3DGS
space to now crop the latents, rounding the cuts to account
Page 7:
Win Rate (%)
Overall Geometry Exploration Diversity Realism
90.9 81.8 90.9 90.9 86.4
Table 1. Win rates of our method against BlockFusion. We asked
participants to select which scene they prefer overall, as well as
which one has better geometry, would be more interesting to ex-
plore, is more diverse, and has better realism.
With context Without context
Figure 10. Left: 2×2 grid generated with our method, where con-
text is taken into account as described in Sec. 3.2 Right: Generated
with our method using the same prompts, but nottaking into ac-
count context — here, the scale of the buildings is not consistent.
for the discrete nature of the latent voxel grid. In the pre-
vious step, however, we assumed that the latents γ1andγ2
have the same spatial resolution, γ∈RD×R×R×R. After
cropping, this is not the case any more if the cuts of neigh-
bouring tiles differ. Thus, similarly to how we resize the
tiles’ 3DGS to a unit size, we have to upsample the now-
cropped latents back to the original grid resolution R.
We found that naively upsampling the latents by interpo-
lation leads to poor reconstructions, as shown in Fig. 9. We
attribute this to the sparse structure of the latents and quirks
of the latent decoder of TRELLIS.
We propose the following upsampling scheme. First,
we upsample the cropped occupancy volume that TREL-
LIS predicted to the original resolution V∈ {0,1}R×R×R.
Next, we denoise a new set of latents γon the upsampled
occupancy volume. To preserve the details and textures of
the original 3D tile, we render it from multiple views and
jointly condition the denoising on all of them. In practice,
when denoising with multiple conditioning views, at each
timestep, the denoising step is computed as the average de-
noising step across all views. We show that this upsampling
scheme leads to superior reconstructions in Fig. 9.
4. Experiments
Experimental details We generate the text prompts using
ChatGPT o3-mini-high. For the 2D inpainter, we use the
Flux ControlNet of [1].
Human preference. We evaluate human preference for the
results generated by our method and those obtained with
BlockFusion [59]. In particular, we compare a ‘city’ scene,
showing the entire scene as well as close-up detail views.
As seen in Tab. 1, participants ( n= 22 ) find our methodMethod Base Area Squareness ↑Completeness ↑
No Rebasing 2271 0.92 0.73
Ours 4096 1.00 1.00
Table 2. Average tile 3D geometry metrics for an approach with-
out rebasing and our method. Rebasing is crucial to ensure a tile
is square and its base has been reconstructed faithfully. The met-
rics we use are the area of the base in voxels, a measure for the
‘squareness’ of the base, and how many border voxels have been
faithfully reconstructed. For details, please refer to the appendix.
better overall, with better geometry, realism, and diversity.
4.1. Ablations
Here we ablate several components of our approach, show-
ing the importance of each of them.
Building a grid. A naive approach to generating a 3D
scene is querying the image generator to produce an image
of a large-scale scene (using our 2D image prompt setup)
and then obtaining the entire 3D world directly with TREL-
LIS. To achieve the same level of control provided by our
method, the textual prompt needs to be highly detailed and
include layout instructions. However, we found neither pre-
cise nor abstract prompts to be effective at steering the gen-
erations of Flux (for details, see A.4).
2D prompting context. We remove context from neigh-
bouring tiles, as described in Sec. 3.2. Doing that, each tile
is sampled independently, and the relative scale between ob-
jects is inconsistent, as shown in Fig. 10.
Rebasing. To place tiles on a grid, they need to be square
(otherwise the grid would be jagged) and their base needs to
have been reconstructed faithfully (clearly delimiting where
the tile stops). Without rebasing, the geometry generated
by TRELLIS might extend beyond the base and makes the
tile’s ‘true’ extent difficult to detect, as shown in Fig. 8. We
ablate the effect of rebasing, using a small 2×2scene to
curb the effect of error accumulation. As seen in Tab. 2,
no rebasing causes TRELLIS to generate tiles that are, on
average, neither perfectly square nor have a solid border.
Method LPIPS ↓SSIM↑FID↓KID↓
Naive upsampling 0.5914 0.3093 200.5 0.243
Ours (single frame) 0.3517 0.5149 111.6 0.069
Ours (multi frame) 0.3212 0.5312 89.1 0.051
Table 3. Perceptual metrics for our methods and the naive ap-
proach. Lower values for LPIPS [69], FID [16], and KID [3] are
better, while higher values for SSIM are better. We see that even
using a single conditioning frame leads to better upsampling re-
sults, and multiple frames further improve performance.
Page 8:
Figure 11. Exploring a 3D world. We show trajectories exploring the 3D worlds we generate. Please see the supplement for videos.
Before blending After blending
Figure 12. Left: Tiles before applying the 3D blending step Right:
After the 3D blending step. We see that where boundaries between
tiles were obvious, they are now well-blended.
Latent upsampling. We sample 10 random views each
from 200 tiles generated by TRELLIS and compute per-
ceptual metrics in Tab. 3 when upsampling with our pro-
posed upsampling approach in Sec. 3.4 and a naive inter-
polation. We see that the proposed method leads to better
results across a range of metrics, even when using a single
conditioning frame.
3D blending. In Fig. 12, we generate a scene where we do
not apply the 3D blending (Sec. 3.4), resulting in disconti-
nuities between the tiles.
4.2. Qualitative Results
We present example scenes generated by our method
in Fig. 1. Further, we show detail views, highlighting the
quality and diversity of the scene. Please see the supple-mentary material for many more examples.
Exploring a generated world. We can sample trajectories
exploring the generated 3D worlds (Fig. 11). A skybox has
been added for visual effect. Unlike the trajectories gen-
erated by world video models [38], ours are guaranteed to
be consistent and do not suffer from semantic drift. Dif-
ferent from other systems that only generate a bubble, our
method creates spaces sufficiently large to be navigated in a
non-trivial manner.
5. Conclusion
We have introduced SynCity, an approach to generate di-
verse, high-quality, and complex 3D worlds with fine-
grained control over their layout and appearance. SynCity
creates worlds by autoregressively generating tiles on a grid,
enabling scalability to arbitrary grid sizes. By accounting
for local context and by means of 3D inpainting, the tiles are
seamlessly stitched together into coherent scenes. SynCity
is flexible: it can either generate worlds from a brief ‘world’
text prompt or allow control of the individual tiles via tile-
specific instructions, all the while maintaining an overall
thematic consistency of the generated world. The rich detail
of the generated worlds can be fully explored, not restricted
to a single ‘3D bubble’ as in many prior works.
We have demonstrated the effectiveness of off-the-shelf
generation by utilizing pre-trained language, 2D, and 3D
generators through carefully designed prompting strategies
and without requiring retraining of any of these compo-
nents. Nevertheless, we expect that in cases where 3D scene
scale data is available, fine-tuning some components would
result in further improved results and simplifications in the
Page 9:
alignment and rebasing steps of the pipeline. Future work
could also consider relaxing the tile structure, for example,
by randomly shifting and scaling tiles and using coarse-to-
fine modeling to ensure coherent global structure and fine-
grained local details.
Ethics. For details on ethics, data protection, and copy-
right, please see https://www.robots.ox.ac.uk/
~vedaldi/research/union/ethics.html .
Acknowledgments. The authors of this work are supported
by ERC 101001212-UNION, AIMS EP/S024050/1, and
Meta Research.
References
[1] AlimamaCreative. Flux-controlnet-inpainting. https:
/ / github . com / alimama - creative / FLUX -
Controlnet-Inpainting , 2024. GitHub repository. 7
[2] Miguel Ángel Bautista et al. GAUDI: A neural architect for
immersive 3D scene generation. arXiv.cs , abs/2207.13751,
2022. 3
[3] Mikołaj Bi ´nkowski, Danica J Sutherland, Michael Arbel and
Arthur Gretton. Demystifying mmd gans. arXiv preprint
arXiv:1801.01401 , 2018. 7
[4] Ron Brinkmann. The art and science of digital compositing:
Techniques for visual effects, animation and motion graph-
ics. Morgan Kaufmann, 2008. 5
[5] Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad
Shahbazi, Anton Obukhov, Luc Van Gool and Gordon Wet-
zstein. DiffDreamer: Towards consistent unsupervised
single-view scene extrapolation with conditional diffusion
models. In Proc. ICCV , 2023. 3
[6] Lucy Chai, Richard Tucker, Zhengqi Li, Phillip Isola and
Noah Snavely. Persistent nature: A generative model of un-
bounded 3d worlds. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition , pages
20863–20874, 2023.
[7] Zhaoxi Chen, Guangcong Wang and Ziwei Liu. Scene-
dreamer: Unbounded 3d scene generation from 2d image
collections. IEEE transactions on pattern analysis and ma-
chine intelligence , 45(12):15562–15576, 2023. 3
[8] Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin
Lee and Kyoung Mu Lee. Luciddreamer: Domain-free
generation of 3d gaussian splatting scenes. arXiv.cs ,
abs/2311.13384, 2023. 2, 3
[9] Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja Giryes
and Daniel Cohen-Or. Set-the-scene: Global-local training
for generating controllable nerf scenes. In Proc. ICCV Work-
shops , 2023. 3
[10] Deemos. Rodin text-to-3D gen-1 (0525) v0.5, 2024. 2
[11] Matt Deitke et al. Objaverse-XL: A universe of 10M+ 3D
objects. CoRR , abs/2307.05663, 2023. 4
[12] Paul Engstler, Andrea Vedaldi, Iro Laina and Christian Rup-
precht. Invisible stitch: Generating smooth 3D scenes with
depth inpainting. In Proceedings of the International Con-
ference on 3D Vision (3DV) , 2025. 2, 3[13] Rafail Fridman, Amit Abecasis, Yoni Kasten and Tali Dekel.
Scenescape: Text-driven consistent scene generation. CoRR ,
abs/2302.01133, 2023. 2, 3
[14] Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur
Brussee, Ricardo Martin-Brualla, Pratul Srinivasan,
Jonathan T. Barron and Ben Poole. CAT3D: create anything
in 3d with multi-view diffusion models. arXiv , 2405.10314,
2024. 2
[15] Daniel Gatis. rembg, 2025. 5
[16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Bernhard Nessler and Sepp Hochreiter. GANs trained by
a two time-scale update rule converge to a local nash equi-
librium. In Proc. NeurIPS , 2017. 7
[17] Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson
and Matthias Nießner. Text2Room: Extracting textured 3D
meshes from 2D text-to-image models. In Proc. ICCV , 2023.
2, 3
[18] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger and
Shenghua Gao. 2d gaussian splatting for geometrically ac-
curate radiance fields. In ACM SIGGRAPH 2024 conference
papers , pages 1–11, 2024. 2
[19] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler and
George Drettakis. 3D Gaussian Splatting for real-time radi-
ance field rendering. Proc. SIGGRAPH , 42(4), 2023. 2
[20] Black Forest Labs. Flux. https://github.com/
black-forest-labs/flux , 2023. 2
[21] World Labs. Generating worlds, 2024. 2
[22] Jiabao Lei, Jiapeng Tang and Kui Jia. RGBD2: generative
scene synthesis via incremental view inpainting using RGBD
diffusion models. In Proc. CVPR , 2023. 3
[23] Jiaxin Li, Zijian Feng, Qi She, Henghui Ding, Changhu
Wang and Gim Hee Lee. MINE: towards continuous depth
MPI with NeRF for novel view synthesis. In Proc. ICCV ,
2021. 2
[24] Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun
Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg
Shakhnarovich and Sai Bi. Instant3D: Fast text-to-3D
with sparse-view generation and large reconstruction model.
Proc. ICLR , 2024. 2
[25] Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen,
Ping Tan and Xiaoxiao Long. CraftsMan: high-fidelity mesh
generation with 3d native generation and interactive geome-
try refiner. arXiv , 2405.14979, 2024. 2, 4
[26] Yangguang Li et al. TripoSG: high-fidelity 3D shape
synthesis using large-scale rectified flow models. arXiv ,
2502.06608, 2025. 2, 4
[27] Zhengqi Li, Qianqian Wang, Noah Snavely and Angjoo
Kanazawa. Infinitenature-zero: Learning perpetual view
generation of natural scenes from single images. In Eu-
ropean Conference on Computer Vision , pages 515–534.
Springer, 2022. 3
[28] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa,
Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler,
Ming-Yu Liu and Tsung-Yi Lin. Magic3D: High-resolution
text-to-3D content creation. arXiv.cs , abs/2211.10440, 2022.
4
Page 10:
[29] Chieh Hubert Lin, Hsin-Ying Lee, Willi Menapace, Menglei
Chai, Aliaksandr Siarohin, Ming-Hsuan Yang and Sergey
Tulyakov. Infinicity: Infinite-scale city synthesis. In Pro-
ceedings of the IEEE/CVF international conference on com-
puter vision , pages 22808–22818, 2023. 3
[30] A Liu, R Tucker, V Jampani, A Makadia and N Snavely. . . .
Infinite nature: Perpetual view generation of natural scenes
from a single image. In Proc. ICCV , 2021. 2, 3
[31] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok-
makov, Sergey Zakharov and Carl V ondrick. Zero-1-to-3:
Zero-shot one image to 3D object. In Proc. ICCV , 2023. 4
[32] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia
Neverova, Andrea Vedaldi, Oran Gafni and Filippos Kokki-
nos. IM-3D: Iterative multiview diffusion and reconstruction
for high-quality 3D generation. In Proceedings of the Inter-
national Conference on Machine Learning (ICML) , 2024. 2
[33] Quan Meng, Lei Li, Matthias Nießner and Angela Dai.
LT3SD: latent trees for 3D scene diffusion. arXiv ,
2409.08215, 2024. 2, 3
[34] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz Cayon,
Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng and
Abhishek Kar. Local light field fusion: practical view syn-
thesis with prescriptive sampling guidelines. Proc. SIG-
GRAPH , 38(4), 2019. 2
[35] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik,
Jonathan T. Barron, Ravi Ramamoorthi and Ren Ng. NeRF:
Representing scenes as neural radiance fields for view syn-
thesis. In Proc. ECCV , 2020. 2
[36] OpenAI et al. GPT-4 technical report. arXiv , 2303.08774,
2024. 4
[37] Hao Ouyang, Kathryn Heal, Stephen Lombardi and
Tiancheng Sun. Text2Immersion: Generative immersive
scene with 3D gaussians. arXiv.cs , abs/2312.09242, 2023.
2, 3
[38] Jack Parker-Holder et al. Genie 2: A large-scale foundation
world model, 2024. 8
[39] Ben Poole, Ajay Jain, Jonathan T. Barron and Ben Milden-
hall. DreamFusion: Text-to-3D using 2D diffusion. In Proc.
ICLR , 2023. 2
[40] Guocheng Qian et al. Magic123: One image to high-quality
3D object generation using both 2D and 3D diffusion priors.
arXiv.cs , abs/2306.17843, 2023. 4
[41] Amit Raj et al. DreamBooth3D: subject-driven text-to-3D
generation. In Proc. ICCV , 2023. 4
[42] Chris Rockwell, David F. Fouhey and Justin Johnson. Pixel-
Synth: Generating a 3D-consistent experience from a single
image. In Proc. ICCV , 2021. 2, 3
[43] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser and Björn Ommer. High-resolution image syn-
thesis with latent diffusion models. In Proc. CVPR , 2022.
2
[44] Kyle Sargent et al. ZeroNVS: Zero-shot 360-degree view
synthesis from a single real image. arXiv.cs , abs/2310.17994,
2023. 2, 3
[45] Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya
Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Se-
ungryong Kim and Yuki Mitsufuji. GenWarp: single imageto novel views with semantic-preserving generative warping.
arXiv , 2405.17251, 2024. 2
[46] Yu Shang, Yuming Lin, Yu Zheng, Hangyu Fan, Jingtao
Ding, Jie Feng, Jiansheng Chen, Li Tian and Yong Li. Urban-
world: An urban world model for 3d city generation. arXiv
preprint arXiv:2407.11965 , 2024. 3
[47] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li
and Xiao Yang. MVDream: Multi-view diffusion for 3D
generation. In Proc. ICLR , 2024. 2, 4
[48] Meng-Li Shih, Shih-Yang Su, Johannes Kopf and Jia-Bin
Huang. 3d photography using context-aware layered depth
inpainting. In Proc. CVPR , 2020. 2
[49] Jaidev Shriram, Alex Trevithick, Lingjie Liu and Ravi Ra-
mamoorthi. RealmDreamer: text-driven 3d scene generation
with inpainting and depth diffusion. In Proc. 3DV , 2025. 3
[50] Pratul P Srinivasan, Richard Tucker, Jonathan T Barron,
Ravi Ramamoorthi, Ren Ng and Noah Snavely. Pushing the
boundaries of view extrapolation with multiplane images. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition , pages 175–184, 2019. 2
[51] Gabriela Ben Melech Stan et al. LDM3D: Latent diffusion
model for 3D. arXiv.cs , (2305.10853), 2023. 3
[52] Stanislaw Szymanowicz, Christian Rupprecht and Andrea
Vedaldi. Viewset diffusion: (0-)image-conditioned 3D gen-
erative models from 2D data. In Proceedings of the Interna-
tional Conference on Computer Vision (ICCV) , 2023. 2
[53] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang,
Gang Zeng and Ziwei Liu. LGM: Large multi-view Gaus-
sian model for high-resolution 3D content creation. arXiv ,
2402.05054, 2024. 4
[54] Richard Tucker and Noah Snavely. Single-view view synthe-
sis with multiplane images. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition ,
pages 551–560, 2020. 2
[55] Shubham Tulsiani, Richard Tucker and Noah Snavely.
Layer-structured 3d scene inference via view synthesis. In
Proc. ECCV , 2018. 2
[56] Guangcong Wang, Peng Wang, Zhaoxi Chen, Wenping
Wang, Chen Change Loy and Ziwei Liu. PERF: panoramic
neural radiance field from a single panorama. tpami , 2024. 3
[57] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan
Li, Hang Su and Jun Zhu. Prolificdreamer: High-fidelity and
diverse text-to-3d generation with variational score distilla-
tion. Advances in Neural Information Processing Systems ,
36:8406–8441, 2023. 3
[58] Olivia Wiles, Georgia Gkioxari, Richard Szeliski and Justin
Johnson. Synsin: End-to-end view synthesis from a single
image. In Proc. CVPR , 2020. 2
[59] Zhennan Wu et al. BlockFusion: Expandable 3D scene gen-
eration using latent tri-plane extrapolation. arXiv.cs , 2024.
2, 3, 7
[60] Jianfeng Xiang, Jiaolong Yang, Binbin Huang and Xin
Tong. 3D-aware image generation using 2D diffusion mod-
els.arXiv.cs , abs/2303.17905, 2023. 3
[61] Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng
Wang, Bowen Zhang, Dong Chen, Xin Tong and Jiaolong
Yang. Structured 3D latents for scalable and versatile 3d
generation. arXiv , 2412.01506, 2024. 2, 4, 5
Page 11:
[62] Haozhe Xie, Zhaoxi Chen, Fangzhou Hong and Ziwei Liu.
Citydreamer: Compositional generative model of unbounded
3d cities. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition , pages 9666–9675,
2024. 3
[63] Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen,
Ceyuan Yang, Sida Peng, Yujun Shen and Gordon Wetzstein.
GRM: Large gaussian reconstruction model for efficient 3D
reconstruction and generation. arXiv , 2403.14621, 2024. 2
[64] Hong-Xing Yu et al. Wonderjourney: Going from anywhere
to everywhere. arXiv.cs , abs/2312.03884, 2023. 2, 3
[65] Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T
Freeman and Jiajun Wu. Wonderworld: Interactive 3d
scene generation from a single image. arXiv preprint
arXiv:2406.09394 , 2024. 3
[66] Biao Zhang, Jiapeng Tang, Matthias Niessner and Peter
Wonka. 3dshape2vecset: A 3d shape representation for neu-
ral fields and generative diffusion models. In ACM Transac-
tions on Graphics , 2023. 2
[67] Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang and Jing
Liao. Text2NeRF: Text-driven 3D scene generation with
neural radiance fields. arXiv.cs , abs/2305.11588, 2023. 2,
3
[68] Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu,
Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu and Jingyi Yu.
Clay: A controllable large-scale generative model for creat-
ing high-quality 3d assets. ACM Transactions on Graphics
(TOG) , 43(4):1–20, 2024. 4
[69] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman
and Oliver Wang. The unreasonable effectiveness of deep
features as a perceptual metric. In Proc. CVPR , pages 586–
595, 2018. 7, 13
[70] Zibo Zhao et al. Hunyuan3d 2.0: Scaling diffusion mod-
els for high resolution textured 3d assets generation. arXiv
preprint arXiv:2501.12202 , 2025. 4
Page 12:
SynCity: Training-Free Generation of 3D Worlds
Supplementary Material
A. Appendix
A.1. Language Model Prompting Details
While the prompt pcan be constructed manually, an LLM
may also be employed. For it to understand the task it is
asked to solve, we utilize the following prompt for ChatGPT
o3-mini-high:
Assume you had access to an AI model that can
generate small-scale cities on an isometric grid by
creating individual tiles. For each of these tiles
(identified by their 2D position), a short but ex-
pressive text prompt has to be provided. Addi-
tionally, a global prompt is used, which provides
context, lighting, time of day, as well as the art
style. The prompts of the tiles can be generic but
they might have a semantic connection to neigh-
bouring tiles (such that a river can flow through
the city on multiple tiles). The format for the in-
structions to the AI model is JSON. Consider the
following example: <TEMPLATE> The art style
and perspective mentioned in the prompt should
be maintained. The rest may be freely adapted.
There are no limits to the setting, the sky is your
limit. Now, please generate a 3×3grid.
In this prompt, a template or a ‘seed’ prompt is provided.
We use a simple JSON file, as exemplified in Fig. 13.
{"tiles" : [
{"prompt" :"ancient stone bridge over a stream
","x":0,"y":0},
{"prompt" :"lively stream past mossy banks" ,"
x":1,"y":0},
{"prompt" :"serene pond reflecting moonlight" ,
"x":0,"y":1},
{"prompt" :"bustling medieval market street" ,
"x":1,"y":1} ],
"prompt" :"{tile_prompt}, medieval setting,
isometric view, glowing lanterns, soft
shading, vibrant colors, detailed textures"
}
Figure 13. Example JSON file to describe each tile in a 2×2
world.
A.2. 2D Prompting Details
In Sec. 3.2, we describe how tiles are generated in the
context of those that already exist. There is a special case
that we address separately, where context has to be boot-
strapped, namely tiles L:={(x, y)∈ T:x= 0∧y >0}.Due to our build order and the trimming of obstructing 3D
geometry, these tiles might lack sufficient contextual cues.
As a remedy, we temporarily provide context with a pre-
viously generated tile: For a tile (0, y)∈ L , we dupli-
cate the tile (0, y−1)∈ T and place the copy at posi-
tion(−1, y). During inpainting, this tile serves to provide
context in terms of scale and general appearance. Once in-
painting is completed, this copy is removed.
Figure 14. Tile geometry validation. To check the geometric
qualities of a reconstructed tile, we look at the occupancy grid
V∈ {0,1}R×R×Rgenerated by TRELLIS. Activated voxels are
indicated in orange ( ■).Left: The extent of an object in an object
at height w(slice visualized in 2D). Right: An example of a 3D
tile base template VB.
A.3. 3D Geometry Validation Details
TRELLIS is a two-stage method and produces an occu-
pancy volume V∈ {0,1}R×R×Rin the first stage (before
the 3D Gaussian mixture is output), where R= 64 is the
resolution of the grid. To perform geometric validation, we
utilize this occupancy volume, which captures the rough 3D
geometry of the tile, and check that it conforms to the de-
sired geometry.
First, we test whether the reconstructed tile is supported
by a square by computing its 2D rectangular footprint and
ensuring that the latter is sufficiently large and isotropic.
To this end, let (u, v, w )index the R×R×Rvoxel grid,
where uandvcorresponds to world directions xandy. Let
(umin, umax, vmin, vmax,)be the bounding box containing all
the active voxels at height w.*Letextu= max {0,1 +
umax−umin}be the width of the bounding box and extvits
height. We discard the tile if the area is too small, i.e., if
extu·extv<(R/2)2. We also discard it if it is not square,
i.e., ifmin{extu,extv}/max{extu,extv}< α= 1.
Second, we check that the base that we have added in
2D in the ‘rebasing’ step has been faithfully reconstructed
in 3D.
*So for instance umin= min{u:∃v, w:V(u, v, w ) = 1}.
Page 13:
We define a 3D tile base template, which we call VB.
Letumin(w)be the minimum uof the bounding box that
contains the volume slice at height w, and define umax(w)
and so on in a similar manner, so for instance umin(w) =
min{u:∃v:V(u, v, w ) = 1}. Let w∗be the height at
which the base is the largest, i.e.,w∗= argmaxwextu(w)·
extv(w).Then, VBis the indicator function of the voxels
(u, v, w )such that w=w∗and
max|2u−umax(w∗)−umin(w∗)|
umax(w∗)−umin(w∗),|2v−vmax(w∗)−vmin(w∗)|
vmax(w∗)−vmin(w∗)
= 1.
Note that the template VBis constructed adaptively to
match the input tile V.
We discard a generated tile if (V·VB)/(VB·VB)< β=
0.95,where ·denotes the inner product of tensors.
A.4. 3DGS Post-Processing Details
3D cropping, resizing and centering. Given the 3D Gaus-
sian mixture G(x, y)initially output by the 3D generator,
we first identify the extent of the tile ‘proper’ (discounting
the extended base). We consider the xyfootprint of the tile
(i.e., we look at the tile from above) and seek to identify
four cuts (from the left, right, top, and bottom) that define an
axis-aligned rectangle strictly containing the tile. For exam-
ple, to determine the location of the left cut x∗, we consider
slices Vx={(x′, y′, z′)∈R3:x−δ≤x′< x+δ}.
We find the 3D Gaussians whose centers falls within Vx
and compute their average color cx. Then, we compute
the distance d(x) =∥cx−cxmin∥where cxminis the aver-
age color of the leftmost slice (used as a reference). We set
x∗= min {x:d(x)> τ}where τis a threshold, which cor-
responds to the slice that transition from the ‘background’
color to something else.
We find in this manner the four cuts, keep only the Gaus-
sians contained in the resulting rectangular footprint, and
recenter and resize this footprint to fill the standard tile size.
Additionally, the base allows us to figure out the position
of the tile’s surface: As TRELLIS centers objects vertically,
the ground surface level of any two tiles will vary. We use
the average height of the tile’s four corners to determine the
position of the surface, allowing us to align it with others.
3D reorientation. We also note that TRELLIS generates
the 3D object with an arbitrary orientation with respect to
the input image ˜I(x, y). However, the tile must be inserted
with the correct orientation in the 3D world a otherwise the
continuity between tiles, which the inpainting method of
Fig. 4 encourages, will be lost. In practice, the ambigu-
ity is limited to 90-degree rotations around the vertical axis
and is very easy to remove.*To do so, we test four possi-
ble 90-degree rotations of the tile around the vertical axis,
rendering the corresponding views and comparing them to
*This is likely due to the implicit bias in the TRELLIS training set that
consists of synthetic 3D objects which are almost invariably axis aligned.˜I(x, y)using the LPIPS [69] loss. The minimizer is taken
as the correct orientation.
A.5. Ablation Details
Building a Grid. In the following, we present results from
our experiments attempting to generate a large scene non-
iteratively. Here, we generate single image with Flux that
is used as conditioning for TRELLIS to directly create the
desired scene.
Figure 15. Non-iterative city building. We obtain conditioning
images generated by Flux (left) and directly use them to build a
large-scale scene with TRELLIS (center). While the generated 3D
structures are visually appealing, the level of detail (right) is very
limited. The first row used generic prompting for the conditioning
image (“a city scene on top of a base”), whereas the second row
uses a more involved prompt with an explicit layout (e.g., “a house
in the bottom left corner”, “a pharmacy in the top right corner”).
In the first set of experiments, we do not use our 2D
prompting design. To obtain an isolated 3D object that can
be generated by TRELLIS, we use prompts with the prefix
“a 3d object of”. We show those results in Fig. 15. While
the generated objects are visually appealing, they have sev-
eral limitations: (i) The resolution of the conditioning im-
age and the 3D structures TRELLIS can generate is limited.
Therefore, this approach is not scalable to arbitrarily large
scenes. (ii) Due to the lack of perfect control over the base
structure, the result cannot be easily extended or edited. (iii)
The layout instructions are mostly ignored, thus severely
limiting the level of control over the generation.
For the second set of experiments, we use our 2D
prompting design along with the Flux ControlNet for in-
painting (Fig. 16). However, with this setup, the quality of
the results is not improved. The layout instructions in the
prompt are mostly ignored, again.
Querying Flux to generate large-scale scenes directly has
not been successful in our experiments, prompting the need
for our grid-based method that allows fine-grained layout
and appearance control for each tile.
Page 14:
Figure 16. Non-iterative city building (with our 2D prompt-
ing). We obtain conditioning images generated by Flux (left) and
directly use them to build a large-scale scene with TRELLIS (cen-
ter). Despite the initial visual appeal, the structures lack in detail.
The first row used generic prompting for the conditioning image
(“a vibrant city scene”), whereas the second row uses a more in-
volved prompt with an explicit layout (e.g., “a house in the bottom
left corner”, “a pharmacy in the top right corner”).
A.6. Additional Qualitative Results
In Figs. 17 to 20, we show additional results of our method.
As we leverage a pre-trained 2D image generator trained
on a very large dataset, we are able to generate highly di-
verse scenes. Thanks to our fine-grained control at the tile
level, we can generate interesting patterns, such as a transi-
tion between seasons across a grid (observe the largest grid
in Fig. 20).
A.7. Limitations
While our method allows creating large and diverse scenes,
there are some limitations to be addressed in future work.
Atomic tiles. Although we inpaint tiles conditioned on
their surroundings, they remain individual units. While
structures can be created that span across multiple tiles, this
requires harmonious cooperation between Flux and TREL-
LIS.
Use of heuristics. To determine the ground surface height
for each tile and removing the base we added during rebas-
ing, we employ heuristics. We have designed these care-
fully with fallback mechanisms, but they are not infallible.
Inherited limitations. As our method builds on top of Flux
and TRELLIS, their limitations also apply to ours. During
our experiments, we have observed—that despite good in-
painting results—TRELLIS at times only vaguely adheres
to the conditioning image in terms of appearance, in partic-
ular color. Thus, transitions between tiles might not look
perfectly smooth (even if they were generated that way in
the inpainting result).
Page 15:
Figure 17. Exploring a 3D world. We show trajectories exploring the 3D worlds we generate. A sky box has been added for visual effect.
Page 16:
Figure 18. Generated scenes. We show scenes generated with the same prompts, but different seeds in 2D inpainting.
Page 17:
Figure 19. Generated scenes.
Page 18:
Figure 20. Generated scenes. Our method can easily generate large scenes. Further, interesting patterns can be injected thanks to fine-
grained control over each tile. Top: The scene transitions in season, from winter to spring to summer to autumn. Bottom: The scenery
transitions from a city-like to a rural environment.