loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2503.16420

SynCity: Training-Free Generation of 3D Worlds

Authors: Paul Engstler, Aleksandar Shtedritski, Iro Laina, Christian Rupprecht, Andrea Vedaldi

Published: 2025-03-20

Abstract:

We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artistic versatility of 2D image generators to create large, high-quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to generate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the appearance of scenes. The world is generated tile-by-tile, and each new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and immersive scenes that are rich in detail and diversity.

Paper Content: on Alphaxiv
Page 1: SynCity: Training-Free Generation of 3D Worlds Paul Engstler∗Aleksandar Shtedritski∗Iro Laina Christian Rupprecht Andrea Vedaldi Visual Geometry Group, University of Oxford {paule,suny,iro,chrisr,vedaldi}@robots.ox.ac.uk Figure 1. We introduce SynCity , a novel method that can generate from a prompt complex and immersive 3D worlds that can be navigated freely. Our method is training-free and leverages powerful language, 2D and 3D generators via novel prompt engineering strategies. Abstract We address the challenge of generating 3D worlds from textual descriptions. We propose SynCity, a training- and optimization-free approach, which leverages the geometric precision of pre-trained 3D generative models and the artis- tic versatility of 2D image generators to create large, high- quality 3D spaces. While most 3D generative models are object-centric and cannot generate large-scale worlds, we show how 3D and 2D generators can be combined to gener- ate ever-expanding scenes. Through a tile-based approach, we allow fine-grained control over the layout and the ap- pearance of scenes. The world is generated tile-by-tile, and *Equal contributioneach new tile is generated within its world-context and then fused with the scene. SynCity generates compelling and im- mersive scenes that are rich in detail and diversity. 1. Introduction We consider the problem of generating 3D worlds from tex- tual descriptions. Generating 3D content, for example, for video games, virtual reality, special effects and simulation, is highly laborious and time-consuming. When it comes to generating entire 3D scenes, much of the content is not of particular artistic value, and its manual creation, which is still necessary, may be seen as a waste of human resources, talent and creativity. Generative models can help to reduce 1arXiv:2503.16420v1 [cs.CV] 20 Mar 2025 Page 2: or even remove this burden by largely automating many of these mundane tasks. The advent of modern generative AI has significantly im- pacted 3D content generation and promises to reduce the cost of its production dramatically. DreamFusion [39] was among the first to co-opt state-of-the-art diffusion-based 2D image generators [43] to create 3D objects. The area has since matured significantly by fine-tuning 2D image gener- ators to produce multiple consistent views of an object [14, 32, 47, 52] and by learning few-view 3D reconstruction net- works [24, 63]. More recently, the focus has shifted to methods that learn a 3D latent space [10, 25, 26, 61, 66], which can then be sampled to generate 3D objects. Because the latent space directly encodes the 3D structure of the ob- ject rather than its 2D appearance, these methods can gen- erate much more accurate and regular geometry. Despite their advantages, 3D generative methods have been so far mostly limited to generating individual objects. However, the most promising usage of 3D generative AI is the construction of entire virtual worlds, as this is where automation can make by far the most difference. There is ample literature on generating 3D scenes from textual or image prompts. Most such methods are image-based, and progressively reconstruct larger scene regions by expanding from an initial image [8, 12, 13, 17, 30, 37, 42, 64, 67], com- bining depth prediction, image and depth outpainting, and 3D reconstruction using NeRF [35] or 3D Gaussian Splat- ting [19]. The main advantage of these approaches is that they can leverage powerful 2D image generator models to create the first and subsequent views of the scene. These 2D generators allow the overall system to understand complex textual prompts and generate corresponding 3D scenes with good artistic quality. However, it is difficult for these ap- proaches to maintain a coherent 3D structure across a large scene. For example, while the reconstructed scene may en- velop the observer in a 360° manner, it is generally not pos- sible to ‘walk’ into the scene for more than a few steps. This is the case even for state-of-the-art implementations like the one of World Labs [21], a company specialized in SpatialAI. A challenge with extending scenes beyond these ‘3D bubbles’ is that it is difficult for image-based methods to maintain consistency incrementally without drifting. We argue that 3D generative models might be preferable as they can regularize and constrain the reconstructed geom- etry, including hallucinating shape and textures in regions behind the visible sides of objects. This is clearly shown in prior works like BlockDiffusion [59] and LT3SD [33], where large coherent spaces can be generated. However, these methods are limited in the quality and diversity of the generated scenes as it is difficult to train 3D generative mod- els directly for scene generation. In particular, unlike their view-based peers, these methods do not build on an imagegenerator. They thus cannot benefit from the artistic quality and ability to interpret complex textual prompts that come from pre-training 2D generators on billions of images. In this work, we thus seek to build on 3D generative models while still building on the strengths of 2D image generators to generate large, high-quality 3D spaces that can be navigated freely (Fig. 1). First, we note that while 3D models like TRELLIS [61] are trained for object-level re- construction, they can reconstruct fairly complex local com- positions of multiple objects. Borrowing ideas from video game world building, we show in particular that TRELLIS can effectively generate, if not an entire world, at least a tile representing a local region of the world. We show, in par- ticular, how to prompt the model with an ‘isometric’ view of the tile and then generate the tile in 3D. Given this basic capability, we then look at the problem of generating a large scene by generating and stitching to- gether multiple tiles. We build on a text-to-image generator (Flux [20]) and propose a novel way of prompting that sta- bilizes it to consistently produce tiles with a similar isomet- ric framing. In this manner, we encourage the tiles’ framing to be stable and compatible between different tiles, making it easier to stitch them together. To ensure tiles to fit together appropriately, we propose two mechanisms. First, we encourage consistency in ap- pearance by using previously generated tiles to draw con- text for the image generator, where each new tile inpaints a missing region in a 2D isometric view of the scene. Sec- ondly, we enforce geometric consistency by blending the 3D representations of neighbouring tiles using the 3D gen- erative model. 2. Related Work Novel view synthesis for scenes. Expanding an image beyond its boundaries has been a long-standing task in computer vision. Early methods that sought to expand object-centric scenes rely on layer-structured representa- tions [23, 34, 48, 50, 54, 55], which disregard the scene’s true geometry. SynSin [58] is a pivotal work, where im- age features are projected and used as conditioning to gen- erate novel views, achieving geometric and semantic con- sistency. ZeroNVS [44] introduces high-quality results with fine-grained control of the camera but remains object- centric. GenWarp [45] integrates semantic information through cross-attention when generating a novel view. The major challenges for these methods remain seman- tic drift and object permanence. To obtain an explicit 3D representation, the generated views need to be trans- ferred into such a representation, e.g., NeRF [35] or Gaus- sians [18, 19], where any geometric conflicts would need to be resolved. Page 3: SynCity 2D prompting 3D prompting 3D Blending ConditionRender Add to world2D 3D2D 2D 2D 2D3D3D3D Flux Inpainting FG Extract & RebaseTRELLIS Image -to-3D Text prompt Stitch & RenderCrop & Reorient Flux InpaintingStitch TRELLIS Denoise & Blend Figure 2. Overview of SynCity. 2D prompting: To generate a new tile, we first render a view of where that tile should be placed, including context from neighbouring tiles. 3D prompting: We extract the new tile image and construct an image prompt for TRELLIS by adding a wider base under the tile. 3D blending: The 3D model that TRELLIS outputs is usually not well blended with the rest of the scene. To address that, we render a view of the new tile next to each neighbouring tile, and inpaint the region between the two with an image inpainting model. Next, we condition using that well-blended view to refine the region between the two 3D tiles. Finally, the new, blended, tile is added to the world. Image projection-based scene generation. A different line of work follows the paradigm of building the 3D repre- sentation of a scene sequentially using 2D image generation models [8, 13, 17, 22, 37, 42, 57, 60, 64, 67]. Most of them employ an image generation model to outpaint the existing scene using pre-defined camera poses. The results are then fused in 3D with depth prediction models. Text2Room [17] generates meshes of indoor scenes. As the scene is clearly delimited by the bounds of the mesh, it can be freely ex- plored. LucidDreamer [8] and Text2Immersion [37] go be- yond indoor scenes, but their generated scenes reveal geo- metric inconsistencies when stepping away from the camera poses used to generate the scene. Invisible Stitch [12] ad- dresses this issue by inpainting depth (rather than naively aligning it) and RealmDreamer [49] proposes multiple op- timization losses to refine the generated scene. Despite these improvements, the resulting scenes still suffer from geometric artifacts and remain rather small. WonderJour- ney [64] introduces novel ideas for depth fusion, such as grouping objects at similar disparity to planes and sky depth refinement, enabling large ‘scene journeys’, where indepen- dent representations are built between scene ‘keyframes’, but these are not merged into one coherent scene. Won- derWorld [65] leverages these improvements to build a sin-gle scene, allowing interactive updates, but the true extent of the generated scenes remains limited. Other works use panoramas [51, 56] or implicit representations [2, 44] but the freedom of movement remains constricted. Procedural scene generation. Further methods permit long-range fly-overs over nature [5–7, 27, 30] or cities [29, 46, 62]. These usually generate procedural unbounded im- ages (e.g., the terrain make-up or a city layout). While those methods create realistic-looking images, they are of- ten monotonous as the methods are domain-specific and thus highly constrained in the variety they can generate. 3D scene generation. Instead of merely generating images of a scene or outpainting it only in 2D, further methods gen- erate the representation directly. Set-the-Scene [9] adds a layer of control to the layout of NeRF scenes by defining object proxies. BlockFusion [59] learns a network to auto- regressively diffuse small blocks to extend a mesh. A 2D layout conditioning is used to control the generation pro- cess, allowing users to generate scenes of rooms, a village, and a city. While the method allows building large-scale scenes, the variety of the objects it generates is severely lim- ited as it requires domain-specific 3D training data. Further- more, it generates untextured meshes. LT3SD [33] learns a Page 4: diffusion model that generates 3D environments in a patch- by-patch and coarse-to-fine fashion. However, this method is only trained to produce indoor scenes. At the same time, the synthesis of complex, high-fidelity objects has been enabled by the rapid progress in the fields of text-to-3D and image-to-3D generation [25, 26, 28, 31, 40, 41, 47, 53, 61, 68, 70]. Trained on large-scale curated subsets of 3D datasets such as Objaverse-XL [11], these models can generate a large variety of different objects. However, to the best of our knowledge, no prior work has leveraged object generators for scene generation. 3. Method Our goal is to generate a 3D world Gfrom an initial tex- tual prompt p0. Our main result is to show how prompt engineering can be used in combination with off-the-shelf language, 2D and 3D generators to create the entire world automatically, with no need to retrain the models. We structure the world as a W×HgridT= {0, . . . , W −1} × { 0, . . . , H −1}of square tiles, each of which can contain several complex 3D objects (e.g., a build- ing, a bridge, trees, etc.) as well as the ground surface. We generate the world progressively, tile by tile, as shown in in Fig. 3. Hence, when tile (x, y)∈ T is generated, tiles T(x, y) ={(x′, y′)∈ T :y′< y∨(y′=y∧x′< x)} have already been generated. xxy 00 W-1H-1 y Figure 3. Left: Progressive generation of world tiles T.Right: Isometric framing of a tile for image-based prompting. An overview of our approach is shown in Fig. 2. The first step of our method is to expand the world description p0into tile-specific prompts (Sec. 3.1). The second step is to pass these tile-specific prompts to a 2D image generator and in- painter to create an isometric view of each tile, accounting for the part of the world generated so far (Sec. 3.2). The third step is to extract image prompts from these isomet- ric views and use them as input to a image-to-3D gener- ator to reconstruct each tile’s geometry and appearance in 3D (Sec. 3.3). The final step is to align and blend the 3D reconstructions of the tiles to create a coherent 3D world (Sec. 3.4). 3.1. Prompting the Language Model The goal of language prompting is to take a high-level tex- tual description of the world p0and expand it into a set p of tile-specific textual prompts that can be used to generatethe 3D world. Specifically, pis a collection of sub-prompts pxy∈Σ∗, one for each tile, and a world-level ‘style’ prompt p⋆∈Σ∗, so that we can write p={pxy}(x,y)∈T∪ {p⋆}, where Σ∗is the set of all possible strings. The prompt pcould be constructed manually (which al- lows controlling the content of each tile) or generated by a large language model (LLM) such as ChatGPT [36] from a ‘seed’ prompt. For the latter, we prompt ChatGPT o3- mini-high to generate a grid-like world with tile-specific descriptions, providing it with an example JSON file (see Appendix A.1 for details). 3.2. Prompting the 2D Generator We use the language prompts pfrom Sec. 3.1 to prompt an off-the-shelf 2D image generator Φ2Dto output a 2D image I(x, y)of each tile to be generated, as shown for example in Fig. 4. The image I(x, y)must satisfy several constraints: (1) It must reflect the tile-specific instructions pxyof the target tile as well as the world-level instructions p⋆. (2) It must be suitable for prompting the image-to-3D generator in the next step. (3) It must be consistent with the previously generated tiles. Our prompting strategy is designed to satisfy these con- straints. The image is drawn as a sample I(x, y)∼ Φ2D(q, B, M )from the 2D image generator Φ2D, where q=pxy·p⋆is a prompt that combines the tile-specific and world-level descriptions. The generator Φ2Dis also pro- vided with a base image Band an inpainting mask Mthat constrain the output. We assume that Φ2Dis capable of in- painting — a common feature of modern image generators. Tile inpainting. To satisfy constraint (2), we need to en- courage the image generator to generate regular tiles so that the image-to-3D model can output tiles with regular geom- etry that fit well together. We assume that tiles have a fixed square basis of unit size and that they are imaged in an ‘iso- metric’ manner. This framing of the tiles is conducive to the generation of regular 3D tiles. Furthermore, it is a common choice in video games and might have been observed by the image generator during training as these are often trained on game-like data. Hence, our goal is to condition the image generator Φ2D to produce images of this kind. While one possible ap- proach is to fine-tune the generator on such images, we demonstrate that it is possible to obtain this effect through prompt engineering alone, avoiding any retraining. We achieve this by carefully constructing the inputs BandM as shown in Fig. 4. Specifically, we set Bto be the image of the base of the tile, as a square, grey slab imaged from a fixed isometric vantage point. The mask Mis a binary mask covering a cube on top of the base. Figure 4 shows the result of prompting the model in this manner as well as what happens if signals BandMare removed: the viewpoint and general frame of the tile is ran- Page 5: dom and not suitable for 3D generation. 𝐵 𝑀Φ2𝐷 𝑞Φ2𝐷 Prompt 𝑞 Figure 4. Left: Generation of the 2D image prompt for the first world tile at x= 0 andy= 0. The image generator Φ2Dis conditioned on q=p00·p⋆and tasked with inpainting the base image Bin the masked region M.Right: If we do not ‘frame’ the image by using BandM, the generator produces an image which is not suitable for tiling. Taking the context into account. Except for the first tile (0,0), the tiles are generated in the context of the world al- ready generated. In order to account for this context, for tiles with x, y > 0, we modify the base image Band the mask Mas shown in Fig. 5. For the base image B, we render the part of the 3D world generated so far, provid- ing context for the inpainting network. We also modify the mask Mto avoid covering tiles already generated to the left (‘west’; i.e., for a tile (i, j)∈ T these are tiles {(x, y)∈ T:x < i∧y=j}). Figure 5. Left: Base image Band inpainting mask M(white over- lay) to prompt the image generator Φ2Dto generate an image for a x >0,y >0world tile. Right: Result of inpainting. Because we wish to ensure continuity of the ground, be- fore rendering this contextual image, we trim any 3D geom- etry that is sufficiently high to occlude the tile we wish to generate, as shown in Fig. 6 (see the result in Fig. 5). Figure 6. Trimming tall structures for 2D prompting.The appendix discusses a special case for tiles at the boundaries of the world (see Appendix A.2). 3.3. Prompting the 3D Generator Given the tile image I(x, y)obtained from the 2D image generator in Sec. 3.2, the next goal is to generate a cor- responding 3D reconstruction G(x, y)of the tile utilizing an image-to-3D model Φ3D. We opt for using a robust 3D generator and select TRELLIS [61] due to its good perfor- mance, ability to generate both shape and texture, and latent space structure, which will be easy to manipulate for blend- ing as we show later in Sec. 3.4. Hence, 3D reconstruction amounts to drawing a sample G(x, y)∼Φ3D(J(x, y))from the image-to-3D generator Φ3D. Rather than conditioning on the image I(x, y), we use a pre-processed version J(x, y), as described next. 2D foreground extraction and rebasing. Recall that the image I(x, y)output by the 2D generator of Sec. 3.2 is an image of the tile and its context. However, the 3D gen- erator Φ3Dexpects the input image to only show the ob- ject that needs to be reconstructed, i.e., the new tile. The first step is thus extracting only that part from I(x, y)that corresponds to the new tile, which we do by applying the inpainting mask and then running rembg [15] with alpha matting [4] to remove the background, as shown in Fig. 7. Foreground Extraction Mask rembgRebasing Paste Onto Image of Base 2D Figure 7. Left: Isolating the image of the new tile from I(x, y). Right: Placing a slightly larger base underneath. The resulting image is narrowly cropped around the new tile. Similar to Sec. 3.2, we found it beneficial to hallucinate a base for the tile, an operation that we call ‘rebasing’, as shown in Fig. 7. We simply compose the image of the tile with a slightly larger gray slab (in 2D) to obtain J(x, y), which in effect provides a ‘frame’ for the 3D generator to work with. The base is reconstructed as part of the tile’s geometry, which can be used for validation and as a simple- to-detect handle for further 3D processing. The ‘rebased’ image J(x, y)is fed to the 3D genera- torΦ3Dto obtain the 3D reconstruction G(x, y)of the tile, which are 3D Gaussian Splats (3DGS). The effect of rebas- ing on the 3D result is shown in Fig. 8. 3D geometric validation. Because the generators are im- perfect, we verify the 3D reconstruction G(x, y)to ensure Page 6: TRELLIS Input (2D)TRELLIS Output (3D)Front ViewBack ViewFigure 8. Top: 3D reconstruction using a tight base. Bottom: the same, but with a slightly larger base, which helps to contain the tile’s geometry above ground (see the back of the reconstruction), and creates an easy-to-detect 3D base. that it is of sufficient quality. If not, we discard it and gen- erate the tile again using a different random seed. To verify the tile, we use a few heuristics that check that the tile’s ge- ometry occupies a square region of sufficient size and that the base of the tile has been reconstructed faithfully. Please see Appendix A.3. 3D post-processing. At this point, we have verified the 3D reconstruction G(x, y)for the tile as a mixture of 3D Gaus- sians. However, the actual 3D footprint, orientation, and size of the tile are controlled by the 3D generator and are in- consistent. The post-processing step applies simple heuris- tics to refine the 3D Gaussian representation by first crop- ping out the added base, then rescaling the tile to a unit size, and finally reorienting it to match the 2D image prompt. We explain this in more detail in Appendix A.4. 3.4. 3D Blending At this point of the pipeline, we have reconstructed all 3D tilesG(x, y),(x, y)∈ T . As a result of the prompting and post-processing steps in Secs. 3.2 and 3.3, the tiles are already approximately aligned and correctly oriented, with their ground level at roughly the same height in 3D space. Because the 3D reconstructions are 3D Gaussian Splats, it is easy to simply take their union as the reconstruction of the whole world. Even so, the boundaries of the tiles may not match per- fectly. This is largely due to the fact that TRELLIS does not reconstruct the input images exactly and to the fact that only a single view of each tile is provided to it, which only indirectly controls the reconstruction of the back of the tile. In this section, we thus propose a method to improve the blending of tiles, ensuring that the 3D world is coherent and continuous. In particular, we regenerate the boundary region of two adjacent tiles in the latent space of TRELLIS, in essence Original tile Naïve upsampling Proposed upsampling Figure 9. Upsampling sparse latents. We need to resize or up- sample sparse latents in order to stitch them. Due to the sparsity of the latents and the behaviour of the latent decoder, naively re- sampling in latent space leads to artifacts. Our proposed resizing of the sparse latents better preserves textures and fine structures. blending it, and then decode it into 3DGS. Next, we discuss the specifics of this process. Blending in 2D. To blend the latents of two neighbouring tiles, we first predict the appearance of the boundary be- tween the two tiles. To that end, we place the two 3D tiles next to each other, render a frontal view, and inpaint the middle region of the rendering (Fig. 2) with Φ2D. This leads to a well-blended image, which we use to condition for Φ3D. Blending in 3D. Next, we use Φ3Dto blend the latents. We take the latents of the two tiles γ1andγ2where γ1, γ2∈ RD×R×R×RareD-dimensional features in the R-sized 3D grid that TRELLIS denoises. We put them together in a new volume γ, where the side where they meet is in the middle: γ:,x,y,z =( γ1 :,x+R/2,y,z,ifx < R/ 2 γ2 :,x−R/2,y,z,ifx≥R/2. We apply the denoising function Ω, which is the latent denoiser of Φ3D, to the volume γ, but only within the middle region where we have applied the stitch, i.e., forx∈[R/2− r, R/ 2 +r]for some r < R/ 2, while keeping the rest fixed. Formally, we initialise ˜γ∼ N(0, I)and at each denoising stept, we update ˜γas: ˜γt+1,:,x,y,z =( Ω(˜γt,:,x,y,z),if|x−R/2| ≤r γt+1,:,x,y,z,otherwise , where γtis obtained by adding noise to the original γat the corresponding noise level for step t. In practice, we only de- noise the second stage of TRELLIS, keeping the occupancy latents of the first stage fixed. The reason for that is that the first stage is at a very low spatial resolution ( R= 16 , compared to R= 64 of the second stage), which gives little flexibility for the size of the denoised region r. Upsampling the latents. Remember that due to the rebas- ing,G(x, y)contains a 3D base. While we have removed the base in 3DGS space, we have yet to do the same in the latent space. We use the same cuts we applied in 3DGS space to now crop the latents, rounding the cuts to account Page 7: Win Rate (%) Overall Geometry Exploration Diversity Realism 90.9 81.8 90.9 90.9 86.4 Table 1. Win rates of our method against BlockFusion. We asked participants to select which scene they prefer overall, as well as which one has better geometry, would be more interesting to ex- plore, is more diverse, and has better realism. With context Without context Figure 10. Left: 2×2 grid generated with our method, where con- text is taken into account as described in Sec. 3.2 Right: Generated with our method using the same prompts, but nottaking into ac- count context — here, the scale of the buildings is not consistent. for the discrete nature of the latent voxel grid. In the pre- vious step, however, we assumed that the latents γ1andγ2 have the same spatial resolution, γ∈RD×R×R×R. After cropping, this is not the case any more if the cuts of neigh- bouring tiles differ. Thus, similarly to how we resize the tiles’ 3DGS to a unit size, we have to upsample the now- cropped latents back to the original grid resolution R. We found that naively upsampling the latents by interpo- lation leads to poor reconstructions, as shown in Fig. 9. We attribute this to the sparse structure of the latents and quirks of the latent decoder of TRELLIS. We propose the following upsampling scheme. First, we upsample the cropped occupancy volume that TREL- LIS predicted to the original resolution V∈ {0,1}R×R×R. Next, we denoise a new set of latents γon the upsampled occupancy volume. To preserve the details and textures of the original 3D tile, we render it from multiple views and jointly condition the denoising on all of them. In practice, when denoising with multiple conditioning views, at each timestep, the denoising step is computed as the average de- noising step across all views. We show that this upsampling scheme leads to superior reconstructions in Fig. 9. 4. Experiments Experimental details We generate the text prompts using ChatGPT o3-mini-high. For the 2D inpainter, we use the Flux ControlNet of [1]. Human preference. We evaluate human preference for the results generated by our method and those obtained with BlockFusion [59]. In particular, we compare a ‘city’ scene, showing the entire scene as well as close-up detail views. As seen in Tab. 1, participants ( n= 22 ) find our methodMethod Base Area Squareness ↑Completeness ↑ No Rebasing 2271 0.92 0.73 Ours 4096 1.00 1.00 Table 2. Average tile 3D geometry metrics for an approach with- out rebasing and our method. Rebasing is crucial to ensure a tile is square and its base has been reconstructed faithfully. The met- rics we use are the area of the base in voxels, a measure for the ‘squareness’ of the base, and how many border voxels have been faithfully reconstructed. For details, please refer to the appendix. better overall, with better geometry, realism, and diversity. 4.1. Ablations Here we ablate several components of our approach, show- ing the importance of each of them. Building a grid. A naive approach to generating a 3D scene is querying the image generator to produce an image of a large-scale scene (using our 2D image prompt setup) and then obtaining the entire 3D world directly with TREL- LIS. To achieve the same level of control provided by our method, the textual prompt needs to be highly detailed and include layout instructions. However, we found neither pre- cise nor abstract prompts to be effective at steering the gen- erations of Flux (for details, see A.4). 2D prompting context. We remove context from neigh- bouring tiles, as described in Sec. 3.2. Doing that, each tile is sampled independently, and the relative scale between ob- jects is inconsistent, as shown in Fig. 10. Rebasing. To place tiles on a grid, they need to be square (otherwise the grid would be jagged) and their base needs to have been reconstructed faithfully (clearly delimiting where the tile stops). Without rebasing, the geometry generated by TRELLIS might extend beyond the base and makes the tile’s ‘true’ extent difficult to detect, as shown in Fig. 8. We ablate the effect of rebasing, using a small 2×2scene to curb the effect of error accumulation. As seen in Tab. 2, no rebasing causes TRELLIS to generate tiles that are, on average, neither perfectly square nor have a solid border. Method LPIPS ↓SSIM↑FID↓KID↓ Naive upsampling 0.5914 0.3093 200.5 0.243 Ours (single frame) 0.3517 0.5149 111.6 0.069 Ours (multi frame) 0.3212 0.5312 89.1 0.051 Table 3. Perceptual metrics for our methods and the naive ap- proach. Lower values for LPIPS [69], FID [16], and KID [3] are better, while higher values for SSIM are better. We see that even using a single conditioning frame leads to better upsampling re- sults, and multiple frames further improve performance. Page 8: Figure 11. Exploring a 3D world. We show trajectories exploring the 3D worlds we generate. Please see the supplement for videos. Before blending After blending Figure 12. Left: Tiles before applying the 3D blending step Right: After the 3D blending step. We see that where boundaries between tiles were obvious, they are now well-blended. Latent upsampling. We sample 10 random views each from 200 tiles generated by TRELLIS and compute per- ceptual metrics in Tab. 3 when upsampling with our pro- posed upsampling approach in Sec. 3.4 and a naive inter- polation. We see that the proposed method leads to better results across a range of metrics, even when using a single conditioning frame. 3D blending. In Fig. 12, we generate a scene where we do not apply the 3D blending (Sec. 3.4), resulting in disconti- nuities between the tiles. 4.2. Qualitative Results We present example scenes generated by our method in Fig. 1. Further, we show detail views, highlighting the quality and diversity of the scene. Please see the supple-mentary material for many more examples. Exploring a generated world. We can sample trajectories exploring the generated 3D worlds (Fig. 11). A skybox has been added for visual effect. Unlike the trajectories gen- erated by world video models [38], ours are guaranteed to be consistent and do not suffer from semantic drift. Dif- ferent from other systems that only generate a bubble, our method creates spaces sufficiently large to be navigated in a non-trivial manner. 5. Conclusion We have introduced SynCity, an approach to generate di- verse, high-quality, and complex 3D worlds with fine- grained control over their layout and appearance. SynCity creates worlds by autoregressively generating tiles on a grid, enabling scalability to arbitrary grid sizes. By accounting for local context and by means of 3D inpainting, the tiles are seamlessly stitched together into coherent scenes. SynCity is flexible: it can either generate worlds from a brief ‘world’ text prompt or allow control of the individual tiles via tile- specific instructions, all the while maintaining an overall thematic consistency of the generated world. The rich detail of the generated worlds can be fully explored, not restricted to a single ‘3D bubble’ as in many prior works. We have demonstrated the effectiveness of off-the-shelf generation by utilizing pre-trained language, 2D, and 3D generators through carefully designed prompting strategies and without requiring retraining of any of these compo- nents. Nevertheless, we expect that in cases where 3D scene scale data is available, fine-tuning some components would result in further improved results and simplifications in the Page 9: alignment and rebasing steps of the pipeline. Future work could also consider relaxing the tile structure, for example, by randomly shifting and scaling tiles and using coarse-to- fine modeling to ensure coherent global structure and fine- grained local details. Ethics. For details on ethics, data protection, and copy- right, please see https://www.robots.ox.ac.uk/ ~vedaldi/research/union/ethics.html . Acknowledgments. The authors of this work are supported by ERC 101001212-UNION, AIMS EP/S024050/1, and Meta Research. References [1] AlimamaCreative. Flux-controlnet-inpainting. https: / / github . com / alimama - creative / FLUX - Controlnet-Inpainting , 2024. GitHub repository. 7 [2] Miguel Ángel Bautista et al. GAUDI: A neural architect for immersive 3D scene generation. arXiv.cs , abs/2207.13751, 2022. 3 [3] Mikołaj Bi ´nkowski, Danica J Sutherland, Michael Arbel and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401 , 2018. 7 [4] Ron Brinkmann. The art and science of digital compositing: Techniques for visual effects, animation and motion graph- ics. Morgan Kaufmann, 2008. 5 [5] Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool and Gordon Wet- zstein. DiffDreamer: Towards consistent unsupervised single-view scene extrapolation with conditional diffusion models. In Proc. ICCV , 2023. 3 [6] Lucy Chai, Richard Tucker, Zhengqi Li, Phillip Isola and Noah Snavely. Persistent nature: A generative model of un- bounded 3d worlds. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 20863–20874, 2023. [7] Zhaoxi Chen, Guangcong Wang and Ziwei Liu. Scene- dreamer: Unbounded 3d scene generation from 2d image collections. IEEE transactions on pattern analysis and ma- chine intelligence , 45(12):15562–15576, 2023. 3 [8] Jaeyoung Chung, Suyoung Lee, Hyeongjin Nam, Jaerin Lee and Kyoung Mu Lee. Luciddreamer: Domain-free generation of 3d gaussian splatting scenes. arXiv.cs , abs/2311.13384, 2023. 2, 3 [9] Dana Cohen-Bar, Elad Richardson, Gal Metzer, Raja Giryes and Daniel Cohen-Or. Set-the-scene: Global-local training for generating controllable nerf scenes. In Proc. ICCV Work- shops , 2023. 3 [10] Deemos. Rodin text-to-3D gen-1 (0525) v0.5, 2024. 2 [11] Matt Deitke et al. Objaverse-XL: A universe of 10M+ 3D objects. CoRR , abs/2307.05663, 2023. 4 [12] Paul Engstler, Andrea Vedaldi, Iro Laina and Christian Rup- precht. Invisible stitch: Generating smooth 3D scenes with depth inpainting. In Proceedings of the International Con- ference on 3D Vision (3DV) , 2025. 2, 3[13] Rafail Fridman, Amit Abecasis, Yoni Kasten and Tali Dekel. Scenescape: Text-driven consistent scene generation. CoRR , abs/2302.01133, 2023. 2, 3 [14] Ruiqi Gao, Aleksander Holynski, Philipp Henzler, Arthur Brussee, Ricardo Martin-Brualla, Pratul Srinivasan, Jonathan T. Barron and Ben Poole. CAT3D: create anything in 3d with multi-view diffusion models. arXiv , 2405.10314, 2024. 2 [15] Daniel Gatis. rembg, 2025. 5 [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equi- librium. In Proc. NeurIPS , 2017. 7 [17] Lukas Höllein, Ang Cao, Andrew Owens, Justin Johnson and Matthias Nießner. Text2Room: Extracting textured 3D meshes from 2D text-to-image models. In Proc. ICCV , 2023. 2, 3 [18] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger and Shenghua Gao. 2d gaussian splatting for geometrically ac- curate radiance fields. In ACM SIGGRAPH 2024 conference papers , pages 1–11, 2024. 2 [19] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler and George Drettakis. 3D Gaussian Splatting for real-time radi- ance field rendering. Proc. SIGGRAPH , 42(4), 2023. 2 [20] Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux , 2023. 2 [21] World Labs. Generating worlds, 2024. 2 [22] Jiabao Lei, Jiapeng Tang and Kui Jia. RGBD2: generative scene synthesis via incremental view inpainting using RGBD diffusion models. In Proc. CVPR , 2023. 3 [23] Jiaxin Li, Zijian Feng, Qi She, Henghui Ding, Changhu Wang and Gim Hee Lee. MINE: towards continuous depth MPI with NeRF for novel view synthesis. In Proc. ICCV , 2021. 2 [24] Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich and Sai Bi. Instant3D: Fast text-to-3D with sparse-view generation and large reconstruction model. Proc. ICLR , 2024. 2 [25] Weiyu Li, Jiarui Liu, Rui Chen, Yixun Liang, Xuelin Chen, Ping Tan and Xiaoxiao Long. CraftsMan: high-fidelity mesh generation with 3d native generation and interactive geome- try refiner. arXiv , 2405.14979, 2024. 2, 4 [26] Yangguang Li et al. TripoSG: high-fidelity 3D shape synthesis using large-scale rectified flow models. arXiv , 2502.06608, 2025. 2, 4 [27] Zhengqi Li, Qianqian Wang, Noah Snavely and Angjoo Kanazawa. Infinitenature-zero: Learning perpetual view generation of natural scenes from single images. In Eu- ropean Conference on Computer Vision , pages 515–534. Springer, 2022. 3 [28] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu and Tsung-Yi Lin. Magic3D: High-resolution text-to-3D content creation. arXiv.cs , abs/2211.10440, 2022. 4 Page 10: [29] Chieh Hubert Lin, Hsin-Ying Lee, Willi Menapace, Menglei Chai, Aliaksandr Siarohin, Ming-Hsuan Yang and Sergey Tulyakov. Infinicity: Infinite-scale city synthesis. In Pro- ceedings of the IEEE/CVF international conference on com- puter vision , pages 22808–22818, 2023. 3 [30] A Liu, R Tucker, V Jampani, A Makadia and N Snavely. . . . Infinite nature: Perpetual view generation of natural scenes from a single image. In Proc. ICCV , 2021. 2, 3 [31] Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tok- makov, Sergey Zakharov and Carl V ondrick. Zero-1-to-3: Zero-shot one image to 3D object. In Proc. ICCV , 2023. 4 [32] Luke Melas-Kyriazi, Iro Laina, Christian Rupprecht, Natalia Neverova, Andrea Vedaldi, Oran Gafni and Filippos Kokki- nos. IM-3D: Iterative multiview diffusion and reconstruction for high-quality 3D generation. In Proceedings of the Inter- national Conference on Machine Learning (ICML) , 2024. 2 [33] Quan Meng, Lei Li, Matthias Nießner and Angela Dai. LT3SD: latent trees for 3D scene diffusion. arXiv , 2409.08215, 2024. 2, 3 [34] Ben Mildenhall, Pratul P. Srinivasan, Rodrigo Ortiz Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng and Abhishek Kar. Local light field fusion: practical view syn- thesis with prescriptive sampling guidelines. Proc. SIG- GRAPH , 38(4), 2019. 2 [35] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi and Ren Ng. NeRF: Representing scenes as neural radiance fields for view syn- thesis. In Proc. ECCV , 2020. 2 [36] OpenAI et al. GPT-4 technical report. arXiv , 2303.08774, 2024. 4 [37] Hao Ouyang, Kathryn Heal, Stephen Lombardi and Tiancheng Sun. Text2Immersion: Generative immersive scene with 3D gaussians. arXiv.cs , abs/2312.09242, 2023. 2, 3 [38] Jack Parker-Holder et al. Genie 2: A large-scale foundation world model, 2024. 8 [39] Ben Poole, Ajay Jain, Jonathan T. Barron and Ben Milden- hall. DreamFusion: Text-to-3D using 2D diffusion. In Proc. ICLR , 2023. 2 [40] Guocheng Qian et al. Magic123: One image to high-quality 3D object generation using both 2D and 3D diffusion priors. arXiv.cs , abs/2306.17843, 2023. 4 [41] Amit Raj et al. DreamBooth3D: subject-driven text-to-3D generation. In Proc. ICCV , 2023. 4 [42] Chris Rockwell, David F. Fouhey and Justin Johnson. Pixel- Synth: Generating a 3D-consistent experience from a single image. In Proc. ICCV , 2021. 2, 3 [43] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser and Björn Ommer. High-resolution image syn- thesis with latent diffusion models. In Proc. CVPR , 2022. 2 [44] Kyle Sargent et al. ZeroNVS: Zero-shot 360-degree view synthesis from a single real image. arXiv.cs , abs/2310.17994, 2023. 2, 3 [45] Junyoung Seo, Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, Chieh-Hsin Lai, Se- ungryong Kim and Yuki Mitsufuji. GenWarp: single imageto novel views with semantic-preserving generative warping. arXiv , 2405.17251, 2024. 2 [46] Yu Shang, Yuming Lin, Yu Zheng, Hangyu Fan, Jingtao Ding, Jie Feng, Jiansheng Chen, Li Tian and Yong Li. Urban- world: An urban world model for 3d city generation. arXiv preprint arXiv:2407.11965 , 2024. 3 [47] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li and Xiao Yang. MVDream: Multi-view diffusion for 3D generation. In Proc. ICLR , 2024. 2, 4 [48] Meng-Li Shih, Shih-Yang Su, Johannes Kopf and Jia-Bin Huang. 3d photography using context-aware layered depth inpainting. In Proc. CVPR , 2020. 2 [49] Jaidev Shriram, Alex Trevithick, Lingjie Liu and Ravi Ra- mamoorthi. RealmDreamer: text-driven 3d scene generation with inpainting and depth diffusion. In Proc. 3DV , 2025. 3 [50] Pratul P Srinivasan, Richard Tucker, Jonathan T Barron, Ravi Ramamoorthi, Ren Ng and Noah Snavely. Pushing the boundaries of view extrapolation with multiplane images. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition , pages 175–184, 2019. 2 [51] Gabriela Ben Melech Stan et al. LDM3D: Latent diffusion model for 3D. arXiv.cs , (2305.10853), 2023. 3 [52] Stanislaw Szymanowicz, Christian Rupprecht and Andrea Vedaldi. Viewset diffusion: (0-)image-conditioned 3D gen- erative models from 2D data. In Proceedings of the Interna- tional Conference on Computer Vision (ICCV) , 2023. 2 [53] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng and Ziwei Liu. LGM: Large multi-view Gaus- sian model for high-resolution 3D content creation. arXiv , 2402.05054, 2024. 4 [54] Richard Tucker and Noah Snavely. Single-view view synthe- sis with multiplane images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 551–560, 2020. 2 [55] Shubham Tulsiani, Richard Tucker and Noah Snavely. Layer-structured 3d scene inference via view synthesis. In Proc. ECCV , 2018. 2 [56] Guangcong Wang, Peng Wang, Zhaoxi Chen, Wenping Wang, Chen Change Loy and Ziwei Liu. PERF: panoramic neural radiance field from a single panorama. tpami , 2024. 3 [57] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distilla- tion. Advances in Neural Information Processing Systems , 36:8406–8441, 2023. 3 [58] Olivia Wiles, Georgia Gkioxari, Richard Szeliski and Justin Johnson. Synsin: End-to-end view synthesis from a single image. In Proc. CVPR , 2020. 2 [59] Zhennan Wu et al. BlockFusion: Expandable 3D scene gen- eration using latent tri-plane extrapolation. arXiv.cs , 2024. 2, 3, 7 [60] Jianfeng Xiang, Jiaolong Yang, Binbin Huang and Xin Tong. 3D-aware image generation using 2D diffusion mod- els.arXiv.cs , abs/2303.17905, 2023. 3 [61] Jianfeng Xiang, Zelong Lv, Sicheng Xu, Yu Deng, Ruicheng Wang, Bowen Zhang, Dong Chen, Xin Tong and Jiaolong Yang. Structured 3D latents for scalable and versatile 3d generation. arXiv , 2412.01506, 2024. 2, 4, 5 Page 11: [62] Haozhe Xie, Zhaoxi Chen, Fangzhou Hong and Ziwei Liu. Citydreamer: Compositional generative model of unbounded 3d cities. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 9666–9675, 2024. 3 [63] Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen and Gordon Wetzstein. GRM: Large gaussian reconstruction model for efficient 3D reconstruction and generation. arXiv , 2403.14621, 2024. 2 [64] Hong-Xing Yu et al. Wonderjourney: Going from anywhere to everywhere. arXiv.cs , abs/2312.03884, 2023. 2, 3 [65] Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T Freeman and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image. arXiv preprint arXiv:2406.09394 , 2024. 3 [66] Biao Zhang, Jiapeng Tang, Matthias Niessner and Peter Wonka. 3dshape2vecset: A 3d shape representation for neu- ral fields and generative diffusion models. In ACM Transac- tions on Graphics , 2023. 2 [67] Jingbo Zhang, Xiaoyu Li, Ziyu Wan, Can Wang and Jing Liao. Text2NeRF: Text-driven 3D scene generation with neural radiance fields. arXiv.cs , abs/2305.11588, 2023. 2, 3 [68] Longwen Zhang, Ziyu Wang, Qixuan Zhang, Qiwei Qiu, Anqi Pang, Haoran Jiang, Wei Yang, Lan Xu and Jingyi Yu. Clay: A controllable large-scale generative model for creat- ing high-quality 3d assets. ACM Transactions on Graphics (TOG) , 43(4):1–20, 2024. 4 [69] Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. CVPR , pages 586– 595, 2018. 7, 13 [70] Zibo Zhao et al. Hunyuan3d 2.0: Scaling diffusion mod- els for high resolution textured 3d assets generation. arXiv preprint arXiv:2501.12202 , 2025. 4 Page 12: SynCity: Training-Free Generation of 3D Worlds Supplementary Material A. Appendix A.1. Language Model Prompting Details While the prompt pcan be constructed manually, an LLM may also be employed. For it to understand the task it is asked to solve, we utilize the following prompt for ChatGPT o3-mini-high: Assume you had access to an AI model that can generate small-scale cities on an isometric grid by creating individual tiles. For each of these tiles (identified by their 2D position), a short but ex- pressive text prompt has to be provided. Addi- tionally, a global prompt is used, which provides context, lighting, time of day, as well as the art style. The prompts of the tiles can be generic but they might have a semantic connection to neigh- bouring tiles (such that a river can flow through the city on multiple tiles). The format for the in- structions to the AI model is JSON. Consider the following example: <TEMPLATE> The art style and perspective mentioned in the prompt should be maintained. The rest may be freely adapted. There are no limits to the setting, the sky is your limit. Now, please generate a 3×3grid. In this prompt, a template or a ‘seed’ prompt is provided. We use a simple JSON file, as exemplified in Fig. 13. {"tiles" : [ {"prompt" :"ancient stone bridge over a stream ","x":0,"y":0}, {"prompt" :"lively stream past mossy banks" ," x":1,"y":0}, {"prompt" :"serene pond reflecting moonlight" , "x":0,"y":1}, {"prompt" :"bustling medieval market street" , "x":1,"y":1} ], "prompt" :"{tile_prompt}, medieval setting, isometric view, glowing lanterns, soft shading, vibrant colors, detailed textures" } Figure 13. Example JSON file to describe each tile in a 2×2 world. A.2. 2D Prompting Details In Sec. 3.2, we describe how tiles are generated in the context of those that already exist. There is a special case that we address separately, where context has to be boot- strapped, namely tiles L:={(x, y)∈ T:x= 0∧y >0}.Due to our build order and the trimming of obstructing 3D geometry, these tiles might lack sufficient contextual cues. As a remedy, we temporarily provide context with a pre- viously generated tile: For a tile (0, y)∈ L , we dupli- cate the tile (0, y−1)∈ T and place the copy at posi- tion(−1, y). During inpainting, this tile serves to provide context in terms of scale and general appearance. Once in- painting is completed, this copy is removed. Figure 14. Tile geometry validation. To check the geometric qualities of a reconstructed tile, we look at the occupancy grid V∈ {0,1}R×R×Rgenerated by TRELLIS. Activated voxels are indicated in orange ( ■).Left: The extent of an object in an object at height w(slice visualized in 2D). Right: An example of a 3D tile base template VB. A.3. 3D Geometry Validation Details TRELLIS is a two-stage method and produces an occu- pancy volume V∈ {0,1}R×R×Rin the first stage (before the 3D Gaussian mixture is output), where R= 64 is the resolution of the grid. To perform geometric validation, we utilize this occupancy volume, which captures the rough 3D geometry of the tile, and check that it conforms to the de- sired geometry. First, we test whether the reconstructed tile is supported by a square by computing its 2D rectangular footprint and ensuring that the latter is sufficiently large and isotropic. To this end, let (u, v, w )index the R×R×Rvoxel grid, where uandvcorresponds to world directions xandy. Let (umin, umax, vmin, vmax,)be the bounding box containing all the active voxels at height w.*Letextu= max {0,1 + umax−umin}be the width of the bounding box and extvits height. We discard the tile if the area is too small, i.e., if extu·extv<(R/2)2. We also discard it if it is not square, i.e., ifmin{extu,extv}/max{extu,extv}< α= 1. Second, we check that the base that we have added in 2D in the ‘rebasing’ step has been faithfully reconstructed in 3D. *So for instance umin= min{u:∃v, w:V(u, v, w ) = 1}. Page 13: We define a 3D tile base template, which we call VB. Letumin(w)be the minimum uof the bounding box that contains the volume slice at height w, and define umax(w) and so on in a similar manner, so for instance umin(w) = min{u:∃v:V(u, v, w ) = 1}. Let w∗be the height at which the base is the largest, i.e.,w∗= argmaxwextu(w)· extv(w).Then, VBis the indicator function of the voxels (u, v, w )such that w=w∗and max|2u−umax(w∗)−umin(w∗)| umax(w∗)−umin(w∗),|2v−vmax(w∗)−vmin(w∗)| vmax(w∗)−vmin(w∗) = 1. Note that the template VBis constructed adaptively to match the input tile V. We discard a generated tile if (V·VB)/(VB·VB)< β= 0.95,where ·denotes the inner product of tensors. A.4. 3DGS Post-Processing Details 3D cropping, resizing and centering. Given the 3D Gaus- sian mixture G(x, y)initially output by the 3D generator, we first identify the extent of the tile ‘proper’ (discounting the extended base). We consider the xyfootprint of the tile (i.e., we look at the tile from above) and seek to identify four cuts (from the left, right, top, and bottom) that define an axis-aligned rectangle strictly containing the tile. For exam- ple, to determine the location of the left cut x∗, we consider slices Vx={(x′, y′, z′)∈R3:x−δ≤x′< x+δ}. We find the 3D Gaussians whose centers falls within Vx and compute their average color cx. Then, we compute the distance d(x) =∥cx−cxmin∥where cxminis the aver- age color of the leftmost slice (used as a reference). We set x∗= min {x:d(x)> τ}where τis a threshold, which cor- responds to the slice that transition from the ‘background’ color to something else. We find in this manner the four cuts, keep only the Gaus- sians contained in the resulting rectangular footprint, and recenter and resize this footprint to fill the standard tile size. Additionally, the base allows us to figure out the position of the tile’s surface: As TRELLIS centers objects vertically, the ground surface level of any two tiles will vary. We use the average height of the tile’s four corners to determine the position of the surface, allowing us to align it with others. 3D reorientation. We also note that TRELLIS generates the 3D object with an arbitrary orientation with respect to the input image ˜I(x, y). However, the tile must be inserted with the correct orientation in the 3D world a otherwise the continuity between tiles, which the inpainting method of Fig. 4 encourages, will be lost. In practice, the ambigu- ity is limited to 90-degree rotations around the vertical axis and is very easy to remove.*To do so, we test four possi- ble 90-degree rotations of the tile around the vertical axis, rendering the corresponding views and comparing them to *This is likely due to the implicit bias in the TRELLIS training set that consists of synthetic 3D objects which are almost invariably axis aligned.˜I(x, y)using the LPIPS [69] loss. The minimizer is taken as the correct orientation. A.5. Ablation Details Building a Grid. In the following, we present results from our experiments attempting to generate a large scene non- iteratively. Here, we generate single image with Flux that is used as conditioning for TRELLIS to directly create the desired scene. Figure 15. Non-iterative city building. We obtain conditioning images generated by Flux (left) and directly use them to build a large-scale scene with TRELLIS (center). While the generated 3D structures are visually appealing, the level of detail (right) is very limited. The first row used generic prompting for the conditioning image (“a city scene on top of a base”), whereas the second row uses a more involved prompt with an explicit layout (e.g., “a house in the bottom left corner”, “a pharmacy in the top right corner”). In the first set of experiments, we do not use our 2D prompting design. To obtain an isolated 3D object that can be generated by TRELLIS, we use prompts with the prefix “a 3d object of”. We show those results in Fig. 15. While the generated objects are visually appealing, they have sev- eral limitations: (i) The resolution of the conditioning im- age and the 3D structures TRELLIS can generate is limited. Therefore, this approach is not scalable to arbitrarily large scenes. (ii) Due to the lack of perfect control over the base structure, the result cannot be easily extended or edited. (iii) The layout instructions are mostly ignored, thus severely limiting the level of control over the generation. For the second set of experiments, we use our 2D prompting design along with the Flux ControlNet for in- painting (Fig. 16). However, with this setup, the quality of the results is not improved. The layout instructions in the prompt are mostly ignored, again. Querying Flux to generate large-scale scenes directly has not been successful in our experiments, prompting the need for our grid-based method that allows fine-grained layout and appearance control for each tile. Page 14: Figure 16. Non-iterative city building (with our 2D prompt- ing). We obtain conditioning images generated by Flux (left) and directly use them to build a large-scale scene with TRELLIS (cen- ter). Despite the initial visual appeal, the structures lack in detail. The first row used generic prompting for the conditioning image (“a vibrant city scene”), whereas the second row uses a more in- volved prompt with an explicit layout (e.g., “a house in the bottom left corner”, “a pharmacy in the top right corner”). A.6. Additional Qualitative Results In Figs. 17 to 20, we show additional results of our method. As we leverage a pre-trained 2D image generator trained on a very large dataset, we are able to generate highly di- verse scenes. Thanks to our fine-grained control at the tile level, we can generate interesting patterns, such as a transi- tion between seasons across a grid (observe the largest grid in Fig. 20). A.7. Limitations While our method allows creating large and diverse scenes, there are some limitations to be addressed in future work. Atomic tiles. Although we inpaint tiles conditioned on their surroundings, they remain individual units. While structures can be created that span across multiple tiles, this requires harmonious cooperation between Flux and TREL- LIS. Use of heuristics. To determine the ground surface height for each tile and removing the base we added during rebas- ing, we employ heuristics. We have designed these care- fully with fallback mechanisms, but they are not infallible. Inherited limitations. As our method builds on top of Flux and TRELLIS, their limitations also apply to ours. During our experiments, we have observed—that despite good in- painting results—TRELLIS at times only vaguely adheres to the conditioning image in terms of appearance, in partic- ular color. Thus, transitions between tiles might not look perfectly smooth (even if they were generated that way in the inpainting result). Page 15: Figure 17. Exploring a 3D world. We show trajectories exploring the 3D worlds we generate. A sky box has been added for visual effect. Page 16: Figure 18. Generated scenes. We show scenes generated with the same prompts, but different seeds in 2D inpainting. Page 17: Figure 19. Generated scenes. Page 18: Figure 20. Generated scenes. Our method can easily generate large scenes. Further, interesting patterns can be injected thanks to fine- grained control over each tile. Top: The scene transitions in season, from winter to spring to summer to autumn. Bottom: The scenery transitions from a city-like to a rural environment.

---