loader
Generating audio...
Extracting PDF content...

arxiv

Paper 2412.14897

Diffusion priors for Bayesian 3D reconstruction from incomplete measurements

Authors: Julian L. Möbius, Michael Habeck

Published: 2024-12-19

Abstract:

Many inverse problems are ill-posed and need to be complemented by prior information that restricts the class of admissible models. Bayesian approaches encode this information as prior distributions that impose generic properties on the model such as sparsity, non-negativity or smoothness. However, in case of complex structured models such as images, graphs or three-dimensional (3D) objects,generic prior distributions tend to favor models that differ largely from those observed in the real world. Here we explore the use of diffusion models as priors that are combined with experimental data within a Bayesian framework. We use 3D point clouds to represent 3D objects such as household items or biomolecular complexes formed from proteins and nucleic acids. We train diffusion models that generate coarse-grained 3D structures at a medium resolution and integrate these with incomplete and noisy experimental data. To demonstrate the power of our approach, we focus on the reconstruction of biomolecular assemblies from cryo-electron microscopy (cryo-EM) images, which is an important inverse problem in structural biology. We find that posterior sampling with diffusion model priors allows for 3D reconstruction from very sparse, low-resolution and partial observations.

Paper Content: on Alphaxiv
Page 1: DIFFUSION PRIORS FOR BAYESIAN 3D RECONSTRUC - TION FROM INCOMPLETE MEASUREMENTS Julian L. M ¨obius & Michael Habeck Microscopic Image Analysis Group Jena University Hospital Friedrich Schiller University Jena 07743 Jena, Germany {julian.moebius,michael.habeck }@uni-jena.de ABSTRACT Many inverse problems are ill-posed and need to be complemented by prior in- formation that restricts the class of admissible models. Bayesian approaches en- code this information as prior distributions that impose generic properties on the model such as sparsity, non-negativity or smoothness. However, in case of com- plex structured models such as images, graphs or three-dimensional (3D) objects, generic prior distributions tend to favor models that differ largely from those ob- served in the real world. Here we explore the use of diffusion models as priors that are combined with experimental data within a Bayesian framework. We use 3D point clouds to represent 3D objects such as household items or biomolecu- lar complexes formed from proteins and nucleic acids. We train diffusion models that generate coarse-grained 3D structures at a medium resolution and integrate these with incomplete and noisy experimental data. To demonstrate the power of our approach, we focus on the reconstruction of biomolecular assemblies from cryo-electron microscopy (cryo-EM) images, which is an important inverse prob- lem in structural biology. We find that posterior sampling with diffusion model priors allows for 3D reconstruction from very sparse, low-resolution and partial observations. 1 I NTRODUCTION Inverse problems are encountered in many different scientific fields. The basic setting is that we observe noisy and incomplete data yand seek to find a model xthat predicts mock data via a forward model Asuch that y≈ A(x). An important subclass are linear models where Ais a linear operator. Well-known inverse problems are deconvolution or tomography. The challenge in solving inverse problems stems from the fact that they tend to be ill-posed mean- ing that many models can produce highly similar data and/or the reconstructed model can be very sensitive to noise. The remedy is to combine a reconstruction loss with a regularizer. Well-studied regularizers are Tikhonov regularization (aka ridge regression), sparsity, and non-negativity. Bayesian inference offers a powerful framework to tackle inverse problems. The conditional prob- ability p(y|x), the likelihood, relates the data yto the mock data A(x)via a noise model. A common assumption is independent Gaussian noise resulting in the likelihood p(y|x)∝exp −∥y− A(x)∥2 2σ2 . (1) Maximizing the likelihood is then equivalent to standard least-squares fitting. The prior probability p(x)encodes data-independent knowledge about a particular model x; its negative logarithm −logp(x)can be viewed as a regularizer. The posterior of the model is p(x|y) =p(y|x)p(x) p(y)(2) 1arXiv:2412.14897v1 [cs.LG] 19 Dec 2024 Page 2: with model evidence p(y) =R p(y|x)p(x)dx. In case of a Gaussian likelihood, maximization of logp(x|y)is equivalent to regularized least-squares fitting. Often detailed knowledge about reasonable solutions is available but difficult to capture by the stan- dard priors that are typically used to tackle inverse problems. For example, cryo-electron microscopy (cryo-EM) aims to reconstruct the three-dimensional structure of macromolecular complexes from two-dimensional (2D) projections. Cryo-EM images are typically very noisy with signal-to-noise ra- tios (SNR) far below one. On the other hand, a large body of knowledge has been accumulated over the past six decades, including hundreds of thousands of experimentally determined biomolecular structures that are stored in the Protein Data Bank (PDB) (Berman et al., 2000). Experimentally de- termined structures exhibit recurrent features such as alpha-helices and beta-strands and preferences for the proximity and packing of amino acids and entire subunits. This detailed information is not captured by standard priors used in cryo-EM reconstruction packages such as CryoDRGN (Zhong et al., 2021a;b), cryoSPARC (Punjani et al., 2017) or RELION (Scheres, 2012). These approaches represent the structure as voxel grid and use generic priors enforcing non-negativity or penalizing high-frequency contributions. If one were to sample volumes from the corresponding prior, the sam- pled structures would not resemble any of the known biomolecular structures. Here we try to encode the rich knowledge available in the PDB as a diffusion model prior. We test 3D reconstruction from sparse, low-resolution and partial measurements by posterior sampling with diffusion models as priors. 1.1 C ONTRIBUTION Our contributions are as follows: We propose a method to reconstruct 3D structures from 2D projec- tions that utilizes diffusion models as priors. Using diffusion priors has previously not been explored to solve the 2D to 3D reconstruction problem in cryo-EM. The combination of the diffusion model prior with a likelihood allows us to reconstruct 3D structures from very sparse observations such as 2D projections, low-resolution structures and known structures of subunits. This is achieved with diffusion-based posterior sampling (DPS) (Chung et al., 2023) a method that has not yet been inves- tigated in the context of 3D point cloud data. We combine DPS with optimized diffusion schedules and second-order correction steps with adaptable noise injection (Karras et al., 2022) to improve sample quality and runtime. We demonstrate the fidelity and flexibility of our method on highly complex and diverse datasets of 3D point clouds from ShapeNet and the PDB. We emphasize that the reconstruction problem which we solve differs from the problem of recon- structing a 3D surface from a 2D surface color image, which is tackled by, for example, Point-E (Nichol et al., 2022), Shape-E (Jun & Nichol, 2023), PC2(Melas-Kyriazi et al., 2023), One-2-3-45 (Liu et al., 2023) and BDM (Xu et al., 2024). The main difference is that in our work the 2D ob- servations are projections that provide information about the density across the full volume rather than only information about the surface. In addition, our approach is also capable of conditioning on coarse-grained or partial observations of the 3D structure. Moreover, the reconstruction problem in cryo-EM aims to reconstruct the internal structure not only the surface. It is important to note that diffusion models have been used in prior work in the context of cryo-EM but have not been used for the 2D to 3D reconstruction problem. DiffModeler (Wang et al., 2024) uses a diffusion model as part of their pipeline to trace the backbone of protein chains within already reconstructed 3D cryo-EM maps to model large protein complexes. Zhang et al. (2024) utilized a diffusion model to denoise and restore 2D cryo-EM images of single particles. Kreis et al. (2022) trained diffusion models on the CryoDRGN latent conformational space of single proteins to tackle the prior hole problem. In Ingraham et al. (2023) diffusion models were used to generate protein complexes at the atomic level and also showed generation conditioned on shape and symmetry constraints. So far they have not used their diffusion model for the 2D to 3D reconstruction task. 2 B ACKGROUND ON DIFFUSION MODELS Diffusion models have gained wide recognition in the field of generative modeling (Sohl-Dickstein et al., 2015; Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021), particularly in image synthesis, where diffusion models have demonstrated their capability by surpassing former leading models in key metrics (Dhariwal & Nichol, 2021) and continue to set new records (Karras et al., 2024). In 2 Page 3: generative modeling, the main goal is to learn a sampler for an unknown distribution p0from i.i.d. samples x(0)i∼p0that serve as training data. A diffusion model tries to achieve this goal by approximating a probability flow from a latent Gaussian distribution pTto the unknown target p0. For this purpose, a forward process from the target distribution p0to the latent distribution pTis defined in terms of a stochastic differential equations (SDE) of the form dx=f(x, t)dt+g(t)dwt, (3) where wtis a Wiener process, f(·, t) :Rd→Rdis the drift ofx(t)andg:R→Ris the diffusion coefficient (Song et al., 2021). Starting at time t= 0 with samples x(0)∼p0from the target distribution, process (3) is designed such that it gradually destroys the information content of the samples x(0)by transforming them into samples x(T)from an isotropic Gaussian. A diffusion model aims to represent the reverse process from pTtop0such that we can draw noise from a Gaussian distribution and slowly transform it into samples from the data distribution p0. Anderson (1982) showed that the forward process (3) has a reverse process of the form dx= f(x, t)−g(t)2∇xlogpt(x) dt+g(t)dwt (4) withpt(x(t)) =R p0t(x(t)|x(0))p0(x(0))dx(0)being the marginal distribution of x(t)where p0tis the perturbation kernel from time 0tot. The score ∇xlogpt(x)of the marginals is unknown and has to be approximated with a parametric score model sθ(x(t), t). Diffusion model training works by applying gradient descent to the denoising score matching (DSM) objective to train sθ: min θEt,x(0),x(t)h λ(t) ∇x(t)logp0t(x(t)|x(0))−sθ(x(t), t) 2i (5) where t∼ptrain,x(t)∼pt,x(0)∼p0andx(t)∼p0t(·|x(0)) with the loss weighting λ:R+→ R+. In DSM, we only need to evaluate the score of the perturbation kernel p0t, which is easy to calculate for suitable choices of the drift and diffusion coefficient (consider, for instance, the variance exploding orvariance preserving schedules in Song et al. (2021)). More background on the training process can be found in Appendix A.2. After training, the score model sθcan be used as a replacement for ∇xlogpt(x)togenerate new data by sampling the latent model pTand simulating the reverse SDE in equation (4) backward in time. The reverse SDE can be simulated with numerical methods such as Euler-Maruyama, starting from Tand ending shortly before 0to avoid numerical errors. 2.1 D IFFUSION MODELS IN 3D Apart from the 2D image domain, diffusion models have been employed to estimate the distribution of 3D objects. Various representations have been used including point clouds (Luo & Hu, 2021; Vah- dat et al., 2022; Nichol et al., 2022; Zhou et al., 2021), meshes and implicit neural representations (Jain et al., 2021; Jun & Nichol, 2023; Erkoc ¸ et al., 2023) such as neural radiance fields (Mildenhall et al., 2021). Here, we employ a point cloud representation and adopt the point transformer archi- tecture from Nichol et al. (2022). This representation allows us to model 3D volume densities such as cryo-EM maps efficiently, unlike meshes, which only model the surface. Furthermore, the point cloud representation simplifies the process of developing likelihoods for the cryo-EM reconstruction problem. In addition, we avoid any kind of latent diffusion (as, for example, proposed by Vahdat et al. (2022)) for which likelihood guidance is more difficult (Song et al., 2024). 2.2 D IFFUSION POSTERIOR SAMPLING In many practical applications such as text-to-image or class-to-image generation, the focus is on sampling from the posterior x(0)∼p0(·|y)given some input y. In this case, our unconditional score∇x(t)logpt(x(t))will be extended to ∇x(t)logpt(x(t)|y)Bayes rule=∇x(t)logpt(y|x(t)) +∇x(t)logpt(x(t)). (6) Given pairs of training data {(x(0)i,yi)}, we could train a diffusion prior plus a classifier pt(y|x(t))and use its score ∇x(t)logpt(y|x(t))during inference for classifier guidance (Dhari- wal & Nichol, 2021). Another popular option is to perform classifier-free guidance and directly 3 Page 4: train∇x(t)logpt(x(t)|y)(Ho & Salimans, 2022). For example, Zhou et al. (2021) used this ap- proach for 3D shape completion and 3D shape reconstruction from a single depth map. Another line of work attempts to avoid task-specific training and instead uses the known forward model to guide the generation process (Chung et al., 2023; 2022; Ho et al., 2022; Lugmayr et al., 2022; Song et al., 2021; Trippe et al., 2023a;b; Dou & Song, 2024; Cardoso et al., 2023). In tasks with known forward model like inpainting, shape completion or colorization, we have access to a likelihood p0(y|x(0)) based on the noiseless data x(0). Chung et al. (2023) make use of this likelihood by approximating the score of the posterior by ∇x(t)logpt(x(t)|y)≈ζ∇x(t)logp0(y|Dθ(x(t), t)) +∇x(t)logpt(x(t)) (7) with weighting ζ > 0and denoising function Dθ, which is an estimator of D(x(t), t) := Ex(0)∼p(·|x(t))[x(0)]that is learned during the training of the diffusion model (see Section 3.1 and Appendix A.2). This approximation approach, called reconstruction guidance in Ho et al. (2022), has been applied across multiple contexts with prominent results in ill-posed inverse problems of 2D images (Chung et al., 2023; 2022; Ho et al., 2022; Trippe et al., 2023a). Simpler approaches such as the replacement method of Song et al. (2021) are computationally cheaper because they do not need additional backpropagation. However, the replacement method sometimes suffers from severe artifacts (Lugmayr et al., 2022; Chung et al., 2023). Most recently, several approaches used reweighing schemes within the Sequential Monte Carlo (SMC) framework to derive exact methods for diffusion posterior sampling (Trippe et al., 2023a;b; Cardoso et al., 2023; Dou & Song, 2024). However, the guarantee of exactness is in our case of limited practical relevance (due to high com- putational demands) because the required number of particles in SMC tends to be excessively large (Gupta et al., 2024). 3 T HEORETICAL FRAMEWORK Our theoretical framework is inspired by existing diffusion models and uses reconstruction guidance provided by forward models for 3D reconstruction from sparse observations in 2D and 3D. 3.1 3D D IFFUSION PRIOR TRAINING AND SAMPLING We follow the design choice recommendations of Karras et al. (2022) using f(x, t) =0and g(t) =√ 2twhich yield the forward diffusion SDE dx=√ 2t dwtand the perturbation kernel p0t(x(t)|x(0)) = N(x(t);x(0), t2I). For the loss weighting λ(t),ptrain(t)and the score model parameterization sθ(x(t), t) = (Dθ(x(t), t)−x(t))/t2we also followed Karras et al. (2022) (more details can be found in the Appendix A.2). During inference time, we utilize the more general version of the reverse SDE presented by Karras et al. (2022) which has the same marginals as dx=√ 2t dwtand gives us more flexibility in choosing favorable sampling schemes: dx=−[t+β(t)t2]∇xlogpt(x)dt+p 2β(t)t2dwt (8) where β:R+→R+is a function that controls how noisy the trajectory behaves. The choice β(t) = 1 /tresults in Eq. (4) as a special case, whereas β(t) = 0 yields an ordinary differential equation (ODE) called the flowODE . In practice, the score of the marginals ∇xlogpt(x)is replaced by the score estimator (Dθ(x, t)−x)/t2and the differential equation must be solved backward in time by a numerical integrator such as Euler-Maruyama for a specific time interval t∈[tmin, tmax] where tmin>0. The time interval must be discretized into Ntime steps {tmax=t0> . . . > tN−1=tmin> tN= 0}. More time steps result in a more accurate simulation of the SDE, but also increase the number of network function evaluations (NFE). Accurate simulation of the SDE can be especially difficult in areas with a high curvature in the trajectory, which is typically prominent at smaller t. We therefore adopt the time step heuristic of Karras et al. (2022): ti= (t1/ρ max+i N−1(t1/ρ min−t1/ρ max))ρwithi < N ,ρ≥1andtN= 0where an increase in ρleads to more time steps in the lower part of the time frame. We found that ρ= 3 works well for sampling 3D point clouds. Algorithm 1 with ∇log ˜pt(x|y) = (Dθ(x, t)−x)/t2implements unguided diffusion prior sampling using Euler-Maruyama with correction step. 4 Page 5: 3.2 D IFFUSION POSTERIOR SAMPLING FOR 3D RECONSTRUCTION To sample the trained diffusion prior in the light of observations yoriginating from a known for- ward process, we use reconstruction guidance (Chung et al., 2023). In contrast to Chung et al. (2023), we apply a more advanced diffusion schedule (EDM (Karras et al., 2022) instead of VP- SDE (Song et al., 2021)) to enhance the capabilities of the proposed guidance strategy. We supple- ment the schedule with a stochastic Euler-Maruyama integrator that uses a second-order correction step, because both the use of stochasticity (solving an SDE rather than an ODE) and second-order samplers have been shown to improve image generation performance in the unconditional setting (Karras et al., 2022). We observed that this also holds for our conditional setting in 3D (see Ta- ble 2). For conditional generation, we extend the score of the marginals from the diffusion prior ∇x(t)logpt(x(t))with an approximate score of the perturbed likelihoods: ∇x(t)logpt(x(t)) +ζ∇x(t)logp0(y|Dθ(x(t), t)) =:∇x(t)log ˜pt(x(t)|y) (9) withζ=α(t)/p logp0(y|Dθ(x(t), t))following Chung et al. (2023). Algorithm 1 illustrates our method for conditional generation with reconstruction guidance. In order to apply this methodology to reconstruct partially observed 3D volumes represented as point clouds, we now list the subsequent forward processes. Single 2D projection to 3D. In the simplest version of the reconstruction problem, we partially observe a single 2D projection of a 3D object in a known orientation. Here we represent the structure of an object as a 3D point cloud x(0)∈RN×3withNpoints and the corresponding 2D projection y1∈RM×2as a 2D point cloud consisting of Mpoints. We define the likelihood of observing the projection y1givenx(0)asp0(y1|x(0))∝exp (−E1(x(0))) where the energy is defined as Ek(x(0)) := min P∈PN×N PUy k−(x(0)Rk):,(1,2) 2 F(10) fork= 1with permutation matrices PN×N⊂ {0,1}N×N, orthogonal matrix Rk∈O(3), Frobe- nius norm ||·|| Fand the linear operator U∈ {0,1}N×Mthat upsamples1ykby randomly redrawing points. The permutation matrix Passigns each point in Uykto a single point in the rotated and projected object (x(0)Rk):,(1,2). The introduction of Parises from the assumption of a hidden one-to-one correspondence between the upsampled points Uykand the points in x(0). The inner optimization problem is a linear assignment problem (Crouse, 2016) that can be solved exactly in polynomial time by using the Hungarian method (Kuhn, 1955). We stress that due to the miss- ing correspondence information, the 3D reconstruction problem from 2D projections with known orientations is non-trivial and severely ill-posed. Multiple 2D projections to 3D. To generalize the above forward process, we consider the case of observing Kprojections y={y1, . . . ,yK}of an object x(0)from known orientations R= {R1, . . . ,RK}. Then the likelihood of observing the set of projections yfrom orientations Rgiven x(0)is the product of all independent observations: p0(y|x(0)) =Y kp(yk|x(0))∝exp −X kEk(x(0)) (11) Coarse to fine grained. We can also guide the diffusion prior by a 3D point cloud with fewer points M < N representing a low-resolution version ycg∈RM×3ofx(0)∈RN×3. From this coarser observation, we want to infer a higher resolution structure. In order to characterize the rela- tion between different resolutions, we employ a likelihood similar to the one used for 2D projections, p0(ycg|x(0))∝exp (−E∗(x(0))) where the energy is defined as Ecg(x(0)) := min P∈PN×N||PUy cg−x(0)||2 F. (12) From subunit to full 3D reconstruction. If available we can further update our prior knowledge encoded in the diffusion model by utilizing information about parts or subunits of the unknown 3D 1Here we look at the case M≤N, however this formulation can also be used to downsample ykifM > N . 5 Page 6: structure. Thus we define the energy for the likelihood p0(ysu|x)∝exp(−Esu(x(0))) of observing the subunit ysu∈RL×3givenx(0)∈RN×3by Esu(x(0)) := min P∈PL×N||Px(0)−ysu||2 F (13) with partial permutation matrices PL×N⊂ {0,1}L×Nthat pick Lout of Npoints in x(0)to create a one-to-one correspondence to the Lpoints in ysu. We can also combine likelihoods for all possible observations y={ysu,ycg,y1, . . . ,yK}of the 3D structure to update the prior knowledge encoded in the diffusion prior. To enable the assignment of importance or uncertainty to each dataset, we can weight the corresponding energies: p0(y|x(0))∝exp −wsuEsu(x(0))−wcgEcg(x(0))−X kwkEk(x(0)) (14) with weights wsu, wcg, wk≥0, coarse-grained structure ycg, subunit ysu, 2D observations {y1, . . . ,yK}, orientations Rand 3D structure x(0). In the experiments of this work, we apply equal weighting of 1/|y|to all the observations. The likelihood guidance of the diffusion prior allows us to flexibly incorporate all this information with varying shapes, thereby avoiding task- specific retraining. Algorithm 1 Approximate posterior sampling with correction step 1:Input: Noise control function β, time steps {t0> t1> . . . > t N= 0}, observations y 2:Output: Approximate sample from p0(x(0)|y) 3:x(t0)∼ N(0, t2 0I) 4:fori∈ {0, . . . , N −1}do 5: ∆t←(ti−ti+1) 6: x(ti+1)←x(ti) +ti∇log ˜pt(x(ti)|y)∆t 7: ifti+1̸= 0then ▷correction step + noise injection 8: d←(ti+β(ti)t2 i) [∇log ˜pt(x(ti)|y) +∇log ˜pt(x(ti+1)|y)] ∆t/2 9: n∼ N 0,2β(ti)t2 i∆tI 10: x(ti+1)←x(ti) +d+n 11: end if 12:end for 13:return x 4 E XPERIMENTS To demonstrate the fidelity and flexibility of our approach, we conducted multiple experiments. For this, we performed training on multiple different 3D datasets and tested their usefulness on a variety of 3D reconstruction tasks. 4.1 D IFFUSION PRIOR TRAINING We trained diffusion priors for each of the three datasets from multiple domains each differing in their level of complexity. (A) ShapeNet-Chair : 2658 point clouds from the training split of the ShapeNet dataset in the cate- gory “Chair” accessed via PyTorch Geometric (Chang et al., 2015; Fey & Lenssen, 2019). During training, we randomly subsampled 1024 points from each point cloud and applied random orthogo- nal transformations to augment the dataset. (B) ShapeNet-Mixed : 10693 point clouds from the training split of the ShapeNet dataset in the categories “Airplane”, “Bag”, “Cap”, “Car”, “Chair”, “Guitar”, “Laptop”, “Motorbike”, “Mug”, “Pistol”, “Rocket”, “Skateboard” and “Table” (all categories from ShapeNet with point clouds larger 6 Page 7: than or equal to 1024) accessed via PyTorch Geometric (Chang et al., 2015; Fey & Lenssen, 2019). Again, we applied subsampling and augmentation with random orthogonal transformations to the training data. (C) CryoStruct : 6629 point clouds representing mixture models of size 1024 constructed from the 3D atom positions of biomolecular complexes from the PDB contained in the train split of the curated Cryo2StructDataset (Giri et al., 2024). The mixture models were created using the scikit- learn GaussianMixture method with covariance matrix shared among the components (Pedregosa et al., 2011). We also augmented the dataset by randomly rotating the biomolecular complexes. The point clouds in all three datasets are centered and scaled so as to fit into the [−1,1]cube. Figures 3, 4, and 5 in the appendix present images of unconditional samples from the diffusion priors. Following the methodology of Yang et al. (2019), we present the 1-nearest neighbor accuracy (1-NNA), coverage (COV), and minimum matching distance (MMD) in Table 3 in the appendix to quantify the performance of the diffusion model. 4.2 3D RECONSTRUCTION ON SHAPE NET To demonstrate the performance and flexibility of our method on the widely used ShapeNet bench- mark (Chang et al., 2015), we conducted experiments across nine different configurations. An ad- vantage of the ShapeNet reconstruction tasks is that it is easier to visually judge the quality of the reconstructions than for the CryoStruct reconstruction tasks. In each setting, we took the first 100 instances from both the ShapeNet-Chair and ShapeNet-Mixed test set as ground truth and created sparse observations y. These observations include 2D projections, coarse-grained point clouds, or subunits. The 2D projections are constructed by sampling points from the ground truth and applying a random orthogonal transformation to the sub-sampled points before projecting them onto the xy- plane. The coarse-grained point clouds are constructed by taking the means from a mixture model fitted to the ground truth point cloud. A subunit, i.e. a partial structure, corresponds to a single k-means cluster selected randomly from the ground truth. We applied our version of approximate DPS (see Algorithm 1) to generate ten 3D reconstructions per instance using only 40 time steps (additional details on the parameters can be found in Section A.3 of the Appendix). We compared our method to the ML approach obtained by maximizing the same log-likelihood that was also used to guide the diffusion prior during approximate DPS. Starting from 10 different random clouds with points uniformly distributed in [−1,1]3, we performed gradient descent for 100 steps using the Adam optimizer with a learning rate of 0.01(Kingma & Ba, 2014). By using the same likelihoods without the diffusion model, we can assess how much we gain in 3D reconstruction performance by utilizing a diffusion prior. Similar to the approach of Yang et al. (2019), we measure the 3D reconstruction error between a reconstructed point cloud and the ground truth with the Chamfer Distance (CD) and the Earth Movers Distance (EMD). The values in Table 1 are the means and standard deviations of all 100×10reconstruction errors measured in CD and EMD as well as the negative log-likelihood (energy E) of the corresponding forward model. Table 1 shows that, as expected, in most cases the maximum likelihood approach creates 3D re- constructions with a higher likelihood (lower energy E) of observing the input data ythan DPS. However, in the face of the ill-posedness of the reconstruction tasks, it is not sufficient to simply optimize the likelihood. This explains why the incorporation of the diffusion prior consistently re- sults in better reconstruction errors in all test cases for both EMD and CD, although for most test cases the likelihoods obtained with DPS are worse than those obtained with ML. Prominent example reconstructions that demonstrate the superior performance of DPS are shown in Figure 1. The diffu- sion prior helps navigate the space of possible 3D reconstructions with high likelihood toward those with typical ShapeNet structures, information that is not sufficiently provided by the observations y themselves. Therefore the structural models obtained with DPS are also visually much closer to the ground truth with notable ShapeNet-like characteristics. 4.3 D IFFUSION POSTERIOR SAMPLING FOR CRYO -EM We also benchmark posterior sampling with diffusion priors on various reconstruction tasks arising in the context of cryo-EM reconstruction. We are mostly interested in sparse data scenarios. This might appear to be at odds with the fact that cryo-EM tends to produce many hundreds of thousands 7 Page 8: Figure 1: Results for five different reconstruction tasks. In all examples, the ML reconstruction has a higher likelihood of observing the input data than the models obtained with approximate DPS. However, the ML-based models show a higher reconstruction error than those from DPS. The results are also part of the tests presented in Table 1 and correspond to rows 9, 8, 1, 8 and 2 (from left to right). 8 Page 9: Table 1: Results from the 3D reconstruction task from sparse data. Tests were conducted on the test partition of the ShapeNet (Mixed, Chair) datasets under various configurations, altering the number of points per projection, coarse-grained structure and subunit. We compared our variant of approx- imate diffusion posterior sampling (DPS) to the maximum likelihood (ML) approach. To quantify the error between the reconstructions and the ground truth point clouds we calculated the mean Chamfer Distance (CD) and mean Earth Movers Distance (EMD) over in total 1k reconstructions (10 samples for each of the 100 test instances). For further analysis we also show the energy of the forward model ( E). ShapeNetMethodProjection Number of Coarse grained SubunitCD([×102],↓)EMD ([×102],↓)E([×103],↓)category points projections points points ChairDPS200 5 - -9.98±2.38 8 .32±1.77 3.80±1.02 ML 13.30±2.00 11 .03±1.69 3.68±0.88 ChairDPS200 6 - -9.71±1.81 7 .78±1.35 3.98±0.87 ML 12.53±1.53 10 .04±1.30 3.87±0.79 MixedDPS400 4 - -10.56±4.09 8 .18±3.40 2.41±1.01 ML 12.37±2.34 10 .21±2.09 2.38±0.79 MixedDPS400 5 - -9.29±2.65 7 .00±2.22 2 .30±0.89 ML 11.78±1.88 9 .39±1.77 2 .72±1.27 ChairDPS300 1 30 -10.38±2.24 10 .66±3.23 6.21±1.76 ML 12.40±2.16 11 .89±2.84 5.04±1.52 MixedDPS300 1 30 -9.36±2.23 9 .21±2.47 5.57±2.11 ML 11.99±1.89 11 .08±2.01 5.13±1.69 ChairDPS200 2 - ≈25613.21±5.69 12 .86±5.62 2.51±0.66 ML 16.98±4.58 16 .63±4.85 1.84±0.63 MixedDPS200 2 - ≈25611.11±4.13 11 .19±5.09 2.47±0.86 ML 18.14±6.28 17 .99±6.59 2.22±1.05 MixedDPS200 1 30 ≈1288.55±1.97 9 .38±1.96 4.17±1.64 ML 11.19±1.82 10 .90±1.88 3.80±1.49 Table 2: Evaluation of the improvement we obtain by switching from integrating the flowODE (β(t) = 0 ) using Euler’s method in A to the integration of the SDE ( β(t) = 1 /tift >0.15and else0) using the Euler–Maruyama method in B. In C, we observe that the reconstruction error is lowered further by adding a second-order correction step. The test errors have been studied on the ShapeNet-Mixed reconstruction task given a subunit with ≈256points and two projection images with 200 points each (row 8 in Table 1). In all three schedules, we used 79 NFE which accounts to 79 time steps in A and B and 40 time steps in C. CD([×102],↓)EMD ([×102],↓)E([×103],↓) A Euler ODE 14.38±5.64 13 .98±5.94 3 .09±1.19 B + noise 11.80±3.94 11 .71±5.03 2 .61±0.85 C + correction step 11.11±4.13 11 .19±5.09 2 .47±0.86 of images. Our interest is in reconstructing intermediate resolution structures from very few images, with the goal of elucidating structural differences between individual copies of the biomolecule. These structural variations are expected to occur, because biomolecular complexes are flexible and undergo conformational changes. Conformational heterogeneity is often linked to the biological function of a macromolecular complex and of particular interest to the structural biologist (Toader et al., 2023). We designed various benchmarks based on a held-out set of 100 structures from Cryo2StructDataset that were not used in the training of the diffusion prior. The reconstruction tasks involve sparse 2D and/or 3D information. Again, as a baseline we used ML models obtained by maximizing the likelihood without the diffusion prior (a detailed presentation of the results can be found in the Supplementary Material, Sections A.4.1 to A.4.7). We generated ten models with and without diffusion model per reconstruction task. To assess the accuracy of the model structures, we compare them against the atomic structure deposited in the PDB and the point cloud obtained by fitting a 1024-component mixture of Gaussians used for the generation of the input measurements. The 3D points generated by ML and DPS tend to concentrate in [−1,1]3. Before a meaningful comparison between the ground truth and model structures can be made, we first need to scale the model points 9 Page 10: Figure 2: Outcomes for five cryo-EM reconstruction tasks. The top row shows the sparse input mea- surements. The second row shows all ten point clouds generated with DPS. The third row shows the 1024 component means of a mixture model fitted to the atomic models (last row). (A)Nucleosome- CHD4 from five projections (PDB code 6ryr). (B)F-ATP Synthase from four projections (PDB code 6rdm). (C)RNA polymerase transcription open promoter complex with Sorangicin from three projections (PDB code 6vvy). (D)Human spliceosome after Prp43 loaded from one projection and a low-resolution structure consisting of 40 particles (PDB code 6id1). (E)26S proteasome from three projections and a known 20S structure (PDB code 6fvt). so as to match the physical units of the coordinates in the PDB file (which are in ˚A). We achieve this by matching the radii of gyration. However, there could still be a mismatch between the ground truth and the scaled model resulting from a relative rotation and translation (rigid transformation) between the two point clouds. We estimate the optimal alignment of both point clouds by maximizing the kernel correlation (Tsin & Kanade, 2004). After scaling and superposition, we can meaningfully compare model point clouds against the atomic and coarse-grained ground truth structures. We assess the accuracy of the models with the root mean square deviation (RMSD) which is commonly used to compare biomolecular struc- tures. Since there is no one-to-one correspondence between the points in the cloud representing the ground truth (all heavy atoms in the PDB file or component means of the Gaussian mixture) and the models computed with ML or DPS, we compute RMSD = (1 NPN n=1∥xn−x′ ℓn∥2)1/2where ℓn∈ {1, . . . , 1024}encodes the correspondence between points xnrepresenting the ground truth and points x′ mrepresenting the model (where m∈ {1, . . . , 1024}). In case the ground truth is rep- resented by all heavy atoms, we set ℓn=argmin ∥xn−x′ m∥(where “argmin” runs over all m) and Nis the total number of heavy atoms in the PDB file. The 100 PDB structures in the test set vary largely in the number of heavy atoms from N= 2178 toN= 110541 . In case the ground truth is represented by the 1024 component means of the Gaussian mixture (also referred to as “subsampled structure” in the following), we compute ℓnby solving the linear assignment problem that matches the 1024 points representing the ground truth against the 1024 in the model (in this case N= 1024 ). Figure 2 shows representative cryo-EM reconstructions for five different sparse data scenarios. Fig- ure 2A shows the results for a nucleosome-CHD4 complex (PDB code 6ryr, 17820 heavy atoms). Five 2D projections served as input for DPS reconstruction. The RMSD between the ten DPS mod- els and the ground truth is 3.56±0.04˚A (atomic structure) and 2.05±0.09˚A (subsampled structure). We also inferred the structure of F-ATP synthase (PDB code 6rdm, 33891 heavy atoms) from 4 pro- jections (Fig. 2B). The RMSDs between the DPS models and the ground truth is 4.46±0.02˚A (atomic structure) and 2.83±0.03˚A (subsampled structure). RNA polymerase transcription open 10 Page 11: promoter complex with Sorangicin (PDB code 6vvy, 30033 heavy atoms) was inferred from three projections (Fig. 2C). The RMSDs between the DPS models and the ground truth is 4.87±0.77 ˚A (atomic structure) and 3.39±1.21˚A (subsampled structure). These tests show that intermediate resolution structures can be computed from very few 2D projections. A common scenario in cryo-EM is that a low-resolution structure is already known and the goal of a cryo-EM study is to furnish structural details at higher resolution. This scenario was tested on the human intron lariat spliceosome (PDB code 6id1, 79882 heavy atoms). The structural models were computed from a single projection and a low-resolution structure represented by only 40 points (Fig. 2D). The RMSD between DPS models and the ground truth is 10.35±0.20˚A (atomic structure) and 9.82±0.15˚A (1024 component means). Because the structure is huge and the input data for DPS are very sparse, the RMSD is worse than in the previous examples. Nevertheless, it is remarkable that such sparse information allows us to refine the coarse-grained spliceosome structure to a medium resolution. The final example shows the power of DPS for 3D reconstruction from few projections and a subunit structure. This is a common scenario in structural biology where many partial structures have been determined and the challenge is to determine the full structure. To test this scenario, we model the 26S proteasome (PDB code 6fvt, 110541 heavy atoms). Historically, a huge part of the 26S protea- some, the 20S proteasome, was determined before the complete 26S structure could be elucidated by cryo-EM. In our tests, we use three projections and the structure of the 20S proteasome as input (Fig. 2E). The RMSD between the models obtained with DSP and the ground truth is 8.14±0.12˚A (atomic structure) and 5.66±0.19˚A (subsampled structure). 4.4 L IMITATIONS A major limitation of the proposed method concerns its runtime. In each approximate DPS step with correction, we have to evaluate the gradient of the energy from our forward model twice. Overall, this means that we need 2×#timesteps −1network function evaluations and have to solve (2×#timesteps −1)×#observations linear assignment problems to obtain a single 3D reconstruction. However, the time to reconstruct a 3D structure in the case of 6 input projections and 40 timesteps within a batch of 10 still takes ≈1.2min per sample on a A100 GPU in combination with an Intel Xeon Platinum 8360Y 2.40 GHz CPU. 5 C ONCLUSION We propose a Bayesian approach for 3D reconstruction from sparse measurements such as 2D pro- jections, coarse-grained structures, and/or substructures, using diffusion models as priors. Diffusion models are capable of encoding rich prior information about 3D structures and enable us to re- construct meaningful 3D models from very sparse input data via approximate diffusion posterior sampling. Diffusion priors can distill rich data sources and thereby complement existing regulariza- tion techniques whenever such training data are available. The goal of future research is to improve the resolution of the 3D reconstructions. ACKNOWLEDGMENTS This work was supported by the Carl Zeiss Stiftung within the project “Interactive Inference”. In addition, Michael Habeck acknowledges funding by the Carl Zeiss Stiftung within the program “CZS Stiftungsprofessuren” and is grateful for the support of the DFG within project 432680300 - SFB 1456 subproject A05. REFERENCES Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications , 12(3):313–326, 1982. ISSN 0304-4149. doi: https://doi.org/10. 1016/0304-4149(82)90051-5. URL https://www.sciencedirect.com/science/ article/pii/0304414982900515 . 11 Page 12: Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig, Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research , 28(1): 235–242, 2000. Gabriel Cardoso, Yazid Janati El Idrissi, Sylvain Le Corff, and Eric Moulines. Monte Carlo guided diffusion for Bayesian linear inverse problems. arXiv preprint arXiv:2308.07983 , 2023. Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015. Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving Diffusion Models for Inverse Problems using Manifold Constraints. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems , 2022. URL https://openreview.net/forum?id=nJJjv0JDJju . Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul Ye. Diffusion Posterior Sampling for General Noisy Inverse Problems. In The Eleventh Interna- tional Conference on Learning Representations , 2023. URL https://openreview.net/ forum?id=OnD9zGAGT0k . David Frederic Crouse. On implementing 2D rectangular assignment algorithms. IEEE Trans- actions on Aerospace and Electronic Systems , 52:1679–1696, 2016. URL https://api. semanticscholar.org/CorpusID:20649848 . Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion Models Beat GANs on Image Synthesis. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neu- ral Information Processing Systems , 2021. URL https://openreview.net/forum?id= AAWuCvzaVt . Zehao Dou and Yang Song. Diffusion Posterior Sampling for Linear Inverse Problem Solving: A Filtering Perspective. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum?id=tplXNcHZs1 . Ziya Erkoc ¸, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Hyperdiffusion: Generat- ing implicit neural fields with weight-space diffusion. In Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision , pp. 14300–14310, 2023. Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with PyTorch Geometric. arXiv preprint arXiv:1903.02428 , 2019. Nabin Giri, Liguo Wang, and Jianlin Cheng. Cryo2structdata: A large labeled cryo-em density map dataset for ai-based modeling of protein structures. Scientific Data , 11(1):458, 2024. Shivam Gupta, Ajil Jalal, Aditya Parulekar, Eric Price, and Zhiyang Xun. Diffusion Posterior Sam- pling is Computationally Intractable. arXiv preprint arXiv:2402.12727 , 2024. Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance, 2022. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neu- ral Information Processing Systems , volume 33, pp. 6840–6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/ file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf . Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video Diffusion Models, 2022. Aapo Hyv ¨arinen. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of Machine Learning Research , 6(24):695–709, 2005. URL http://jmlr.org/papers/v6/ hyvarinen05a.html . 12 Page 13: John B. Ingraham, Max Baranov, Zak Costello, Karl W. Barber, Wujie Wang, Ahmed Ismail, Vincent Frappier, Dana M. Lord, Christopher Ng-Thow-Hing, Erik R. Van Vlack, Shan Tie, Vincent Xue, Sarah C. Cowles, Alan Leung, Jo ˜ao V . Rodrigues, Claudio L. Morales-Perez, Alex M. Ayoub, Robin Green, Katherine Puentes, Frank Oplinger, Nishant V . Panwar, Fritz Obermeyer, Adam R. Root, Andrew L. Beam, Frank J. Poelwijk, and Gevorg Grigoryan. Illuminating protein space with a programmable generative model. Nature , 623(7989):1070–1078, November 2023. ISSN 1476-4687. doi: 10.1038/s41586-023-06728-8. URL http://dx.doi.org/10.1038/ s41586-023-06728-8 . Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, and Ben Poole. Zero-Shot Text- Guided Object Generation with Dream Fields. CoRR , abs/2112.01455, 2021. URL https: //arxiv.org/abs/2112.01455 . Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463 , 2023. Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of Diffusion-Based Generative Models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems , 2022. URL https://openreview.net/forum?id=k7FuTOWMOc7 . Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyz- ing and Improving the Training Dynamics of Diffusion Models, 2024. Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. CoRR , abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID: 6628106 . Karsten Kreis, Tim Dockhorn, Zihao Li, and Ellen Zhong. Latent space diffusion models of cryo-em structures. arXiv preprint arXiv:2211.14169 , 2022. H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly , 2(1-2):83–97, 1955. doi: https://doi.org/10.1002/nav.3800020109. URL https: //onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109 . Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su. One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems , volume 36, pp. 22226–22246. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/4683beb6bab325650db13afd05d1a14a-Paper-Conference.pdf . Andreas Lugmayr, Martin Danelljan, Andr ´es Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. RePaint: Inpainting using Denoising Diffusion Probabilistic Models. CoRR , abs/2201.09865, 2022. URL https://arxiv.org/abs/2201.09865 . Shitong Luo and Wei Hu. Diffusion Probabilistic Models for 3D Point Cloud Generation. CoRR , abs/2103.01458, 2021. URL https://arxiv.org/abs/2103.01458 . Luke Melas-Kyriazi, Christian Rupprecht, and Andrea Vedaldi. Pc2: Projection-conditioned point cloud diffusion for single-image 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 12923–12932, 2023. Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM , 65(1):99–106, dec 2021. ISSN 0001-0782. doi: 10.1145/3503250. URL https://doi.org/ 10.1145/3503250 . Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts, 2022. 13 Page 14: F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten- hofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research , 12:2825–2830, 2011. Ali Punjani, John L Rubinstein, David J Fleet, and Marcus A Brubaker. cryoSPARC: algorithms for rapid unsupervised cryo-EM structure determination. Nature methods , 14(3):290–296, 2017. Sjors HW Scheres. RELION: implementation of a Bayesian approach to cryo-EM structure deter- mination. Journal of structural biology , 180(3):519–530, 2012. LLC Schr ¨odinger and Warren DeLano. PyMOL, 2020. URL http://www.pymol.org/ pymol . Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised Learning using Nonequilibrium Thermodynamics. In Francis Bach and David Blei (eds.), Pro- ceedings of the 32nd International Conference on Machine Learning , volume 37 of Proceedings of Machine Learning Research , pp. 2256–2265, Lille, France, 07–09 Jul 2015. PMLR. URL https://proceedings.mlr.press/v37/sohl-dickstein15.html . Bowen Song, Soo Min Kwon, Zecheng Zhang, Xinyu Hu, Qing Qu, and Liyue Shen. Solving Inverse Problems with Latent Diffusion Models via Hard Data Consistency. In The Twelfth International Conference on Learning Representations , 2024. URL https://openreview.net/forum? id=j8hdRqOUhN . Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems , 32, 2019. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-Based Generative Modeling through Stochastic Differential Equations. In Interna- tional Conference on Learning Representations , 2021. URL https://openreview.net/ forum?id=PxTIG12RRHS . Bogdan Toader, Fred J Sigworth, and Roy R Lederman. Methods for cryo-EM single particle re- construction of macromolecules having continuous heterogeneity. Journal of Molecular Biology , 435(9):168020, 2023. Brian L. Trippe, Luhuan Wu, Christian A. Naesseth, David Blei, and John Patrick Cunningham. Practical and Asymptotically Exact Conditional Sampling in Diffusion Models. In ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling , 2023a. URL https: //openreview.net/forum?id=r9s3Gbxz7g . Brian L. Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, and Tommi S. Jaakkola. Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif- scaffolding problem. In The Eleventh International Conference on Learning Representations , 2023b. URL https://openreview.net/forum?id=6TxBxqNME1Y . Yanghai Tsin and Takeo Kanade. A correlation-based approach to robust point set registration. In Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech Republic, May 11-14, 2004. Proceedings, Part III 8 , pp. 558–569. Springer, 2004. Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis, et al. Lion: La- tent point diffusion models for 3d shape generation. Advances in Neural Information Processing Systems , 35:10021–10039, 2022. Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural Computation , 23(7):1661–1674, 2011. doi: 10.1162/NECO a00142. Xiao Wang, Han Zhu, Genki Terashi, Manav Taluja, and Daisuke Kihara. Diffmodeler: large macromolecular structure modeling for cryo-em maps using a diffusion model. Nature Meth- ods, October 2024. ISSN 1548-7105. doi: 10.1038/s41592-024-02479-0. URL http: //dx.doi.org/10.1038/s41592-024-02479-0 . 14 Page 15: Haiyang Xu, Yu Lei, Zeyuan Chen, Xiang Zhang, Yue Zhao, Yilin Wang, and Zhuowen Tu. Bayesian Diffusion Models for 3D Shape Reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 10628–10638, 2024. Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision , pp. 4541–4550, 2019. Jing Zhang, Tengfei Zhao, ShiYu Hu, and Xin Zhao. Robust single-particle cryo-em image denoising and restoration. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pp. 2995–2999, 2024. doi: 10.1109/ICASSP48485.2024.10447135. Ellen D. Zhong, Tristan Bepler, Bonnie Berger, and Joseph H. Davis. Cryodrgn: reconstruction of heterogeneous cryo-em structures using neural networks. Nature Methods , 18(2):176–185, February 2021a. ISSN 1548-7105. doi: 10.1038/s41592-020-01049-4. URL http://dx. doi.org/10.1038/s41592-020-01049-4 . Ellen D. Zhong, Adam Lerer, Joseph H. Davis, and Bonnie Berger. Cryodrgn2: Ab initio neural re- construction of 3d protein structures from real cryo-em images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , pp. 4066–4075, October 2021b. Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF international conference on computer vision , pp. 5826–5835, 2021. A A PPENDIX A.1 R ECONSTRUCTION GUIDANCE For completeness we show how to derive the approximation pt(y|x(t))≈p0(y|Dθ(x(t), t))used in reconstruction guidance to perform approximate diffusion posterior sampling. To take advantage of the likelihood on the noiseless data p0(x(0)|y), we follow the argument of Chung et al. (2023) and first express pt(y|x(t))as the marginal pt(y|x(t)) =Z p(y|x(t),x(0))p(x(0)|x(t))dx(0). (15) Given x(0),yis independent of x(t). Therefore, we can simplify p(y|x(t),x(0)) top0(y|x(0)) and obtain pt(y|x(t)) =Z p0(y|x(0))p(x(0)|x(t))dx(0). (16) Chung et al. (2023) note that p(x(0)|x(t))is intractable in the general case. Therefore, Chung et al. (2023) propose the approximation p(x(0)|x(t))≈δ(D(x(t), t)−x(0)), (17) where D(x(t), t) :=Ex(0)∼p(·|x(t))[x(0)]. The intuition behind Dis that it denoises the noisy inputx(t). This so-called denoising function is in general intractable and has to be replaced by an estimator Dθ(Karras et al., 2022). Learning Dθis an essential part of the Diffusion Model training and will be discussed in Section 3.1. By plugging (17) into (16) we obtain pt(y|x(t))≈Z p0(y|x(0))δ(Dθ(x(t), t)−x(0))dx(0) = p0(y|Dθ(x(t), t)). (18) A.2 T RAINING In this section, we aim to provide additional background on the objective optimized during diffusion model training and elaborate on the specifics of training our 3D diffusion priors. In diffusion model training we use gradient descent to find a good score-model sθviaexplicit score matching (ESM): min θEt,x(t)h λ(t) ∇x(t)logpt(x(t))−sθ(x(t), t) 2i (19) 15 Page 16: Figure 3: Unconditional samples from the diffusion prior trained on the ShapeNet-Chair dataset. Sampled with Algorithm 1 using β(t) = 1 /tift >1else1,tmax= 80 and100time steps. The images are created using the surface mode of PyMOL (Schr ¨odinger & DeLano, 2020). Figure 4: Unconditional samples from the diffusion prior trained on the ShapeNet-Mixed dataset. Sampled with Algorithm 1 using β(t) = 1 /tift >1else1,tmax= 80 and100time steps. The images are created using the surface mode of PyMOL (Schr ¨odinger & DeLano, 2020). Table 3: All metrics (1-NNA, COV and MMD) used to quantify the diffusion prior generation performance are based on the Chamfer Distance. Dataset Samples from 1-NNA (%,↓)COV(%,↑)MMD ([×102],↓) ShapeNet-ChairDiffusion prior 78.34 44.89 27.11 Training set 47.80 60.23 22.97 ShapeNet-MixedDiffusion prior 66.76 44.59 24.83 Training set 45.91 55.41 20.63 CryoStructDiffusion prior 54.29 42.38 18.68 Training set 44.94 53.05 18.13 16 Page 17: Figure 5: Unconditional samples from the diffusion prior trained on the CryoStruct dataset. Sampled with Algorithm 1 using β(t) = 1 /tift >0.8else1,tmax= 80 and100time steps. The images are created using the surface mode of PyMOL (Schr ¨odinger & DeLano, 2020). where t∼ptrain(i.e.ptrain=U[0, T]),x(t)∼ptwith the loss weighting λ:R+→R+(i.e.λ(t) = 1/2) (Hyv ¨arinen, 2005; Vincent, 2011). Nonetheless, the score of the marginals ∇x(t)logpt(x(t)) is unknown; therefore, we do not have an explicit regression target for sθ. Fortunately, the results from Vincent (2011) state that the optimisation problem of ESM is equivalent to the denoising score matching (DSM) objective: min θEt,x(0),x(t)h λ(t) ∇x(t)logp0t(x(t)|x(0))−sθ(x(t), t) 2i (20) where x(0)∼p0,x(t)∼p0t(·|x(0)). In DSM we only need to evaluate the score of the pertur- bation kernel p0twhich is easy to calculate for suitable choices of the drift anddiffusion coefficient (see for example variance exploding orvariance preserving schedules in Song et al. (2021)). We follow the design choice recommendations of Karras et al. (2022) by using f(x, t) = 0 andg(t) =√ 2twhich yields the forward diffusion SDE: dx=√ 2t dwt. The resulting per- turbation kernel p(x(t)|x(0)) is a sum of the starting position x(0)and infinite independent infinitesimal small Gaussian contributions. The perturbation kernel is therefore itself Gaussian: p0t(x(t)|x(0)) = N(x(t);x(0), t2I). Plugging it into the DSM objective in (20) we obtain min θEt,x(0),x(t)" λ(t) x(0)−x(t) t2−sθ(x(t), t) 2# . (21) This motivates to use the score-model parameterization: sθ(x(t), t) = (Dθ(x(t), t)−x(t))/t2. Plugging it into the objective further simplifies it to min θEt,x(0),x(t)h λ(t)||x(0)−Dθ(x(t), t)||2i , (22) which justifies the terminology denoising function forDθ. According to Karras et al. (2022) we defined the weighting λ(t) := 1 /c2 outwithcout:= 0.25t/√ t2+ 0.25and selected ptrain(t)during training as a log-normal distribution ln(t)∼ N (Pmean, P2 std)to focus on the relevant parts of the noise schedule. Following Karras et al. (2022), we define the denoiser by Dθ(x, t) :=cskipx+cout(t)Fθ(cin(t)x, cnoise(t)) (23) where cskip:= 0.25/(t2+ 0.25),cin:= 1/√ t2+ 0.25withcnoise:= 1000 ·tfor the ShapeNet- Chair diffusion prior and cnoise:= 1000 ·t/tmaxfor the ShapeNet-Mixed and CryoStruct diffusion 17 Page 18: priors. To model Fθ, we used the point transformer architecture of Nichol et al. (2022) with a width of 512 and 24 layers, where each of the 1024 points has its 3D coordinates as features. This architecture provides us with a highly flexible model with permutation equivariance. To account for the lack of rotation equivariance, we used data augmentation during training. For ShapeNet-Chair and ShapeNet-Mixed we performed data augmentation by transforming each point cloud during training with a random orthogonal matrix. In the case of the CryoStruct dataset, each point cloud is transformed by a random proper rotation matrix, an element of SO(3), to avoid training on unnatural mirrored biomolecules. To train the diffusion prior on the ShapeNet-Chair dataset, we selected Pmean=−4,Pstd= 1.2 withtmax= 1 for roughly the first 90% of the training steps and Pmean=−1.2,Pstd= 1.2with tmax= 80 for the rest. For the ShapeNet-Mixed and CryoStruct dataset, we selected Pmean=−2.8,Pstd= 0.9withtmax= 1for approximately the first 90% of the training steps and Pmean=−1.2,Pstd= 1.2withtmax= 80 for the remaining ones. We trained the ShapeNet-Chair diffusion prior for 2827 epochs on the ShapeNet-Chair training split with random orthogonal augmentations. During training, we manually adjusted the batch size (ranging from 100 to 200) and learning rate ( 1×10−4to1×10−5). The ShapeNet-Mixed diffusion prior was also trained on the respective training split with random orthogonal augmentations for 998 epochs with batch size ranging from 200 to 360 and constant learning rate of 8×10−5. For the prior CryoStruct diffusion, we used the training split of Giri et al. (2024) with random rotational augmentations. Throughout the 2070 epochs of training, we manually increased the batch size from 120 to 200 and decreased the learning rate from 1×10−4to8×10−5. A.3 T ESTPARAMETERS We consistently used 40 time steps and ρ= 3 for the reconstruction tasks involving the ShapeNet- Chair, ShapeNet-Mixed, and CryoStruct datasets. An exception was made for the results depicted in Figure 2 where the number of timesteps was doubled to 80. Moreover, we found that our likelihood- based guidance is most effective at lower values of t. Thus, to prevent superfluous computational work, we have set tmaxto 1 for the reconstruction task. We primarily utilized β(t) = 1 /tfor most tasks, but noticed that for input projections with a low number of points, setting β(t)to 0 during the final time steps produced sharper 3D point clouds. The guidance strength αis roughly selected based on the amount of information contained in the input data y. Refer to Table 4 for further details. 1020 40 8012141618 Number of Time StepsCD([×102],↓) EMD ([×102],↓) 1020 40 808101214 Number of Time StepsCD([×102],↓) EMD ([×102],↓) Figure 6: Comparison of reconstruction error in two scenarios over the number of time steps used in Algorithm 1. The plot on the left corresponds to the test case of row 8 in Table 1 with parameters of row 8 in Table 4. The right side depicts the test errors for the configuration outlined in row 3 of Table 1, utilizing the parameters specified in row 4 of Table 4. 18 Page 19: Table 4: Parameters used during approximate diffusion posterior sampling. DatasetProjection Number of Coarse grained Subunitβ(t) αpoints projections points points ShapeNet-Chair 200 5 - - 1/tift >0.15, else 0 10k ShapeNet-Chair 200 6 - - 1/tift >0.15, else 0 10k ShapeNet-Mixed 400 5 - - 1/t 10k ShapeNet-Mixed 400 4 - - 1/t 5k ShapeNet-Chair 300 1 30 - 1/t 4k ShapeNet-Mixed 300 1 30 - 1/t 4k ShapeNet-Chair 200 2 - ≈256 1 /tift >0.15, else 0 4k ShapeNet-Mixed 200 2 - ≈256 1 /tift >0.15, else 0 4k ShapeNet-Mixed 200 2 30 ≈128 1 /tift >0.15, else 0 4k CryoStruct 900 2 30 - 1/t 4k CryoStruct 300 1 40 - 1/t 4k CryoStruct 1024 1 40 - 1/t 4k CryoStruct 1024 3 - - 1/t 40k CryoStruct 800 4 - - 1/t 40k CryoStruct 1024 4 - - 1/t 60k CryoStruct 1024 5 - - 1/t 80k A.4 C RYOSTRUCT BENCHMARKS The following figures summarize the CryoStruct benchmark for various input measurements. The PDB codes of the 100 test structures are indicated on the x-axis. The red dashed line shows the average RMSD of the ML-based models. The box plots and dashed orange lines show the RMSD of the models generated with DPS. 19 Page 20: A.4.1 I NPUT DATA :THREE 2D PROJECTIONS , 1024 POINTS PER PROJECTION 7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f0102030RMSD [Å] ML DPS 6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq20102030RMSD [Å] 7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m0102030RMSD [Å] 6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq00102030RMSD [Å] 20 Page 21: A.4.2 I NPUT DATA :FOUR 2D PROJECTIONS , 1024 POINTS PER PROJECTION 7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f246810RMSD [Å] ML DPS 6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq2246810RMSD [Å] 7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m246810RMSD [Å] 6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq0246810RMSD [Å] 21 Page 22: A.4.3 I NPUT DATA :FIVE 2D PROJECTIONS , 1024 POINTS PER PROJECTION 7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f51015RMSD [Å] ML DPS 6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq251015RMSD [Å] 7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m51015RMSD [Å] 6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq051015RMSD [Å] 22 Page 23: A.4.4 I NPUT DATA :FOUR 2D PROJECTIONS , 800 POINTS PER PROJECTION 7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f246810RMSD [Å] ML DPS 6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq2246810RMSD [Å] 7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m246810RMSD [Å] 6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq0246810RMSD [Å] 23 Page 24: A.4.5 I NPUT DATA :TWO PROJECTIONS (900 POINTS PER PROJECTION )+A LOW -RESOLUTION STRUCTURE (30 POINTS ) 7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f246810RMSD [Å] ML DPS 6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq2246810RMSD [Å] 7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m246810RMSD [Å] 6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq0246810RMSD [Å] 24 Page 25: A.4.6 I NPUT DATA :A SINGLE PROJECTION (1024 POINTS )+A LOW -RESOLUTION STRUCTURE (40 POINTS ) 7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f2.55.07.510.012.5RMSD [Å] ML DPS 6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq22.55.07.510.012.5RMSD [Å] 7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m2.55.07.510.012.5RMSD [Å] 6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq02.55.07.510.012.5RMSD [Å] 25 Page 26: A.4.7 I NPUT DATA :A SINGLE PROJECTION (300 POINTS )+A LOW -RESOLUTION STRUCTURE (40 POINTS ) 7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f2.55.07.510.012.5RMSD [Å] ML DPS 6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq22.55.07.510.012.5RMSD [Å] 7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m2.55.07.510.012.5RMSD [Å] 6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq02.55.07.510.012.5RMSD [Å] 26

---