Authors: Julian L. Möbius, Michael Habeck
Page 1:
DIFFUSION PRIORS FOR BAYESIAN 3D RECONSTRUC -
TION FROM INCOMPLETE MEASUREMENTS
Julian L. M ¨obius & Michael Habeck
Microscopic Image Analysis Group
Jena University Hospital
Friedrich Schiller University Jena
07743 Jena, Germany
{julian.moebius,michael.habeck }@uni-jena.de
ABSTRACT
Many inverse problems are ill-posed and need to be complemented by prior in-
formation that restricts the class of admissible models. Bayesian approaches en-
code this information as prior distributions that impose generic properties on the
model such as sparsity, non-negativity or smoothness. However, in case of com-
plex structured models such as images, graphs or three-dimensional (3D) objects,
generic prior distributions tend to favor models that differ largely from those ob-
served in the real world. Here we explore the use of diffusion models as priors
that are combined with experimental data within a Bayesian framework. We use
3D point clouds to represent 3D objects such as household items or biomolecu-
lar complexes formed from proteins and nucleic acids. We train diffusion models
that generate coarse-grained 3D structures at a medium resolution and integrate
these with incomplete and noisy experimental data. To demonstrate the power of
our approach, we focus on the reconstruction of biomolecular assemblies from
cryo-electron microscopy (cryo-EM) images, which is an important inverse prob-
lem in structural biology. We find that posterior sampling with diffusion model
priors allows for 3D reconstruction from very sparse, low-resolution and partial
observations.
1 I NTRODUCTION
Inverse problems are encountered in many different scientific fields. The basic setting is that we
observe noisy and incomplete data yand seek to find a model xthat predicts mock data via a
forward model Asuch that y≈ A(x). An important subclass are linear models where Ais a linear
operator. Well-known inverse problems are deconvolution or tomography.
The challenge in solving inverse problems stems from the fact that they tend to be ill-posed mean-
ing that many models can produce highly similar data and/or the reconstructed model can be very
sensitive to noise. The remedy is to combine a reconstruction loss with a regularizer. Well-studied
regularizers are Tikhonov regularization (aka ridge regression), sparsity, and non-negativity.
Bayesian inference offers a powerful framework to tackle inverse problems. The conditional prob-
ability p(y|x), the likelihood, relates the data yto the mock data A(x)via a noise model. A
common assumption is independent Gaussian noise resulting in the likelihood
p(y|x)∝exp
−∥y− A(x)∥2
2σ2
. (1)
Maximizing the likelihood is then equivalent to standard least-squares fitting.
The prior probability p(x)encodes data-independent knowledge about a particular model x; its
negative logarithm −logp(x)can be viewed as a regularizer. The posterior of the model is
p(x|y) =p(y|x)p(x)
p(y)(2)
1arXiv:2412.14897v1 [cs.LG] 19 Dec 2024
Page 2:
with model evidence p(y) =R
p(y|x)p(x)dx. In case of a Gaussian likelihood, maximization of
logp(x|y)is equivalent to regularized least-squares fitting.
Often detailed knowledge about reasonable solutions is available but difficult to capture by the stan-
dard priors that are typically used to tackle inverse problems. For example, cryo-electron microscopy
(cryo-EM) aims to reconstruct the three-dimensional structure of macromolecular complexes from
two-dimensional (2D) projections. Cryo-EM images are typically very noisy with signal-to-noise ra-
tios (SNR) far below one. On the other hand, a large body of knowledge has been accumulated over
the past six decades, including hundreds of thousands of experimentally determined biomolecular
structures that are stored in the Protein Data Bank (PDB) (Berman et al., 2000). Experimentally de-
termined structures exhibit recurrent features such as alpha-helices and beta-strands and preferences
for the proximity and packing of amino acids and entire subunits. This detailed information is not
captured by standard priors used in cryo-EM reconstruction packages such as CryoDRGN (Zhong
et al., 2021a;b), cryoSPARC (Punjani et al., 2017) or RELION (Scheres, 2012). These approaches
represent the structure as voxel grid and use generic priors enforcing non-negativity or penalizing
high-frequency contributions. If one were to sample volumes from the corresponding prior, the sam-
pled structures would not resemble any of the known biomolecular structures. Here we try to encode
the rich knowledge available in the PDB as a diffusion model prior. We test 3D reconstruction from
sparse, low-resolution and partial measurements by posterior sampling with diffusion models as
priors.
1.1 C ONTRIBUTION
Our contributions are as follows: We propose a method to reconstruct 3D structures from 2D projec-
tions that utilizes diffusion models as priors. Using diffusion priors has previously not been explored
to solve the 2D to 3D reconstruction problem in cryo-EM. The combination of the diffusion model
prior with a likelihood allows us to reconstruct 3D structures from very sparse observations such as
2D projections, low-resolution structures and known structures of subunits. This is achieved with
diffusion-based posterior sampling (DPS) (Chung et al., 2023) a method that has not yet been inves-
tigated in the context of 3D point cloud data. We combine DPS with optimized diffusion schedules
and second-order correction steps with adaptable noise injection (Karras et al., 2022) to improve
sample quality and runtime. We demonstrate the fidelity and flexibility of our method on highly
complex and diverse datasets of 3D point clouds from ShapeNet and the PDB.
We emphasize that the reconstruction problem which we solve differs from the problem of recon-
structing a 3D surface from a 2D surface color image, which is tackled by, for example, Point-E
(Nichol et al., 2022), Shape-E (Jun & Nichol, 2023), PC2(Melas-Kyriazi et al., 2023), One-2-3-45
(Liu et al., 2023) and BDM (Xu et al., 2024). The main difference is that in our work the 2D ob-
servations are projections that provide information about the density across the full volume rather
than only information about the surface. In addition, our approach is also capable of conditioning
on coarse-grained or partial observations of the 3D structure. Moreover, the reconstruction problem
in cryo-EM aims to reconstruct the internal structure not only the surface.
It is important to note that diffusion models have been used in prior work in the context of cryo-EM
but have not been used for the 2D to 3D reconstruction problem. DiffModeler (Wang et al., 2024)
uses a diffusion model as part of their pipeline to trace the backbone of protein chains within already
reconstructed 3D cryo-EM maps to model large protein complexes. Zhang et al. (2024) utilized a
diffusion model to denoise and restore 2D cryo-EM images of single particles. Kreis et al. (2022)
trained diffusion models on the CryoDRGN latent conformational space of single proteins to tackle
the prior hole problem. In Ingraham et al. (2023) diffusion models were used to generate protein
complexes at the atomic level and also showed generation conditioned on shape and symmetry
constraints. So far they have not used their diffusion model for the 2D to 3D reconstruction task.
2 B ACKGROUND ON DIFFUSION MODELS
Diffusion models have gained wide recognition in the field of generative modeling (Sohl-Dickstein
et al., 2015; Song & Ermon, 2019; Ho et al., 2020; Song et al., 2021), particularly in image synthesis,
where diffusion models have demonstrated their capability by surpassing former leading models in
key metrics (Dhariwal & Nichol, 2021) and continue to set new records (Karras et al., 2024). In
2
Page 3:
generative modeling, the main goal is to learn a sampler for an unknown distribution p0from i.i.d.
samples x(0)i∼p0that serve as training data. A diffusion model tries to achieve this goal by
approximating a probability flow from a latent Gaussian distribution pTto the unknown target p0.
For this purpose, a forward process from the target distribution p0to the latent distribution pTis
defined in terms of a stochastic differential equations (SDE) of the form
dx=f(x, t)dt+g(t)dwt, (3)
where wtis a Wiener process, f(·, t) :Rd→Rdis the drift ofx(t)andg:R→Ris the diffusion
coefficient (Song et al., 2021). Starting at time t= 0 with samples x(0)∼p0from the target
distribution, process (3) is designed such that it gradually destroys the information content of the
samples x(0)by transforming them into samples x(T)from an isotropic Gaussian.
A diffusion model aims to represent the reverse process from pTtop0such that we can draw noise
from a Gaussian distribution and slowly transform it into samples from the data distribution p0.
Anderson (1982) showed that the forward process (3) has a reverse process of the form
dx=
f(x, t)−g(t)2∇xlogpt(x)
dt+g(t)dwt (4)
withpt(x(t)) =R
p0t(x(t)|x(0))p0(x(0))dx(0)being the marginal distribution of x(t)where
p0tis the perturbation kernel from time 0tot. The score ∇xlogpt(x)of the marginals is unknown
and has to be approximated with a parametric score model sθ(x(t), t).
Diffusion model training works by applying gradient descent to the denoising score matching
(DSM) objective to train sθ:
min
θEt,x(0),x(t)h
λ(t)∇x(t)logp0t(x(t)|x(0))−sθ(x(t), t)2i
(5)
where t∼ptrain,x(t)∼pt,x(0)∼p0andx(t)∼p0t(·|x(0)) with the loss weighting λ:R+→
R+. In DSM, we only need to evaluate the score of the perturbation kernel p0t, which is easy
to calculate for suitable choices of the drift and diffusion coefficient (consider, for instance, the
variance exploding orvariance preserving schedules in Song et al. (2021)). More background on
the training process can be found in Appendix A.2. After training, the score model sθcan be
used as a replacement for ∇xlogpt(x)togenerate new data by sampling the latent model pTand
simulating the reverse SDE in equation (4) backward in time. The reverse SDE can be simulated
with numerical methods such as Euler-Maruyama, starting from Tand ending shortly before 0to
avoid numerical errors.
2.1 D IFFUSION MODELS IN 3D
Apart from the 2D image domain, diffusion models have been employed to estimate the distribution
of 3D objects. Various representations have been used including point clouds (Luo & Hu, 2021; Vah-
dat et al., 2022; Nichol et al., 2022; Zhou et al., 2021), meshes and implicit neural representations
(Jain et al., 2021; Jun & Nichol, 2023; Erkoc ¸ et al., 2023) such as neural radiance fields (Mildenhall
et al., 2021). Here, we employ a point cloud representation and adopt the point transformer archi-
tecture from Nichol et al. (2022). This representation allows us to model 3D volume densities such
as cryo-EM maps efficiently, unlike meshes, which only model the surface. Furthermore, the point
cloud representation simplifies the process of developing likelihoods for the cryo-EM reconstruction
problem. In addition, we avoid any kind of latent diffusion (as, for example, proposed by Vahdat
et al. (2022)) for which likelihood guidance is more difficult (Song et al., 2024).
2.2 D IFFUSION POSTERIOR SAMPLING
In many practical applications such as text-to-image or class-to-image generation, the focus is on
sampling from the posterior x(0)∼p0(·|y)given some input y. In this case, our unconditional
score∇x(t)logpt(x(t))will be extended to
∇x(t)logpt(x(t)|y)Bayes rule=∇x(t)logpt(y|x(t)) +∇x(t)logpt(x(t)). (6)
Given pairs of training data {(x(0)i,yi)}, we could train a diffusion prior plus a classifier
pt(y|x(t))and use its score ∇x(t)logpt(y|x(t))during inference for classifier guidance (Dhari-
wal & Nichol, 2021). Another popular option is to perform classifier-free guidance and directly
3
Page 4:
train∇x(t)logpt(x(t)|y)(Ho & Salimans, 2022). For example, Zhou et al. (2021) used this ap-
proach for 3D shape completion and 3D shape reconstruction from a single depth map.
Another line of work attempts to avoid task-specific training and instead uses the known forward
model to guide the generation process (Chung et al., 2023; 2022; Ho et al., 2022; Lugmayr et al.,
2022; Song et al., 2021; Trippe et al., 2023a;b; Dou & Song, 2024; Cardoso et al., 2023). In tasks
with known forward model like inpainting, shape completion or colorization, we have access to
a likelihood p0(y|x(0)) based on the noiseless data x(0). Chung et al. (2023) make use of this
likelihood by approximating the score of the posterior by
∇x(t)logpt(x(t)|y)≈ζ∇x(t)logp0(y|Dθ(x(t), t)) +∇x(t)logpt(x(t)) (7)
with weighting ζ > 0and denoising function Dθ, which is an estimator of D(x(t), t) :=
Ex(0)∼p(·|x(t))[x(0)]that is learned during the training of the diffusion model (see Section 3.1 and
Appendix A.2). This approximation approach, called reconstruction guidance in Ho et al. (2022),
has been applied across multiple contexts with prominent results in ill-posed inverse problems of
2D images (Chung et al., 2023; 2022; Ho et al., 2022; Trippe et al., 2023a). Simpler approaches
such as the replacement method of Song et al. (2021) are computationally cheaper because they
do not need additional backpropagation. However, the replacement method sometimes suffers from
severe artifacts (Lugmayr et al., 2022; Chung et al., 2023). Most recently, several approaches used
reweighing schemes within the Sequential Monte Carlo (SMC) framework to derive exact methods
for diffusion posterior sampling (Trippe et al., 2023a;b; Cardoso et al., 2023; Dou & Song, 2024).
However, the guarantee of exactness is in our case of limited practical relevance (due to high com-
putational demands) because the required number of particles in SMC tends to be excessively large
(Gupta et al., 2024).
3 T HEORETICAL FRAMEWORK
Our theoretical framework is inspired by existing diffusion models and uses reconstruction guidance
provided by forward models for 3D reconstruction from sparse observations in 2D and 3D.
3.1 3D D IFFUSION PRIOR TRAINING AND SAMPLING
We follow the design choice recommendations of Karras et al. (2022) using f(x, t) =0and
g(t) =√
2twhich yield the forward diffusion SDE dx=√
2t dwtand the perturbation kernel
p0t(x(t)|x(0)) = N(x(t);x(0), t2I). For the loss weighting λ(t),ptrain(t)and the score model
parameterization sθ(x(t), t) = (Dθ(x(t), t)−x(t))/t2we also followed Karras et al. (2022) (more
details can be found in the Appendix A.2).
During inference time, we utilize the more general version of the reverse SDE presented by Karras
et al. (2022) which has the same marginals as dx=√
2t dwtand gives us more flexibility in
choosing favorable sampling schemes:
dx=−[t+β(t)t2]∇xlogpt(x)dt+p
2β(t)t2dwt (8)
where β:R+→R+is a function that controls how noisy the trajectory behaves. The choice
β(t) = 1 /tresults in Eq. (4) as a special case, whereas β(t) = 0 yields an ordinary differential
equation (ODE) called the flowODE . In practice, the score of the marginals ∇xlogpt(x)is replaced
by the score estimator (Dθ(x, t)−x)/t2and the differential equation must be solved backward in
time by a numerical integrator such as Euler-Maruyama for a specific time interval t∈[tmin, tmax]
where tmin>0. The time interval must be discretized into Ntime steps {tmax=t0> . . . >
tN−1=tmin> tN= 0}. More time steps result in a more accurate simulation of the SDE,
but also increase the number of network function evaluations (NFE). Accurate simulation of the
SDE can be especially difficult in areas with a high curvature in the trajectory, which is typically
prominent at smaller t. We therefore adopt the time step heuristic of Karras et al. (2022): ti=
(t1/ρ
max+i
N−1(t1/ρ
min−t1/ρ
max))ρwithi < N ,ρ≥1andtN= 0where an increase in ρleads to more
time steps in the lower part of the time frame. We found that ρ= 3 works well for sampling 3D
point clouds. Algorithm 1 with ∇log ˜pt(x|y) = (Dθ(x, t)−x)/t2implements unguided diffusion
prior sampling using Euler-Maruyama with correction step.
4
Page 5:
3.2 D IFFUSION POSTERIOR SAMPLING FOR 3D RECONSTRUCTION
To sample the trained diffusion prior in the light of observations yoriginating from a known for-
ward process, we use reconstruction guidance (Chung et al., 2023). In contrast to Chung et al.
(2023), we apply a more advanced diffusion schedule (EDM (Karras et al., 2022) instead of VP-
SDE (Song et al., 2021)) to enhance the capabilities of the proposed guidance strategy. We supple-
ment the schedule with a stochastic Euler-Maruyama integrator that uses a second-order correction
step, because both the use of stochasticity (solving an SDE rather than an ODE) and second-order
samplers have been shown to improve image generation performance in the unconditional setting
(Karras et al., 2022). We observed that this also holds for our conditional setting in 3D (see Ta-
ble 2). For conditional generation, we extend the score of the marginals from the diffusion prior
∇x(t)logpt(x(t))with an approximate score of the perturbed likelihoods:
∇x(t)logpt(x(t)) +ζ∇x(t)logp0(y|Dθ(x(t), t)) =:∇x(t)log ˜pt(x(t)|y) (9)
withζ=α(t)/p
logp0(y|Dθ(x(t), t))following Chung et al. (2023). Algorithm 1 illustrates our
method for conditional generation with reconstruction guidance. In order to apply this methodology
to reconstruct partially observed 3D volumes represented as point clouds, we now list the subsequent
forward processes.
Single 2D projection to 3D. In the simplest version of the reconstruction problem, we partially
observe a single 2D projection of a 3D object in a known orientation. Here we represent the structure
of an object as a 3D point cloud x(0)∈RN×3withNpoints and the corresponding 2D projection
y1∈RM×2as a 2D point cloud consisting of Mpoints. We define the likelihood of observing the
projection y1givenx(0)asp0(y1|x(0))∝exp (−E1(x(0))) where the energy is defined as
Ek(x(0)) := min
P∈PN×NPUy k−(x(0)Rk):,(1,2)2
F(10)
fork= 1with permutation matrices PN×N⊂ {0,1}N×N, orthogonal matrix Rk∈O(3), Frobe-
nius norm ||·|| Fand the linear operator U∈ {0,1}N×Mthat upsamples1ykby randomly redrawing
points. The permutation matrix Passigns each point in Uykto a single point in the rotated and
projected object (x(0)Rk):,(1,2). The introduction of Parises from the assumption of a hidden
one-to-one correspondence between the upsampled points Uykand the points in x(0). The inner
optimization problem is a linear assignment problem (Crouse, 2016) that can be solved exactly in
polynomial time by using the Hungarian method (Kuhn, 1955). We stress that due to the miss-
ing correspondence information, the 3D reconstruction problem from 2D projections with known
orientations is non-trivial and severely ill-posed.
Multiple 2D projections to 3D. To generalize the above forward process, we consider the case
of observing Kprojections y={y1, . . . ,yK}of an object x(0)from known orientations R=
{R1, . . . ,RK}. Then the likelihood of observing the set of projections yfrom orientations Rgiven
x(0)is the product of all independent observations:
p0(y|x(0)) =Y
kp(yk|x(0))∝exp
−X
kEk(x(0))
(11)
Coarse to fine grained. We can also guide the diffusion prior by a 3D point cloud with fewer
points M < N representing a low-resolution version ycg∈RM×3ofx(0)∈RN×3. From this
coarser observation, we want to infer a higher resolution structure. In order to characterize the rela-
tion between different resolutions, we employ a likelihood similar to the one used for 2D projections,
p0(ycg|x(0))∝exp (−E∗(x(0))) where the energy is defined as
Ecg(x(0)) := min
P∈PN×N||PUy cg−x(0)||2
F. (12)
From subunit to full 3D reconstruction. If available we can further update our prior knowledge
encoded in the diffusion model by utilizing information about parts or subunits of the unknown 3D
1Here we look at the case M≤N, however this formulation can also be used to downsample ykifM > N .
5
Page 6:
structure. Thus we define the energy for the likelihood p0(ysu|x)∝exp(−Esu(x(0))) of observing
the subunit ysu∈RL×3givenx(0)∈RN×3by
Esu(x(0)) := min
P∈PL×N||Px(0)−ysu||2
F (13)
with partial permutation matrices PL×N⊂ {0,1}L×Nthat pick Lout of Npoints in x(0)to create
a one-to-one correspondence to the Lpoints in ysu.
We can also combine likelihoods for all possible observations y={ysu,ycg,y1, . . . ,yK}of the 3D
structure to update the prior knowledge encoded in the diffusion prior. To enable the assignment of
importance or uncertainty to each dataset, we can weight the corresponding energies:
p0(y|x(0))∝exp
−wsuEsu(x(0))−wcgEcg(x(0))−X
kwkEk(x(0))
(14)
with weights wsu, wcg, wk≥0, coarse-grained structure ycg, subunit ysu, 2D observations
{y1, . . . ,yK}, orientations Rand 3D structure x(0). In the experiments of this work, we apply
equal weighting of 1/|y|to all the observations. The likelihood guidance of the diffusion prior
allows us to flexibly incorporate all this information with varying shapes, thereby avoiding task-
specific retraining.
Algorithm 1 Approximate posterior sampling with correction step
1:Input: Noise control function β, time steps {t0> t1> . . . > t N= 0}, observations y
2:Output: Approximate sample from p0(x(0)|y)
3:x(t0)∼ N(0, t2
0I)
4:fori∈ {0, . . . , N −1}do
5: ∆t←(ti−ti+1)
6: x(ti+1)←x(ti) +ti∇log ˜pt(x(ti)|y)∆t
7: ifti+1̸= 0then ▷correction step + noise injection
8: d←(ti+β(ti)t2
i) [∇log ˜pt(x(ti)|y) +∇log ˜pt(x(ti+1)|y)] ∆t/2
9: n∼ N
0,2β(ti)t2
i∆tI
10: x(ti+1)←x(ti) +d+n
11: end if
12:end for
13:return x
4 E XPERIMENTS
To demonstrate the fidelity and flexibility of our approach, we conducted multiple experiments. For
this, we performed training on multiple different 3D datasets and tested their usefulness on a variety
of 3D reconstruction tasks.
4.1 D IFFUSION PRIOR TRAINING
We trained diffusion priors for each of the three datasets from multiple domains each differing in
their level of complexity.
(A) ShapeNet-Chair : 2658 point clouds from the training split of the ShapeNet dataset in the cate-
gory “Chair” accessed via PyTorch Geometric (Chang et al., 2015; Fey & Lenssen, 2019). During
training, we randomly subsampled 1024 points from each point cloud and applied random orthogo-
nal transformations to augment the dataset.
(B) ShapeNet-Mixed : 10693 point clouds from the training split of the ShapeNet dataset in the
categories “Airplane”, “Bag”, “Cap”, “Car”, “Chair”, “Guitar”, “Laptop”, “Motorbike”, “Mug”,
“Pistol”, “Rocket”, “Skateboard” and “Table” (all categories from ShapeNet with point clouds larger
6
Page 7:
than or equal to 1024) accessed via PyTorch Geometric (Chang et al., 2015; Fey & Lenssen, 2019).
Again, we applied subsampling and augmentation with random orthogonal transformations to the
training data.
(C) CryoStruct : 6629 point clouds representing mixture models of size 1024 constructed from
the 3D atom positions of biomolecular complexes from the PDB contained in the train split of the
curated Cryo2StructDataset (Giri et al., 2024). The mixture models were created using the scikit-
learn GaussianMixture method with covariance matrix shared among the components (Pedregosa
et al., 2011). We also augmented the dataset by randomly rotating the biomolecular complexes.
The point clouds in all three datasets are centered and scaled so as to fit into the [−1,1]cube.
Figures 3, 4, and 5 in the appendix present images of unconditional samples from the diffusion
priors. Following the methodology of Yang et al. (2019), we present the 1-nearest neighbor accuracy
(1-NNA), coverage (COV), and minimum matching distance (MMD) in Table 3 in the appendix to
quantify the performance of the diffusion model.
4.2 3D RECONSTRUCTION ON SHAPE NET
To demonstrate the performance and flexibility of our method on the widely used ShapeNet bench-
mark (Chang et al., 2015), we conducted experiments across nine different configurations. An ad-
vantage of the ShapeNet reconstruction tasks is that it is easier to visually judge the quality of the
reconstructions than for the CryoStruct reconstruction tasks. In each setting, we took the first 100
instances from both the ShapeNet-Chair and ShapeNet-Mixed test set as ground truth and created
sparse observations y. These observations include 2D projections, coarse-grained point clouds, or
subunits. The 2D projections are constructed by sampling points from the ground truth and applying
a random orthogonal transformation to the sub-sampled points before projecting them onto the xy-
plane. The coarse-grained point clouds are constructed by taking the means from a mixture model
fitted to the ground truth point cloud. A subunit, i.e. a partial structure, corresponds to a single
k-means cluster selected randomly from the ground truth.
We applied our version of approximate DPS (see Algorithm 1) to generate ten 3D reconstructions per
instance using only 40 time steps (additional details on the parameters can be found in Section A.3
of the Appendix). We compared our method to the ML approach obtained by maximizing the same
log-likelihood that was also used to guide the diffusion prior during approximate DPS. Starting from
10 different random clouds with points uniformly distributed in [−1,1]3, we performed gradient
descent for 100 steps using the Adam optimizer with a learning rate of 0.01(Kingma & Ba, 2014).
By using the same likelihoods without the diffusion model, we can assess how much we gain in
3D reconstruction performance by utilizing a diffusion prior. Similar to the approach of Yang et al.
(2019), we measure the 3D reconstruction error between a reconstructed point cloud and the ground
truth with the Chamfer Distance (CD) and the Earth Movers Distance (EMD). The values in Table
1 are the means and standard deviations of all 100×10reconstruction errors measured in CD and
EMD as well as the negative log-likelihood (energy E) of the corresponding forward model.
Table 1 shows that, as expected, in most cases the maximum likelihood approach creates 3D re-
constructions with a higher likelihood (lower energy E) of observing the input data ythan DPS.
However, in the face of the ill-posedness of the reconstruction tasks, it is not sufficient to simply
optimize the likelihood. This explains why the incorporation of the diffusion prior consistently re-
sults in better reconstruction errors in all test cases for both EMD and CD, although for most test
cases the likelihoods obtained with DPS are worse than those obtained with ML. Prominent example
reconstructions that demonstrate the superior performance of DPS are shown in Figure 1. The diffu-
sion prior helps navigate the space of possible 3D reconstructions with high likelihood toward those
with typical ShapeNet structures, information that is not sufficiently provided by the observations y
themselves. Therefore the structural models obtained with DPS are also visually much closer to the
ground truth with notable ShapeNet-like characteristics.
4.3 D IFFUSION POSTERIOR SAMPLING FOR CRYO -EM
We also benchmark posterior sampling with diffusion priors on various reconstruction tasks arising
in the context of cryo-EM reconstruction. We are mostly interested in sparse data scenarios. This
might appear to be at odds with the fact that cryo-EM tends to produce many hundreds of thousands
7
Page 8:
Figure 1: Results for five different reconstruction tasks. In all examples, the ML reconstruction
has a higher likelihood of observing the input data than the models obtained with approximate DPS.
However, the ML-based models show a higher reconstruction error than those from DPS. The results
are also part of the tests presented in Table 1 and correspond to rows 9, 8, 1, 8 and 2 (from left to
right).
8
Page 9:
Table 1: Results from the 3D reconstruction task from sparse data. Tests were conducted on the test
partition of the ShapeNet (Mixed, Chair) datasets under various configurations, altering the number
of points per projection, coarse-grained structure and subunit. We compared our variant of approx-
imate diffusion posterior sampling (DPS) to the maximum likelihood (ML) approach. To quantify
the error between the reconstructions and the ground truth point clouds we calculated the mean
Chamfer Distance (CD) and mean Earth Movers Distance (EMD) over in total 1k reconstructions
(10 samples for each of the 100 test instances). For further analysis we also show the energy of the
forward model ( E).
ShapeNetMethodProjection Number of Coarse grained SubunitCD([×102],↓)EMD ([×102],↓)E([×103],↓)category points projections points points
ChairDPS200 5 - -9.98±2.38 8 .32±1.77 3.80±1.02
ML 13.30±2.00 11 .03±1.69 3.68±0.88
ChairDPS200 6 - -9.71±1.81 7 .78±1.35 3.98±0.87
ML 12.53±1.53 10 .04±1.30 3.87±0.79
MixedDPS400 4 - -10.56±4.09 8 .18±3.40 2.41±1.01
ML 12.37±2.34 10 .21±2.09 2.38±0.79
MixedDPS400 5 - -9.29±2.65 7 .00±2.22 2 .30±0.89
ML 11.78±1.88 9 .39±1.77 2 .72±1.27
ChairDPS300 1 30 -10.38±2.24 10 .66±3.23 6.21±1.76
ML 12.40±2.16 11 .89±2.84 5.04±1.52
MixedDPS300 1 30 -9.36±2.23 9 .21±2.47 5.57±2.11
ML 11.99±1.89 11 .08±2.01 5.13±1.69
ChairDPS200 2 - ≈25613.21±5.69 12 .86±5.62 2.51±0.66
ML 16.98±4.58 16 .63±4.85 1.84±0.63
MixedDPS200 2 - ≈25611.11±4.13 11 .19±5.09 2.47±0.86
ML 18.14±6.28 17 .99±6.59 2.22±1.05
MixedDPS200 1 30 ≈1288.55±1.97 9 .38±1.96 4.17±1.64
ML 11.19±1.82 10 .90±1.88 3.80±1.49
Table 2: Evaluation of the improvement we obtain by switching from integrating the flowODE
(β(t) = 0 ) using Euler’s method in A to the integration of the SDE ( β(t) = 1 /tift >0.15and
else0) using the Euler–Maruyama method in B. In C, we observe that the reconstruction error is
lowered further by adding a second-order correction step. The test errors have been studied on the
ShapeNet-Mixed reconstruction task given a subunit with ≈256points and two projection images
with 200 points each (row 8 in Table 1). In all three schedules, we used 79 NFE which accounts to
79 time steps in A and B and 40 time steps in C.
CD([×102],↓)EMD ([×102],↓)E([×103],↓)
A Euler ODE 14.38±5.64 13 .98±5.94 3 .09±1.19
B + noise 11.80±3.94 11 .71±5.03 2 .61±0.85
C + correction step 11.11±4.13 11 .19±5.09 2 .47±0.86
of images. Our interest is in reconstructing intermediate resolution structures from very few images,
with the goal of elucidating structural differences between individual copies of the biomolecule.
These structural variations are expected to occur, because biomolecular complexes are flexible and
undergo conformational changes. Conformational heterogeneity is often linked to the biological
function of a macromolecular complex and of particular interest to the structural biologist (Toader
et al., 2023).
We designed various benchmarks based on a held-out set of 100 structures from Cryo2StructDataset
that were not used in the training of the diffusion prior. The reconstruction tasks involve sparse
2D and/or 3D information. Again, as a baseline we used ML models obtained by maximizing
the likelihood without the diffusion prior (a detailed presentation of the results can be found in
the Supplementary Material, Sections A.4.1 to A.4.7). We generated ten models with and without
diffusion model per reconstruction task. To assess the accuracy of the model structures, we compare
them against the atomic structure deposited in the PDB and the point cloud obtained by fitting a
1024-component mixture of Gaussians used for the generation of the input measurements. The 3D
points generated by ML and DPS tend to concentrate in [−1,1]3. Before a meaningful comparison
between the ground truth and model structures can be made, we first need to scale the model points
9
Page 10:
Figure 2: Outcomes for five cryo-EM reconstruction tasks. The top row shows the sparse input mea-
surements. The second row shows all ten point clouds generated with DPS. The third row shows the
1024 component means of a mixture model fitted to the atomic models (last row). (A)Nucleosome-
CHD4 from five projections (PDB code 6ryr). (B)F-ATP Synthase from four projections (PDB
code 6rdm). (C)RNA polymerase transcription open promoter complex with Sorangicin from three
projections (PDB code 6vvy). (D)Human spliceosome after Prp43 loaded from one projection and a
low-resolution structure consisting of 40 particles (PDB code 6id1). (E)26S proteasome from three
projections and a known 20S structure (PDB code 6fvt).
so as to match the physical units of the coordinates in the PDB file (which are in ˚A). We achieve this
by matching the radii of gyration. However, there could still be a mismatch between the ground truth
and the scaled model resulting from a relative rotation and translation (rigid transformation) between
the two point clouds. We estimate the optimal alignment of both point clouds by maximizing the
kernel correlation (Tsin & Kanade, 2004).
After scaling and superposition, we can meaningfully compare model point clouds against the
atomic and coarse-grained ground truth structures. We assess the accuracy of the models with
the root mean square deviation (RMSD) which is commonly used to compare biomolecular struc-
tures. Since there is no one-to-one correspondence between the points in the cloud representing the
ground truth (all heavy atoms in the PDB file or component means of the Gaussian mixture) and
the models computed with ML or DPS, we compute RMSD = (1
NPN
n=1∥xn−x′
ℓn∥2)1/2where
ℓn∈ {1, . . . , 1024}encodes the correspondence between points xnrepresenting the ground truth
and points x′
mrepresenting the model (where m∈ {1, . . . , 1024}). In case the ground truth is rep-
resented by all heavy atoms, we set ℓn=argmin ∥xn−x′
m∥(where “argmin” runs over all m) and
Nis the total number of heavy atoms in the PDB file. The 100 PDB structures in the test set vary
largely in the number of heavy atoms from N= 2178 toN= 110541 . In case the ground truth is
represented by the 1024 component means of the Gaussian mixture (also referred to as “subsampled
structure” in the following), we compute ℓnby solving the linear assignment problem that matches
the 1024 points representing the ground truth against the 1024 in the model (in this case N= 1024 ).
Figure 2 shows representative cryo-EM reconstructions for five different sparse data scenarios. Fig-
ure 2A shows the results for a nucleosome-CHD4 complex (PDB code 6ryr, 17820 heavy atoms).
Five 2D projections served as input for DPS reconstruction. The RMSD between the ten DPS mod-
els and the ground truth is 3.56±0.04˚A (atomic structure) and 2.05±0.09˚A (subsampled structure).
We also inferred the structure of F-ATP synthase (PDB code 6rdm, 33891 heavy atoms) from 4 pro-
jections (Fig. 2B). The RMSDs between the DPS models and the ground truth is 4.46±0.02˚A
(atomic structure) and 2.83±0.03˚A (subsampled structure). RNA polymerase transcription open
10
Page 11:
promoter complex with Sorangicin (PDB code 6vvy, 30033 heavy atoms) was inferred from three
projections (Fig. 2C). The RMSDs between the DPS models and the ground truth is 4.87±0.77
˚A (atomic structure) and 3.39±1.21˚A (subsampled structure). These tests show that intermediate
resolution structures can be computed from very few 2D projections.
A common scenario in cryo-EM is that a low-resolution structure is already known and the goal of
a cryo-EM study is to furnish structural details at higher resolution. This scenario was tested on the
human intron lariat spliceosome (PDB code 6id1, 79882 heavy atoms). The structural models were
computed from a single projection and a low-resolution structure represented by only 40 points (Fig.
2D). The RMSD between DPS models and the ground truth is 10.35±0.20˚A (atomic structure) and
9.82±0.15˚A (1024 component means). Because the structure is huge and the input data for DPS are
very sparse, the RMSD is worse than in the previous examples. Nevertheless, it is remarkable that
such sparse information allows us to refine the coarse-grained spliceosome structure to a medium
resolution.
The final example shows the power of DPS for 3D reconstruction from few projections and a subunit
structure. This is a common scenario in structural biology where many partial structures have been
determined and the challenge is to determine the full structure. To test this scenario, we model the
26S proteasome (PDB code 6fvt, 110541 heavy atoms). Historically, a huge part of the 26S protea-
some, the 20S proteasome, was determined before the complete 26S structure could be elucidated
by cryo-EM. In our tests, we use three projections and the structure of the 20S proteasome as input
(Fig. 2E). The RMSD between the models obtained with DSP and the ground truth is 8.14±0.12˚A
(atomic structure) and 5.66±0.19˚A (subsampled structure).
4.4 L IMITATIONS
A major limitation of the proposed method concerns its runtime. In each approximate DPS step
with correction, we have to evaluate the gradient of the energy from our forward model twice.
Overall, this means that we need 2×#timesteps −1network function evaluations and have to solve
(2×#timesteps −1)×#observations linear assignment problems to obtain a single 3D reconstruction.
However, the time to reconstruct a 3D structure in the case of 6 input projections and 40 timesteps
within a batch of 10 still takes ≈1.2min per sample on a A100 GPU in combination with an Intel
Xeon Platinum 8360Y 2.40 GHz CPU.
5 C ONCLUSION
We propose a Bayesian approach for 3D reconstruction from sparse measurements such as 2D pro-
jections, coarse-grained structures, and/or substructures, using diffusion models as priors. Diffusion
models are capable of encoding rich prior information about 3D structures and enable us to re-
construct meaningful 3D models from very sparse input data via approximate diffusion posterior
sampling. Diffusion priors can distill rich data sources and thereby complement existing regulariza-
tion techniques whenever such training data are available. The goal of future research is to improve
the resolution of the 3D reconstructions.
ACKNOWLEDGMENTS
This work was supported by the Carl Zeiss Stiftung within the project “Interactive Inference”. In
addition, Michael Habeck acknowledges funding by the Carl Zeiss Stiftung within the program
“CZS Stiftungsprofessuren” and is grateful for the support of the DFG within project 432680300 -
SFB 1456 subproject A05.
REFERENCES
Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and
their Applications , 12(3):313–326, 1982. ISSN 0304-4149. doi: https://doi.org/10.
1016/0304-4149(82)90051-5. URL https://www.sciencedirect.com/science/
article/pii/0304414982900515 .
11
Page 12:
Helen M Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge Weissig,
Ilya N Shindyalov, and Philip E Bourne. The protein data bank. Nucleic acids research , 28(1):
235–242, 2000.
Gabriel Cardoso, Yazid Janati El Idrissi, Sylvain Le Corff, and Eric Moulines. Monte Carlo guided
diffusion for Bayesian linear inverse problems. arXiv preprint arXiv:2308.07983 , 2023.
Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li,
Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu.
ShapeNet: An Information-Rich 3D Model Repository. Technical Report arXiv:1512.03012
[cs.GR], Stanford University — Princeton University — Toyota Technological Institute at
Chicago, 2015.
Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving Diffusion Models for
Inverse Problems using Manifold Constraints. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave,
and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems , 2022. URL
https://openreview.net/forum?id=nJJjv0JDJju .
Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky, and Jong Chul
Ye. Diffusion Posterior Sampling for General Noisy Inverse Problems. In The Eleventh Interna-
tional Conference on Learning Representations , 2023. URL https://openreview.net/
forum?id=OnD9zGAGT0k .
David Frederic Crouse. On implementing 2D rectangular assignment algorithms. IEEE Trans-
actions on Aerospace and Electronic Systems , 52:1679–1696, 2016. URL https://api.
semanticscholar.org/CorpusID:20649848 .
Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion Models Beat GANs on Image Synthesis.
In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neu-
ral Information Processing Systems , 2021. URL https://openreview.net/forum?id=
AAWuCvzaVt .
Zehao Dou and Yang Song. Diffusion Posterior Sampling for Linear Inverse Problem Solving:
A Filtering Perspective. In The Twelfth International Conference on Learning Representations ,
2024. URL https://openreview.net/forum?id=tplXNcHZs1 .
Ziya Erkoc ¸, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Hyperdiffusion: Generat-
ing implicit neural fields with weight-space diffusion. In Proceedings of the IEEE/CVF Interna-
tional Conference on Computer Vision , pp. 14300–14310, 2023.
Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with PyTorch Geometric.
arXiv preprint arXiv:1903.02428 , 2019.
Nabin Giri, Liguo Wang, and Jianlin Cheng. Cryo2structdata: A large labeled cryo-em density map
dataset for ai-based modeling of protein structures. Scientific Data , 11(1):458, 2024.
Shivam Gupta, Ajil Jalal, Aditya Parulekar, Eric Price, and Zhiyang Xun. Diffusion Posterior Sam-
pling is Computationally Intractable. arXiv preprint arXiv:2402.12727 , 2024.
Jonathan Ho and Tim Salimans. Classifier-Free Diffusion Guidance, 2022.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. In
H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neu-
ral Information Processing Systems , volume 33, pp. 6840–6851. Curran Associates, Inc.,
2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/
file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf .
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J.
Fleet. Video Diffusion Models, 2022.
Aapo Hyv ¨arinen. Estimation of Non-Normalized Statistical Models by Score Matching. Journal of
Machine Learning Research , 6(24):695–709, 2005. URL http://jmlr.org/papers/v6/
hyvarinen05a.html .
12
Page 13:
John B. Ingraham, Max Baranov, Zak Costello, Karl W. Barber, Wujie Wang, Ahmed Ismail, Vincent
Frappier, Dana M. Lord, Christopher Ng-Thow-Hing, Erik R. Van Vlack, Shan Tie, Vincent Xue,
Sarah C. Cowles, Alan Leung, Jo ˜ao V . Rodrigues, Claudio L. Morales-Perez, Alex M. Ayoub,
Robin Green, Katherine Puentes, Frank Oplinger, Nishant V . Panwar, Fritz Obermeyer, Adam R.
Root, Andrew L. Beam, Frank J. Poelwijk, and Gevorg Grigoryan. Illuminating protein space
with a programmable generative model. Nature , 623(7989):1070–1078, November 2023. ISSN
1476-4687. doi: 10.1038/s41586-023-06728-8. URL http://dx.doi.org/10.1038/
s41586-023-06728-8 .
Ajay Jain, Ben Mildenhall, Jonathan T. Barron, Pieter Abbeel, and Ben Poole. Zero-Shot Text-
Guided Object Generation with Dream Fields. CoRR , abs/2112.01455, 2021. URL https:
//arxiv.org/abs/2112.01455 .
Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint
arXiv:2305.02463 , 2023.
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the Design Space of
Diffusion-Based Generative Models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and
Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems , 2022. URL
https://openreview.net/forum?id=k7FuTOWMOc7 .
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyz-
ing and Improving the Training Dynamics of Diffusion Models, 2024.
Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization.
CoRR , abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:
6628106 .
Karsten Kreis, Tim Dockhorn, Zihao Li, and Ellen Zhong. Latent space diffusion models of cryo-em
structures. arXiv preprint arXiv:2211.14169 , 2022.
H. W. Kuhn. The Hungarian method for the assignment problem. Naval Research Logistics
Quarterly , 2(1-2):83–97, 1955. doi: https://doi.org/10.1002/nav.3800020109. URL https:
//onlinelibrary.wiley.com/doi/abs/10.1002/nav.3800020109 .
Minghua Liu, Chao Xu, Haian Jin, Linghao Chen, Mukund Varma T, Zexiang Xu, and Hao Su.
One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization. In
A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in
Neural Information Processing Systems , volume 36, pp. 22226–22246. Curran Associates, Inc.,
2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/
file/4683beb6bab325650db13afd05d1a14a-Paper-Conference.pdf .
Andreas Lugmayr, Martin Danelljan, Andr ´es Romero, Fisher Yu, Radu Timofte, and Luc Van Gool.
RePaint: Inpainting using Denoising Diffusion Probabilistic Models. CoRR , abs/2201.09865,
2022. URL https://arxiv.org/abs/2201.09865 .
Shitong Luo and Wei Hu. Diffusion Probabilistic Models for 3D Point Cloud Generation. CoRR ,
abs/2103.01458, 2021. URL https://arxiv.org/abs/2103.01458 .
Luke Melas-Kyriazi, Christian Rupprecht, and Andrea Vedaldi. Pc2: Projection-conditioned point
cloud diffusion for single-image 3d reconstruction. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition , pp. 12923–12932, 2023.
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and
Ren Ng. NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM ,
65(1):99–106, dec 2021. ISSN 0001-0782. doi: 10.1145/3503250. URL https://doi.org/
10.1145/3503250 .
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system
for generating 3d point clouds from complex prompts, 2022.
13
Page 14:
F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Pretten-
hofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
and E. Duchesnay. Scikit-learn: Machine Learning in Python. Journal of Machine Learning
Research , 12:2825–2830, 2011.
Ali Punjani, John L Rubinstein, David J Fleet, and Marcus A Brubaker. cryoSPARC: algorithms for
rapid unsupervised cryo-EM structure determination. Nature methods , 14(3):290–296, 2017.
Sjors HW Scheres. RELION: implementation of a Bayesian approach to cryo-EM structure deter-
mination. Journal of structural biology , 180(3):519–530, 2012.
LLC Schr ¨odinger and Warren DeLano. PyMOL, 2020. URL http://www.pymol.org/
pymol .
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised
Learning using Nonequilibrium Thermodynamics. In Francis Bach and David Blei (eds.), Pro-
ceedings of the 32nd International Conference on Machine Learning , volume 37 of Proceedings
of Machine Learning Research , pp. 2256–2265, Lille, France, 07–09 Jul 2015. PMLR. URL
https://proceedings.mlr.press/v37/sohl-dickstein15.html .
Bowen Song, Soo Min Kwon, Zecheng Zhang, Xinyu Hu, Qing Qu, and Liyue Shen. Solving Inverse
Problems with Latent Diffusion Models via Hard Data Consistency. In The Twelfth International
Conference on Learning Representations , 2024. URL https://openreview.net/forum?
id=j8hdRqOUhN .
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.
Advances in neural information processing systems , 32, 2019.
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben
Poole. Score-Based Generative Modeling through Stochastic Differential Equations. In Interna-
tional Conference on Learning Representations , 2021. URL https://openreview.net/
forum?id=PxTIG12RRHS .
Bogdan Toader, Fred J Sigworth, and Roy R Lederman. Methods for cryo-EM single particle re-
construction of macromolecules having continuous heterogeneity. Journal of Molecular Biology ,
435(9):168020, 2023.
Brian L. Trippe, Luhuan Wu, Christian A. Naesseth, David Blei, and John Patrick Cunningham.
Practical and Asymptotically Exact Conditional Sampling in Diffusion Models. In ICML 2023
Workshop on Structured Probabilistic Inference & Generative Modeling , 2023a. URL https:
//openreview.net/forum?id=r9s3Gbxz7g .
Brian L. Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, and
Tommi S. Jaakkola. Diffusion Probabilistic Modeling of Protein Backbones in 3D for the motif-
scaffolding problem. In The Eleventh International Conference on Learning Representations ,
2023b. URL https://openreview.net/forum?id=6TxBxqNME1Y .
Yanghai Tsin and Takeo Kanade. A correlation-based approach to robust point set registration. In
Computer Vision-ECCV 2004: 8th European Conference on Computer Vision, Prague, Czech
Republic, May 11-14, 2004. Proceedings, Part III 8 , pp. 558–569. Springer, 2004.
Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis, et al. Lion: La-
tent point diffusion models for 3d shape generation. Advances in Neural Information Processing
Systems , 35:10021–10039, 2022.
Pascal Vincent. A Connection Between Score Matching and Denoising Autoencoders. Neural
Computation , 23(7):1661–1674, 2011. doi: 10.1162/NECO a00142.
Xiao Wang, Han Zhu, Genki Terashi, Manav Taluja, and Daisuke Kihara. Diffmodeler: large
macromolecular structure modeling for cryo-em maps using a diffusion model. Nature Meth-
ods, October 2024. ISSN 1548-7105. doi: 10.1038/s41592-024-02479-0. URL http:
//dx.doi.org/10.1038/s41592-024-02479-0 .
14
Page 15:
Haiyang Xu, Yu Lei, Zeyuan Chen, Xiang Zhang, Yue Zhao, Yilin Wang, and Zhuowen Tu. Bayesian
Diffusion Models for 3D Shape Reconstruction. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition , pp. 10628–10638, 2024.
Guandao Yang, Xun Huang, Zekun Hao, Ming-Yu Liu, Serge Belongie, and Bharath Hariharan.
Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the
IEEE/CVF international conference on computer vision , pp. 4541–4550, 2019.
Jing Zhang, Tengfei Zhao, ShiYu Hu, and Xin Zhao. Robust single-particle cryo-em image denoising
and restoration. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP) , pp. 2995–2999, 2024. doi: 10.1109/ICASSP48485.2024.10447135.
Ellen D. Zhong, Tristan Bepler, Bonnie Berger, and Joseph H. Davis. Cryodrgn: reconstruction
of heterogeneous cryo-em structures using neural networks. Nature Methods , 18(2):176–185,
February 2021a. ISSN 1548-7105. doi: 10.1038/s41592-020-01049-4. URL http://dx.
doi.org/10.1038/s41592-020-01049-4 .
Ellen D. Zhong, Adam Lerer, Joseph H. Davis, and Bonnie Berger. Cryodrgn2: Ab initio neural re-
construction of 3d protein structures from real cryo-em images. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV) , pp. 4066–4075, October 2021b.
Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel
diffusion. In Proceedings of the IEEE/CVF international conference on computer vision , pp.
5826–5835, 2021.
A A PPENDIX
A.1 R ECONSTRUCTION GUIDANCE
For completeness we show how to derive the approximation pt(y|x(t))≈p0(y|Dθ(x(t), t))used
in reconstruction guidance to perform approximate diffusion posterior sampling. To take advantage
of the likelihood on the noiseless data p0(x(0)|y), we follow the argument of Chung et al. (2023)
and first express pt(y|x(t))as the marginal
pt(y|x(t)) =Z
p(y|x(t),x(0))p(x(0)|x(t))dx(0). (15)
Given x(0),yis independent of x(t). Therefore, we can simplify p(y|x(t),x(0)) top0(y|x(0))
and obtain
pt(y|x(t)) =Z
p0(y|x(0))p(x(0)|x(t))dx(0). (16)
Chung et al. (2023) note that p(x(0)|x(t))is intractable in the general case. Therefore, Chung et al.
(2023) propose the approximation
p(x(0)|x(t))≈δ(D(x(t), t)−x(0)), (17)
where D(x(t), t) :=Ex(0)∼p(·|x(t))[x(0)]. The intuition behind Dis that it denoises the noisy
inputx(t). This so-called denoising function is in general intractable and has to be replaced by an
estimator Dθ(Karras et al., 2022). Learning Dθis an essential part of the Diffusion Model training
and will be discussed in Section 3.1. By plugging (17) into (16) we obtain
pt(y|x(t))≈Z
p0(y|x(0))δ(Dθ(x(t), t)−x(0))dx(0) = p0(y|Dθ(x(t), t)). (18)
A.2 T RAINING
In this section, we aim to provide additional background on the objective optimized during diffusion
model training and elaborate on the specifics of training our 3D diffusion priors. In diffusion model
training we use gradient descent to find a good score-model sθviaexplicit score matching (ESM):
min
θEt,x(t)h
λ(t)∇x(t)logpt(x(t))−sθ(x(t), t)2i
(19)
15
Page 16:
Figure 3: Unconditional samples from the diffusion prior trained on the ShapeNet-Chair dataset.
Sampled with Algorithm 1 using β(t) = 1 /tift >1else1,tmax= 80 and100time steps. The
images are created using the surface mode of PyMOL (Schr ¨odinger & DeLano, 2020).
Figure 4: Unconditional samples from the diffusion prior trained on the ShapeNet-Mixed dataset.
Sampled with Algorithm 1 using β(t) = 1 /tift >1else1,tmax= 80 and100time steps. The
images are created using the surface mode of PyMOL (Schr ¨odinger & DeLano, 2020).
Table 3: All metrics (1-NNA, COV and MMD) used to quantify the diffusion prior generation
performance are based on the Chamfer Distance.
Dataset Samples from 1-NNA (%,↓)COV(%,↑)MMD ([×102],↓)
ShapeNet-ChairDiffusion prior 78.34 44.89 27.11
Training set 47.80 60.23 22.97
ShapeNet-MixedDiffusion prior 66.76 44.59 24.83
Training set 45.91 55.41 20.63
CryoStructDiffusion prior 54.29 42.38 18.68
Training set 44.94 53.05 18.13
16
Page 17:
Figure 5: Unconditional samples from the diffusion prior trained on the CryoStruct dataset. Sampled
with Algorithm 1 using β(t) = 1 /tift >0.8else1,tmax= 80 and100time steps. The images are
created using the surface mode of PyMOL (Schr ¨odinger & DeLano, 2020).
where t∼ptrain(i.e.ptrain=U[0, T]),x(t)∼ptwith the loss weighting λ:R+→R+(i.e.λ(t) =
1/2) (Hyv ¨arinen, 2005; Vincent, 2011). Nonetheless, the score of the marginals ∇x(t)logpt(x(t))
is unknown; therefore, we do not have an explicit regression target for sθ. Fortunately, the results
from Vincent (2011) state that the optimisation problem of ESM is equivalent to the denoising score
matching (DSM) objective:
min
θEt,x(0),x(t)h
λ(t)∇x(t)logp0t(x(t)|x(0))−sθ(x(t), t)2i
(20)
where x(0)∼p0,x(t)∼p0t(·|x(0)). In DSM we only need to evaluate the score of the pertur-
bation kernel p0twhich is easy to calculate for suitable choices of the drift anddiffusion coefficient
(see for example variance exploding orvariance preserving schedules in Song et al. (2021)).
We follow the design choice recommendations of Karras et al. (2022) by using f(x, t) = 0
andg(t) =√
2twhich yields the forward diffusion SDE: dx=√
2t dwt. The resulting per-
turbation kernel p(x(t)|x(0)) is a sum of the starting position x(0)and infinite independent
infinitesimal small Gaussian contributions. The perturbation kernel is therefore itself Gaussian:
p0t(x(t)|x(0)) = N(x(t);x(0), t2I). Plugging it into the DSM objective in (20) we obtain
min
θEt,x(0),x(t)"
λ(t)x(0)−x(t)
t2−sθ(x(t), t)2#
. (21)
This motivates to use the score-model parameterization: sθ(x(t), t) = (Dθ(x(t), t)−x(t))/t2.
Plugging it into the objective further simplifies it to
min
θEt,x(0),x(t)h
λ(t)||x(0)−Dθ(x(t), t)||2i
, (22)
which justifies the terminology denoising function forDθ. According to Karras et al. (2022) we
defined the weighting λ(t) := 1 /c2
outwithcout:= 0.25t/√
t2+ 0.25and selected ptrain(t)during
training as a log-normal distribution ln(t)∼ N (Pmean, P2
std)to focus on the relevant parts of the
noise schedule.
Following Karras et al. (2022), we define the denoiser by
Dθ(x, t) :=cskipx+cout(t)Fθ(cin(t)x, cnoise(t)) (23)
where cskip:= 0.25/(t2+ 0.25),cin:= 1/√
t2+ 0.25withcnoise:= 1000 ·tfor the ShapeNet-
Chair diffusion prior and cnoise:= 1000 ·t/tmaxfor the ShapeNet-Mixed and CryoStruct diffusion
17
Page 18:
priors. To model Fθ, we used the point transformer architecture of Nichol et al. (2022) with a
width of 512 and 24 layers, where each of the 1024 points has its 3D coordinates as features. This
architecture provides us with a highly flexible model with permutation equivariance. To account for
the lack of rotation equivariance, we used data augmentation during training. For ShapeNet-Chair
and ShapeNet-Mixed we performed data augmentation by transforming each point cloud during
training with a random orthogonal matrix. In the case of the CryoStruct dataset, each point cloud is
transformed by a random proper rotation matrix, an element of SO(3), to avoid training on unnatural
mirrored biomolecules.
To train the diffusion prior on the ShapeNet-Chair dataset, we selected Pmean=−4,Pstd= 1.2
withtmax= 1 for roughly the first 90% of the training steps and Pmean=−1.2,Pstd= 1.2with
tmax= 80 for the rest.
For the ShapeNet-Mixed and CryoStruct dataset, we selected Pmean=−2.8,Pstd= 0.9withtmax=
1for approximately the first 90% of the training steps and Pmean=−1.2,Pstd= 1.2withtmax= 80
for the remaining ones.
We trained the ShapeNet-Chair diffusion prior for 2827 epochs on the ShapeNet-Chair training
split with random orthogonal augmentations. During training, we manually adjusted the batch size
(ranging from 100 to 200) and learning rate ( 1×10−4to1×10−5). The ShapeNet-Mixed diffusion
prior was also trained on the respective training split with random orthogonal augmentations for 998
epochs with batch size ranging from 200 to 360 and constant learning rate of 8×10−5. For the
prior CryoStruct diffusion, we used the training split of Giri et al. (2024) with random rotational
augmentations. Throughout the 2070 epochs of training, we manually increased the batch size from
120 to 200 and decreased the learning rate from 1×10−4to8×10−5.
A.3 T ESTPARAMETERS
We consistently used 40 time steps and ρ= 3 for the reconstruction tasks involving the ShapeNet-
Chair, ShapeNet-Mixed, and CryoStruct datasets. An exception was made for the results depicted in
Figure 2 where the number of timesteps was doubled to 80. Moreover, we found that our likelihood-
based guidance is most effective at lower values of t. Thus, to prevent superfluous computational
work, we have set tmaxto 1 for the reconstruction task. We primarily utilized β(t) = 1 /tfor most
tasks, but noticed that for input projections with a low number of points, setting β(t)to 0 during
the final time steps produced sharper 3D point clouds. The guidance strength αis roughly selected
based on the amount of information contained in the input data y. Refer to Table 4 for further details.
1020 40 8012141618
Number of Time StepsCD([×102],↓)
EMD ([×102],↓)
1020 40 808101214
Number of Time StepsCD([×102],↓)
EMD ([×102],↓)
Figure 6: Comparison of reconstruction error in two scenarios over the number of time steps used
in Algorithm 1. The plot on the left corresponds to the test case of row 8 in Table 1 with parameters
of row 8 in Table 4. The right side depicts the test errors for the configuration outlined in row 3 of
Table 1, utilizing the parameters specified in row 4 of Table 4.
18
Page 19:
Table 4: Parameters used during approximate diffusion posterior sampling.
DatasetProjection Number of Coarse grained Subunitβ(t) αpoints projections points points
ShapeNet-Chair 200 5 - - 1/tift >0.15, else 0 10k
ShapeNet-Chair 200 6 - - 1/tift >0.15, else 0 10k
ShapeNet-Mixed 400 5 - - 1/t 10k
ShapeNet-Mixed 400 4 - - 1/t 5k
ShapeNet-Chair 300 1 30 - 1/t 4k
ShapeNet-Mixed 300 1 30 - 1/t 4k
ShapeNet-Chair 200 2 - ≈256 1 /tift >0.15, else 0 4k
ShapeNet-Mixed 200 2 - ≈256 1 /tift >0.15, else 0 4k
ShapeNet-Mixed 200 2 30 ≈128 1 /tift >0.15, else 0 4k
CryoStruct 900 2 30 - 1/t 4k
CryoStruct 300 1 40 - 1/t 4k
CryoStruct 1024 1 40 - 1/t 4k
CryoStruct 1024 3 - - 1/t 40k
CryoStruct 800 4 - - 1/t 40k
CryoStruct 1024 4 - - 1/t 60k
CryoStruct 1024 5 - - 1/t 80k
A.4 C RYOSTRUCT BENCHMARKS
The following figures summarize the CryoStruct benchmark for various input measurements. The
PDB codes of the 100 test structures are indicated on the x-axis. The red dashed line shows the
average RMSD of the ML-based models. The box plots and dashed orange lines show the RMSD
of the models generated with DPS.
19
Page 20:
A.4.1 I NPUT DATA :THREE 2D PROJECTIONS , 1024 POINTS PER PROJECTION
7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f0102030RMSD [Å]
ML
DPS
6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq20102030RMSD [Å]
7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m0102030RMSD [Å]
6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq00102030RMSD [Å]
20
Page 21:
A.4.2 I NPUT DATA :FOUR 2D PROJECTIONS , 1024 POINTS PER PROJECTION
7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f246810RMSD [Å]
ML
DPS
6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq2246810RMSD [Å]
7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m246810RMSD [Å]
6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq0246810RMSD [Å]
21
Page 22:
A.4.3 I NPUT DATA :FIVE 2D PROJECTIONS , 1024 POINTS PER PROJECTION
7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f51015RMSD [Å]
ML
DPS
6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq251015RMSD [Å]
7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m51015RMSD [Å]
6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq051015RMSD [Å]
22
Page 23:
A.4.4 I NPUT DATA :FOUR 2D PROJECTIONS , 800 POINTS PER PROJECTION
7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f246810RMSD [Å]
ML
DPS
6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq2246810RMSD [Å]
7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m246810RMSD [Å]
6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq0246810RMSD [Å]
23
Page 24:
A.4.5 I NPUT DATA :TWO PROJECTIONS (900 POINTS PER PROJECTION )+A
LOW -RESOLUTION STRUCTURE (30 POINTS )
7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f246810RMSD [Å]
ML
DPS
6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq2246810RMSD [Å]
7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m246810RMSD [Å]
6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq0246810RMSD [Å]
24
Page 25:
A.4.6 I NPUT DATA :A SINGLE PROJECTION (1024 POINTS )+A LOW -RESOLUTION
STRUCTURE (40 POINTS )
7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f2.55.07.510.012.5RMSD [Å]
ML
DPS
6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq22.55.07.510.012.5RMSD [Å]
7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m2.55.07.510.012.5RMSD [Å]
6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq02.55.07.510.012.5RMSD [Å]
25
Page 26:
A.4.7 I NPUT DATA :A SINGLE PROJECTION (300 POINTS )+A LOW -RESOLUTION
STRUCTURE (40 POINTS )
7syx6xz77rze5w0s6v227s0c6cs36c236ifr7mdt6ezm6rdm7o3j6ap16pv87e7d7bp96bm06uz37x1t7jzy7bxu7kwg6caa7f7f2.55.07.510.012.5RMSD [Å]
ML
DPS
6ebk7tci5m3l5ogw7wli7y9s6nyg7fec7xmr7w4o7l886uza8cy67lij7k9x7czp6cv36x6c6q2o7czt7liw7emf6urg7nt96nq22.55.07.510.012.5RMSD [Å]
7r0z5lk87dxc6vae6drk7ew36rwl7t9n7r877dty7wr87kiy5mmj7jk37jwb6l546ryr6uja7xjk7l6q6uze7s9a7kf57w597m6m2.55.07.510.012.5RMSD [Å]
6dnh6xld6yt57rky6x405vzl6utf7rja6uwf7kne7ssz6vvy6f1u6m717ej07e1x5an96wm46zkc7mis7fie7k8x6w626eit7sq02.55.07.510.012.5RMSD [Å]
26