Paper Content:
Page 1:
The Curse of Conditions: Analyzing and Improving Optimal Transport for
Conditional Flow-Based Generation
Ho Kei Cheng Alexander Schwing
University of Illinois Urbana-Champaign
{hokeikc2,aschwing}@illinois.edu
6.232±0.186 0.072±0.025 2.573±0.112 0.483±0.059 0.048±0.010 0.732±0.052 8.276±6.510 0.077±0.024
0.120±0.045 0.050±0.040 0.059±0.029 0.462±0.025 0.018±0.006 0.028±0.010 2.143±1.993 0.013±0.003
FM OT FM OT C2OT (ours) FM OT C2OT (ours)Unconditonal With discrete conditionsWith continuous conditions
(target xcoordinates)Euler-1 step Adaptive
Figure 1. We visualize the flows learned by different algorithms using the 8gaussians →moons dataset. We compare flow matching
(FM), minibatch optimal transport FM (OT), and our proposed conditional optimal transport FM (C2OT) using one Euler step and an
adaptive ODE solver, respectively. Below each plot, we show the 2-Wasserstein distance (lower is better; mean ±std over 10 runs). For
unconditional generation, OT achieves significantly straighter flows, outperforming FM. However, paradoxically, OT performs worse when
conditions are introduced. This paper analyzes this degradation and finds that it occurs because optimal transport disregards conditions,
leading to a train-test gap. To address this, we propose a simple fix (C2OT) that outperforms FM and OT in conditional generation.
Abstract
Minibatch optimal transport coupling straightens paths in
unconditional flow matching. This leads to computationally
less demanding inference as fewer integration steps and less
complex numerical solvers can be employed when numeri-
cally solving an ordinary differential equation at test time.
However, in the conditional setting, minibatch optimal trans-
port falls short. This is because the default optimal transport
mapping disregards conditions, resulting in a conditionally
skewed prior distribution during training. In contrast, at
test time, we have no access to the skewed prior, and instead
sample from the full, unbiased prior distribution. This gap
between training and testing leads to a subpar performance.
To bridge this gap, we propose conditional optimal transport
(C2OT) that adds a conditional weighting term in the cost
matrix when computing the optimal transport assignment.
Experiments demonstrate that this simple fix works with both
discrete and continuous conditions in 8gaussians →moons,CIFAR-10, ImageNet-32 ×32, and ImageNet-256 ×256. Our
method performs better overall compared to the existing
baselines across different function evaluation budgets. Code
is available at hkchengrex.github.io/C2OT .
1. Introduction
We focus on flow-matching-based conditional generative
models, i.e., generation guided by an input condition.1Ex-
amples of such conditions include class labels, text, or video.
Recently, flow matching has been applied in many areas
of computer vision, including text-to-image/video [ 10,35],
vision-language applications [ 3], image restoration [ 32], and
video-to-audio [ 4]. However, these methods can be slow at
1We use condition to denote an input condition , not to be confused with
the ‘condition’ in conditional flow matching (CFM) [ 27] (where the second
‘C’ in our acronym comes from), which refers to a data sample. To avoid
ambiguity, we refer to CFM simply as flow matching (FM) in this paper.
1arXiv:2503.10636v1 [cs.LG] 13 Mar 2025
Page 2:
test-time – obtaining a solution requires numerically integrat-
ing the flow with an ODE solver, typically involving many
steps, each requiring a deep network forward pass. Naïvely
reducing the number of steps degrades performance, as the
underlying flow is often curved (Figure 1, leftmost column),
necessitating a small step size for accurate numerical inte-
gration. One way to address this issue is by straightening
the flow. In the context of unconditional generation, Tong
et al. [42] and Pooladian et al. [36] concurrently proposed
“minibatch optimal transport” (OT), which deterministically
couples data and samples from the prior within a minibatch
via optimal transport (replacing random coupling) to mini-
mize flow path lengths and to straighten the flow.
While OT improves unconditional generation, we find
that it paradoxically and consistently harms conditional gen-
eration (Figure 1). This occurs because OT disregards the
conditions when computing the coupling. As a result, during
training, OT samples from a prior distribution that is skewed
by the condition, as we show in Section 3.2. However, at
test time, we have no access to this skewed distribution and
instead sample from the full prior distribution. This mis-
match creates a gap between training and testing, leading to
performance degradation. To bridge this gap, we propose
conditional optimal transport FM (C2OT) which introduces
a simple yet effective condition-aware weighting term when
computing the cost matrix for OT. Additionally, we propose
two techniques: adaptive weight finding to simplify hyper-
parameter tuning and efficient oversampling of OT batches
to counteract the reduced effective OT batch size introduced
by the weighting term.
We conduct extensive experiments on both two-
dimensional synthetic data and high-dimensional image
data, including CIFAR-10, ImageNet-32 ×32, and ImageNet-
256×256. The results demonstrate that our method performs
well across different condition types (discrete and continu-
ous), datasets, network architectures (UNets and transform-
ers), and data spaces (image space and latent space). C2OT
achieves better overall performance than the existing base-
lines across different inference computation budgets, includ-
ing with few-steps Euler’s method and with an adaptive
ordinary differential equation (ODE) solver [9].
2. Related Works
Flow Matching. Recently, flow matching [ 1,2,27] has
become a popular choice for generative modeling, in part due
to its simple training objective and ability to generate high-
quality samples. However, at test time, flow-based methods
typically require many computationally expensive forward
passes through a deep net to numerically integrate the flow.
This is because the underlying flow is usually curved and
therefore cannot be approximated well with a few integration
steps.Optimal Transport Coupling. To straighten the flow, in
the context of unconditional generation, Tong et al. [42] and
Pooladian et al. [36] concurrently proposed minibatch op-
timal transport (OT). While OT has proven effective in the
unconditional setting, we show in Section 3.2 that it skews
the prior distribution and fails in the conditional setting if
used naively. Our method, C2OT, specifically addresses this
failure, aiming to extend the success of OT in unconditional
generation to conditional generation. Note, OT has also been
used in equivariant flow matching [ 22,41], which exploits
domain-specific data symmetry ( e.g., in molecules). In con-
trast, our method targets general data. Further, Hui et al.
[17] find mixing of OT and independent coupling improves
shape completion, which is an orthogonal contribution to
this paper.
Straightening Flow Paths. Other methods of straighten-
ing flow paths exist. Reflow [ 30] achieves this by retraining
a flow matching network multiple times, at the cost of ad-
ditional training overhead. Variational flow matching [ 13]
reduces flow ambiguity with a latent variable and hierar-
chical flow matching [ 47] straightens flows by modeling
higher-order flows. These approaches are orthogonal to our
focus on prior-data coupling and, in principle, can be com-
bined with our method.
Learning the Prior Distribution. An alternative approach
to improving flow is to jointly learn a prior distribution with
the flow matching network. However, these methods [ 26,29,
39] typically rely on a variational autoencoder (V AE) [ 16,
21] to learn the prior. This introduces yet another density
estimation problem, which complicates training. In this
work, we focus on improving training-free coupling plans
without learning a new prior.
Consistency Models. Consistency models [ 40] have been
proposed to accelerate diffusion model sampling, and can
be extended to flow matching [ 45]. Recent advancements
have improved training efficiency [ 12], stability [ 31], and
quality [ 19]. These methods are orthogonal to our contribu-
tion, which focuses on improving prior-data coupling, and
can be integrated, as demonstrated by Silvestri et al. [39]
in their combination of consistency models with OT. In this
paper, we focus on the base form of flow matching to avoid
distractions from other formulations.
3. Method
3.1. Preliminaries
Flow Matching (FM). We base our discussion on FM [ 27,
42] (also called rectified flow [ 30]) for generative modeling
and refer readers to [30, 42] for details. At test-time, a sam-
ple˜x1is characterized by an ordinary differential equation
(ODE) with an initial boundary value specified via noise x0,
randomly drawn from a prior distribution ( e.g., the standard
2
Page 3:
/u1D461 = 0/u1D461 = 1
/u1D45E(/u1D4650) = /u1D45E(/u1D4650∣/u1D450) /u1D45E(/u1D4650) = /u1D45E(/u1D4650∣/u1D450) /u1D45E(/u1D4650) = /u1D45E(/u1D4650∣/u1D450) /u1D45E(/u1D465 0) = /u1D45E(/u1D4650∣/u1D450) /u1D45E(/u1D4650) = /u1D45E(/u1D4650∣/u1D450) /u1D45E(/u1D4650 ∣ /u1D450 = 0)/u1D45E(/u1D4650 ∣ /u1D450 = 1)train test train test train test
/u1D45E(/u1D4651 ∣ /u1D450 = 0) /u1D45E(/u1D4651 ∣ /u1D450 = 1) /u1D45E(/u1D4651 ∣ /u1D450 = 0) /u1D45E(/u1D4651 ∣ /u1D450 = 1) /u1D45E(/u1D4651 ∣ /u1D450 = 0) /u1D45E(/u1D4651 ∣ /u1D450 = 1)FM OT C2OT (ours)Figure 2. We visualize the coupling during training and the learned flow from t= 0tot= 1during testing for FM, OT, and our proposed
method C2OT. The prior is defined as q(x0) =N(0,1), while the target distribution is given by q(x1) =1
2q(x1|c= 0) +1
2q(x1|c=
1)) =1
2N(−2,0.5) +1
2N(2,0.5). Note, FM results in curved flows during testing, while OT degenerates due to skewed training samples
(q(x0|c)̸=q(x0)). In contrast, our method successfully learns straight flows without degeneration.
normal). To compute the sample ˜x1, this ODE is commonly
solved by numerically integrating from time t= 0to time
t= 1a learned flow/velocity field vθ, where θdenotes the
set of trainable deep net parameters. For conditional gen-
eration, we additionally obtain a condition c,e.g., a class
label from user input, which adjusts the flow. We use vθ,cto
denote this conditional flow. At training time, we find the
deep net parameters θby minimizing the objective
Et,(x0,x1,c)∼π(x0,x1,c)∥vθ,c(t, xt)−u(xt|x0, x1)∥2,(1)
where tis uniformly drawn from the interval [0,1], and
where π(x0, x1, c)denotes the joint distribution of the prior
and the training data. In practice, often, independent cou-
pling is used, i.e., the prior and the training data are sampled
independently, such that
π(x0, x1, c) =q(x0)q(x1, c). (2)
Further,
xt=tx1+ (1−t)x0 (3)
defines a linear interpolation between noise and data, and
u(xt|x0, x1) =x1−x0 (4)
denotes its corresponding flow velocity at xt. Note, simply
dropping the condition cleads to the unconditional flow
matching formulation.
Although rectified flow matching is trained with straight
paths, the learned flow is often curved (Figure 1, bottom-
left) since there are many possible paths induced by the
independently sampled couplings x0, x1through any xtat
timet[36]. Curved paths require more numerical integration
and network evaluation steps, which is not ideal.
Optimal Transport (OT) Coupling. To address this issue
forunconditional generation , Tong et al. [42] and Poola-
dian et al. [36] concurrently proposed to use minibatch opti-
mal transport to deterministically couple the sampled priorand a sampled batch of the data. This minimizes flow path
lengths and straightens the flow. Concretely, given a mini-
batch of b e-dimensional samples, an OT coupling seeks a
b×bpermutation matrix Psuch that the squared Euclidean
distance is minimized, i.e.,
min
P∥X0−PX1∥2
2. (5)
Here, X0, X1∈Rb×eare matrices containing a minibatch
of samples from the prior and the data. Instead of an “in-
dependent coupling”, in the unconditional setting we then
optimize a variant of Equation (1) using the coupled tuple
(X0, PX 1). Note that this algorithm corresponds to sam-
pling tuples (x0, x1)from a joint distribution πot(x0, x1).
Importantly, for unconditional generation ,i.e., when
using the joint distribution πot(x0, x1), the marginals remain
unchanged from independent coupling (see [42]):
Z
πot(x0, x1)q(x0) dx0=q(x1),and
Z
πot(x0, x1)q(x1) dx1=q(x0). (6)
Hence, during test-time, we can sample from the prior q(x0)
and integrate the flow vθto simulate data from the data
distribution q(x1).
Unfortunately, as shown in the next section, this does not
hold for conditional generation . Said differently, naively
extending the unconditional joint distribution πot(x0, x1)to
a conditional joint distribution πot(x0, x1, c)by using the
coupled tuple (X0, PX 1, PC), where C∈Rb×his a batch
ofh-dimensional conditions, does not maintain marginals
and leads to a gap between training time and test time. We
discuss this next.
3
Page 4:
3.2. OT Coupling Skews The Prior
Intuition. Consider the one-dimensional example illus-
trated in Figure 2: the prior distribution q(x0)follows a
Gaussian distribution, and the target distribution q(x1)con-
sists of a mixture of two Gaussians translated left and right,
with the translation direction indicated by a binary condition
c. When training with a πot(x0, x1, c)that represents the
coupled tuple (X0, PX 1, PC), samples from the prior and
the sampled conditions become correlated through the OT
coupling – samples from the left half of the prior distribution
are always associated with c= 0, while those from the right
half always correspond to c= 1. As a result, the model
overfits on data pairs (x0, c)where either (x0<0, c= 0)
or(x0>0, c= 1) . However, during testing, this skewed
distribution is no longer accessible, and instead, x0andcare
sampled independently. Consequently, the model encoun-
ters previously unseen data pairs, i.e.,(x0>0, c= 0) and
(x0<0, c= 1) – leading to failure. A similar phenomenon
can be observed in the two-dimensional example illustrated
in Figure 1.
Skewed Prior. Formally, imagine that we consider the
most general OT coupling joint distribution πot(x0, x1, c)at
training time. At test time, the user wants to specify a condi-
tionc1to obtain samples from the conditional distribution
q(x1|c=c1). What is the prior distribution that we need to
sample from such that we arrive at the correct conditional
distribution q(x1|c=c1)? To compute this, imagine, we
are given an arbitrary condition c1,i.e., we look at the in-
duced distribution πot(x0, x1, c|c=c1). Marginalizing this
induced distribution over x1yields
Z
πot(x0, x1, c|c=c1) dx1=Z
πot(x0, x1|c=c1) dx1
=q(x0|c=c1), (7)
which implies that the prior required to arrive at the condi-
tional distribution q(x1|c=c1)isq(x0|c=c1)– a distri-
bution that we cannot access at test time. In general, the
prior that we need to sample from, given condition c, is
skewed from q(x0)toq(x0|c)which is inaccessible at test
time. As a result, accurately capturing q(x1|c)at test time
becomes infeasible unless q(x0)andq(c)are independent,
i.e., ifq(x0|c) =q(x0). However, this condition is generally
not satisfied by the most general OT coupling – for instance,
in Figure 2, clearly q(x0|c)̸=q(x0). To address this issue,
we propose a method C2OT to unskew the prior. We discuss
this in the next section.
3.3. Conditional Optimal Transport FM
3.3.1. Formulation
We propose conditional optimal transport FM (C2OT) to
unskew the prior q(x0|c)such that q(x0|c) =q(x0). Thisensures that at test time, we can sample from the full
prior q(x0), irrespective of the condition c. Importantly,
at the same time, we also aim to preserve the straight
flow paths provided by OT. Conceptually, we construct
a prior distribution independently for each condition, i.e.,
q(x0|c1) =q(x0|c2) =q(x0),∀c1, c2. This can be achieved
by sampling from the prior and computing OT independently
for each condition, as shown in Figure 2 (right).
Formally, we construct the joint distribution as
πc2ot(x0, x1, c):=q(x0)q(c)πot(x1|x0, c). (8)
Here, the prior q(x0)and the condition q(c)are sampled
independently, while the data x1is conditioned on x0(via
OT) and c(via the dataset construction). Different from inde-
pendent coupling in Equation (2), x1andx0are associated
through OT and therefore lead to straighter flow paths. Also
different from the most general OT coupling πot(x0, x1, c),
we explicitly enforce the independence between x0andc.
To see that this joint distribution provides an unskewed
prior, imagine we are given an arbitrary input condition c1,
marginalizing over x1yields
Z
q(x0)q(c|c=c1)πot(x1|x0, c=c1) dx1
=Z
q(x0)πot(x1|x0, c=c1) dx1=q(x0). (9)
This implies that, at test time, we can sample from the entire
prior q(x0), regardless of the condition c, to arrive at the
desired data distribution q(x1|c). During training, in practice,
we sample from πc2otby modifying the optimal transport cost
function in Equation (5). This process is exact for discrete
conditions (Section 3.3.2) and approximate for continuous
conditions (Section 3.3.3). Specifically, given a minibatch of
b e-dimensional samples and a minibatch of b h-dimensional
conditions C∈Rb×h, C2OT seeks a b×bpermutation
matrix Pthat minimizes the following cost function:
min
P∥X0−PX1∥2
2+bX
i=1f(Ci,[PC]i), (10)
where fis a symmetric, non-negative cost function satisfying
the property f(c, c) = 0∀c. The key differences between
independent coupling, OT coupling, and C2OT coupling
are highlighted in Algorithms 1 to 3. In the following two
sections, we discuss our choice of ffor both discrete and
continuous conditions.
3.3.2. Discrete Conditions
For discrete conditions, we assume the conditions corre-
spond to class labels from a finite, discrete set, i.e.,c∈
{0,1, . . . , k −1}. For example, in Figure 2, the labels 0and
1represent the left and right groups in the target distribution,
4
Page 5:
Algorithm 1 Independent Coupling
1:function INDEPENDENT (X1, C)
2: X0∼ N ▷Sample from prior
3:
4:
5: return X0, X1, CAlgorithm 2 OT Coupling
1:function OT(X1, C)
2: X0∼ N ▷Sample from prior
3: Cost←pairwise_dist (X0, X1)
4: i, j←Hungarian_matching(Cost)
5: return X0[i], X1[j], C[j]Algorithm 3 C2OT Coupling (Ours)
1:function C2OT(X1, C)
2: X0∼ N ▷Sample from prior
3: Cost←pairwise_dist (X0, X1)+
pairwise_dist (C, C)
4: i, j←Hungarian_matching(Cost)
5: return X0[i], X1[j], C[j]
respectively. We define f=fdsuch that no transport is
allowed to modify the conditions ( i.e.,PC=C). This is
achieved via
fd(c1, c2) =(
∞, c1̸=c2,
0,otherwise .(11)
As a result, the unordered set of prior samples for any given
condition cremains unchanged: {X0i|ci=c}iid∼q(x0).
This formulation samples exactly from our ideal (Equa-
tion (8)), as OT is computed independently within each
class, and the data sample from each class is exposed to the
full prior without skew. This effect is illustrated in Figure 1
(middle) and Figure 2 (right). Note that, as expected, the
results differ from FM’s curved flow paths and OT’s inability
to properly capture the target distribution.
It is important to note that enforcing PC=Cdoes not
necessarily imply P=Isince we can still swap row iandj
inPas long as ci=cj. Setting P=Iwould correspond to
a degeneration back to independent coupling in Equation (2)
and forfeit the benefits of straightened flow provided by OT.
Unfortunately, using fdwith continuous conditions leads
to exactly this degeneration, since, in general, no ci=cj
unless i=j. Hence, in the next section, we devise a relaxed
cost for handling continuous conditions.
3.3.3. Continuous Conditions
For continuous conditions, the condition is typically a feature
embedding, e.g., computed from a text prompt. Directly
applying fdin Equation (11) leads to a degeneration back to
independent coupling as discussed earlier. To avoid this, we
propose a relaxed penalty function fcbased on the cosine
distance:
fc(c1, c2) =w
1−c1·c2
∥c1∥∥c2∥
=w·cdist(c1, c2).(12)
Here, w > 0is a scaling hyperparameter. Note, when w= 0,
this formulation degenerates to regular OT; when w→ ∞ ,
this formulation approaches Equation (11), i.e., independent
coupling if no two conditions are equal. The right section
of Figure 1 illustrates an example where the x-coordinate
of the target data point serves as the condition. Notably,
C2OT preserves straight flow paths without exhibiting OT’s
apparent degradation.Finding w.The optimal wis highly dependent on the data
distribution and the magnitude of features. To alleviate the
need for tuning the hyperparameter w, we propose to find
wadaptively for each minibatch. For this, we develop a
ratior(w)which measures the proportion of samples that
are considered “potential optimal transport candidates”, i.e.,
r(w) =1
B2X
i,j1
∥X0i−X1j∥2
2+w·cdist(ci, cj)
≤ ∥X0i−X1i∥2
2
,(13)
where 1(·)is an indicator function that evaluates to one if
the condition is satisfied and to zero otherwise. Then, we
introduce a hyperparameter rtaras the target ratio and find
wsuch that r(w)≈rtar. Note that setting rtaris invariant
to the scaling of distances between x0andx1. Namely, if
∥X0i−X1j∥2
2is scaled by s,wcan also be scaled by sto
preserve the same rtar. This invariance is particularly useful
when transitioning to a new dataset, such as changing image
resolutions in image generation tasks. In our implementa-
tion, we search for win two steps: first with exponential
search to establish the upper/lower bounds, then with binary
search to locate a precise value. The final wis used as the
initial value in the subsequent minibatch. In practice, we
observe wdiffers very little between minibatches (<1%) and
the search converges quickly within 10 iterations with a neg-
ligible overhead. Next, we introduce oversampling , a simple
technique to obtain more accurate OT couplings.
3.3.4. Oversampling OT Batches
OT vs. Deep Net Batch Sizes. Typically, the “OT batch
size” bused to compute optimal transport is set equal to the
“deep net batch size” bnused to compute forward/backward
passes of the deep net [ 36,42]. While the OT batch size con-
trols the dynamics of prior-to-data assignments, the deep net
batch size affects gradient variances, training efficiency, and
memory usage. There is no reason why setting them equal
should be optimal. Particularly, with our proposed C2OT, the
effective OT batch size is reduced since we restrict transport
between samples with divergent conditions, which calls for
anincreased OT batch size b. At the same time, keeping the
deep net batch size bnunchanged prevents unwarranted side
effects in other aspects of training.
Reduced Effective OT Batch Size. In the case of
discrete conditions, OT is effectively performed within
5
Page 6:
Table 1. Results on the 8gaussians →moons dataset with dif-
ferent training coupling methods and different conditioning. Note
that our method C2OT does not apply in the unconditional setting.
Method Euler-1 ( W2
2↓)Adaptive ( W2
2↓) NFE ↓
Unconditional
FM 6.232±0.186 0.120±0.045 93.22±1.81
OT 0.072±0.025 0.050±0.040 36.08±2.57
Binary conditions
FM 2.573±0.112 0.059±0.029 91.10±2.77
OT 0.483±0.059 0.462±0.025 32.89±0.95
C2OT (ours) 0.048±0.010 0.018±0.006 58.88±1.48
Continuous conditions
FM 0.732±0.052 0.028±0.010 93.86±1.76
OT 8.276±6.510 2.143±1.993 42.07±2.59
C2OT (ours) 0.077±0.024 0.013±0.003 89.87±1.53
{X0i, X1i|ci=c}for each condition c. With kclasses,
the expected effective OT batch size is reduced to E(|{i|ci=
c}|) =b/k. Empirically, we find that using a much larger
OT batch size bthan the network batch size bnis helpful.
This observation motivates oversampling.
Efficiency. A natural concern with increasing bis the com-
putational overhead, since we compute optimal transport
with Hungarian matching [ 24] which has a time complexity
ofO(b3). However,
•Since each OT batch can be used across (b/bn)for-
ward/backward passes, the amortized cost per minibatch
reduces to O(b2bn).
•We offload OT computation from the main process to
the data loading processes, enabling data preparation on
the CPU in parallel with forward/backward passes on the
GPU.
With a modest choice of 8 data workers per GPU, we observe
no wall-clock overhead for OT batch sizes up to 6,400 when
training on CIFAR-10 and ImageNet, as data preparation is
completed faster than the GPU can process batches.
4. Experiments
We experimentally verify our proposed method C2OT in
two sections: first on the two-dimensional synthetic dataset
8gaussians →moons , and then on high-dimensional im-
age data, including CIFAR-10 [ 23] and ImageNet-1K [ 7].
For ImageNet-1K, we conduct experiments on both 32 ×32
images in image space and 256 ×256 images in latent space.
Implementation details are provided in the appendix.
4.1. Two-Dimensional Data
The two-dimensional 8gaussians →moons dataset maps
a mixture of eight Gaussian distributions to two interleav-ing half-circles, as visualized in Figure 1. We consider
three conditioning scenarios: (a) unconditional; (b) binary
class conditions, where each class corresponds to one of
the half-circles; and (c) continuous conditions, where the
x-coordinate of the desired target point is given as the con-
dition. For (b), we apply our setup for discrete conditions
(Section 3.3.2); for (c), we apply our setup for continuous
conditions (Section 3.3.3) with rtar= 0.01.
Metrics. We report the 2-Wasserstein distance ( W2
2) (fol-
lowing [ 42]) between 10K generated data points and 10K
samples from the ground-truth moons distribution. We ex-
periment with single-step Euler’s method (Euler-1) and the
adaptive Dormand–Prince method [ 9] (dopri5, referred to as
adaptive) for numerical integration. Additionally, we report
the number of function evaluations (NFE) in the adaptive
method, which is the number of forward passes through
the neural network. All experiments are averaged over ten
training runs with ten different random seeds, and we report
mean±std.
Results. The results are summarized in Table 1 and visual-
ized in Figure 1. In the unconditional setting, OT produces
straighter flows, leading to better distribution matching with
fewer NFEs, and our method does not apply. However, in
theconditional setting, OT degrades and performs worse
than regular FM in the adaptive setting due to the aforemen-
tioned train-test gap in OT coupling. Our method, C2OT,
effectively models the target distribution with both Euler-1
and the adaptive method, improving upon the baselines.
4.2. High-Dimensional Image Data
We conduct experiments on CIFAR-10 [ 23] (image space,
class-conditioned), ImageNet-32 ×32 [ 7] (image space,
caption-conditioned), and ImageNet-256 ×256 [ 7] (latent
space, caption-conditioned). For ImageNet, we use the
captions provided by [ 25] and encode text features using
a CLIP [ 37]-like DFN [ 11] text encoder as input conditions.
Our results are summarized in Table 2. More implementation
details can be found in the appendix.
Metrics. We report Fréchet inception distance (FID) [ 15]
and the number of function evaluations (NFE) in the adap-
tive solver for all three datasets. To assess how well the
generations adhere to the input conditions, we additionally
compute condition adherence metrics. For CIFAR-10, we
run a pretrained classifier [ 5] on the generated images to
obtain logits and report the average cross entropy (CE) with
the input class label. For ImageNet, we compute CLIP fea-
tures [ 37] using SigLIP-2 [ 43] from the generated images
and the input captions respectively, and report the average
cosine similarities (CLIP).
CIFAR-10. We base our UNet architecture on Tong et al.
[42], with details provided in the appendix. The CIFAR-10
training set contains 50,000 32 ×32 color images across 10
6
Page 7:
Table 2. Results of different training algorithms on image generation tasks. We bold the best-performing entry and underline the second-best.
Our proposed method, C2OT, achieves the best overall performance – better than FM with few sampling steps and better than OT with more
sampling steps. In cases where C2OT performs second, our performance is often comparable to the best entry. Additionally, C2OT has the
most stable performance, with the smallest deviations across runs in most entries. Variations in CLIP scores are minimal.
CIFAR-10 Class-Conditioned Generation
Method Euler-2 Euler-5 Euler-10 Euler-25 Euler-50 Euler-100 Adaptive NFE ↓FID↓FM 105.481 ±1.619 21.968±0.400 10.479±0.237 5.640±0.117 4.128±0.069 3.429±0.054 2.922±0.039 124.0±2.5
OT 64.417±0.434 18.194±0.409 10.778±0.270 6.821±0.186 5.368±0.113 4.671±0.066 4.144±0.018 125.7±0.4
C2OT (ours) 64.694±0.379 17.565±0.202 9.762±0.136 5.644±0.081 4.152±0.061 3.447±0.049 2.904±0.035 127.7±0.7CE↓FM 3.343±0.022 0.546±0.020 0.323±0.007 0.272±0.004 0.267±0.002 0.269±0.003 0.276±0.004 124.0±2.5
OT 2.529±0.010 0.624±0.012 0.426±0.005 0.369±0.004 0.363±0.004 0.364±0.005 0.372±0.005 125.7±0.4
C2OT (ours) 2.264±0.022 0.475±0.009 0.322±0.004 0.276±0.005 0.271±0.003 0.272±0.003 0.279±0.004 127.7±0.7
ImageNet 32 ×32 Caption-Conditioned GenerationFID↓FM 113.250 ±0.793 23.015±0.123 11.141±0.051 7.165±0.087 6.142±0.085 5.673±0.074 5.358±0.059 146.9±3.5
OT 81.317±0.369 20.763±0.154 11.360±0.089 7.854±0.077 6.899±0.079 6.469±0.071 6.220±0.065 137.7±0.2
C2OT (ours) 102.380 ±0.279 21.965±0.035 10.897±0.033 7.069±0.027 6.084±0.016 5.638±0.018 5.350±0.017 134.2±1.4CLIP ↑FM 0.067±0.000 0.076±0.000 0.074±0.000 0.071±0.000 0.071±0.000 0.070±0.000 0.069±0.000 146.9±3.5
OT 0.067±0.000 0.071±0.000 0.070±0.000 0.068±0.000 0.068±0.000 0.067±0.000 0.067±0.000 137.7±0.2
C2OT (ours) 0.068±0.000 0.075±0.000 0.073±0.000 0.071±0.000 0.070±0.000 0.070±0.000 0.069±0.000 134.2±1.4
ImageNet 256 ×256 Caption-Conditioned Generation in Latent SpaceFID↓FM 203.355 ±0.075 31.740±0.100 10.088±0.020 3.898±0.016 5.225±0.011 3.474±0.008 3.484±0.014 133.5±10.9
OT 190.893 ±0.213 46.510±0.140 18.221±0.078 7.477±0.045 9.333±0.046 7.630±0.020 7.181±0.022 115.9±15.3
C2OT (ours) 201.010 ±0.067 30.578±0.124 10.032±0.007 3.702±0.017 5.075±0.013 3.335±0.010 3.290±0.012 126.7±12.7CLIP ↑FM 0.027±0.000 0.120±0.000 0.133±0.000 0.138±0.000 0.137±0.000 0.139±0.000 0.139±0.000 133.5±10.9
OT 0.031±0.000 0.101±0.000 0.114±0.000 0.118±0.000 0.117±0.000 0.118±0.000 0.118±0.000 115.9±15.3
C2OT (ours) 0.029±0.000 0.120±0.000 0.133±0.000 0.138±0.000 0.137±0.000 0.139±0.000 0.138±0.000 126.7±12.7
object classes. During testing, we generate 50,000 images
evenly distributed among classes and assess FID against the
training set following [ 42]. Each model is trained five times
with different random seeds and the results are reported
as mean ±std. We use an OT batch size bof 640 and a
ratio rtarof 0.01. Compared to FM, our method achieves
better performance (FID and CE) in few-steps settings and
comparable performance in many-steps settings. Compared
to OT, our method significantly outperforms in the many-
step setting. Although OT achieves slightly better FID in
Euler-2, it performs worse in CE, indicating that it fails to
follow the condition.
ImageNet 32 ×32. We base our UNet architecture
on Pooladian et al. [36], with details provided in the ap-
pendix. We train and evaluate on the face-blurred ver-
sion [ 44] of ImageNet, following[ 36]. In contrast to prior
works that assess FID against the training set, we assess
FID against the validation set (49,997 images) to mitigate
overfitting, as the fine-grained nature of captions may lead
to overfitting on the training set. Each model is trained
three times with different random seeds and the results are
reported as mean ±std. Compared to CIFAR-10, there are
much more variations in this dataset ( e.g., 1000 classes vs.
10 classes in CIFAR-10). Thus, we adopt a larger OT batchsizebof 6400 while keeping the same target ratio rtar0.01
as before. Similar to our findings in CIFAR-10, our method
performs better or is comparable to FM; in few-steps set-
tings, OT achieves better FID but worse condition following
(CLIP). Our method strikes the best overall balance.
ImageNet 256 ×256 in Latent Space. Here, we explore
latent flow matching models [ 38]. During training, the flow
matching model learns to generate data in the latent space,
which is encoded from images. At test-time, the generated la-
tents are decoded into images using a pretrained decoder. We
base our experiment on a recent transformer-based model,
LightningDiT [ 46], demonstrating that our method is effec-
tive across different network architectures and data spaces.
We use the officially provided “64 epochs” configuration for
all experiments (due to computational constraints), which
is not directly comparable to methods that are trained with
800 epochs. We follow the same evaluation protocol as
ImageNet-32 ×32, and adopt the same OT batch size bof
6400 and the target ratio rtarof 0.01. Each model is evalu-
ated three times with different random seeds and the results
are reported as mean ±std. Again, C2OT has the best overall
performance compared to both FM and OT. We visualize
the generated images in Figure 3 and provide additional
uncurated samples in the appendix.
7
Page 8:
Euler-2 Euler-5 Euler-10 Euler-25 Euler-50 Euler-100FM OT C2OT (ours)
Figure 3. Visual comparisons of 256 ×256 images generated by
the baselines and our approach with different amounts of sampling
steps. Our approach converges faster and produces cleaner outputs.
The input caption is “a small dog sitting on a carpet”.
4.3. Ablation Studies
4.3.1. Varying OT Batch Size b
Here, we analyze the effect of varying the OT batch size
b. We conduct these experiments on the CIFAR-10 dataset,
training each model three times with different random seeds
for every choice of b. We plot the results in Figure 4. In-
creasing the OT batch size clearly improves FID in the few-
sampling-steps setting (Euler-2). This is intuitive, because
a larger OT batch size enables the coupling to approach
thetrue OT results, i.e., dataset-level optimal transport. In
turn, this leads to straighter paths. However, in the adaptive
setting, increasing the OT batch size slightly worsens FID,
though the effect is much less significant (the range of the
blue y-axis is much smaller than the range of the red y-axis).
We think this occurs because a more accurate OT coupling
reduces variations in the training data (similar to having less
data augmentations) which harms performance as the benefit
of straighter flows diminishes in the adaptive setting. This
observation aligns with the findings of Pooladian et al. [36]:
computing OT coupling across GPUs ( i.e., having a larger
effective OT batch size) slightly harms results. In contrast
to Pooladian et al. [36], though, we seek to counteract the
reduced effective OT batch size induced by conditional opti-
mal transport (as discussed in Section 3.3.4), and we do find
reasonably large (640 for CIFAR, 6400 for ImageNet) OT
batch sizes useful while not adding wall-clock overhead.
4.3.2. Varying Target Ratio rtar
Here, we analyze the effects of varying the target ratio rtar,
as defined in Equation (13). We conduct these experiments
on the ImageNet-32 ×32 dataset, training each model three
times with different random seeds for every choice of rtar.
We plot the results in Figure 5. Recall that when r→0,
w→ ∞ and the method approaches FM; when r→0.52,
2There is a 0.5 chance that the distance between a random prior-data
pair is closer to that of another random prior-data pair.128 640 1,280 2,560 6,400505560657075
OT (Euler-2)
OT batch size bEuler-2 FID ↓
2.852.882.912.942.973.00
FM (adaptive)
Adaptive FID ↓
Ours, Euler-2
Ours, adaptive
Figure 4. Changes in FID with respect to varying OT batch sizes b.
We plot mean ±std over three runs and represent std with a shaded
region. OT (adaptive) and FM (Euler-2) perform worse than all
other results in this plot, hence results are not shown (full plot in
the appendix).
0.0005 0.001 0.002 0.005 0.01 0.02 0.05 0.180859095100105110115
OT (Euler-2)FM (Euler-2)
Target ratio rtarEuler-2 FID ↓
5.255.385.515.645.775.90
FM (adaptive)
Adaptive FID ↓Ours, Euler-2
Ours, adaptive
Figure 5. Changes in FID with respect to varying target ratio rtar.
We plot mean ±std over three runs and represent std with a shaded
region. OT (adaptive) performs worse than all other results in this
plot, hence results are not shown (full plot in the appendix).
w→0and the method approaches OT (Section 3.3.3). Our
findings are similar to those in Section 4.3.1: as we approach
true OT , whether by increasing OT batch size or increasing
rtar), FID improves in the few-step setting and worsens in
the adaptive setting.
5. Conclusion
We first investigate and formalize the struggle of minibatch
optimal transport in conditional generation. Based on the
findings, we propose C2OT, a simple yet effective addi-
tion that corrects the degeneration of OT while maintaining
straight integration paths. Extensive experiments ranging
from simple two-dimensional synthetic data with MLPs to
high-dimension 256 ×256 image generation in latent space
with transformers verify the effectiveness of our method. We
believe that this simple technique will be broadly useful in
future flow-based conditional generative tasks.
8
Page 9:
Acknowledgment. This work is supported in part by NSF
grants 2008387, 2045586, 2106825, NIFA award 2020-
67021-32799, and OAC 2320345.
References
[1]Michael Albergo and Eric Vanden-Eijnden. Building normal-
izing flows with stochastic interpolants. In Proc. ICLR , 2023.
2
[2]Michael Albergo, Nicholas Boffi, and Eric Vanden-Eijnden.
Stochastic interpolants: A unifying framework for flows and
diffusions. arXiv preprint arXiv:2303.08797 , 2023. 2
[3]Kevin Black, Noah Brown, Danny Driess, Adnan Esmail,
Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom,
Karol Hausman, Brian Ichter, et al. pi0: A vision-language-
action flow model for general robot control. arXiv , 2024.
1
[4]Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi
Shibuya, Alexander Schwing, and Yuki Mitsufuji. Taming
multimodal joint training for high-quality video-to-audio syn-
thesis. CVPR , 2025. 1
[5]chenyaofo. Pytorch cifar models. https://github.
com/chenyaofo/pytorch-cifar-models , 2023. 6
[6]Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A
downsampled variant of imagenet as an alternative to the cifar
datasets. arXiv , 2017. 14
[7]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li
Fei-Fei. Imagenet: A large-scale hierarchical image database.
InCVPR , 2009. 6, 14
[8]Prafulla Dhariwal and Alexander Nichol. Diffusion models
beat gans on image synthesis. NeurIPS , 2021. 14
[9]John R Dormand and Peter J Prince. A family of embedded
runge-kutta formulae. Journal of computational and applied
mathematics , 1980. 2, 6
[10] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En-
tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz,
Axel Sauer, Frederic Boesel, et al. Scaling rectified flow
transformers for high-resolution image synthesis. In ICML ,
2024. 1
[11] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig
Schmidt, Alexander Toshev, and Vaishaal Shankar. Data
filtering networks. ICLR , 2024. 6, 14
[12] Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin,
and J Zico Kolter. Consistency models made easy. ICLR ,
2025. 2
[13] Pengsheng Guo and Alexander G Schwing. Variational recti-
fied flow matching. arXiv , 2025. 2
[14] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
units (gelus). arXiv , 2016. 12
[15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern-
hard Nessler, and Sepp Hochreiter. Gans trained by a two
time-scale update rule converge to a local nash equilibrium.
NeurIPS , 2017. 6
[16] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess,
Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and
Alexander Lerchner. beta-vae: Learning basic visual concepts
with a constrained variational framework. In ICLR , 2017. 2[17] Ka-Hei Hui, Chao Liu, Xiaohui Zeng, Chi-Wing Fu, and
Arash Vahdat. Not-so-optimal transport flows for 3d point
cloud generation. arXiv , 2025. 2
[18] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade
Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal
Shankar, Hongseok Namkoong, John Miller, Hannaneh Ha-
jishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021.
If you use this software, please cite it as below. 14
[19] Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu-
rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki
Mitsufuji, and Stefano Ermon. Consistency trajectory models:
Learning probability flow ode trajectory of diffusion. ICLR ,
2024. 2
[20] Diederik P Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. ICLR , 2015. 12, 13
[21] Diederik P Kingma, Max Welling, et al. Auto-encoding
variational bayes. In ICLR , 2014. 2
[22] Leon Klein, Andreas Krämer, and Frank Noé. Equivariant
flow matching. NeurIPS , 36, 2023. 2
[23] Alex Krizhevsky and Geoffrey Hinton. Learning multiple
layers of features from tiny images, 2009. 6
[24] Harold W. Kuhn. The hungarian method for the assignment
problem. Naval research logistics quarterly , 1955. 6
[25] Visual Layer. Imagenet-1k-vl-enriched. https :
//huggingface.co/datasets/visual-layer/
imagenet-1k-vl-enriched , 2024. 6, 14
[26] Sangyun Lee, Beomsu Kim, and Jong Chul Ye. Minimizing
trajectory curvature of ode-based generative models. In ICML ,
2023. 2
[27] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian
Nickel, and Matt Le. Flow matching for generative modeling.
ICLR , 2023. 1, 2, 13
[28] Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul,
Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz,
Heli Ben-Hamu, and Itai Gat. Flow matching guide and code.
arXiv , 2024. 14
[29] Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, and Mannat
Singh. Flowing from words to pixels: A framework for cross-
modality evolution. arXiv preprint arXiv:2412.15213 , 2024.
2
[30] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight
and fast: Learning to generate and transfer data with rectified
flow. ICLR , 2023. 2
[31] Cheng Lu and Yang Song. Simplifying, stabilizing and scaling
continuous-time consistency models. ICLR , 2024. 2
[32] Ségolène Martin, Anne Gagneux, Paul Hagemann, and
Gabriele Steidl. Pnp-flow: Plug-and-play image restoration
with flow matching. ICLR , 2025. 1
[33] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased
resizing and surprising subtleties in gan evaluation. In CVPR ,
2022. 14
[34] Michael Poli, Stefano Massaroli, Atsushi Yamashita, Ha-
jime Asama, Jinkyoo Park, and Stefano Ermon. Torchdyn:
Implicit models and neural numerical methods in pytorch.
Physical Reasoning and Inductive Biases for the Real World
at NeurIPS , 2021. 12
9
Page 10:
[35] Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra,
Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih-
Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary,
Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan
Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng
Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt
Le, Matthew Yu, Mitesh Kumar Singh, Peizhao Zhang, Pe-
ter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly,
Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak
Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, Shelly
Sheynin, Siddharth Bhattacharya, Simran Motwani, Tao Xu,
Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xiaoliang
Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, Yi-Chiao
Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian He, Al-
bert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Mallya,
Baishan Guo, Boris Araya, Breena Kerr, Carleigh Wood, Ce
Liu, Cen Peng, Dimitry Vengertsev, Edgar Schonfeld, El-
liot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff Liang,
John Hoffman, Jonas Kohler, Kaolin Fire, Karthik Sivakumar,
Lawrence Chen, Licheng Yu, Luya Gao, Markos Georgopou-
los, Rashel Moritz, Sara K. Sampson, Shikai Li, Simone
Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, and
Yuming Du. Movie gen: A cast of media foundation models.
arXiv , 2024. 1
[36] Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles
Domingo-Enrich, Brandon Amos, Yaron Lipman, and
Ricky TQ Chen. Multisample flow matching: Straighten-
ing flows with minibatch couplings. ICML , 2023. 2, 3, 5, 7,
8, 13, 14
[37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
transferable visual models from natural language supervision.
InICML , 2021. 6
[38] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Patrick Esser, and Björn Ommer. High-resolution image
synthesis with latent diffusion models. In CVPR , 2022. 7
[39] Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta
Takida, and Yuki Mitsufuji. Training consistency models with
variational noise coupling. arXiv preprint arXiv:2502.18197 ,
2025. 2
[40] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever.
Consistency models. In ICML , 2023. 2
[41] Yuxuan Song, Jingjing Gong, Minkai Xu, Ziyao Cao, Yanyan
Lan, Stefano Ermon, Hao Zhou, and Wei-Ying Ma. Equiv-
ariant flow matching with hybrid probability transport for 3d
molecule generation. NeurIPS , 36, 2023. 2
[42] Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume
Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and
Yoshua Bengio. Improving and generalizing flow-based gen-
erative models with minibatch optimal transport. TMLR , 2024.
2, 3, 5, 6, 7, 12, 13, 14
[43] Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham-
mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil
Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil
Mustafa, et al. Siglip 2: Multilingual vision-language en-
coders with improved semantic understanding, localization,and dense features. arXiv preprint arXiv:2502.14786 , 2025.
6, 14
[44] Kaiyu Yang, Jacqueline H Yau, Li Fei-Fei, Jia Deng, and
Olga Russakovsky. A study of face obfuscation in imagenet.
InICML , 2022. 7, 14
[45] Ling Yang, Zixiang Zhang, Zhilong Zhang, Xingchao Liu,
Minkai Xu, Wentao Zhang, Chenlin Meng, Stefano Ermon,
and Bin Cui. Consistency flow matching: Defining straight
flows with velocity consistency. arXiv , 2024. 2
[46] Jingfeng Yao and Xinggang Wang. Reconstruction vs. genera-
tion: Taming optimization dilemma in latent diffusion models.
CVPR , 2025. 7, 14
[47] Yichi Zhang, Yici Yan, Alex Schwing, and Zhizhen Zhao.
Towards hierarchical rectified flow. ICLR , 2025. 2
10
Page 11:
Table of Contents
1 Introduction 1
2 Related Works 2
3 Method 2
3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 OT Coupling Skews The Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Conditional Optimal Transport FM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3.2 Discrete Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3.3 Continuous Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.3.4 Oversampling OT Batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4 Experiments 6
4.1 Two-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 High-Dimensional Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.1 Varying OT Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.2 Varying Target Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Conclusion 8
A Extended Plots 12
B Data Coupling in 8 Gaussians →moons 12
C Implementation Details 12
C.1 Two-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
C.2 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
C.3 ImageNet-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
C.4 ImageNet-256 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
D Additional Generated Images 14
D.1 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
D.2 ImageNet-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
D.3 ImageNet-256 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
E DeltaAI Acknowledgment 19
11
Page 12:
A. Extended Plots
Figures A1 and A2 extend Figures 5 and 6 of the main paper to include all data points.
128 640 1,280 2,560 6,40050556065707580859095100105110
OT (Euler-2)FM (Euler-2)
OT batch size bEuler-2 FID ↓
2.853.023.183.353.523.683.854.024.184.354.524.684.85
FM (adaptive)OT (adaptive)
Adaptive FID ↓Ours, Euler-2
Ours, adaptive
Figure A1. Changes in FID with respect to varying OT batch sizes b.
We plot mean ±std over three runs and represent std with a shaded
region.0.0005 0.001 0.002 0.005 0.01 0.02 0.05 0.17580859095100105110115
OT (Euler-2)FM (Euler-2)
Target ratio rtarEuler-2 FID ↓
5.155.295.445.585.735.876.016.166.30
FM (adaptive)OT (adaptive)
Adaptive FID ↓Ours, Euler-2
Ours, adaptive
Figure A2. Changes in FID with respect to varying target ratio rtar.
We plot mean ±std over three runs and represent std with a shaded
region.
B. Data Coupling in 8 Gaussians →moons
Figure A3 extends Figure 1 with an additional row that shows coupling during training. Clearly, OT samples form a biased
distribution at training time in conditional generation as discussed. Since we cannot sample from this biased distribution at
test-time, we obtain a gap between training and testing. This gap degrades the performance of OT.
C. Implementation Details
C.1. Two-Dimensional Data
Data. Following the implementation of Tong et al. [42], we generate the “moons” data using the torchdyn library [ 34],
and the “8 Gaussians” using torchcfm [42].
Network. We employ a simple multi-layer perceptron (MLP) network for this dataset. Initially, the two-dimensional input (x
and y coordinates) and the flow timestep (a scalar uniformly sampled from [0,1]) are projected into the hidden dimension using
individual linear layers. When an input condition is provided, it is similarly projected into the hidden dimension. Discrete
conditions are encoded as -1 or +1, while continuous conditions are represented by the x-coordinate of the target data point.
After projection, all input features are summed and processed through a network comprising three MLP modules. Each MLP
module consists of two linear layers, where the first layer uses an expansion ratio of 4, and is followed by a GELU activation
function [ 14]. We use a residual connection to incorporate the output of each MLP block. Finally, another linear layer projects
the features to two dimensions to produce the output velocity. The hidden dimension is set to 128.
Training. We train each network for 20,000 iterations with the Adam [ 20] optimizer, a learning rate of 3e-4without weight
decay, and a “deep net” batch size of 256 for computing forward/backward passes/gradient updates. We use an OT batch size b
of 1024 and a target ratio rtarof 0.01.
12
Page 13:
6.232±0.186 0.072±0.025 2.573±0.112 0.483±0.059 0.048±0.010 0.732±0.052 8.276±6.510 0.077±0.024
0.120±0.045 0.050±0.040 0.059±0.029 0.462±0.025 0.018±0.006 0.028±0.010 2.143±1.993 0.013±0.003
FM OT FM OT C2OT (ours) FM OT C2OT (ours)Unconditonal With discrete conditionsWith continuous conditions
(target xcoordinates)Training Euler-1 step Adaptive
Figure A3. We visualize the flows learned by different algorithms using the 8gaussians →moons dataset. Below each plot, we show the
2-Wasserstein distance (lower is better; mean ±std over 10 runs). Compared to Figure 1, we have added a first row illustrating the prior-data
coupling during training. Note that the OT coupled paths during training (first row) do cross. This is expected – the commonly referred to
“no-crossing” property of OT coupling refers to the uniqueness of the pair (x0, x1)given xandt– at the same timestep t, no two paths may
cross at x(see Proposition 3.4 in Tong et al. [42] and Theorem D.2 in Pooladian et al. [36]). Since we plot all timesteps simultaneously in
this figure, there are apparent crossings. However, the intersecting paths do not share the same timestep tat the point of intersection.
CIFAR-10 ImageNet-32
Channels 128 256
Depth 2 3
Channels multiple 1, 2, 2, 2 1, 2, 2, 2
Heads 4 4
Heads channels 64 64
Attention resolution 16 4
Dropout 0.0 0.0
Use scale shift norm True True
Batch size / GPU 128 128
GPUs 2 4
Effective batch size 256 512
Iterations 100k 300k
Learning rate 2.0e −4 1.0e −4
Learning rate scheduler Warmup then constant Warmup then linear decay
Warmup steps 5k 20k
OT batch size (per GPU) 640 6400
Table A1. Hyperparameter settings for training on CIFAR-10 and ImageNet-32.
C.2. CIFAR-10
In CIFAR-10, we employ the UNet architecture used by Tong et al. [42]. We list our hyperparameters in Table A1, following
the format in [ 27,36,42]. To accelerate training, we use bf16 . We use the Adam [ 20] optimizer with the following parameters:
β1= 0.9,β2= 0.95, weight decay=0.0, and ϵ= 1e−8. For learning rate scheduling, we linearly increase the learning rate
from 1.0e −8 to 2.0e −4 over 5,000 iterations and then keep the learning rate constant. The “use scale shift norm” denotes
13
Page 14:
employing adaptive layer normalization to incorporate the input condition, as implemented in [ 8]. To stabilize training, we clip
the gradient norm to 1.0. We report the results using an exponential moving average (EMA) model with a decay factor of
0.9999. For FID computation, we use the clean_fid library [33] in legacy_tensorflow mode following [42].
C.3. ImageNet-32
Data. We use face-blurred ImageNet-1K [ 7,44] following [ 28], and apply the downsampling script from [ 6]. Images are
downsampled to 32 ××32 using the ‘box’ algorithm, and reference FID statistics are computed with respect to the downsampled
validation set images. For text input, we use the captions provided by [25].
Network and Training. We largely follow the training pipeline described in Appendix C.2. We use a larger network
following [ 36] and list our hyperparameters in Table A1. To encode text input, we use the openclip library [ 18] and the text
encoder of the “DFN5B-CLIP-ViT-H-14” checkpoint [ 11], a pretrained CLIP-like model. CLIP feature vectors are normalized
to unit norm before being used as input conditions. For learning rate scheduling, after the initial warmup phase, we linearly
decay the learning rate to 1.0e −8 over time.
Evaluation. As stated in the main paper, we use 49,997 images from the validation set to compute FID. This is because
the fine-grained nature of image captions might lead to overfitting, i.e., memorizing the training set. For CLIP score
computation, we evaluate the cosine similarity between the input caption and the generated image using SigLIP-2 [ 43], with
theViT-SO400M-16-SigLIP2-256 checkpoint via the openclip library [18].
C.4. ImageNet-256
For this dataset, we use the open-source implementation of LightningDiT [ 46] and train the models under the ‘64 epochs’ setting
with minimal modifications to change the network from class-conditioned to caption-conditioned. In addition to integrating the
coupling algorithms (OT and C2OT, while the original LightningDiT [46] already employs FM), our modifications include:
1.Changing the input conditional mapping layer from an embedding layer (that takes a class label as input) to a linear layer
(that takes CLIP features as input).
2.Adjusting the classifier-free guidance (CFG) scale. We find that the model benefits from a higher CFG scale when using
caption conditioning. Specifically, we increase the CFG scale from 10.0 to 17.0, and adjust the CFG interval start parameter
from 0.11 to 0.10.
For data and evaluation, we follow the same setup as described in Appendix C.3.
D. Additional Generated Images
We present additional image generation results in this section. All showcased images are uncurated, meaning they were
sampled completely at random. For consistency and direct comparison, we used the same random seed for each generation
across different methods.
14
Page 15:
D.1. CIFAR-10
Euler-10 Adaptive
FM OT C2OT (ours)
Figure A4. Uncurated generations in CIFAR-10, 10-per-class. We compare FM, OT, and C2OT with both 10-step Euler’s method and an
adaptive solver for test-time numerical integration.
15
Page 16:
D.2. ImageNet-32
Euler-10 Adaptive
FM OT C2OT (ours)
Figure A5. Uncurated generations in ImageNet-32. We compare FM, OT, and C2OT with both 10-step Euler’s method and an adaptive
solver for test-time numerical integration.
16
Page 17:
D.3. ImageNet-256
Euler-2 Euler-5 Euler-10 Euler-25 Euler-50 Euler-100FM OT C2OT (ours)
FM OT C2OT (ours) FM OT C2OT (ours)
Figure A6. Uncurated generations in ImageNet-256. We compare FM, OT, and C2OT with different amounts of sampling steps.
17
Page 18:
FM OT C2OT (ours)
Figure A7. Uncurated generations in ImageNet-256. We compare FM, OT, and C2OT with an adaptive solver for test-time numerical
integration.
18
Page 19:
E. DeltaAI Acknowledgment
This work used the DeltaAI system at the National Center for Supercomputing Applications through allocation CIS250008
from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported
by National Science Foundation grants 2138259, 2138286, 2138307, 2137603, and 2138296.
19