loader
Generating audio...

arxiv

Paper 2503.10636

The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation

Authors: Ho Kei Cheng, Alexander Schwing

Published: 2025-03-13

Abstract:

Minibatch optimal transport coupling straightens paths in unconditional flow matching. This leads to computationally less demanding inference as fewer integration steps and less complex numerical solvers can be employed when numerically solving an ordinary differential equation at test time. However, in the conditional setting, minibatch optimal transport falls short. This is because the default optimal transport mapping disregards conditions, resulting in a conditionally skewed prior distribution during training. In contrast, at test time, we have no access to the skewed prior, and instead sample from the full, unbiased prior distribution. This gap between training and testing leads to a subpar performance. To bridge this gap, we propose conditional optimal transport C^2OT that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. Experiments demonstrate that this simple fix works with both discrete and continuous conditions in 8gaussians-to-moons, CIFAR-10, ImageNet-32x32, and ImageNet-256x256. Our method performs better overall compared to the existing baselines across different function evaluation budgets. Code is available at https://hkchengrex.github.io/C2OT

Paper Content:
Page 1: The Curse of Conditions: Analyzing and Improving Optimal Transport for Conditional Flow-Based Generation Ho Kei Cheng Alexander Schwing University of Illinois Urbana-Champaign {hokeikc2,aschwing}@illinois.edu 6.232±0.186 0.072±0.025 2.573±0.112 0.483±0.059 0.048±0.010 0.732±0.052 8.276±6.510 0.077±0.024 0.120±0.045 0.050±0.040 0.059±0.029 0.462±0.025 0.018±0.006 0.028±0.010 2.143±1.993 0.013±0.003 FM OT FM OT C2OT (ours) FM OT C2OT (ours)Unconditonal With discrete conditionsWith continuous conditions (target xcoordinates)Euler-1 step Adaptive Figure 1. We visualize the flows learned by different algorithms using the 8gaussians →moons dataset. We compare flow matching (FM), minibatch optimal transport FM (OT), and our proposed conditional optimal transport FM (C2OT) using one Euler step and an adaptive ODE solver, respectively. Below each plot, we show the 2-Wasserstein distance (lower is better; mean ±std over 10 runs). For unconditional generation, OT achieves significantly straighter flows, outperforming FM. However, paradoxically, OT performs worse when conditions are introduced. This paper analyzes this degradation and finds that it occurs because optimal transport disregards conditions, leading to a train-test gap. To address this, we propose a simple fix (C2OT) that outperforms FM and OT in conditional generation. Abstract Minibatch optimal transport coupling straightens paths in unconditional flow matching. This leads to computationally less demanding inference as fewer integration steps and less complex numerical solvers can be employed when numeri- cally solving an ordinary differential equation at test time. However, in the conditional setting, minibatch optimal trans- port falls short. This is because the default optimal transport mapping disregards conditions, resulting in a conditionally skewed prior distribution during training. In contrast, at test time, we have no access to the skewed prior, and instead sample from the full, unbiased prior distribution. This gap between training and testing leads to a subpar performance. To bridge this gap, we propose conditional optimal transport (C2OT) that adds a conditional weighting term in the cost matrix when computing the optimal transport assignment. Experiments demonstrate that this simple fix works with both discrete and continuous conditions in 8gaussians →moons,CIFAR-10, ImageNet-32 ×32, and ImageNet-256 ×256. Our method performs better overall compared to the existing baselines across different function evaluation budgets. Code is available at hkchengrex.github.io/C2OT . 1. Introduction We focus on flow-matching-based conditional generative models, i.e., generation guided by an input condition.1Ex- amples of such conditions include class labels, text, or video. Recently, flow matching has been applied in many areas of computer vision, including text-to-image/video [ 10,35], vision-language applications [ 3], image restoration [ 32], and video-to-audio [ 4]. However, these methods can be slow at 1We use condition to denote an input condition , not to be confused with the ‘condition’ in conditional flow matching (CFM) [ 27] (where the second ‘C’ in our acronym comes from), which refers to a data sample. To avoid ambiguity, we refer to CFM simply as flow matching (FM) in this paper. 1arXiv:2503.10636v1 [cs.LG] 13 Mar 2025 Page 2: test-time – obtaining a solution requires numerically integrat- ing the flow with an ODE solver, typically involving many steps, each requiring a deep network forward pass. Naïvely reducing the number of steps degrades performance, as the underlying flow is often curved (Figure 1, leftmost column), necessitating a small step size for accurate numerical inte- gration. One way to address this issue is by straightening the flow. In the context of unconditional generation, Tong et al. [42] and Pooladian et al. [36] concurrently proposed “minibatch optimal transport” (OT), which deterministically couples data and samples from the prior within a minibatch via optimal transport (replacing random coupling) to mini- mize flow path lengths and to straighten the flow. While OT improves unconditional generation, we find that it paradoxically and consistently harms conditional gen- eration (Figure 1). This occurs because OT disregards the conditions when computing the coupling. As a result, during training, OT samples from a prior distribution that is skewed by the condition, as we show in Section 3.2. However, at test time, we have no access to this skewed distribution and instead sample from the full prior distribution. This mis- match creates a gap between training and testing, leading to performance degradation. To bridge this gap, we propose conditional optimal transport FM (C2OT) which introduces a simple yet effective condition-aware weighting term when computing the cost matrix for OT. Additionally, we propose two techniques: adaptive weight finding to simplify hyper- parameter tuning and efficient oversampling of OT batches to counteract the reduced effective OT batch size introduced by the weighting term. We conduct extensive experiments on both two- dimensional synthetic data and high-dimensional image data, including CIFAR-10, ImageNet-32 ×32, and ImageNet- 256×256. The results demonstrate that our method performs well across different condition types (discrete and continu- ous), datasets, network architectures (UNets and transform- ers), and data spaces (image space and latent space). C2OT achieves better overall performance than the existing base- lines across different inference computation budgets, includ- ing with few-steps Euler’s method and with an adaptive ordinary differential equation (ODE) solver [9]. 2. Related Works Flow Matching. Recently, flow matching [ 1,2,27] has become a popular choice for generative modeling, in part due to its simple training objective and ability to generate high- quality samples. However, at test time, flow-based methods typically require many computationally expensive forward passes through a deep net to numerically integrate the flow. This is because the underlying flow is usually curved and therefore cannot be approximated well with a few integration steps.Optimal Transport Coupling. To straighten the flow, in the context of unconditional generation, Tong et al. [42] and Pooladian et al. [36] concurrently proposed minibatch op- timal transport (OT). While OT has proven effective in the unconditional setting, we show in Section 3.2 that it skews the prior distribution and fails in the conditional setting if used naively. Our method, C2OT, specifically addresses this failure, aiming to extend the success of OT in unconditional generation to conditional generation. Note, OT has also been used in equivariant flow matching [ 22,41], which exploits domain-specific data symmetry ( e.g., in molecules). In con- trast, our method targets general data. Further, Hui et al. [17] find mixing of OT and independent coupling improves shape completion, which is an orthogonal contribution to this paper. Straightening Flow Paths. Other methods of straighten- ing flow paths exist. Reflow [ 30] achieves this by retraining a flow matching network multiple times, at the cost of ad- ditional training overhead. Variational flow matching [ 13] reduces flow ambiguity with a latent variable and hierar- chical flow matching [ 47] straightens flows by modeling higher-order flows. These approaches are orthogonal to our focus on prior-data coupling and, in principle, can be com- bined with our method. Learning the Prior Distribution. An alternative approach to improving flow is to jointly learn a prior distribution with the flow matching network. However, these methods [ 26,29, 39] typically rely on a variational autoencoder (V AE) [ 16, 21] to learn the prior. This introduces yet another density estimation problem, which complicates training. In this work, we focus on improving training-free coupling plans without learning a new prior. Consistency Models. Consistency models [ 40] have been proposed to accelerate diffusion model sampling, and can be extended to flow matching [ 45]. Recent advancements have improved training efficiency [ 12], stability [ 31], and quality [ 19]. These methods are orthogonal to our contribu- tion, which focuses on improving prior-data coupling, and can be integrated, as demonstrated by Silvestri et al. [39] in their combination of consistency models with OT. In this paper, we focus on the base form of flow matching to avoid distractions from other formulations. 3. Method 3.1. Preliminaries Flow Matching (FM). We base our discussion on FM [ 27, 42] (also called rectified flow [ 30]) for generative modeling and refer readers to [30, 42] for details. At test-time, a sam- ple˜x1is characterized by an ordinary differential equation (ODE) with an initial boundary value specified via noise x0, randomly drawn from a prior distribution ( e.g., the standard 2 Page 3: /u1D461 = 0/u1D461 = 1 /u1D45E(/u1D4650) = /u1D45E(/u1D4650∣/u1D450) /u1D45E(/u1D4650) = /u1D45E(/u1D4650∣/u1D450) /u1D45E(/u1D4650) = /u1D45E(/u1D4650∣/u1D450) /u1D45E(/u1D465 0) = /u1D45E(/u1D4650∣/u1D450) /u1D45E(/u1D4650) = /u1D45E(/u1D4650∣/u1D450) /u1D45E(/u1D4650 ∣ /u1D450 = 0)/u1D45E(/u1D4650 ∣ /u1D450 = 1)train test train test train test /u1D45E(/u1D4651 ∣ /u1D450 = 0) /u1D45E(/u1D4651 ∣ /u1D450 = 1) /u1D45E(/u1D4651 ∣ /u1D450 = 0) /u1D45E(/u1D4651 ∣ /u1D450 = 1) /u1D45E(/u1D4651 ∣ /u1D450 = 0) /u1D45E(/u1D4651 ∣ /u1D450 = 1)FM OT C2OT (ours)Figure 2. We visualize the coupling during training and the learned flow from t= 0tot= 1during testing for FM, OT, and our proposed method C2OT. The prior is defined as q(x0) =N(0,1), while the target distribution is given by q(x1) =1 2q(x1|c= 0) +1 2q(x1|c= 1)) =1 2N(−2,0.5) +1 2N(2,0.5). Note, FM results in curved flows during testing, while OT degenerates due to skewed training samples (q(x0|c)̸=q(x0)). In contrast, our method successfully learns straight flows without degeneration. normal). To compute the sample ˜x1, this ODE is commonly solved by numerically integrating from time t= 0to time t= 1a learned flow/velocity field vθ, where θdenotes the set of trainable deep net parameters. For conditional gen- eration, we additionally obtain a condition c,e.g., a class label from user input, which adjusts the flow. We use vθ,cto denote this conditional flow. At training time, we find the deep net parameters θby minimizing the objective Et,(x0,x1,c)∼π(x0,x1,c)∥vθ,c(t, xt)−u(xt|x0, x1)∥2,(1) where tis uniformly drawn from the interval [0,1], and where π(x0, x1, c)denotes the joint distribution of the prior and the training data. In practice, often, independent cou- pling is used, i.e., the prior and the training data are sampled independently, such that π(x0, x1, c) =q(x0)q(x1, c). (2) Further, xt=tx1+ (1−t)x0 (3) defines a linear interpolation between noise and data, and u(xt|x0, x1) =x1−x0 (4) denotes its corresponding flow velocity at xt. Note, simply dropping the condition cleads to the unconditional flow matching formulation. Although rectified flow matching is trained with straight paths, the learned flow is often curved (Figure 1, bottom- left) since there are many possible paths induced by the independently sampled couplings x0, x1through any xtat timet[36]. Curved paths require more numerical integration and network evaluation steps, which is not ideal. Optimal Transport (OT) Coupling. To address this issue forunconditional generation , Tong et al. [42] and Poola- dian et al. [36] concurrently proposed to use minibatch opti- mal transport to deterministically couple the sampled priorand a sampled batch of the data. This minimizes flow path lengths and straightens the flow. Concretely, given a mini- batch of b e-dimensional samples, an OT coupling seeks a b×bpermutation matrix Psuch that the squared Euclidean distance is minimized, i.e., min P∥X0−PX1∥2 2. (5) Here, X0, X1∈Rb×eare matrices containing a minibatch of samples from the prior and the data. Instead of an “in- dependent coupling”, in the unconditional setting we then optimize a variant of Equation (1) using the coupled tuple (X0, PX 1). Note that this algorithm corresponds to sam- pling tuples (x0, x1)from a joint distribution πot(x0, x1). Importantly, for unconditional generation ,i.e., when using the joint distribution πot(x0, x1), the marginals remain unchanged from independent coupling (see [42]): Z πot(x0, x1)q(x0) dx0=q(x1),and Z πot(x0, x1)q(x1) dx1=q(x0). (6) Hence, during test-time, we can sample from the prior q(x0) and integrate the flow vθto simulate data from the data distribution q(x1). Unfortunately, as shown in the next section, this does not hold for conditional generation . Said differently, naively extending the unconditional joint distribution πot(x0, x1)to a conditional joint distribution πot(x0, x1, c)by using the coupled tuple (X0, PX 1, PC), where C∈Rb×his a batch ofh-dimensional conditions, does not maintain marginals and leads to a gap between training time and test time. We discuss this next. 3 Page 4: 3.2. OT Coupling Skews The Prior Intuition. Consider the one-dimensional example illus- trated in Figure 2: the prior distribution q(x0)follows a Gaussian distribution, and the target distribution q(x1)con- sists of a mixture of two Gaussians translated left and right, with the translation direction indicated by a binary condition c. When training with a πot(x0, x1, c)that represents the coupled tuple (X0, PX 1, PC), samples from the prior and the sampled conditions become correlated through the OT coupling – samples from the left half of the prior distribution are always associated with c= 0, while those from the right half always correspond to c= 1. As a result, the model overfits on data pairs (x0, c)where either (x0<0, c= 0) or(x0>0, c= 1) . However, during testing, this skewed distribution is no longer accessible, and instead, x0andcare sampled independently. Consequently, the model encoun- ters previously unseen data pairs, i.e.,(x0>0, c= 0) and (x0<0, c= 1) – leading to failure. A similar phenomenon can be observed in the two-dimensional example illustrated in Figure 1. Skewed Prior. Formally, imagine that we consider the most general OT coupling joint distribution πot(x0, x1, c)at training time. At test time, the user wants to specify a condi- tionc1to obtain samples from the conditional distribution q(x1|c=c1). What is the prior distribution that we need to sample from such that we arrive at the correct conditional distribution q(x1|c=c1)? To compute this, imagine, we are given an arbitrary condition c1,i.e., we look at the in- duced distribution πot(x0, x1, c|c=c1). Marginalizing this induced distribution over x1yields Z πot(x0, x1, c|c=c1) dx1=Z πot(x0, x1|c=c1) dx1 =q(x0|c=c1), (7) which implies that the prior required to arrive at the condi- tional distribution q(x1|c=c1)isq(x0|c=c1)– a distri- bution that we cannot access at test time. In general, the prior that we need to sample from, given condition c, is skewed from q(x0)toq(x0|c)which is inaccessible at test time. As a result, accurately capturing q(x1|c)at test time becomes infeasible unless q(x0)andq(c)are independent, i.e., ifq(x0|c) =q(x0). However, this condition is generally not satisfied by the most general OT coupling – for instance, in Figure 2, clearly q(x0|c)̸=q(x0). To address this issue, we propose a method C2OT to unskew the prior. We discuss this in the next section. 3.3. Conditional Optimal Transport FM 3.3.1. Formulation We propose conditional optimal transport FM (C2OT) to unskew the prior q(x0|c)such that q(x0|c) =q(x0). Thisensures that at test time, we can sample from the full prior q(x0), irrespective of the condition c. Importantly, at the same time, we also aim to preserve the straight flow paths provided by OT. Conceptually, we construct a prior distribution independently for each condition, i.e., q(x0|c1) =q(x0|c2) =q(x0),∀c1, c2. This can be achieved by sampling from the prior and computing OT independently for each condition, as shown in Figure 2 (right). Formally, we construct the joint distribution as πc2ot(x0, x1, c):=q(x0)q(c)πot(x1|x0, c). (8) Here, the prior q(x0)and the condition q(c)are sampled independently, while the data x1is conditioned on x0(via OT) and c(via the dataset construction). Different from inde- pendent coupling in Equation (2), x1andx0are associated through OT and therefore lead to straighter flow paths. Also different from the most general OT coupling πot(x0, x1, c), we explicitly enforce the independence between x0andc. To see that this joint distribution provides an unskewed prior, imagine we are given an arbitrary input condition c1, marginalizing over x1yields Z q(x0)q(c|c=c1)πot(x1|x0, c=c1) dx1 =Z q(x0)πot(x1|x0, c=c1) dx1=q(x0). (9) This implies that, at test time, we can sample from the entire prior q(x0), regardless of the condition c, to arrive at the desired data distribution q(x1|c). During training, in practice, we sample from πc2otby modifying the optimal transport cost function in Equation (5). This process is exact for discrete conditions (Section 3.3.2) and approximate for continuous conditions (Section 3.3.3). Specifically, given a minibatch of b e-dimensional samples and a minibatch of b h-dimensional conditions C∈Rb×h, C2OT seeks a b×bpermutation matrix Pthat minimizes the following cost function: min P∥X0−PX1∥2 2+bX i=1f(Ci,[PC]i), (10) where fis a symmetric, non-negative cost function satisfying the property f(c, c) = 0∀c. The key differences between independent coupling, OT coupling, and C2OT coupling are highlighted in Algorithms 1 to 3. In the following two sections, we discuss our choice of ffor both discrete and continuous conditions. 3.3.2. Discrete Conditions For discrete conditions, we assume the conditions corre- spond to class labels from a finite, discrete set, i.e.,c∈ {0,1, . . . , k −1}. For example, in Figure 2, the labels 0and 1represent the left and right groups in the target distribution, 4 Page 5: Algorithm 1 Independent Coupling 1:function INDEPENDENT (X1, C) 2: X0∼ N ▷Sample from prior 3: 4: 5: return X0, X1, CAlgorithm 2 OT Coupling 1:function OT(X1, C) 2: X0∼ N ▷Sample from prior 3: Cost←pairwise_dist (X0, X1) 4: i, j←Hungarian_matching(Cost) 5: return X0[i], X1[j], C[j]Algorithm 3 C2OT Coupling (Ours) 1:function C2OT(X1, C) 2: X0∼ N ▷Sample from prior 3: Cost←pairwise_dist (X0, X1)+ pairwise_dist (C, C) 4: i, j←Hungarian_matching(Cost) 5: return X0[i], X1[j], C[j] respectively. We define f=fdsuch that no transport is allowed to modify the conditions ( i.e.,PC=C). This is achieved via fd(c1, c2) =( ∞, c1̸=c2, 0,otherwise .(11) As a result, the unordered set of prior samples for any given condition cremains unchanged: {X0i|ci=c}iid∼q(x0). This formulation samples exactly from our ideal (Equa- tion (8)), as OT is computed independently within each class, and the data sample from each class is exposed to the full prior without skew. This effect is illustrated in Figure 1 (middle) and Figure 2 (right). Note that, as expected, the results differ from FM’s curved flow paths and OT’s inability to properly capture the target distribution. It is important to note that enforcing PC=Cdoes not necessarily imply P=Isince we can still swap row iandj inPas long as ci=cj. Setting P=Iwould correspond to a degeneration back to independent coupling in Equation (2) and forfeit the benefits of straightened flow provided by OT. Unfortunately, using fdwith continuous conditions leads to exactly this degeneration, since, in general, no ci=cj unless i=j. Hence, in the next section, we devise a relaxed cost for handling continuous conditions. 3.3.3. Continuous Conditions For continuous conditions, the condition is typically a feature embedding, e.g., computed from a text prompt. Directly applying fdin Equation (11) leads to a degeneration back to independent coupling as discussed earlier. To avoid this, we propose a relaxed penalty function fcbased on the cosine distance: fc(c1, c2) =w 1−c1·c2 ∥c1∥∥c2∥ =w·cdist(c1, c2).(12) Here, w > 0is a scaling hyperparameter. Note, when w= 0, this formulation degenerates to regular OT; when w→ ∞ , this formulation approaches Equation (11), i.e., independent coupling if no two conditions are equal. The right section of Figure 1 illustrates an example where the x-coordinate of the target data point serves as the condition. Notably, C2OT preserves straight flow paths without exhibiting OT’s apparent degradation.Finding w.The optimal wis highly dependent on the data distribution and the magnitude of features. To alleviate the need for tuning the hyperparameter w, we propose to find wadaptively for each minibatch. For this, we develop a ratior(w)which measures the proportion of samples that are considered “potential optimal transport candidates”, i.e., r(w) =1 B2X i,j1 ∥X0i−X1j∥2 2+w·cdist(ci, cj) ≤ ∥X0i−X1i∥2 2 ,(13) where 1(·)is an indicator function that evaluates to one if the condition is satisfied and to zero otherwise. Then, we introduce a hyperparameter rtaras the target ratio and find wsuch that r(w)≈rtar. Note that setting rtaris invariant to the scaling of distances between x0andx1. Namely, if ∥X0i−X1j∥2 2is scaled by s,wcan also be scaled by sto preserve the same rtar. This invariance is particularly useful when transitioning to a new dataset, such as changing image resolutions in image generation tasks. In our implementa- tion, we search for win two steps: first with exponential search to establish the upper/lower bounds, then with binary search to locate a precise value. The final wis used as the initial value in the subsequent minibatch. In practice, we observe wdiffers very little between minibatches (<1%) and the search converges quickly within 10 iterations with a neg- ligible overhead. Next, we introduce oversampling , a simple technique to obtain more accurate OT couplings. 3.3.4. Oversampling OT Batches OT vs. Deep Net Batch Sizes. Typically, the “OT batch size” bused to compute optimal transport is set equal to the “deep net batch size” bnused to compute forward/backward passes of the deep net [ 36,42]. While the OT batch size con- trols the dynamics of prior-to-data assignments, the deep net batch size affects gradient variances, training efficiency, and memory usage. There is no reason why setting them equal should be optimal. Particularly, with our proposed C2OT, the effective OT batch size is reduced since we restrict transport between samples with divergent conditions, which calls for anincreased OT batch size b. At the same time, keeping the deep net batch size bnunchanged prevents unwarranted side effects in other aspects of training. Reduced Effective OT Batch Size. In the case of discrete conditions, OT is effectively performed within 5 Page 6: Table 1. Results on the 8gaussians →moons dataset with dif- ferent training coupling methods and different conditioning. Note that our method C2OT does not apply in the unconditional setting. Method Euler-1 ( W2 2↓)Adaptive ( W2 2↓) NFE ↓ Unconditional FM 6.232±0.186 0.120±0.045 93.22±1.81 OT 0.072±0.025 0.050±0.040 36.08±2.57 Binary conditions FM 2.573±0.112 0.059±0.029 91.10±2.77 OT 0.483±0.059 0.462±0.025 32.89±0.95 C2OT (ours) 0.048±0.010 0.018±0.006 58.88±1.48 Continuous conditions FM 0.732±0.052 0.028±0.010 93.86±1.76 OT 8.276±6.510 2.143±1.993 42.07±2.59 C2OT (ours) 0.077±0.024 0.013±0.003 89.87±1.53 {X0i, X1i|ci=c}for each condition c. With kclasses, the expected effective OT batch size is reduced to E(|{i|ci= c}|) =b/k. Empirically, we find that using a much larger OT batch size bthan the network batch size bnis helpful. This observation motivates oversampling. Efficiency. A natural concern with increasing bis the com- putational overhead, since we compute optimal transport with Hungarian matching [ 24] which has a time complexity ofO(b3). However, •Since each OT batch can be used across (b/bn)for- ward/backward passes, the amortized cost per minibatch reduces to O(b2bn). •We offload OT computation from the main process to the data loading processes, enabling data preparation on the CPU in parallel with forward/backward passes on the GPU. With a modest choice of 8 data workers per GPU, we observe no wall-clock overhead for OT batch sizes up to 6,400 when training on CIFAR-10 and ImageNet, as data preparation is completed faster than the GPU can process batches. 4. Experiments We experimentally verify our proposed method C2OT in two sections: first on the two-dimensional synthetic dataset 8gaussians →moons , and then on high-dimensional im- age data, including CIFAR-10 [ 23] and ImageNet-1K [ 7]. For ImageNet-1K, we conduct experiments on both 32 ×32 images in image space and 256 ×256 images in latent space. Implementation details are provided in the appendix. 4.1. Two-Dimensional Data The two-dimensional 8gaussians →moons dataset maps a mixture of eight Gaussian distributions to two interleav-ing half-circles, as visualized in Figure 1. We consider three conditioning scenarios: (a) unconditional; (b) binary class conditions, where each class corresponds to one of the half-circles; and (c) continuous conditions, where the x-coordinate of the desired target point is given as the con- dition. For (b), we apply our setup for discrete conditions (Section 3.3.2); for (c), we apply our setup for continuous conditions (Section 3.3.3) with rtar= 0.01. Metrics. We report the 2-Wasserstein distance ( W2 2) (fol- lowing [ 42]) between 10K generated data points and 10K samples from the ground-truth moons distribution. We ex- periment with single-step Euler’s method (Euler-1) and the adaptive Dormand–Prince method [ 9] (dopri5, referred to as adaptive) for numerical integration. Additionally, we report the number of function evaluations (NFE) in the adaptive method, which is the number of forward passes through the neural network. All experiments are averaged over ten training runs with ten different random seeds, and we report mean±std. Results. The results are summarized in Table 1 and visual- ized in Figure 1. In the unconditional setting, OT produces straighter flows, leading to better distribution matching with fewer NFEs, and our method does not apply. However, in theconditional setting, OT degrades and performs worse than regular FM in the adaptive setting due to the aforemen- tioned train-test gap in OT coupling. Our method, C2OT, effectively models the target distribution with both Euler-1 and the adaptive method, improving upon the baselines. 4.2. High-Dimensional Image Data We conduct experiments on CIFAR-10 [ 23] (image space, class-conditioned), ImageNet-32 ×32 [ 7] (image space, caption-conditioned), and ImageNet-256 ×256 [ 7] (latent space, caption-conditioned). For ImageNet, we use the captions provided by [ 25] and encode text features using a CLIP [ 37]-like DFN [ 11] text encoder as input conditions. Our results are summarized in Table 2. More implementation details can be found in the appendix. Metrics. We report Fréchet inception distance (FID) [ 15] and the number of function evaluations (NFE) in the adap- tive solver for all three datasets. To assess how well the generations adhere to the input conditions, we additionally compute condition adherence metrics. For CIFAR-10, we run a pretrained classifier [ 5] on the generated images to obtain logits and report the average cross entropy (CE) with the input class label. For ImageNet, we compute CLIP fea- tures [ 37] using SigLIP-2 [ 43] from the generated images and the input captions respectively, and report the average cosine similarities (CLIP). CIFAR-10. We base our UNet architecture on Tong et al. [42], with details provided in the appendix. The CIFAR-10 training set contains 50,000 32 ×32 color images across 10 6 Page 7: Table 2. Results of different training algorithms on image generation tasks. We bold the best-performing entry and underline the second-best. Our proposed method, C2OT, achieves the best overall performance – better than FM with few sampling steps and better than OT with more sampling steps. In cases where C2OT performs second, our performance is often comparable to the best entry. Additionally, C2OT has the most stable performance, with the smallest deviations across runs in most entries. Variations in CLIP scores are minimal. CIFAR-10 Class-Conditioned Generation Method Euler-2 Euler-5 Euler-10 Euler-25 Euler-50 Euler-100 Adaptive NFE ↓FID↓FM 105.481 ±1.619 21.968±0.400 10.479±0.237 5.640±0.117 4.128±0.069 3.429±0.054 2.922±0.039 124.0±2.5 OT 64.417±0.434 18.194±0.409 10.778±0.270 6.821±0.186 5.368±0.113 4.671±0.066 4.144±0.018 125.7±0.4 C2OT (ours) 64.694±0.379 17.565±0.202 9.762±0.136 5.644±0.081 4.152±0.061 3.447±0.049 2.904±0.035 127.7±0.7CE↓FM 3.343±0.022 0.546±0.020 0.323±0.007 0.272±0.004 0.267±0.002 0.269±0.003 0.276±0.004 124.0±2.5 OT 2.529±0.010 0.624±0.012 0.426±0.005 0.369±0.004 0.363±0.004 0.364±0.005 0.372±0.005 125.7±0.4 C2OT (ours) 2.264±0.022 0.475±0.009 0.322±0.004 0.276±0.005 0.271±0.003 0.272±0.003 0.279±0.004 127.7±0.7 ImageNet 32 ×32 Caption-Conditioned GenerationFID↓FM 113.250 ±0.793 23.015±0.123 11.141±0.051 7.165±0.087 6.142±0.085 5.673±0.074 5.358±0.059 146.9±3.5 OT 81.317±0.369 20.763±0.154 11.360±0.089 7.854±0.077 6.899±0.079 6.469±0.071 6.220±0.065 137.7±0.2 C2OT (ours) 102.380 ±0.279 21.965±0.035 10.897±0.033 7.069±0.027 6.084±0.016 5.638±0.018 5.350±0.017 134.2±1.4CLIP ↑FM 0.067±0.000 0.076±0.000 0.074±0.000 0.071±0.000 0.071±0.000 0.070±0.000 0.069±0.000 146.9±3.5 OT 0.067±0.000 0.071±0.000 0.070±0.000 0.068±0.000 0.068±0.000 0.067±0.000 0.067±0.000 137.7±0.2 C2OT (ours) 0.068±0.000 0.075±0.000 0.073±0.000 0.071±0.000 0.070±0.000 0.070±0.000 0.069±0.000 134.2±1.4 ImageNet 256 ×256 Caption-Conditioned Generation in Latent SpaceFID↓FM 203.355 ±0.075 31.740±0.100 10.088±0.020 3.898±0.016 5.225±0.011 3.474±0.008 3.484±0.014 133.5±10.9 OT 190.893 ±0.213 46.510±0.140 18.221±0.078 7.477±0.045 9.333±0.046 7.630±0.020 7.181±0.022 115.9±15.3 C2OT (ours) 201.010 ±0.067 30.578±0.124 10.032±0.007 3.702±0.017 5.075±0.013 3.335±0.010 3.290±0.012 126.7±12.7CLIP ↑FM 0.027±0.000 0.120±0.000 0.133±0.000 0.138±0.000 0.137±0.000 0.139±0.000 0.139±0.000 133.5±10.9 OT 0.031±0.000 0.101±0.000 0.114±0.000 0.118±0.000 0.117±0.000 0.118±0.000 0.118±0.000 115.9±15.3 C2OT (ours) 0.029±0.000 0.120±0.000 0.133±0.000 0.138±0.000 0.137±0.000 0.139±0.000 0.138±0.000 126.7±12.7 object classes. During testing, we generate 50,000 images evenly distributed among classes and assess FID against the training set following [ 42]. Each model is trained five times with different random seeds and the results are reported as mean ±std. We use an OT batch size bof 640 and a ratio rtarof 0.01. Compared to FM, our method achieves better performance (FID and CE) in few-steps settings and comparable performance in many-steps settings. Compared to OT, our method significantly outperforms in the many- step setting. Although OT achieves slightly better FID in Euler-2, it performs worse in CE, indicating that it fails to follow the condition. ImageNet 32 ×32. We base our UNet architecture on Pooladian et al. [36], with details provided in the ap- pendix. We train and evaluate on the face-blurred ver- sion [ 44] of ImageNet, following[ 36]. In contrast to prior works that assess FID against the training set, we assess FID against the validation set (49,997 images) to mitigate overfitting, as the fine-grained nature of captions may lead to overfitting on the training set. Each model is trained three times with different random seeds and the results are reported as mean ±std. Compared to CIFAR-10, there are much more variations in this dataset ( e.g., 1000 classes vs. 10 classes in CIFAR-10). Thus, we adopt a larger OT batchsizebof 6400 while keeping the same target ratio rtar0.01 as before. Similar to our findings in CIFAR-10, our method performs better or is comparable to FM; in few-steps set- tings, OT achieves better FID but worse condition following (CLIP). Our method strikes the best overall balance. ImageNet 256 ×256 in Latent Space. Here, we explore latent flow matching models [ 38]. During training, the flow matching model learns to generate data in the latent space, which is encoded from images. At test-time, the generated la- tents are decoded into images using a pretrained decoder. We base our experiment on a recent transformer-based model, LightningDiT [ 46], demonstrating that our method is effec- tive across different network architectures and data spaces. We use the officially provided “64 epochs” configuration for all experiments (due to computational constraints), which is not directly comparable to methods that are trained with 800 epochs. We follow the same evaluation protocol as ImageNet-32 ×32, and adopt the same OT batch size bof 6400 and the target ratio rtarof 0.01. Each model is evalu- ated three times with different random seeds and the results are reported as mean ±std. Again, C2OT has the best overall performance compared to both FM and OT. We visualize the generated images in Figure 3 and provide additional uncurated samples in the appendix. 7 Page 8: Euler-2 Euler-5 Euler-10 Euler-25 Euler-50 Euler-100FM OT C2OT (ours) Figure 3. Visual comparisons of 256 ×256 images generated by the baselines and our approach with different amounts of sampling steps. Our approach converges faster and produces cleaner outputs. The input caption is “a small dog sitting on a carpet”. 4.3. Ablation Studies 4.3.1. Varying OT Batch Size b Here, we analyze the effect of varying the OT batch size b. We conduct these experiments on the CIFAR-10 dataset, training each model three times with different random seeds for every choice of b. We plot the results in Figure 4. In- creasing the OT batch size clearly improves FID in the few- sampling-steps setting (Euler-2). This is intuitive, because a larger OT batch size enables the coupling to approach thetrue OT results, i.e., dataset-level optimal transport. In turn, this leads to straighter paths. However, in the adaptive setting, increasing the OT batch size slightly worsens FID, though the effect is much less significant (the range of the blue y-axis is much smaller than the range of the red y-axis). We think this occurs because a more accurate OT coupling reduces variations in the training data (similar to having less data augmentations) which harms performance as the benefit of straighter flows diminishes in the adaptive setting. This observation aligns with the findings of Pooladian et al. [36]: computing OT coupling across GPUs ( i.e., having a larger effective OT batch size) slightly harms results. In contrast to Pooladian et al. [36], though, we seek to counteract the reduced effective OT batch size induced by conditional opti- mal transport (as discussed in Section 3.3.4), and we do find reasonably large (640 for CIFAR, 6400 for ImageNet) OT batch sizes useful while not adding wall-clock overhead. 4.3.2. Varying Target Ratio rtar Here, we analyze the effects of varying the target ratio rtar, as defined in Equation (13). We conduct these experiments on the ImageNet-32 ×32 dataset, training each model three times with different random seeds for every choice of rtar. We plot the results in Figure 5. Recall that when r→0, w→ ∞ and the method approaches FM; when r→0.52, 2There is a 0.5 chance that the distance between a random prior-data pair is closer to that of another random prior-data pair.128 640 1,280 2,560 6,400505560657075 OT (Euler-2) OT batch size bEuler-2 FID ↓ 2.852.882.912.942.973.00 FM (adaptive) Adaptive FID ↓ Ours, Euler-2 Ours, adaptive Figure 4. Changes in FID with respect to varying OT batch sizes b. We plot mean ±std over three runs and represent std with a shaded region. OT (adaptive) and FM (Euler-2) perform worse than all other results in this plot, hence results are not shown (full plot in the appendix). 0.0005 0.001 0.002 0.005 0.01 0.02 0.05 0.180859095100105110115 OT (Euler-2)FM (Euler-2) Target ratio rtarEuler-2 FID ↓ 5.255.385.515.645.775.90 FM (adaptive) Adaptive FID ↓Ours, Euler-2 Ours, adaptive Figure 5. Changes in FID with respect to varying target ratio rtar. We plot mean ±std over three runs and represent std with a shaded region. OT (adaptive) performs worse than all other results in this plot, hence results are not shown (full plot in the appendix). w→0and the method approaches OT (Section 3.3.3). Our findings are similar to those in Section 4.3.1: as we approach true OT , whether by increasing OT batch size or increasing rtar), FID improves in the few-step setting and worsens in the adaptive setting. 5. Conclusion We first investigate and formalize the struggle of minibatch optimal transport in conditional generation. Based on the findings, we propose C2OT, a simple yet effective addi- tion that corrects the degeneration of OT while maintaining straight integration paths. Extensive experiments ranging from simple two-dimensional synthetic data with MLPs to high-dimension 256 ×256 image generation in latent space with transformers verify the effectiveness of our method. We believe that this simple technique will be broadly useful in future flow-based conditional generative tasks. 8 Page 9: Acknowledgment. This work is supported in part by NSF grants 2008387, 2045586, 2106825, NIFA award 2020- 67021-32799, and OAC 2320345. References [1]Michael Albergo and Eric Vanden-Eijnden. Building normal- izing flows with stochastic interpolants. In Proc. ICLR , 2023. 2 [2]Michael Albergo, Nicholas Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797 , 2023. 2 [3]Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. pi0: A vision-language- action flow model for general robot control. arXiv , 2024. 1 [4]Ho Kei Cheng, Masato Ishii, Akio Hayakawa, Takashi Shibuya, Alexander Schwing, and Yuki Mitsufuji. Taming multimodal joint training for high-quality video-to-audio syn- thesis. CVPR , 2025. 1 [5]chenyaofo. Pytorch cifar models. https://github. com/chenyaofo/pytorch-cifar-models , 2023. 6 [6]Patryk Chrabaszcz, Ilya Loshchilov, and Frank Hutter. A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv , 2017. 14 [7]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InCVPR , 2009. 6, 14 [8]Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS , 2021. 14 [9]John R Dormand and Peter J Prince. A family of embedded runge-kutta formulae. Journal of computational and applied mathematics , 1980. 2, 6 [10] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim En- tezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In ICML , 2024. 1 [11] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig Schmidt, Alexander Toshev, and Vaishaal Shankar. Data filtering networks. ICLR , 2024. 6, 14 [12] Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. ICLR , 2025. 2 [13] Pengsheng Guo and Alexander G Schwing. Variational recti- fied flow matching. arXiv , 2025. 2 [14] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv , 2016. 12 [15] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bern- hard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS , 2017. 6 [16] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR , 2017. 2[17] Ka-Hei Hui, Chao Liu, Xiaohui Zeng, Chi-Wing Fu, and Arash Vahdat. Not-so-optimal transport flows for 3d point cloud generation. arXiv , 2025. 2 [18] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Ha- jishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021. If you use this software, please cite it as below. 14 [19] Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. ICLR , 2024. 2 [20] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR , 2015. 12, 13 [21] Diederik P Kingma, Max Welling, et al. Auto-encoding variational bayes. In ICLR , 2014. 2 [22] Leon Klein, Andreas Krämer, and Frank Noé. Equivariant flow matching. NeurIPS , 36, 2023. 2 [23] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009. 6 [24] Harold W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly , 1955. 6 [25] Visual Layer. Imagenet-1k-vl-enriched. https : //huggingface.co/datasets/visual-layer/ imagenet-1k-vl-enriched , 2024. 6, 14 [26] Sangyun Lee, Beomsu Kim, and Jong Chul Ye. Minimizing trajectory curvature of ode-based generative models. In ICML , 2023. 2 [27] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. ICLR , 2023. 1, 2, 13 [28] Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky TQ Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code. arXiv , 2024. 14 [29] Qihao Liu, Xi Yin, Alan Yuille, Andrew Brown, and Mannat Singh. Flowing from words to pixels: A framework for cross- modality evolution. arXiv preprint arXiv:2412.15213 , 2024. 2 [30] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. ICLR , 2023. 2 [31] Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. ICLR , 2024. 2 [32] Ségolène Martin, Anne Gagneux, Paul Hagemann, and Gabriele Steidl. Pnp-flow: Plug-and-play image restoration with flow matching. ICLR , 2025. 1 [33] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In CVPR , 2022. 14 [34] Michael Poli, Stefano Massaroli, Atsushi Yamashita, Ha- jime Asama, Jinkyoo Park, and Stefano Ermon. Torchdyn: Implicit models and neural numerical methods in pytorch. Physical Reasoning and Inductive Biases for the Real World at NeurIPS , 2021. 12 9 Page 10: [35] Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, David Yan, Dhruv Choudhary, Dingkang Wang, Geet Sethi, Guan Pang, Haoyu Ma, Ishan Misra, Ji Hou, Jialiang Wang, Kiran Jagadeesh, Kunpeng Li, Luxin Zhang, Mannat Singh, Mary Williamson, Matt Le, Matthew Yu, Mitesh Kumar Singh, Peizhao Zhang, Pe- ter Vajda, Quentin Duval, Rohit Girdhar, Roshan Sumbaly, Sai Saketh Rambhatla, Sam Tsai, Samaneh Azadi, Samyak Datta, Sanyuan Chen, Sean Bell, Sharadh Ramaswamy, Shelly Sheynin, Siddharth Bhattacharya, Simran Motwani, Tao Xu, Tianhe Li, Tingbo Hou, Wei-Ning Hsu, Xi Yin, Xiaoliang Dai, Yaniv Taigman, Yaqiao Luo, Yen-Cheng Liu, Yi-Chiao Wu, Yue Zhao, Yuval Kirstain, Zecheng He, Zijian He, Al- bert Pumarola, Ali Thabet, Artsiom Sanakoyeu, Arun Mallya, Baishan Guo, Boris Araya, Breena Kerr, Carleigh Wood, Ce Liu, Cen Peng, Dimitry Vengertsev, Edgar Schonfeld, El- liot Blanchard, Felix Juefei-Xu, Fraylie Nord, Jeff Liang, John Hoffman, Jonas Kohler, Kaolin Fire, Karthik Sivakumar, Lawrence Chen, Licheng Yu, Luya Gao, Markos Georgopou- los, Rashel Moritz, Sara K. Sampson, Shikai Li, Simone Parmeggiani, Steve Fine, Tara Fowler, Vladan Petrovic, and Yuming Du. Movie gen: A cast of media foundation models. arXiv , 2024. 1 [36] Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles Domingo-Enrich, Brandon Amos, Yaron Lipman, and Ricky TQ Chen. Multisample flow matching: Straighten- ing flows with minibatch couplings. ICML , 2023. 2, 3, 5, 7, 8, 13, 14 [37] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InICML , 2021. 6 [38] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR , 2022. 7 [39] Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, and Yuki Mitsufuji. Training consistency models with variational noise coupling. arXiv preprint arXiv:2502.18197 , 2025. 2 [40] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In ICML , 2023. 2 [41] Yuxuan Song, Jingjing Gong, Minkai Xu, Ziyao Cao, Yanyan Lan, Stefano Ermon, Hao Zhou, and Wei-Ying Ma. Equiv- ariant flow matching with hybrid probability transport for 3d molecule generation. NeurIPS , 36, 2023. 2 [42] Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based gen- erative models with minibatch optimal transport. TMLR , 2024. 2, 3, 5, 6, 7, 12, 13, 14 [43] Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization,and dense features. arXiv preprint arXiv:2502.14786 , 2025. 6, 14 [44] Kaiyu Yang, Jacqueline H Yau, Li Fei-Fei, Jia Deng, and Olga Russakovsky. A study of face obfuscation in imagenet. InICML , 2022. 7, 14 [45] Ling Yang, Zixiang Zhang, Zhilong Zhang, Xingchao Liu, Minkai Xu, Wentao Zhang, Chenlin Meng, Stefano Ermon, and Bin Cui. Consistency flow matching: Defining straight flows with velocity consistency. arXiv , 2024. 2 [46] Jingfeng Yao and Xinggang Wang. Reconstruction vs. genera- tion: Taming optimization dilemma in latent diffusion models. CVPR , 2025. 7, 14 [47] Yichi Zhang, Yici Yan, Alex Schwing, and Zhizhen Zhao. Towards hierarchical rectified flow. ICLR , 2025. 2 10 Page 11: Table of Contents 1 Introduction 1 2 Related Works 2 3 Method 2 3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3.2 OT Coupling Skews The Prior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3 Conditional Optimal Transport FM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3.2 Discrete Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.3.3 Continuous Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.3.4 Oversampling OT Batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 4 Experiments 6 4.1 Two-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.2 High-Dimensional Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 4.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.3.1 Varying OT Batch Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4.3.2 Varying Target Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5 Conclusion 8 A Extended Plots 12 B Data Coupling in 8 Gaussians →moons 12 C Implementation Details 12 C.1 Two-Dimensional Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 C.2 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 C.3 ImageNet-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 C.4 ImageNet-256 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 D Additional Generated Images 14 D.1 CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 D.2 ImageNet-32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 D.3 ImageNet-256 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 E DeltaAI Acknowledgment 19 11 Page 12: A. Extended Plots Figures A1 and A2 extend Figures 5 and 6 of the main paper to include all data points. 128 640 1,280 2,560 6,40050556065707580859095100105110 OT (Euler-2)FM (Euler-2) OT batch size bEuler-2 FID ↓ 2.853.023.183.353.523.683.854.024.184.354.524.684.85 FM (adaptive)OT (adaptive) Adaptive FID ↓Ours, Euler-2 Ours, adaptive Figure A1. Changes in FID with respect to varying OT batch sizes b. We plot mean ±std over three runs and represent std with a shaded region.0.0005 0.001 0.002 0.005 0.01 0.02 0.05 0.17580859095100105110115 OT (Euler-2)FM (Euler-2) Target ratio rtarEuler-2 FID ↓ 5.155.295.445.585.735.876.016.166.30 FM (adaptive)OT (adaptive) Adaptive FID ↓Ours, Euler-2 Ours, adaptive Figure A2. Changes in FID with respect to varying target ratio rtar. We plot mean ±std over three runs and represent std with a shaded region. B. Data Coupling in 8 Gaussians →moons Figure A3 extends Figure 1 with an additional row that shows coupling during training. Clearly, OT samples form a biased distribution at training time in conditional generation as discussed. Since we cannot sample from this biased distribution at test-time, we obtain a gap between training and testing. This gap degrades the performance of OT. C. Implementation Details C.1. Two-Dimensional Data Data. Following the implementation of Tong et al. [42], we generate the “moons” data using the torchdyn library [ 34], and the “8 Gaussians” using torchcfm [42]. Network. We employ a simple multi-layer perceptron (MLP) network for this dataset. Initially, the two-dimensional input (x and y coordinates) and the flow timestep (a scalar uniformly sampled from [0,1]) are projected into the hidden dimension using individual linear layers. When an input condition is provided, it is similarly projected into the hidden dimension. Discrete conditions are encoded as -1 or +1, while continuous conditions are represented by the x-coordinate of the target data point. After projection, all input features are summed and processed through a network comprising three MLP modules. Each MLP module consists of two linear layers, where the first layer uses an expansion ratio of 4, and is followed by a GELU activation function [ 14]. We use a residual connection to incorporate the output of each MLP block. Finally, another linear layer projects the features to two dimensions to produce the output velocity. The hidden dimension is set to 128. Training. We train each network for 20,000 iterations with the Adam [ 20] optimizer, a learning rate of 3e-4without weight decay, and a “deep net” batch size of 256 for computing forward/backward passes/gradient updates. We use an OT batch size b of 1024 and a target ratio rtarof 0.01. 12 Page 13: 6.232±0.186 0.072±0.025 2.573±0.112 0.483±0.059 0.048±0.010 0.732±0.052 8.276±6.510 0.077±0.024 0.120±0.045 0.050±0.040 0.059±0.029 0.462±0.025 0.018±0.006 0.028±0.010 2.143±1.993 0.013±0.003 FM OT FM OT C2OT (ours) FM OT C2OT (ours)Unconditonal With discrete conditionsWith continuous conditions (target xcoordinates)Training Euler-1 step Adaptive Figure A3. We visualize the flows learned by different algorithms using the 8gaussians →moons dataset. Below each plot, we show the 2-Wasserstein distance (lower is better; mean ±std over 10 runs). Compared to Figure 1, we have added a first row illustrating the prior-data coupling during training. Note that the OT coupled paths during training (first row) do cross. This is expected – the commonly referred to “no-crossing” property of OT coupling refers to the uniqueness of the pair (x0, x1)given xandt– at the same timestep t, no two paths may cross at x(see Proposition 3.4 in Tong et al. [42] and Theorem D.2 in Pooladian et al. [36]). Since we plot all timesteps simultaneously in this figure, there are apparent crossings. However, the intersecting paths do not share the same timestep tat the point of intersection. CIFAR-10 ImageNet-32 Channels 128 256 Depth 2 3 Channels multiple 1, 2, 2, 2 1, 2, 2, 2 Heads 4 4 Heads channels 64 64 Attention resolution 16 4 Dropout 0.0 0.0 Use scale shift norm True True Batch size / GPU 128 128 GPUs 2 4 Effective batch size 256 512 Iterations 100k 300k Learning rate 2.0e −4 1.0e −4 Learning rate scheduler Warmup then constant Warmup then linear decay Warmup steps 5k 20k OT batch size (per GPU) 640 6400 Table A1. Hyperparameter settings for training on CIFAR-10 and ImageNet-32. C.2. CIFAR-10 In CIFAR-10, we employ the UNet architecture used by Tong et al. [42]. We list our hyperparameters in Table A1, following the format in [ 27,36,42]. To accelerate training, we use bf16 . We use the Adam [ 20] optimizer with the following parameters: β1= 0.9,β2= 0.95, weight decay=0.0, and ϵ= 1e−8. For learning rate scheduling, we linearly increase the learning rate from 1.0e −8 to 2.0e −4 over 5,000 iterations and then keep the learning rate constant. The “use scale shift norm” denotes 13 Page 14: employing adaptive layer normalization to incorporate the input condition, as implemented in [ 8]. To stabilize training, we clip the gradient norm to 1.0. We report the results using an exponential moving average (EMA) model with a decay factor of 0.9999. For FID computation, we use the clean_fid library [33] in legacy_tensorflow mode following [42]. C.3. ImageNet-32 Data. We use face-blurred ImageNet-1K [ 7,44] following [ 28], and apply the downsampling script from [ 6]. Images are downsampled to 32 ××32 using the ‘box’ algorithm, and reference FID statistics are computed with respect to the downsampled validation set images. For text input, we use the captions provided by [25]. Network and Training. We largely follow the training pipeline described in Appendix C.2. We use a larger network following [ 36] and list our hyperparameters in Table A1. To encode text input, we use the openclip library [ 18] and the text encoder of the “DFN5B-CLIP-ViT-H-14” checkpoint [ 11], a pretrained CLIP-like model. CLIP feature vectors are normalized to unit norm before being used as input conditions. For learning rate scheduling, after the initial warmup phase, we linearly decay the learning rate to 1.0e −8 over time. Evaluation. As stated in the main paper, we use 49,997 images from the validation set to compute FID. This is because the fine-grained nature of image captions might lead to overfitting, i.e., memorizing the training set. For CLIP score computation, we evaluate the cosine similarity between the input caption and the generated image using SigLIP-2 [ 43], with theViT-SO400M-16-SigLIP2-256 checkpoint via the openclip library [18]. C.4. ImageNet-256 For this dataset, we use the open-source implementation of LightningDiT [ 46] and train the models under the ‘64 epochs’ setting with minimal modifications to change the network from class-conditioned to caption-conditioned. In addition to integrating the coupling algorithms (OT and C2OT, while the original LightningDiT [46] already employs FM), our modifications include: 1.Changing the input conditional mapping layer from an embedding layer (that takes a class label as input) to a linear layer (that takes CLIP features as input). 2.Adjusting the classifier-free guidance (CFG) scale. We find that the model benefits from a higher CFG scale when using caption conditioning. Specifically, we increase the CFG scale from 10.0 to 17.0, and adjust the CFG interval start parameter from 0.11 to 0.10. For data and evaluation, we follow the same setup as described in Appendix C.3. D. Additional Generated Images We present additional image generation results in this section. All showcased images are uncurated, meaning they were sampled completely at random. For consistency and direct comparison, we used the same random seed for each generation across different methods. 14 Page 15: D.1. CIFAR-10 Euler-10 Adaptive FM OT C2OT (ours) Figure A4. Uncurated generations in CIFAR-10, 10-per-class. We compare FM, OT, and C2OT with both 10-step Euler’s method and an adaptive solver for test-time numerical integration. 15 Page 16: D.2. ImageNet-32 Euler-10 Adaptive FM OT C2OT (ours) Figure A5. Uncurated generations in ImageNet-32. We compare FM, OT, and C2OT with both 10-step Euler’s method and an adaptive solver for test-time numerical integration. 16 Page 17: D.3. ImageNet-256 Euler-2 Euler-5 Euler-10 Euler-25 Euler-50 Euler-100FM OT C2OT (ours) FM OT C2OT (ours) FM OT C2OT (ours) Figure A6. Uncurated generations in ImageNet-256. We compare FM, OT, and C2OT with different amounts of sampling steps. 17 Page 18: FM OT C2OT (ours) Figure A7. Uncurated generations in ImageNet-256. We compare FM, OT, and C2OT with an adaptive solver for test-time numerical integration. 18 Page 19: E. DeltaAI Acknowledgment This work used the DeltaAI system at the National Center for Supercomputing Applications through allocation CIS250008 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants 2138259, 2138286, 2138307, 2137603, and 2138296. 19

---