Paper Content:
Page 1:
Combining Incomplete Observational and Randomized Data
for Heterogeneous Treatment Effects
Dong Yao
Ant Group
Hangzhou, China
yaodong.yao@antgroup.comCaizhi Tang
Ant Group
Hangzhou, China
caizhi.tcz@antgroup.com
Qing Cui
Ant Group
Hangzhou, China
cuiqing.cq@antgroup.comLongfei Li∗
Ant Group
Hangzhou, China
longyao.llf@antgroup.com
Abstract
Data from observational studies (OSs) is widely available and read-
ily obtainable yet frequently contains confounding biases. On the
other hand, data derived from randomized controlled trials (RCTs)
helps to reduce these biases; however, it is expensive to gather, re-
sulting in a tiny size of randomized data. For this reason, effectively
fusing observational data and randomized data to better estimate
heterogeneous treatment effects (HTEs) has gained increasing atten-
tion. However, existing methods for integrating observational data
with randomized data must require complete observational data,
meaning that both treated subjects and untreated subjects must be
included in OSs. This prerequisite confines the applicability of such
methods to very specific situations, given that including all subjects,
whether treated or untreated, in observational studies is not consis-
tently achievable. In our paper, we propose a resilient approach to
Combine Incomplete Observational data and randomized data for
HTE estimation, which we abbreviate as CIO. The CIO is capable
of estimating HTEs efficiently regardless of the completeness of the
observational data, be it full or partial. Concretely, a confounding
bias function is first derived using the pseudo-experimental group
from OSs, in conjunction with the pseudo-control group from RCTs,
via an effect estimation procedure. This function is subsequently
utilized as a corrective residual to rectify the observed outcomes of
observational data during the HTE estimation by combining the
available observational data and the all randomized data. To vali-
date our approach, we have conducted experiments on a synthetic
dataset and two semi-synthetic datasets.
CCS Concepts
•Computing methodologies →Machine learning ;Causal rea-
soning and diagnostics ;Machine learning ;•Mathematics of
computing→Causal networks .
∗Corresponding Authors.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
CIKM ’24, October 21–25, 2024, Boise, ID, USA
©2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0436-9/24/10
https://doi.org/10.1145/3627673.3679593Keywords
Causal Inference, Heterogeneous Treatment Effects, Observational
Data, Random Control Trial Data
ACM Reference Format:
Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li. 2024. Combining Incom-
plete Observational and Randomized Data for Heterogeneous Treatment Ef-
fects. In Proceedings of the 33rd ACM International Conference on Information
and Knowledge Management (CIKM ’24), October 21–25, 2024, Boise, ID, USA.
ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3627673.3679593
1 Introduction
Heterogeneous treatment effects (HTEs) refer to the variations in
causal effects of a treatment or an intervention across different
sub-populations, based on their distinct characteristics or contexts.
It is of great importance to estimate HTEs in various fields, such as
medicine [12, 14] , marketing [4, 7] and epidemiology [27].
There are two types of data in causal inference: observational
data andrandomized data . Observational data is collected with-
out any intervention from observational studies (OSs), reflecting
real-world conditions as they naturally occur. Considering the ad-
vantages of low-cost of acquisition and vast quantities, most exist-
ing literature [ 6,18,21,25,26,32,34] focus on estimating HTEs
from observational data. Although it is valuable for its real-world
relevance and the volume of data it can provide, observational data
is often prone to confounding biases that can challenge causal in-
terpretations. In practice, those methods mentioned above depend
on some assumptions,e.g., assuming the absence of unobserved
confounders, which is not testable and difficult to satisfy in practice.
For instance, when doctors prescribe medication, they consider
various patient-specific factors, some of which may not be captured
in the medical records. Relying solely on observational data in this
case can result in confounding bias, since unrecorded influences on
both the treatment decisions and outcomes remain unaccounted
for. This leads to challenges in identifying HTEs accurately and can
introduce bias into the estimates of treatment effects. Therefore, it is
unreasonable to make use of observational data for HTE estimation
in practice without making the unconfoundedness assumption.
Randomized data, particularly from Randomized Controlled Tri-
als (RCTs), is generated in a randomized controlled experimental set-
ting where participants are randomly assigned to different groups
to isolate the effect of a treatment or intervention. And the trial data
is deemed the gold standard for estimating HTEs. However, trialarXiv:2410.21343v1 [stat.ME] 28 Oct 2024
Page 2:
CIKM ’24, October 21–25, 2024, Boise, ID, USA Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li
Obser v ational DataObser v ational Data
Randomiz ed DataRandomiz ed DataT r eatment Gr oupContr ol Gr oupData F usion
Met hods, e.g.,
RHC, Int R,
CorNet, ... ...
Complet e Obser v ational DataIncomplet e Obser v ational DataData F usion ?
T r eatment Gr oupContr ol Gr oup
Figure 1: The data composition under the two situation: com-
plete and incomplete OS data. For illustration, the right sub-
figure demonstrates a case where the control group is missing.
It should be noted that in practice, the treatment group could
also be absent.
data is often limited by the costs, laws, and ethics. Taking medicine
as an example, it’s impossible to conduct large-scale clinical trials,
especially for those side-effect drugs.
Considering the unique attributes of RCTs and OSs, the integra-
tion of data from both sources has gained traction as a method to
estimate HTEs. References such as [ 5,8,11,13,15,19,29,35,36]
highlight this growing trend. Nevertheless, the prevalent techniques
for HTE estimation largely hinge on the availability of complete
observational datasets. Certain methods begin by constructing an
HTE estimator exclusively based on observational data [ 8,15,19],
whereas others necessitate the creation of a propensity score model
derived from such data [ 35]. For example, RHC [ 19] supposes that
the confounding bias is a parametric function that can be learned.
It learns a biased estimator by training on observational data, and
then uses randomized data to remove the bias. In another instance,
Yang [35] also adopts a parametric approach to model the con-
founding bias when estimating HTEs. They achieve this by using
an integrative 𝑅-learner (IntR) that merges data from RCTs and
OSs. Additionally, this method necessitates a pre-learned propen-
sity score estimator based on both randomized and observational
data. Nonetheless, OSs often suffer from incompleteness due to
the intricate nature of real-world scenarios. The complete observa-
tional data is a dataset which contains control data and treatment.
Therefore, when the one of control group and treatment group from
observational data is absent, we refer it as incomplete observational
data. We draw the Figure 1 to demonstrate the difference between
two situation. As an illustration, consider a scenario where an new
experimental drug is introduced to treat a chronic illness such as
diabetes. Due to the drug’s recent entry into the market and lack of
extensive track record, patients might be skeptical about its benefits
and potential side effects. Consequently, many of them may choose
to stick with their current treatment regimens rather than try the
new medication. This reluctance can result in scarce or incomplete
data for the treatment group within the OS, as the majority of
patients remain within the untreated or control cohorts. In other
words, it is possible to gather data from patients who have not
been treated with the new experimental drug (control data in OSs),
while lacking data from patients who have undergone treatment(treatment data in OSs). Under such circumstances, traditional data
fusion approaches are ill-suited for assessing the impact of the new
experimental drug. The deficiency of treatment group or control
group in OSs can result in a considerable decrease in the efficacy
of these methods or may even lead to their failure to function.
In this paper, we introduce a robust technique, termed CIO,
designed to integrate incomplete observational data with random-
ized trial data for estimating HTEs. Besides, the versatility of CIO
extends to the combination of complete observational data with
randomized data for HTE estimation. Our approach overcomes the
limitations of current data fusion methodologies, which require
complete observational data for HTE estimation, thus offering a
more practical solution for real-world scenarios. Initially, by cre-
ating a dummy treatment, we designate the treatment group or
control group of observational as a pseudo-experimental group and
the all randomized data is used to constitute the pseudo-control
group. Subsequently, we utilize the learning pattern of effect es-
timation to learn a confounding bias function using the pseudo-
experimental group and the pseudo-control group. Finally, we inte-
grate the entirety of the data at our disposal, including the available
observational data and the all randomized data, to derive an HTE
estimator. Simultaneously, we perform debiasing for observational
data, with the assistance of the confounding bias function, which
is employed as a residual to correct for the observed outcomes of
observational data. The specific training process and details are
illustrated in Section 3. The main contributions of our work are as
follows:
•We introduce a robust approach, termed CIO, that fully lever-
ages the strengths of both observational data and randomized
data while addressing the shortcomings of existing data fusion
techniques, i.e., their reliance on complete observational data for
estimating HTEs.
•We form pseudo-experimental and pseudo-control group from
another perspective to train an estimator, which is intended to
serve as a confounding bias function. This is an innovative tactic
in the assessment of confounding bias.
•We validate our approach through extensive experimentation
on one synthetic dataset and two real-world datasets, demon-
strating that our method not only combines observational data
and randomized data more effectively for HTE estimation but
also retains its efficacy in scenarios where the data from OSs is
partially missing.
2 Related Work
2.1 Heterogeneous Treatment Effect Estimation
Accurately estimating heterogeneous treatment effects is consider-
ably significant for medicine, marketing, epidemiology and other
related areas. For that reason, vast machine learning methods have
been proposed to estimate HTEs. We classify these methods into
three category: tree-based methods [ 6,34], bayesian algorithms
[2,3,40] and deep learning algorithms [ 18,32,38,39]. However,
these methods all make a strong unconfounding premise for obser-
vational data, which can not be verified and often does not stand up
to real-world scrutiny. As a result, this prevents the aforementioned
techniques from being implemented in practical settings. To solve
this common problem, certain techniques [ 22,24] aim to identify
Page 3:
Combining Incomplete Observational and Randomized Data
for Heterogeneous Treatment Effects CIKM ’24, October 21–25, 2024, Boise, ID, USA
the actual confounders by working with noisy indicators that serve
as proxies for these confounders. Yet, it remains uncertain if the
covariates we observe genuinely act as surrogates for the actual
confounding variables. Other strategies aim to infer missing con-
founders through the assignment data from multiple treatments
(Wang and Blei, 2019; Bica et al., 2019) or over time with treatments
administered in sequence (Hatt and Feuerriegel, 2021b). Neverthe-
less, these methods also rely on hypotheses such as single strong
ignorability and constancy of confounders over time, which pre-
sume an absence of hidden confounders. Since these suppositions
are not verifiable in real-world applications, the practical applica-
bility of such techniques is impeded. GBCT [ 33] is another attemp
to integrate current observational data and their historical controls.
2.2 Combining Observational and Randomized
Data
Recently, merely a few methods [ 5,8,11,13,15,19,29,35,36]
have been proposed to combine observational data and randomized
data for estimating HTEs. The RHC approach, as described by [ 19],
assumes that confounding bias can be represented as a learnable
parametric function. It involves training a biased estimator with
observational data and subsequently using data from randomized
trials to correct the confounding bias. In a related vein, [ 8] sug-
gest obtaining one estimate from observational data and another
from randomized data, subsequently combining these two estimates
through a weighted average. Nonetheless, the process of calibrating
the weights for this averaging necessitates a substantial randomized
data validation set. Our experimental observations indicate that this
requirement is at odds with the typically limited size of randomized
data samples. [ 35] parametrically formulates the confounding bias
function and an effect estimator for HTE analysis, leveraging an
integrative𝑅-learner that fuse data from RCTs and OS. [ 15] intro-
duces CorNet—a dual-phase framework that exploits a common
structural aspect of both data kinds. More recently, FAST [ 13], as a
tree-based method, draws from the statistical principle of shrinkage
estimation. It crafts a weighting strategy that is optimized to strike
a balance between the unbiased estimator derived from trial data
and the estimator from observational data, which may carry bias.
3 Method
In this section, our goal is to provide a comprehensive introduction
to the proposed CIO method. We begin by presenting the founda-
tional elements, which encompass the definition of variables and
the essential assumptions required for our approach. Following that,
we will illustrate the training steps involved in CIO and discuss the
specific training loss function employed in the process. Finally, we
will give a in-depth analysis in theoretical about why our proposed
method is effective.
3.1 Preliminary
In our study, we concentrate on a scenario where the treatment
variable𝑇is binary, taking values in the set {0,1}. We denote
X∈R𝑝, as the vector of covariates measured before being treated,
and𝑌∈Ras the outcome variable of interest. Employing the
potential outcomes framework [ 30], we define causal effects using
𝑌(𝑡)to represent the potential outcome if the subject were to receivetreatment𝑡, with𝑡being either 0 or 1. Thus, the Heterogeneous
Treatment Effect (HTE) is expressed as 𝜏(X)=E(𝑌(1)−𝑌(0)|
X), which captures the expected treatment effect conditional on
the covariates X. We set𝑆=0and𝑆=1to denote OSs and
RCTs, respectively. We aim to determine an estimator for 𝜏(X),
provided that Xis within the range of values it can take in the
RCTs. To maximize the efficiency benefits derived from the OSs, we
proceed under the premise that the range of Xvalues in the RCTs
is either a subset of or intersects with the range in the OSs ( overlap
assumption). This is because 𝜏(X)cannot be identified for values
ofXthat fall outside the scope of the RCTs. For brevity, within the
subsequent text, "OS data" refers to observational data, while "RCT
data" denotes randomized data.
Before introducing CIO, we present two common assumptions
in causal inference [ 17,28,40] and data fusion literature [ 9,10,35]:
Assumption 3.1 (Consistency, ignorability and overlap) .For any
individual𝑖, assigned to treatment 𝑡𝑖, we observe 𝑌𝑖=𝑌(𝑡𝑖). Fur-
ther,{𝑌(𝑡)}𝑡∈𝑇and the data generating process 𝑝(X,𝑇,𝑌,𝑆)satisfies
strong ignorability: 𝑇⊥ {𝑌(0),𝑌(1)} | ( X,𝑆=1)and overlap:
∀𝑥,0<𝑃(𝑇|X)<1.
Assumption 3.2 (Transportability of the HTE) .E(𝑌(1)−𝑌(0)|
X,𝑆=𝑠)=𝜏(X),𝑠=0,1.
The ignorability assumption, one of the Assumption 3.1, often
known as the no unmeasured confounders condition, which as-
sumes that all variables influencing both the treatment 𝑇and the
outcome𝑌are observed. It is inherently satisfied in an RCT owing
to the nature of the random allocation of treatments. If Assumption
3.1 holds, it is possible to determine the HTEs from RCT data. Con-
versely, in the case of OSs, the assumption of ignorable treatment
assignment is not mandated, recognizing that such a requirement
might be too stringent for practical applications.
As mentioned in 3.1, we do not presume that the treatment as-
signment is ignorable in the context of OS data. Borrowed from
Yang [35], under Assumptions 3.1 and 3.2, the confounding bias
function could be defined as the discrepancy between the condi-
tional mean outcomes derived from OS data and the HTE:
𝑐(X)=E(𝑌|X,𝑇=1,𝑆=0)− (1)
E(𝑌|X,𝑇=0,𝑆=0)−𝜏(X)
Adhering to the consistency clause within Assumption 3.1, we have
E(𝑌|X,𝑇=𝑡,𝑆=0)=E(𝑌(𝑡) |X,𝑇=𝑡,𝑆=0). Supposing
the treatment assignment in the OS data is ignorable, namely 𝑇⊥
𝑌(0),𝑌(1)|(X,𝑆=0), it follows that E(𝑌(𝑡)|X,𝑇=𝑡,𝑆=0)=
E(𝑌(𝑡)|X,𝑆=0). Given Assumption 3.2, this would infer that
the confounding bias function 𝑐(X)is determined to be 0. That
means, if the confounding bias do not exist within OS data, the
confounding bias function 𝑐(X)will be equal to zero.
We focus on a binary-treatment scenario, thus, for RCT data:
𝑌=E(𝑌|X,𝑇=0,𝑆=1)+𝑇𝜏(X)+𝜖𝑟. (2)
The conditional expectation of residual 𝜖𝑟equals to E(𝜖𝑟|X,𝑇,𝑆=
1)=E(𝑌|X,𝑇,𝑆=1)−E(𝑌|X,𝑇=0,𝑆=1)−𝑇·E(𝑌(1)−𝑌(0)|
X,𝑆=1). According to the Assumption 3.1 and 3.2, we obtain the
following conclusion E(𝜖𝑟|X,𝑇,𝑆 =1)=0. The proofs can be
found in Appendix A. Similarly, we formulate the outcome model
Page 4:
CIKM ’24, October 21–25, 2024, Boise, ID, USA Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li
for OS data based on Equation 1 and 2:
𝑌=E(𝑌|X,𝑇=0,𝑆=0)+𝑇·[𝜏(X)+𝑐(X)]+𝜖𝑜. (3)
Condition on Assumption 3.1 and 3.2, we likewise derive the con-
clusion E(𝜖𝑜|X,𝑇=0,𝑆=0)=0. Its proofs are also demonstrated
in Appendix A. Therefore, when we combining OS data and RCT
data for HTE estimation, we integrate Equation 2 and 3:
𝑌=𝑇[𝜏(X)+(1−𝑆)𝑐(X)]+E(𝑌|X,𝑇=0,𝑆)+𝜖 (4)
=𝑇[𝜏(X)+(1−𝑆)𝑐(X)]+𝜇0(X)+𝜖,
where we set 𝜇0(X)=E(𝑌|X,𝑇=0,𝑆). According to the con-
clusion E(𝜖𝑟|X,𝑇,𝑆 =1)=0andE(𝜖𝑜|X,𝑇=0,𝑆=0)=0,
we can infer that the conditional expectation of 𝜖equals to 0, i.e.,
E(𝜖|X,𝑇,𝑆)=0.
3.2 The Identification of Confounding Bias and
HTEs
Before introducing our method, we give the basic assumption which
is deduced through Assumption 3.1:
Assumption 3.3 (Basic assumption) .Assuming𝑆⊥𝑌|𝑋(meaning
S and Y are independent given X), and combining the ignorability
and overlap assumptions, we deduce that 𝑇(1−𝑆)⊥𝑌|𝑋and<
𝑃(𝑇(1−𝑆))<1.
Confounding bias estimation. Based on Equation 4, Yang [35]
has created an integrative 𝑅-learner to estimate both the HTE and
the confounding function. This learner harnesses randomized data
for accurate identification while utilizing observational data to en-
hance its efficiency. However, such an approach necessitates com-
prehensive OS data, i.e., it requires data from both the treatment and
control groups for learning the HTE estimator. As we previously
stated, obtaining complete OS data in complex real-world scenarios
is impractical. Therefore, there is a need for a robust HTE estima-
tion method capable of data fusion, applicable to scenarios with
complete OS data and, more crucially, adaptable to situations with
incomplete OS data. For this purpose, we have developed a robust
CIO to overcome the limitations of current data fusion techniques.
When integrating OS and RCT data, how to learn and eliminate
confounding biases hidden in OS data is a critical procedure that we
can not be circumvent. We decompose the Equation 4 as follows,
𝑌=𝑇(1−𝑆)𝑐(X)+𝑇𝜏(X)+𝜇0(X)+𝜖. (5)
We define𝐷=𝑇(1−𝑆)as an artificially generated treatment
variable, where 𝐷=1is assigned when 𝑇=1and𝑆=0, and
𝐷=0otherwise. Following the creation of this proxy treatment
mechanism, we form a pseudo-experimental group where 𝐷=1
and a pseudo-control group where 𝐷=0. Based on Assumption 3.3,
we the use the dummy data to learn the confounding bias function
𝑐(X), of which the learning process is the same as effect estimation.
Therefore, we denote the learned 𝑐(X)as𝜏𝑐(X). It should be noted
that the pseudo-control group comprises all the samples from the
RCT dataset.
HTE estimation. After training the confounding bias function
𝜏𝑐(X), we frozen the parameters of it. In the current stage, theEquation 5 can be rearranged as:
˜𝑌=𝑇𝜏(X)+𝜇0(X)+𝜖, (6)
˜𝑌=𝑌−𝑇(1−𝑆)𝜏𝑐(X),
where𝑇(1−𝑆)𝜏𝑐(X)is a constant for an individual. Following this
equation, we then integrate the OS data with RCT data to train an
effect estimator 𝜏(·). Utilizing this formula allows us to adjust and
rectify the observed outcomes for the treatment group in the OS
data. Thus, the OS data can be combined with RCT data to estimate
HTEs without the inclusion of confounding biases.
Diverging from existing approaches of data fusion for HTE as-
sessment, the proposed CIO method does not require a propensity
model to be learned beforehand, as seen in methods like FAST [ 13]
and the integrative 𝑅-learner [ 35], nor does it initially demand an
effect estimator derived from OS data, which is a prerequisite for
techniques such as RHC [ 19] and CorNet [ 15]. The training process
outlined above reveals the following insights:
•For estimating confounding biases, it is sufficient to use only the
treated subset of OS data and RCT data.
•In estimating HTEs, even in the absence of access to the untreated
group within the OS data, we have at our disposal, a composite
treated group (encompassing both the treated individuals from
the OS and the RCT) and a separate control group (consisting of
the untreated units from the RCT). These groups can be employed
to train an effect estimator.
These attributes endow the CIO method with resilience. In the
above description of CIO, we assumed the control group of OS is
missing for simplicity. However, CIO offers the versatility to adapt to
situations where the treated cohort from OS data might be missing.
This is achieved by inverting the treatment assignments for the treated
and untreated samples, that is, assigning 𝑇=0for the initially treated
subjects and 𝑇=1for the originally untreated subjects .
3.3 Training Loss
We denote observational data as {x𝑜
𝑖,𝑡𝑜
𝑖,𝑦𝑜
𝑖}𝑚
𝑖=1, thus its treated data
and control data can be {(x𝑜𝑡
𝑖,𝑦𝑜𝑡
𝑖)|𝑡𝑜
𝑖=1}𝑚𝑡
𝑖=1and{(x𝑜𝑐
𝑖,𝑦𝑜𝑐
𝑖)|
𝑡𝑜
𝑖=0}𝑚𝑐
𝑖=1, respectively. In the same way, we denote randomized
data as{x𝑟
𝑖,𝑡𝑟
𝑖,𝑦𝑟
𝑖}𝑛
𝑖=1, its treated data as{(x𝑟𝑡
𝑖,𝑦𝑟𝑡
𝑖)|𝑡𝑟
𝑖=1}𝑛𝑡
𝑖=1and
its control data as {(x𝑟𝑐
𝑖,𝑦𝑟𝑐
𝑖)|𝑡𝑟
𝑖=1}𝑛𝑐
𝑖=1. The𝑚,𝑚𝑡,𝑚𝑐,𝑛,𝑛𝑡,𝑛𝑐
represents the size of OS data, OS treated data, OS control data,
RCT data, RCT treated data and RCT control data, respectively.
Stage 1: Confounding bias estimation. During this phase, vari-
ous regression techniques can be utilized to model the data from
individuals in OSs and RCTs, including ridge regression, random for-
est, neural networks, etc. Thus, we initialize 𝑝1(·)and𝑝0(·)to cor-
respondingly fit on pseudo-experimental data and pseudo-controls,
i.e., the treated data of OSs and the all data of RCTs. According to
Equation 5, the pertinent optimized objectives to this stage can be
exemplified as follows:
ˆ𝑝1(·)=argmin
𝑝11
𝑚𝑡𝑚𝑡∑︁
𝑖=1[𝑦𝑜𝑡
𝑖−𝑝1(x𝑜𝑡
𝑖)]2, (7)
ˆ𝑝0(·)=argmin
𝑝0(
1
𝑛𝑡𝑛𝑡∑︁
𝑖=1[𝑦𝑟𝑡
𝑖−𝑝0(x𝑟𝑡
𝑖)]2+1
𝑛𝑐𝑛𝑐∑︁
𝑖=1[𝑦𝑟𝑐
𝑖−𝑝0(x𝑟𝑐
𝑖)]2)
.
(8)
Page 5:
Combining Incomplete Observational and Randomized Data
for Heterogeneous Treatment Effects CIKM ’24, October 21–25, 2024, Boise, ID, USA
After that, we calculate the confounding bias function ˆ𝜏𝑐(x𝑖)=
ˆ𝑝1(x𝑖)−ˆ𝑝0(x𝑖)for each unit from OS treatment group.
Stage 2: HTE estimation. At this stage, we amalgamate the en-
tire dataset from OSs with that from RCTs while also adjusting the
observed outcomes for the treated group in OSs using the estimated
confounding bias function 𝜏𝑐(·). In a similar vein, we set up the
functions𝑓1(·)and𝑓0(·)to be trained on the entirety of the treat-
ment and control data, drawn from the aggregate of OSs and RCTs,
correspondingly. It should be noted that for effectively debiasing
with ˆ𝜏𝑐(·), we adopt the parameters from ˆ𝑝1(·)as the initial values
for𝑓1(·). Additionally, it is imperative to conduct initial training
for𝑓0(·)using control data, ensuring that the training epochs align
with those of the stage 1. According to Equation 6, the loss function
is presented as follows:
ˆ𝑓1(·)=argmin
𝑓1(
1
𝑚𝑡𝑚𝑡∑︁
𝑖=1{˜𝑦𝑖𝑜𝑡−𝑓1(x𝑜𝑡
𝑖)}2+1
𝑛𝑡𝑛𝑡∑︁
𝑖=1(𝑦𝑟𝑡
𝑖−𝑓1(x𝑟𝑡
𝑖))2)
,
(9)
ˆ𝑓0(·)=argmin
𝑓0(
1
𝑚𝑐𝑚𝑐∑︁
𝑖=1[𝑦𝑜𝑐
𝑖−𝑓0(x𝑜𝑐
𝑖)]2+1
𝑛𝑐𝑛𝑐∑︁
𝑖=1[𝑦𝑟𝑐
𝑖−𝑓0(x𝑟𝑐
𝑖)]2)
,
(10)
where ˜𝑦𝑜𝑡
𝑖=𝑦𝑜𝑡
𝑖−ˆ𝜏𝑐(x𝑜𝑡
𝑖). Finally, we obtain HTE estimator ˆ𝜏(xi)=
ˆ𝑓1(xi)−ˆ𝑓0(xi)for each unit.
3.4 Effectiveness in Theoretical
In this paper, following the approach of intergrative 𝑅-learner Yang
[35], we introduce the confounding function 𝑐to describe the con-
founding bias in Observational Studies (OS), as shown by the Equa-
tion 5𝑌=𝑇(1−𝑆)𝑐(𝑋)+𝑇𝜏(𝑋)+𝜇0(𝑋)+𝜖. When treating 𝑇(1−𝑆)
as a dummy treatment variable, we are able to estimate 𝑐(𝑋), which
constitutes the first stage in our paper. Assuming 𝑆⊥𝑌|𝑋(meaning
S and Y are independent given X), and combining the ignorabil-
ityandoverlap assumptions, we deduce that 𝑇(1−𝑆)⊥𝑌|𝑋and
<𝑃(𝑇(1−𝑆))<1. This implies that, under the Potential Outcome
Framework (POF), 𝑐is identifiable in theoretical. Since the 𝑐is iden-
tified under POF, we denote the identified 𝑐as𝜏𝑐. After acquiring
the confounding bias function 𝜏𝑐, we calibrate the outcome of OS
using it to reduce the confounding bias. Finally, in the second stage,
HTE estimation is generally based on the POF framework and is
also identifiable. In summary, we decompose the process of HTE
estimation combining OS and RCT data into two stages under POF,
which are guaranteed by the effectiveness of POF.
4 Experiment
In this section, we perform experiments on a synthetical dataset
and two real-world datasets to demonstrate the performance of CIO
for HTE estimation. The outcome of these data are all simulated by
a certain strategy. We present the results of various experiments
designed to address the subsequent three research questions:
•RQ1: Wether our proposed approach CIO is effective to com-
bine observational data and randomized data for HTE estimation
under the two situation: observational data is complete or incom-
plete?•RQ2: Should the strength of confounding bias presented in OS
data intensify, would the proposed approach maintain its superior
performance relative to current data fusion techniques?
•RQ3: The impact of inverted treatment assignment on HTE esti-
mation.
The following text will include three subsections: Experimental
Setup, Datasets, and Experimental Analysis. Within the Experimen-
tal Analysis subsection, we conduct a mass of experiments and the
corresponding analysis to answer the above research questions.
4.1 Experimental Setup
Baselines and architectures. To assess the performance of CIO,
we choose RHC [ 19], integrative 𝑅-learner [ 35] and CorNet [ 15]
as baselines for comparison. Following RHC [ 19], we select Ridge
and RF as our base model for comparison. In addition, given that
methods based on representation learning have demonstrated no-
table efficacy in estimating heterogeneous treatment effects (HTEs),
we follow the approach of CFR [ 32] by integrating the Treatment-
Agnostic Representation Network (TARNet) into our suite of base-
line models for experimental purposes. For the sake of brevity in
descriptions, we refer to the integrative 𝑅-learner as the IntR. RHC,
IntR, CorNet and CIO all follow a dual-phase training approach
that leverages OS and RCT data for estimating HTE and are im-
plemented using ridge regression (Ridge), random forest (RF) and
TARNet as underlying models. Since CorNet is a method imple-
mented with DNN network including representation layer, we only
compare with it when we implement other baselines and CIO with
TARNet. Furthermore, to determine the benefits of combining OS
and RCT data, we train baseline estimators solely on each data type,
referred to as SF 𝑂𝑆for OS data and SF 𝑅𝐶𝑇 for RCT data. We also
conduct an experiment where OS and RCT data are simply fused
for training to serve as a reference for the effectiveness of modeling
confounding bias, indicated as SI.
Incomplete OS data. Most importantly, the original purpose of
CIO is to combine the OS and RCT data for HTE estimation in
the situation where the OS data is not fully available. Therefore,
in order to create the incomplete condition, we deliberately omit
either the untreated or treated group from the OS datasets of the
Simulation and STAR experiments during the training phase. This
modified approach is referenced as CIO 𝐼𝑂.
Metrics. When developing models to predict the Individual Treat-
ment Effect (ITE), the main objective is to minimize the Precision
in the Estimation of Heterogeneous Effect (PEHE) as outlined in
reference [ 16,31]. In a binary treatment scenario, PEHE quantifies
the accuracy with which a model can predict the differential impact
of two treatments, 𝑡0and𝑡1, for a given set of samples 𝑋. To calcu-
late PEHE, we determine the mean squared error across 𝑁samples
by comparing the actual difference in outcomes, 𝑦1(𝑛)−𝑦0(𝑛),
which are obtained from the simulation strategy, with the predicted
difference, ˆ𝑦1(𝑛)−ˆ𝑦0(𝑛), where𝑛denotes the sample index:
𝜖PEHE =1
𝑁𝑁∑︁
𝑛=0([𝑦1(𝑛)−𝑦0(𝑛)]−[ˆ𝑦1(𝑛)−ˆ𝑦0(𝑛)])2(11)
Page 6:
CIKM ’24, October 21–25, 2024, Boise, ID, USA Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li
Table 1: Comparison of methods for combining OS and RCT
data on Simulation, STAR and NSW data. We report the mean
value±the standard deviation of√𝜖𝑃𝐸𝐻𝐸 on test data over 10
repeated runs for three data with proportion 𝑝𝑟=0.2of RCT
data, respectively. Besides, we remove the control group of
OS in Simulation and STAR datasets and present their results
inCIO𝐼𝑂. We run trials for 10 times and the best performance
is marked in bold.
Architecture Method𝑝𝑟=0.2
Simulation STAR NSW
RidgeSF𝑂𝑆 21.97±1.06 25.05±2.92 -
SF𝑅𝐶𝑇 9.74±2.39 4.06±0.52 2.49±1.06
SI 21.43±1.06 12.82±3.48 15.85±0.32
RHC 14.62±6.18 9.7±2.68 -
IntR 7.68±0.34 8.52±0.71 -
CIO(Ours) 5.75±0.34 2.36±0.48 -
CIO𝐼𝑂(Ours) 8.96±0.63 2.14±0.65 1.48±0.38
RFSF𝑂𝑆 18.35±0.67 48.17±0.47 -
SF𝑅𝐶𝑇 10.41±2.87 6.96±1.30 2.30±0.70
SI 17.83±0.73 7.83±0.87 3.84±1.08
RHC 11.76±1.28 18.59±0.50 -
IntR 8.84±1.93 9.51±0.22 -
CIO(Ours) 6.65±0.26 5.61±0.93 -
CIO𝐼𝑂(Ours) 10.18±2.94 5.35±0.96 2.29±0.70
TARNetSF𝑂𝑆 23.64±1.30 30.40±11.62 -
SF𝑅𝐶𝑇 10.84±6.41 5.15±3.38 5.14±0.46
SI 22.83±1.08 23.81±7.96 22.47±0.60
RHC 7.89±3.40 7.06±2.65 -
CorNet 12.24±2.25 7.00±1.01 -
IntR 6.97±0.72 4.93±1.81 -
CIO(Ours) 6.61±0.35 3.54±1.38 -
CIO𝐼𝑂(Ours) 6.92±0.41 4.17±1.39 5.11±0.39
4.2 Datasets
In line with prior approaches [ 13,15,35,37], we create a simulated
dataset and choose two real-world datasets, STAR and NSW, for
our experimental evaluations. Due to the absence of ground truth
for the effects in STAR and NSW, we simulate the outcomes for
these datasets rather than relying on their actual outcomes. In this
subsection, we will introduce more details of three experimental
datasets about their data construction and outcome simulation.
4.2.1 Simulation Dataset. A synthetic dataset, of which the covari-
ates and outcomes are all simulated:
•Data construction and outcome simulation. We generate in-
dependent covariates of dimension 𝑝, denoted as 𝑋𝑖, from a stan-
dard normal distribution 𝑋𝑖∼N( 0,1)for𝑖=1,2,...,𝑝 . In this
experiment, we set 𝑝=5. Following this procedure, we produce
200 samples for RCT data, 3000 for OS data, and 1000 for test data.
The potential outcomes for each sample are then simulated using
the equation 𝑌(𝑡)=𝑡𝜏(X)+1+2Í𝑝
𝑖=1𝑋3
𝑖+Í𝑝
𝑖=1𝑋𝑖+5(1−𝑠)𝑈+𝜖(𝑡)
, where𝜏(X)=1+Í𝑝
𝑖=1𝑋𝑖+Í𝑝
𝑖=1𝑋2
𝑖,𝑡∈{0,1}indicates the
treatment status, 𝑠equalling 0 refers to OS data and 1 to RCT
data, and𝜖(𝑡)is drawn fromN(0,1). Treatment assignment for
RCT and OS data is governed by 𝑇|(X,𝑆=1)∼𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(0.5)and𝑇|(X,𝑆=0)∼𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(1/(1+exp(−Í𝑝
𝑖=1𝑋𝑖))), respec-
tively. Echoing the simulation approach in [ 37], the unobserved
variable𝑈is sampled fromN(X𝑇v𝛽(2𝑇−1),1), with vbeing a
unit vector(1,..., 1)𝑇and𝛽being a coefficient that modulates
the magnitude of confounding bias in OS data.
4.2.2 STAR. A semi-synthetic dataset. Beyond synthetic dataset
experiments, we also evaluate the performance of CIO using a
real-world dataset in this subsection:
•Tennessee Student/Teacher Achievement Ratio (STAR). The
STAR Experiment [ 20] was a randomized controlled trial con-
ducted in the late 1980s. Its objective is to measure the impact of
class size on students’ academic performance. We follow RHC
[19] and FAST [ 13] to split STAR for getting observational and
randomized data. Our attention is centered on two experimental
classroom size conditions: small classes consisting of 13-17 stu-
dents and regular classes with 22-25 students. Considering that a
significant number of students commenced the study beginning
in the first grade, we designate the type of class they were placed
in at that time as their initial treatment. For each student, we
consider a set of variables: gender, race, birth month, birthday,
birth year, free lunch given or not, teacher id. We exclude any
students who have missing data for any of these specified covari-
ates. Furthermore, we also eliminate students whose combined
scores for standardized tests in listening, reading, and math are
missing. In total, we recorded 4139 students: 1774 assigned to
treatment (small class, T = 1), and 2365 to control (regular size
class, T = 0).
•Data construction. Through a specific simulation strategy, we
get the outcome for each student. Then, we follow the settings
of RHC [ 19] and FAST [ 13] to construct OS data, RCT data and
test data: To introduce a confounding bias, we divide the study
population based on a variable: students living in rural or inner-
city areas (denoted by U = 1, totaling 2811 students) versus those
in urban or suburban areas (denoted by U = 0, totaling 1407
students). We then create the RCT data by randomly selecting
a proportion, which equals 0.5, of the students with U = 1. The
OS data is compiled in the following manner: For students with
U = 1, we include those who are not part of the trial data and
have a treatment status of control (D = 0), along with the treated
students (D = 1) whose simulated outcomes were in the bottom
50% among their peers with D = 1 and U = 1. For students with
U = 0, we incorporate all of the control students (D = 0) and the
treated students (D = 1) who also fell into the bottom 50% of
simulated outcomes for the group with D = 1 and U = 0. Finally,
the test data is composed of a reserved subset of the entire sample,
excluding those individuals that are included in the RCT dataset.
•Outcome simulation. Following RHC [ 19] and FAST [ 13], the
actual covariates X=(𝑋1,𝑋2,···,𝑋𝑝)𝑇, where𝑝=7. While the
STAR dataset contains actual outcome data, it lacks a ground
truth for the treatment effect. Consequently, we simulate both
the outcome and the treatment effect function specifically for the
STAR dataset. Concretely, we set 𝑌(𝑡)=𝑡𝜏(X)+2Í𝑝
𝑖=1𝑋𝑖+X𝑇X+
𝜖(𝑡)as potential outcomes, where 𝜏(X)=Í𝑝
𝑖=1𝑋𝑖+√︃
|Í𝑝
𝑖=1𝑋𝑖|
and𝜖(𝑡)∼N( 0,1).
Page 7:
Combining Incomplete Observational and Randomized Data
for Heterogeneous Treatment Effects CIKM ’24, October 21–25, 2024, Boise, ID, USA
4.2.3 NSW. A semi-synthetic dataset with incomplete OS dataset.
The datasets illustrated in the previous subsections include com-
plete OS data. While we have shown in Table 1 that CIO maintains
its robustness when faced with incomplete OS data, achieved by
omitting the control group of OSs, this does not fully convey the
method’s practical utility in an intuitive manner. In light of this,
we evaluate on inherently incomplete dataset — NSW, where the
treated data of OS data is absent:
•National Supported Work (NSW) Demonstration. The Na-
tional Supported Work (NSW) Demonstration [ 23] was a ran-
domized experiment investigating the effect of job training on
income and employment status. Following [ 1], we combine ran-
domized samples (297 treated, 425 control) with the 2490 PSID
observational controls in our experiments.
•Data construction. The incomplete nature of the NSW’s OS
dataset makes it an apt choice for testing the resilience of the CIO
method. Thus, we randomly draw 100 samples, encompassing
both the treated group and the control group, from the pool of
722 randomized samples to serve as RCT data. To inject addi-
tional bias into the OS data, we incorporate those samples with
simulated outcomes ranking in the upper 50% from the 2490 PSID
observational controls. The rest of the randomized samples and
observational controls are allocated for the testing phase.
•Outcome simulation. We use the actual covariates X=(𝑋1,𝑋2,···
,𝑋𝑝)𝑇of NSW: age, level of education, ethnicity (split into two
covariates), marital status, and educational degree, where 𝑝=6.
We generate potential outcomes for each person by 𝑌(𝑡)=
𝑡𝜏(X)+2Í𝑝
𝑖=1exp(𝑋𝑖)+𝜖(𝑡), where𝜏(X)=X𝑇X,𝑡={0,1}
and𝜖(𝑡)∼U(− 1,1). It should be noted that since the treat data
of OS data is inexistent, we set the treatment value of RCT treated
data to 0, the control data of OS and RCT to 1.
4.3 Experiment Analysis
Preliminary Trials. Initially, we execute CIO alongside other
benchmark methods using Ridge, RF, and TARNet to calculate HTEs,
an essential functionality of these methodologies. Our investigation
aims to determine if integrating even a tiny quantity of RCT data
with OS data yields any advantages for HTE estimation. To this
end, we randomly select a subset of the RCT data, maintaining a
proportion𝑝𝑟=0.2, and combine it with the entirety of the OS data
for the training process. We promise that the selected RCT data
includes treated and control instances. This strategy is designed to
reflect the common scenario encountered in the real world, where
RCT data is often considerably less abundant than OS data. The
outcomes of this analysis are compiled and presented in Table 1. It
can be observed that relying solely on OS data during the training
phase introduces significant bias, resulting in poor performance
outcomes. Likewise, even though SI integrates OS data with RCTs,
it only shows a marginal improvement over SF 𝑂𝑆. This is because
SI merges OS data and RCT data directly without implementing
any procedures to mitigate bias. In pursuit of this objective, RHC
and IntR are designed to correct for biases in OS data by leveraging
RCT data in the estimation of HTEs. The results presented in Table
1 indicate that CIO significantly outperforms IntR when complete
0.2 0.4 0.6 0.8 1.0
pr5101520PEHE
Ridge
0.2 0.4 0.6 0.8 1.0
pr5101520PEHE
RFSI RHC IntR CIOCIOIO(a) Simulation dataset
0.2 0.4 0.6 0.8 1.0
pr05101520PEHE
Ridge
0.2 0.4 0.6 0.8 1.0
pr5101520PEHE
RFSI RHC IntR CIOCIOIO
(b) STAR dataset
0.2 0.4 0.6 0.8 1.0
pr51015PEHE
Ridge
0.2 0.4 0.6 0.8 1.0
pr23456PEHE
RFSICIOIO
(c) NSW dataset
Figure 2: Comparison among data-fusion baselines under
Ridge and RF with an increasing ratio of RCT data for train-
ing. We plot the results upon Simulation dataset, STAR
dataset and NSW dataset on Figure 1(a), 1(b) and 1(c) respec-
tively.
OS data is used, highlighting CIO’s effectiveness in mitigating con-
founding bias from OS data. For examples, we have conducted a
significance test between IntR and CIO On Simulation dataset, the
p-value is 4.27e-10 for Ridge, 3.43e-3 for RF; On STAR dataset, the p-
value is 1.79e-6 for Ridge, 7.83e-3 for RF. Another important aspect
of our analysis involves the removal of control group data from the
OS set and merging the remaining OS data with the full RCT dataset
to assess CIO’s resilience. Even when faced with incomplete OS
data, CIO is capable of leveraging the available OS data for training
purposes. Despite a noticeable decline in the performance of CIO 𝐼𝑂
as compared to the complete CIO, it still surpasses SF 𝑅𝐶𝑇 in terms
of effectiveness. This underscores CIO’s ability to preserve a robust
performance in HTE estimation by making use of the available
data. Moreover, as previously demonstrated, both RHC and IntR are
incapable of integrating RCT with OS data in the presence of partial
missingness in the OS dataset. As a result, we are limited to only
assessing the outcomes of CIO 𝐼𝑂, SI, and SF 𝑅𝐶𝑇 solely on the NSW
dataset. Employing Ridge as the underlying model, our proposed
approach attains superior performance. The significance test yields
Page 8:
CIKM ’24, October 21–25, 2024, Boise, ID, USA Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li
a p-value of 0.01, falling below the threshold of 0.05. When RF is
employed as the underlying architecture, the performance of CIO 𝐼𝑂
is on par with that of SF 𝑅𝐶𝑇.
Sensitivity of RCT data volume. In the previously mentioned
experiments, CIO demonstrated its efficacy and resilience when pro-
vided with a small amount of RCT data for training. Yet, it remains
unclear if it can sustain advanced performance when supplied with
varying volume of RCT data for training. To investigate this, we
modify the proportion of RCT data used in training— 𝑝𝑟, ranging
from 0.1 to 1.0. The results of all methods implemented with Ridge
regression, RF are shown in Figure 2. We summarize three aspects
from the figure:
•Intuitively, the√𝜖𝑃𝐸𝐻𝐸 results of all methods will decrease with
the size of RCT data swells, given that RCT data are devoid of
unobserved confounders. Our approach consistently surpasses
alternative data-fusion techniques when applied with Ridge and
RF models, thereby affirming the efficacy of CIO in estimating
HTEs.
•More importantly, to assess the robustness of CIO in scenar-
ios where the OS data are incomplete, we conduct experiments
without the controls from OS data. Results indicate that CIO 𝐼𝑂
maintains superior performance, even in the face of partial ab-
sence of OS data.
•Under the Ridge model, RHC and IntR demonstrate enhanced
performance compared with SI, whereas this performance edge is
not observed with RF. In contrast, CIO consistently outperforms
SI regardless of whether Ridge or RF is employed, highlighting
CIO’s stable debiasing capability.An observation of the figure
reveals that CIO consistently secures the top performance tier,
regardless of the amount of RCT data utilized.
Furthermore, it is notable that the standard deviation values in
Figure 1 significantly vary across different data ratios within the
same dataset. At lower ratios, only a minimal amount of RCT data
is provided for training, which can result in substantial variance in
the data samples used for training in different experimental runs,
consequently leading to higher variance in performance. As the
volume of RCT data incorporated into training increases, the fluc-
tuation in the training data set decreases, leading to performance
stability throughout multiple experimental trials.
Impact of the strength of confounding bias. Our approach
involves identifying the confounding bias present in OS data and
using this as a residual adjustment for the observed outcomes.
Recognizing the confounding bias in OS data is essential when
integrating OS and RCT data. To this effect, we manipulate the
intensity of the confounding bias in OS data by varying the 𝛽value.
The results of this manipulation are illustrated in Figure 3, where
the OS data will suffer from enhancing confounding bias with the
value of𝛽rising. It is evident that the efficacy of SI declines sharply
as𝛽increases, whereas the performance of other data-fusion meth-
ods deteriorates at a more gradual rate, which demonstrates the
effectiveness of identifying the confounding bias. Particularly, CIO
can maintain a low√𝜖𝑃𝐸𝐻𝐸 despite high values of 𝛽, regardless of
whether the OS data is complete. It should be highlighted that as 𝛽
reaches a specific threshold, the performance of CIO 𝐼𝑂surpasses
that of CIO. This phenomenon may be attributed to the trade-off
2 4 6 8 10
050100150200PEHE
Ridge
2 4 6 8 10
050100150200PEHE
RFSI
RHCIntR
CIOCIOIOFigure 3: For all data-fusion techniques using Ridge and RF,
we observe√𝜖𝑃𝐸𝐻𝐸 across a range of 𝛽values that modulate
the intensity of the confounding bias in the training OS data.
between the quantity of OS control data and the strength of associ-
ated confounding bias. We can exploit the characteristic when OS
data lacks control or treatment data and suffers from substantial
confounding bias.
Impact of the size of OS control data. In scenarios where the
OS control group data is nonexistent, CIO remains unique ability
of integrating the available OS data with RCT data for estimating
HTE. Consequently, it becomes intriguing to examine the fluctuat-
ing performance of both baseline methods and CIO when trained
with varying volumes of OS control data. We randomly select RCT
data with𝑝𝑟=0.05(ensuring the inclusion of treatment and control
samples) for training. The results are displayed in Figure 4(a). CIO
consistently surpasses alternative methods in performance across
all sizes of OS control data, demonstrating the applicability of CIO
for handling data with varying and intricate compositions. Follow-
ing the trials conducted for the Simulation dataset, we change the
number of OS control data from a same range. To generate a distinct
composition of RCT data for the training set, we set 𝑝𝑟=0.2, as
depicted in 4(b). According to the line chart, CIO maintains a steady
enhancement in performance over other baseline methods, irrespec-
tive of the OS control data size. This illustrates CIO’s exceptional
robustness.
Page 9:
Combining Incomplete Observational and Randomized Data
for Heterogeneous Treatment Effects CIKM ’24, October 21–25, 2024, Boise, ID, USA
0 100 200 300 400 500
OS controls' number10152025PEHE
pr=0.05SI RHC IntR CIO
(a) Simulation dataset
0 100 200 300 400 500
OS controls' number2.55.07.510.012.515.0PEHE
pr=0.2SI RHC IntR CIO
(b) STAR dataset
Figure 4: We change the quantity of control data from the
OS used in the training stage, under which we evaluate the
efficacy of various data-fusion techniques implemented with
Ridge regression. The OS controls’ number varies from a
range of {1, 4, 16, 64, 256, 512}. Results pertaining to the
Simulation dataset are illustrated in Figure 3(a) and for the
STAR dataset in Figure 3(b).
Inverse of Treatment Assignment. In Section 3, for the con-
venience of describing CIO, we assume that the control group of
OSs is unavailable. Nevertheless, in practical situations, it’s often
the treatment group’s data that might be missing. As previously
detailed, in the absence of treatment data, we can reverse the origi-
nal treatment assignments to adapt our methodology. This reversal
process enables the flexible application of CIO for merging both
OS and RCT data to estimate HTEs. While this approach provides
flexibility for CIO, its efficacy post-inversion remains to be seen. To
address this, we perform a series of experiments to assess CIO’s ro-
bustness. Initially, we eliminate the treatment group of OSs from the
Simulation and STAR datasets for our analysis. Subsequently, we
extend our experiments by removing and inverting the treatment
assignments within the OS data and RCT data of these datasets
to further examine the performance of the proposed method. We
report the experimental results in Table 2, where the performance
of CIO is almost invariable between ’original’ and ’inverse’. Such
consistency highlights that CIO retains its effectiveness even in
cases where the treatment group data from OSs is absent.Table 2: For both the Simulation and STAR datasets, we ex-
clude the control group from the OS data for experimenting.
Conversely, the treatment group of OS data is eliminated
from the Simulation dataset and from STAR, with their treat-
ment assignments being inverted. The ’original’ means the
original treatment assignment, while the ’inverse’ represents
the inverted treatment assignment.
Base ModelSimulation STAR
original inverse original inverse
Ridge 8.96±0.63 8.03±0.51 2.14±0.65 2.65±0.99
RF 10.18±2.94 9.79±2.52 5.35±0.96 5.24±0.61
TARNet 6.92±0.41 7.12±0.29 4.17±1.39 4.24±2.19
5 Conclusion
This paper posits that existing data-fusion techniques are deficient
in robustness, rendering them incapable of merging OS data with
RCT data for HTE estimation in instances where the OS training
data is incomplete. In response to this issue, we present CIO, a re-
silient method designed to harness the advantages of both OS and
RCT data. CIO circumvents the limitations of current data fusion
methods by effectively estimating HTEs without requiring fully pop-
ulated OS datasets. To achieve this, we form pseudo-experimental
and pseudo-control groups from another perspective to train an
estimator for effect measurement which is intended to serve as a
confounding bias function, an innovative tactic in assessing con-
founding bias. To confirm the robustness and effectiveness of our
method, we perform numerous tests that explore various aspects:
we examine the influence of RCT data volume, analyze the effect
of confounding bias intensity, and investigate how the quantity of
OS control data affects outcomes. These trials are conducted on a
synthetic dataset and two semi-synthetic datasets which use real-
world covariates and outcomes generated via specific strategies.
Across all experiments, CIO’s performance consistently surpasses
that of the baseline methods it is compared with, irrespective of the
dataset and architecture used. The consistent outperformance of
CIO when benchmarked against other data-fusion methods affirms
the effectiveness and robustness of our confounding bias estima-
tor. This tool, which calibrates the observed outcomes of OS data,
proves to be powerful in merging RCT and OS data for estimating
HTEs.
A Proofs
Proof1. E(𝜖𝑟|X,𝑇=0,𝑆=1)=E(𝑌|X,𝑇=0,𝑆=1)−
E(𝑌|X,𝑇=0,𝑆=1)=0.
Proof2. E(𝜖𝑟|X,𝑇=1,𝑆=1)=E(𝑌|X,𝑇=1,𝑆=1)−E(𝑌|
X,𝑇=0,𝑆=1)−E(𝑌(1)−𝑌(0)|X,𝑆=1)=E(𝑌(1)|X,𝑇=1,𝑆=
1)−E(𝑌(0)|X,𝑇=0,𝑆=1)−E(𝑌(1)−𝑌(0)|X,𝑆=1)=E(𝑌(1)|
X,𝑆=1)E(𝑌(0)|X,𝑆=1)−E(𝑌(1)−𝑌(0)|X,𝑆=1)=0.
Proof3. E(𝜖𝑜|X,𝑇=0,𝑆=0)=E(𝑌|X,𝑇=0,𝑆=0)−
E(𝑌|X,𝑇=0,𝑆=0)=0.
Proof4. E(𝜖𝑜|X,𝑇=1,𝑆=0)=E(𝑌|X,𝑇=1,𝑆=0)−E(𝑌|
X,𝑇=0,𝑆=0)−[E(𝑌|X,𝑇=1,𝑆=0)−E(𝑌|X,𝑇=0,𝑆=
0)]=0.
Page 10:
CIKM ’24, October 21–25, 2024, Boise, ID, USA Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li
References
[1]Jeffrey A. Smith and Petra E. Todd. 2005. Does Matching Overcome LaLonde’s
Critique of Nonexperimental Estimators? Journal of Econometrics 125, 1 (March
2005), 305–353. https://doi.org/10.1016/j.jeconom.2004.04.011
[2]Ahmed Alaa and Mihaela Schaar. 2018. Limits of Estimating Heterogeneous
Treatment Effects: Guidelines for Practical Algorithm Design. In Proceedings of
the 35th International Conference on Machine Learning . PMLR, 129–138.
[3]Ahmed M. Alaa and Mihaela van der Schaar. 2017. Bayesian Inference of In-
dividualized Treatment Effects Using Multi-task Gaussian Processes. https:
//doi.org/10.48550/arXiv.1704.02801 arXiv:1704.02801 [cs]
[4]Susan Athey. 2017. Beyond Prediction: Using Big Data for Policy Problems.
Science 355, 6324 (Feb. 2017), 483–485. https://doi.org/10.1126/science.aal4321
[5]Susan Athey, Raj Chetty, and Guido Imbens. 2020. Combining Experimental
and Observational Data to Estimate Treatment Effects on Long Term Outcomes.
https://doi.org/10.48550/arXiv.2006.09676 arXiv:2006.09676 [econ, stat]
[6]Susan Athey, Julie Tibshirani, and Stefan Wager. 2018. Generalized Random
Forests. https://doi.org/10.48550/arXiv.1610.01271 arXiv:1610.01271 [econ, stat]
[7]Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L.
Scott. 2015. Inferring Causal Impact Using Bayesian Structural Time-Series
Models. The Annals of Applied Statistics 9, 1 (March 2015). https://doi.org/10.
1214/14-AOAS788
[8]David Cheng and Tianxi Cai. 2021. Adaptive Combination of Random-
ized and Observational Data. https://doi.org/10.48550/arXiv.2111.15012
arXiv:2111.15012 [stat]
[9]Bénédicte Colnet, Imke Mayer, Guanhua Chen, Awa Dieng, Ruohong Li, Gaël
Varoquaux, Jean-Philippe Vert, Julie Josse, and Shu Yang. 2023. Causal Inference
Methods for Combining Randomized Trials and Observational Studies: A Review.
https://doi.org/10.48550/arXiv.2011.08047 arXiv:2011.08047 [stat]
[10] Irina Degtiar and Sherri Rose. 2023. A Review of Generalizability and Transporta-
bility. Annual Review of Statistics and Its Application 10, 1 (March 2023), 501–524.
https://doi.org/10.1146/annurev-statistics-042522-103837 arXiv:2102.11904 [stat]
[11] AmirEmad Ghassami, Alan Yang, David Richardson, Ilya Shpitser, and Eric Tch-
etgen Tchetgen. 2022. Combining Experimental and Observational Data for
Identification and Estimation of Long-Term Causal Effects. https://doi.org/10.
48550/arXiv.2201.10743 arXiv:2201.10743 [econ, math, stat]
[12] Thomas A. Glass, Steven N. Goodman, Miguel A. Hernán, and Jonathan M. Samet.
2013. Causal Inference in Public Health. Annual Review of Public Health 34, 1
(2013), 61–75. https://doi.org/10.1146/annurev-publhealth-031811-124606
[13] Jia Gu, Caizhi Tang, Han Yan, Qing Cui, Longfei Li, and Jun Zhou. 2023. FAST:
A Fused and Accurate Shrinkage Tree for Heterogeneous Treatment Effects
Estimation. Thirty-seventh Conference on Neural Information Processing Systems
(2023).
[14] Margaret A. Hamburg and Francis S. Collins. 2010. The Path to Personalized
Medicine. The New England Journal of Medicine 363, 4 (July 2010), 301–304.
https://doi.org/10.1056/NEJMp1006304
[15] Tobias Hatt, Jeroen Berrevoets, Alicia Curth, Stefan Feuerriegel, and Mihaela van
der Schaar. 2022. Combining Observational and Randomized Data for Estimating
Heterogeneous Treatment Effects. arXiv:2202.12891 [cs, stat]
[16] Jennifer L. Hill. 2011. Bayesian Nonparametric Modeling for Causal Inference.
Journal of Computational and Graphical Statistics 20, 1 (Jan. 2011), 217–240. https:
//doi.org/10.1198/jcgs.2010.08162
[17] Fredrik D. Johansson, Nathan Kallus, Uri Shalit, and David Sontag. 2018. Learning
Weighted Representations for Generalization Across Designs. https://doi.org/10.
48550/arXiv.1802.08598 arXiv:1802.08598 [stat]
[18] Fredrik D Johansson, Uri Shalit, and David Sontag. [n. d.]. Learning Representa-
tions for Counterfactual Inference. ([n. d.]).
[19] Nathan Kallus, Aahlad Manas Puli, and Uri Shalit. 2018. Removing Hidden
Confounding by Experimental Grounding. https://doi.org/10.48550/arXiv.1810.
11646 arXiv:1810.11646 [cs, stat]
[20] Alan B. Krueger. 1999. Experimental Estimates of Education Production Functions.
The Quarterly Journal of Economics 114, 2 (1999), 497–532. jstor:2587015
[21] Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. 2019. Meta-Learners
for Estimating Heterogeneous Treatment Effects Using Machine Learning. Pro-
ceedings of the National Academy of Sciences 116, 10 (March 2019), 4156–4165.
https://doi.org/10.1073/pnas.1804597116 arXiv:1706.03461 [math, stat]
[22] Milan Kuzmanovic, Tobias Hatt, and Stefan Feuerriegel. 2021. Deconfounding
Temporal Autoencoder: Estimating Treatment Effects over Time Using Noisy
Proxies. https://doi.org/10.48550/arXiv.2112.03013 arXiv:2112.03013 [cs, stat]
[23] Robert J. LaLonde. 1986. Evaluating the Econometric Evaluations of Training
Programs with Experimental Data. The American Economic Review 76, 4 (1986),
604–620. jstor:1806062
[24] Christos Louizos, Uri Shalit, Joris Mooij, David Sontag, Richard Zemel, and
Max Welling. 2017. Causal Effect Inference with Deep Latent-Variable Models.
https://doi.org/10.48550/arXiv.1705.08821 arXiv:1705.08821 [cs, stat]
[25] Xinkun Nie and Stefan Wager. 2020. Quasi-Oracle Estimation of Het-
erogeneous Treatment Effects. https://doi.org/10.48550/arXiv.1712.04912
arXiv:1712.04912 [econ, math, stat][26] Scott Powers, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam H. Shah,
Trevor Hastie, and Robert Tibshirani. 2018. Some Methods for Heterogeneous
Treatment Effect Estimation in High Dimensions. Statistics in Medicine 37, 11
(May 2018), 1767–1787. https://doi.org/10.1002/sim.7623
[27] James M. Robins, Miguel Ángel Hernán, and Babette Brumback. 2000. Marginal
Structural Models and Causal Inference in Epidemiology:. Epidemiology 11, 5
(Sept. 2000), 550–560. https://doi.org/10.1097/00001648-200009000-00011
[28] Paul R. Rosenbaum and Donald B. Rubin. 1983. The Central Role of the Propensity
Score in Observational Studies for Causal Effects. Biometrika 70, 1 (1983), 41–55.
https://doi.org/10.2307/2335942 jstor:2335942
[29] Evan Rosenman, Guillaume Basse, Art Owen, and Michael Baiocchi. 2020. Com-
bining Observational and Experimental Datasets Using Shrinkage Estimators.
https://doi.org/10.48550/arXiv.2002.06708 arXiv:2002.06708 [math, stat]
[30] Donald B. Rubin. 1974. Estimating Causal Effects of Treatments in Randomized
and Nonrandomized Studies. Journal of Educational Psychology 66, 5 (Oct. 1974),
688–701. https://doi.org/10.1037/h0037350
[31] Patrick Schwab, Lorenz Linhardt, and Walter Karlen. 2019. Perfect Match: A
Simple Method for Learning Representations For Counterfactual Inference With
Neural Networks. https://doi.org/10.48550/arXiv.1810.00656 arXiv:1810.00656 [cs,
stat]
[32] Uri Shalit, Fredrik D. Johansson, and David Sontag. 2017. Estimating Individual
Treatment Effect: Generalization Bounds and Algorithms. In Proceedings of the
34th International Conference on Machine Learning . PMLR, 3076–3085.
[33] Caizhi Tang, Huiyuan Wang, Xinyu Li, Qing Cui, Ya-Lin Zhang, Feng Zhu, Longfei
Li, Jun Zhou, and Linbo Jiang. 2022. Debiased Causal Tree: Heterogeneous
Treatment Effects Estimation with Unmeasured Confounding. Advances in Neural
Information Processing Systems 35 (2022), 5628–5640.
[34] Stefan Wager and Susan Athey. 2017. Estimation and Inference of Heterogeneous
Treatment Effects Using Random Forests. https://doi.org/10.48550/arXiv.1510.
04342 arXiv:1510.04342 [math, stat]
[35] Shu Yang. 2022. Integrative $R$-Learner of Heterogeneous Treatment Effects
Combining Experimental and Observational Studies. In Proceedings of the First
Conference on Causal Learning and Reasoning . PMLR, 904–926.
[36] Shu Yang and Peng Ding. 2021. Combining Multiple Observational Data Sources
to Estimate Causal Effects. arXiv:1801.00802 [stat]
[37] Shu Yang, Donglin Zeng, and Xiaofei Wang. 2022. Improved Inference for Het-
erogeneous Treatment Effects Using Real-World Data Subject to Hidden Con-
founding. https://doi.org/10.48550/arXiv.2007.12922 arXiv:2007.12922 [stat]
[38] Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. 2018.
Representation Learning for Treatment Effect Estimation from Observational
Data. In Advances in Neural Information Processing Systems , Vol. 31. Curran
Associates, Inc.
[39] Jinsung Yoon and James Jordon. 2018. GANITE: ESTIMATION OF INDIVIDUAL-
IZED TREAT- MENT EFFECTS USING GENERATIVE ADVERSARIAL. (2018).
[40] Yao Zhang, Alexis Bellot, and Mihaela van der Schaar. 2020. Learning Overlapping
Representations for the Estimation of Individualized Treatment Effects. https:
//doi.org/10.48550/arXiv.2001.04754 arXiv:2001.04754 [cs, stat]