loader
Generating audio...

arxiv

Paper 2410.21343

Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects

Authors: Dong Yao, Caizhi Tang, Qing Cui, Longfei Li

Published: 2024-10-28

Abstract:

Data from observational studies (OSs) is widely available and readily obtainable yet frequently contains confounding biases. On the other hand, data derived from randomized controlled trials (RCTs) helps to reduce these biases; however, it is expensive to gather, resulting in a tiny size of randomized data. For this reason, effectively fusing observational data and randomized data to better estimate heterogeneous treatment effects (HTEs) has gained increasing attention. However, existing methods for integrating observational data with randomized data must require \textit{complete} observational data, meaning that both treated subjects and untreated subjects must be included in OSs. This prerequisite confines the applicability of such methods to very specific situations, given that including all subjects, whether treated or untreated, in observational studies is not consistently achievable. In our paper, we propose a resilient approach to \textbf{C}ombine \textbf{I}ncomplete \textbf{O}bservational data and randomized data for HTE estimation, which we abbreviate as \textbf{CIO}. The CIO is capable of estimating HTEs efficiently regardless of the completeness of the observational data, be it full or partial. Concretely, a confounding bias function is first derived using the pseudo-experimental group from OSs, in conjunction with the pseudo-control group from RCTs, via an effect estimation procedure. This function is subsequently utilized as a corrective residual to rectify the observed outcomes of observational data during the HTE estimation by combining the available observational data and the all randomized data. To validate our approach, we have conducted experiments on a synthetic dataset and two semi-synthetic datasets.

Paper Content:
Page 1: Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects Dong Yao Ant Group Hangzhou, China yaodong.yao@antgroup.comCaizhi Tang Ant Group Hangzhou, China caizhi.tcz@antgroup.com Qing Cui Ant Group Hangzhou, China cuiqing.cq@antgroup.comLongfei Li∗ Ant Group Hangzhou, China longyao.llf@antgroup.com Abstract Data from observational studies (OSs) is widely available and read- ily obtainable yet frequently contains confounding biases. On the other hand, data derived from randomized controlled trials (RCTs) helps to reduce these biases; however, it is expensive to gather, re- sulting in a tiny size of randomized data. For this reason, effectively fusing observational data and randomized data to better estimate heterogeneous treatment effects (HTEs) has gained increasing atten- tion. However, existing methods for integrating observational data with randomized data must require complete observational data, meaning that both treated subjects and untreated subjects must be included in OSs. This prerequisite confines the applicability of such methods to very specific situations, given that including all subjects, whether treated or untreated, in observational studies is not consis- tently achievable. In our paper, we propose a resilient approach to Combine Incomplete Observational data and randomized data for HTE estimation, which we abbreviate as CIO. The CIO is capable of estimating HTEs efficiently regardless of the completeness of the observational data, be it full or partial. Concretely, a confounding bias function is first derived using the pseudo-experimental group from OSs, in conjunction with the pseudo-control group from RCTs, via an effect estimation procedure. This function is subsequently utilized as a corrective residual to rectify the observed outcomes of observational data during the HTE estimation by combining the available observational data and the all randomized data. To vali- date our approach, we have conducted experiments on a synthetic dataset and two semi-synthetic datasets. CCS Concepts •Computing methodologies →Machine learning ;Causal rea- soning and diagnostics ;Machine learning ;•Mathematics of computing→Causal networks . ∗Corresponding Authors. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). CIKM ’24, October 21–25, 2024, Boise, ID, USA ©2024 Copyright held by the owner/author(s). ACM ISBN 979-8-4007-0436-9/24/10 https://doi.org/10.1145/3627673.3679593Keywords Causal Inference, Heterogeneous Treatment Effects, Observational Data, Random Control Trial Data ACM Reference Format: Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li. 2024. Combining Incom- plete Observational and Randomized Data for Heterogeneous Treatment Ef- fects. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24), October 21–25, 2024, Boise, ID, USA. ACM, New York, NY, USA, 10 pages. https://doi.org/10.1145/3627673.3679593 1 Introduction Heterogeneous treatment effects (HTEs) refer to the variations in causal effects of a treatment or an intervention across different sub-populations, based on their distinct characteristics or contexts. It is of great importance to estimate HTEs in various fields, such as medicine [12, 14] , marketing [4, 7] and epidemiology [27]. There are two types of data in causal inference: observational data andrandomized data . Observational data is collected with- out any intervention from observational studies (OSs), reflecting real-world conditions as they naturally occur. Considering the ad- vantages of low-cost of acquisition and vast quantities, most exist- ing literature [ 6,18,21,25,26,32,34] focus on estimating HTEs from observational data. Although it is valuable for its real-world relevance and the volume of data it can provide, observational data is often prone to confounding biases that can challenge causal in- terpretations. In practice, those methods mentioned above depend on some assumptions,e.g., assuming the absence of unobserved confounders, which is not testable and difficult to satisfy in practice. For instance, when doctors prescribe medication, they consider various patient-specific factors, some of which may not be captured in the medical records. Relying solely on observational data in this case can result in confounding bias, since unrecorded influences on both the treatment decisions and outcomes remain unaccounted for. This leads to challenges in identifying HTEs accurately and can introduce bias into the estimates of treatment effects. Therefore, it is unreasonable to make use of observational data for HTE estimation in practice without making the unconfoundedness assumption. Randomized data, particularly from Randomized Controlled Tri- als (RCTs), is generated in a randomized controlled experimental set- ting where participants are randomly assigned to different groups to isolate the effect of a treatment or intervention. And the trial data is deemed the gold standard for estimating HTEs. However, trialarXiv:2410.21343v1 [stat.ME] 28 Oct 2024 Page 2: CIKM ’24, October 21–25, 2024, Boise, ID, USA Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li Obser v ational DataObser v ational Data Randomiz ed DataRandomiz ed DataT r eatment Gr oupContr ol Gr oupData F usion Met hods, e.g., RHC, Int R, CorNet, ... ... Complet e Obser v ational DataIncomplet e Obser v ational DataData F usion ? T r eatment Gr oupContr ol Gr oup Figure 1: The data composition under the two situation: com- plete and incomplete OS data. For illustration, the right sub- figure demonstrates a case where the control group is missing. It should be noted that in practice, the treatment group could also be absent. data is often limited by the costs, laws, and ethics. Taking medicine as an example, it’s impossible to conduct large-scale clinical trials, especially for those side-effect drugs. Considering the unique attributes of RCTs and OSs, the integra- tion of data from both sources has gained traction as a method to estimate HTEs. References such as [ 5,8,11,13,15,19,29,35,36] highlight this growing trend. Nevertheless, the prevalent techniques for HTE estimation largely hinge on the availability of complete observational datasets. Certain methods begin by constructing an HTE estimator exclusively based on observational data [ 8,15,19], whereas others necessitate the creation of a propensity score model derived from such data [ 35]. For example, RHC [ 19] supposes that the confounding bias is a parametric function that can be learned. It learns a biased estimator by training on observational data, and then uses randomized data to remove the bias. In another instance, Yang [35] also adopts a parametric approach to model the con- founding bias when estimating HTEs. They achieve this by using an integrative 𝑅-learner (IntR) that merges data from RCTs and OSs. Additionally, this method necessitates a pre-learned propen- sity score estimator based on both randomized and observational data. Nonetheless, OSs often suffer from incompleteness due to the intricate nature of real-world scenarios. The complete observa- tional data is a dataset which contains control data and treatment. Therefore, when the one of control group and treatment group from observational data is absent, we refer it as incomplete observational data. We draw the Figure 1 to demonstrate the difference between two situation. As an illustration, consider a scenario where an new experimental drug is introduced to treat a chronic illness such as diabetes. Due to the drug’s recent entry into the market and lack of extensive track record, patients might be skeptical about its benefits and potential side effects. Consequently, many of them may choose to stick with their current treatment regimens rather than try the new medication. This reluctance can result in scarce or incomplete data for the treatment group within the OS, as the majority of patients remain within the untreated or control cohorts. In other words, it is possible to gather data from patients who have not been treated with the new experimental drug (control data in OSs), while lacking data from patients who have undergone treatment(treatment data in OSs). Under such circumstances, traditional data fusion approaches are ill-suited for assessing the impact of the new experimental drug. The deficiency of treatment group or control group in OSs can result in a considerable decrease in the efficacy of these methods or may even lead to their failure to function. In this paper, we introduce a robust technique, termed CIO, designed to integrate incomplete observational data with random- ized trial data for estimating HTEs. Besides, the versatility of CIO extends to the combination of complete observational data with randomized data for HTE estimation. Our approach overcomes the limitations of current data fusion methodologies, which require complete observational data for HTE estimation, thus offering a more practical solution for real-world scenarios. Initially, by cre- ating a dummy treatment, we designate the treatment group or control group of observational as a pseudo-experimental group and the all randomized data is used to constitute the pseudo-control group. Subsequently, we utilize the learning pattern of effect es- timation to learn a confounding bias function using the pseudo- experimental group and the pseudo-control group. Finally, we inte- grate the entirety of the data at our disposal, including the available observational data and the all randomized data, to derive an HTE estimator. Simultaneously, we perform debiasing for observational data, with the assistance of the confounding bias function, which is employed as a residual to correct for the observed outcomes of observational data. The specific training process and details are illustrated in Section 3. The main contributions of our work are as follows: •We introduce a robust approach, termed CIO, that fully lever- ages the strengths of both observational data and randomized data while addressing the shortcomings of existing data fusion techniques, i.e., their reliance on complete observational data for estimating HTEs. •We form pseudo-experimental and pseudo-control group from another perspective to train an estimator, which is intended to serve as a confounding bias function. This is an innovative tactic in the assessment of confounding bias. •We validate our approach through extensive experimentation on one synthetic dataset and two real-world datasets, demon- strating that our method not only combines observational data and randomized data more effectively for HTE estimation but also retains its efficacy in scenarios where the data from OSs is partially missing. 2 Related Work 2.1 Heterogeneous Treatment Effect Estimation Accurately estimating heterogeneous treatment effects is consider- ably significant for medicine, marketing, epidemiology and other related areas. For that reason, vast machine learning methods have been proposed to estimate HTEs. We classify these methods into three category: tree-based methods [ 6,34], bayesian algorithms [2,3,40] and deep learning algorithms [ 18,32,38,39]. However, these methods all make a strong unconfounding premise for obser- vational data, which can not be verified and often does not stand up to real-world scrutiny. As a result, this prevents the aforementioned techniques from being implemented in practical settings. To solve this common problem, certain techniques [ 22,24] aim to identify Page 3: Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects CIKM ’24, October 21–25, 2024, Boise, ID, USA the actual confounders by working with noisy indicators that serve as proxies for these confounders. Yet, it remains uncertain if the covariates we observe genuinely act as surrogates for the actual confounding variables. Other strategies aim to infer missing con- founders through the assignment data from multiple treatments (Wang and Blei, 2019; Bica et al., 2019) or over time with treatments administered in sequence (Hatt and Feuerriegel, 2021b). Neverthe- less, these methods also rely on hypotheses such as single strong ignorability and constancy of confounders over time, which pre- sume an absence of hidden confounders. Since these suppositions are not verifiable in real-world applications, the practical applica- bility of such techniques is impeded. GBCT [ 33] is another attemp to integrate current observational data and their historical controls. 2.2 Combining Observational and Randomized Data Recently, merely a few methods [ 5,8,11,13,15,19,29,35,36] have been proposed to combine observational data and randomized data for estimating HTEs. The RHC approach, as described by [ 19], assumes that confounding bias can be represented as a learnable parametric function. It involves training a biased estimator with observational data and subsequently using data from randomized trials to correct the confounding bias. In a related vein, [ 8] sug- gest obtaining one estimate from observational data and another from randomized data, subsequently combining these two estimates through a weighted average. Nonetheless, the process of calibrating the weights for this averaging necessitates a substantial randomized data validation set. Our experimental observations indicate that this requirement is at odds with the typically limited size of randomized data samples. [ 35] parametrically formulates the confounding bias function and an effect estimator for HTE analysis, leveraging an integrative𝑅-learner that fuse data from RCTs and OS. [ 15] intro- duces CorNet—a dual-phase framework that exploits a common structural aspect of both data kinds. More recently, FAST [ 13], as a tree-based method, draws from the statistical principle of shrinkage estimation. It crafts a weighting strategy that is optimized to strike a balance between the unbiased estimator derived from trial data and the estimator from observational data, which may carry bias. 3 Method In this section, our goal is to provide a comprehensive introduction to the proposed CIO method. We begin by presenting the founda- tional elements, which encompass the definition of variables and the essential assumptions required for our approach. Following that, we will illustrate the training steps involved in CIO and discuss the specific training loss function employed in the process. Finally, we will give a in-depth analysis in theoretical about why our proposed method is effective. 3.1 Preliminary In our study, we concentrate on a scenario where the treatment variable𝑇is binary, taking values in the set {0,1}. We denote X∈R𝑝, as the vector of covariates measured before being treated, and𝑌∈Ras the outcome variable of interest. Employing the potential outcomes framework [ 30], we define causal effects using 𝑌(𝑡)to represent the potential outcome if the subject were to receivetreatment𝑡, with𝑡being either 0 or 1. Thus, the Heterogeneous Treatment Effect (HTE) is expressed as 𝜏(X)=E(𝑌(1)−𝑌(0)| X), which captures the expected treatment effect conditional on the covariates X. We set𝑆=0and𝑆=1to denote OSs and RCTs, respectively. We aim to determine an estimator for 𝜏(X), provided that Xis within the range of values it can take in the RCTs. To maximize the efficiency benefits derived from the OSs, we proceed under the premise that the range of Xvalues in the RCTs is either a subset of or intersects with the range in the OSs ( overlap assumption). This is because 𝜏(X)cannot be identified for values ofXthat fall outside the scope of the RCTs. For brevity, within the subsequent text, "OS data" refers to observational data, while "RCT data" denotes randomized data. Before introducing CIO, we present two common assumptions in causal inference [ 17,28,40] and data fusion literature [ 9,10,35]: Assumption 3.1 (Consistency, ignorability and overlap) .For any individual𝑖, assigned to treatment 𝑡𝑖, we observe 𝑌𝑖=𝑌(𝑡𝑖). Fur- ther,{𝑌(𝑡)}𝑡∈𝑇and the data generating process 𝑝(X,𝑇,𝑌,𝑆)satisfies strong ignorability: 𝑇⊥ {𝑌(0),𝑌(1)} | ( X,𝑆=1)and overlap: ∀𝑥,0<𝑃(𝑇|X)<1. Assumption 3.2 (Transportability of the HTE) .E(𝑌(1)−𝑌(0)| X,𝑆=𝑠)=𝜏(X),𝑠=0,1. The ignorability assumption, one of the Assumption 3.1, often known as the no unmeasured confounders condition, which as- sumes that all variables influencing both the treatment 𝑇and the outcome𝑌are observed. It is inherently satisfied in an RCT owing to the nature of the random allocation of treatments. If Assumption 3.1 holds, it is possible to determine the HTEs from RCT data. Con- versely, in the case of OSs, the assumption of ignorable treatment assignment is not mandated, recognizing that such a requirement might be too stringent for practical applications. As mentioned in 3.1, we do not presume that the treatment as- signment is ignorable in the context of OS data. Borrowed from Yang [35], under Assumptions 3.1 and 3.2, the confounding bias function could be defined as the discrepancy between the condi- tional mean outcomes derived from OS data and the HTE: 𝑐(X)=E(𝑌|X,𝑇=1,𝑆=0)− (1) E(𝑌|X,𝑇=0,𝑆=0)−𝜏(X) Adhering to the consistency clause within Assumption 3.1, we have E(𝑌|X,𝑇=𝑡,𝑆=0)=E(𝑌(𝑡) |X,𝑇=𝑡,𝑆=0). Supposing the treatment assignment in the OS data is ignorable, namely 𝑇⊥ 𝑌(0),𝑌(1)|(X,𝑆=0), it follows that E(𝑌(𝑡)|X,𝑇=𝑡,𝑆=0)= E(𝑌(𝑡)|X,𝑆=0). Given Assumption 3.2, this would infer that the confounding bias function 𝑐(X)is determined to be 0. That means, if the confounding bias do not exist within OS data, the confounding bias function 𝑐(X)will be equal to zero. We focus on a binary-treatment scenario, thus, for RCT data: 𝑌=E(𝑌|X,𝑇=0,𝑆=1)+𝑇𝜏(X)+𝜖𝑟. (2) The conditional expectation of residual 𝜖𝑟equals to E(𝜖𝑟|X,𝑇,𝑆= 1)=E(𝑌|X,𝑇,𝑆=1)−E(𝑌|X,𝑇=0,𝑆=1)−𝑇·E(𝑌(1)−𝑌(0)| X,𝑆=1). According to the Assumption 3.1 and 3.2, we obtain the following conclusion E(𝜖𝑟|X,𝑇,𝑆 =1)=0. The proofs can be found in Appendix A. Similarly, we formulate the outcome model Page 4: CIKM ’24, October 21–25, 2024, Boise, ID, USA Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li for OS data based on Equation 1 and 2: 𝑌=E(𝑌|X,𝑇=0,𝑆=0)+𝑇·[𝜏(X)+𝑐(X)]+𝜖𝑜. (3) Condition on Assumption 3.1 and 3.2, we likewise derive the con- clusion E(𝜖𝑜|X,𝑇=0,𝑆=0)=0. Its proofs are also demonstrated in Appendix A. Therefore, when we combining OS data and RCT data for HTE estimation, we integrate Equation 2 and 3: 𝑌=𝑇[𝜏(X)+(1−𝑆)𝑐(X)]+E(𝑌|X,𝑇=0,𝑆)+𝜖 (4) =𝑇[𝜏(X)+(1−𝑆)𝑐(X)]+𝜇0(X)+𝜖, where we set 𝜇0(X)=E(𝑌|X,𝑇=0,𝑆). According to the con- clusion E(𝜖𝑟|X,𝑇,𝑆 =1)=0andE(𝜖𝑜|X,𝑇=0,𝑆=0)=0, we can infer that the conditional expectation of 𝜖equals to 0, i.e., E(𝜖|X,𝑇,𝑆)=0. 3.2 The Identification of Confounding Bias and HTEs Before introducing our method, we give the basic assumption which is deduced through Assumption 3.1: Assumption 3.3 (Basic assumption) .Assuming𝑆⊥𝑌|𝑋(meaning S and Y are independent given X), and combining the ignorability and overlap assumptions, we deduce that 𝑇(1−𝑆)⊥𝑌|𝑋and< 𝑃(𝑇(1−𝑆))<1. Confounding bias estimation. Based on Equation 4, Yang [35] has created an integrative 𝑅-learner to estimate both the HTE and the confounding function. This learner harnesses randomized data for accurate identification while utilizing observational data to en- hance its efficiency. However, such an approach necessitates com- prehensive OS data, i.e., it requires data from both the treatment and control groups for learning the HTE estimator. As we previously stated, obtaining complete OS data in complex real-world scenarios is impractical. Therefore, there is a need for a robust HTE estima- tion method capable of data fusion, applicable to scenarios with complete OS data and, more crucially, adaptable to situations with incomplete OS data. For this purpose, we have developed a robust CIO to overcome the limitations of current data fusion techniques. When integrating OS and RCT data, how to learn and eliminate confounding biases hidden in OS data is a critical procedure that we can not be circumvent. We decompose the Equation 4 as follows, 𝑌=𝑇(1−𝑆)𝑐(X)+𝑇𝜏(X)+𝜇0(X)+𝜖. (5) We define𝐷=𝑇(1−𝑆)as an artificially generated treatment variable, where 𝐷=1is assigned when 𝑇=1and𝑆=0, and 𝐷=0otherwise. Following the creation of this proxy treatment mechanism, we form a pseudo-experimental group where 𝐷=1 and a pseudo-control group where 𝐷=0. Based on Assumption 3.3, we the use the dummy data to learn the confounding bias function 𝑐(X), of which the learning process is the same as effect estimation. Therefore, we denote the learned 𝑐(X)as𝜏𝑐(X). It should be noted that the pseudo-control group comprises all the samples from the RCT dataset. HTE estimation. After training the confounding bias function 𝜏𝑐(X), we frozen the parameters of it. In the current stage, theEquation 5 can be rearranged as: ˜𝑌=𝑇𝜏(X)+𝜇0(X)+𝜖, (6) ˜𝑌=𝑌−𝑇(1−𝑆)𝜏𝑐(X), where𝑇(1−𝑆)𝜏𝑐(X)is a constant for an individual. Following this equation, we then integrate the OS data with RCT data to train an effect estimator 𝜏(·). Utilizing this formula allows us to adjust and rectify the observed outcomes for the treatment group in the OS data. Thus, the OS data can be combined with RCT data to estimate HTEs without the inclusion of confounding biases. Diverging from existing approaches of data fusion for HTE as- sessment, the proposed CIO method does not require a propensity model to be learned beforehand, as seen in methods like FAST [ 13] and the integrative 𝑅-learner [ 35], nor does it initially demand an effect estimator derived from OS data, which is a prerequisite for techniques such as RHC [ 19] and CorNet [ 15]. The training process outlined above reveals the following insights: •For estimating confounding biases, it is sufficient to use only the treated subset of OS data and RCT data. •In estimating HTEs, even in the absence of access to the untreated group within the OS data, we have at our disposal, a composite treated group (encompassing both the treated individuals from the OS and the RCT) and a separate control group (consisting of the untreated units from the RCT). These groups can be employed to train an effect estimator. These attributes endow the CIO method with resilience. In the above description of CIO, we assumed the control group of OS is missing for simplicity. However, CIO offers the versatility to adapt to situations where the treated cohort from OS data might be missing. This is achieved by inverting the treatment assignments for the treated and untreated samples, that is, assigning 𝑇=0for the initially treated subjects and 𝑇=1for the originally untreated subjects . 3.3 Training Loss We denote observational data as {x𝑜 𝑖,𝑡𝑜 𝑖,𝑦𝑜 𝑖}𝑚 𝑖=1, thus its treated data and control data can be {(x𝑜𝑡 𝑖,𝑦𝑜𝑡 𝑖)|𝑡𝑜 𝑖=1}𝑚𝑡 𝑖=1and{(x𝑜𝑐 𝑖,𝑦𝑜𝑐 𝑖)| 𝑡𝑜 𝑖=0}𝑚𝑐 𝑖=1, respectively. In the same way, we denote randomized data as{x𝑟 𝑖,𝑡𝑟 𝑖,𝑦𝑟 𝑖}𝑛 𝑖=1, its treated data as{(x𝑟𝑡 𝑖,𝑦𝑟𝑡 𝑖)|𝑡𝑟 𝑖=1}𝑛𝑡 𝑖=1and its control data as {(x𝑟𝑐 𝑖,𝑦𝑟𝑐 𝑖)|𝑡𝑟 𝑖=1}𝑛𝑐 𝑖=1. The𝑚,𝑚𝑡,𝑚𝑐,𝑛,𝑛𝑡,𝑛𝑐 represents the size of OS data, OS treated data, OS control data, RCT data, RCT treated data and RCT control data, respectively. Stage 1: Confounding bias estimation. During this phase, vari- ous regression techniques can be utilized to model the data from individuals in OSs and RCTs, including ridge regression, random for- est, neural networks, etc. Thus, we initialize 𝑝1(·)and𝑝0(·)to cor- respondingly fit on pseudo-experimental data and pseudo-controls, i.e., the treated data of OSs and the all data of RCTs. According to Equation 5, the pertinent optimized objectives to this stage can be exemplified as follows: ˆ𝑝1(·)=argmin 𝑝11 𝑚𝑡𝑚𝑡∑︁ 𝑖=1[𝑦𝑜𝑡 𝑖−𝑝1(x𝑜𝑡 𝑖)]2, (7) ˆ𝑝0(·)=argmin 𝑝0( 1 𝑛𝑡𝑛𝑡∑︁ 𝑖=1[𝑦𝑟𝑡 𝑖−𝑝0(x𝑟𝑡 𝑖)]2+1 𝑛𝑐𝑛𝑐∑︁ 𝑖=1[𝑦𝑟𝑐 𝑖−𝑝0(x𝑟𝑐 𝑖)]2) . (8) Page 5: Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects CIKM ’24, October 21–25, 2024, Boise, ID, USA After that, we calculate the confounding bias function ˆ𝜏𝑐(x𝑖)= ˆ𝑝1(x𝑖)−ˆ𝑝0(x𝑖)for each unit from OS treatment group. Stage 2: HTE estimation. At this stage, we amalgamate the en- tire dataset from OSs with that from RCTs while also adjusting the observed outcomes for the treated group in OSs using the estimated confounding bias function 𝜏𝑐(·). In a similar vein, we set up the functions𝑓1(·)and𝑓0(·)to be trained on the entirety of the treat- ment and control data, drawn from the aggregate of OSs and RCTs, correspondingly. It should be noted that for effectively debiasing with ˆ𝜏𝑐(·), we adopt the parameters from ˆ𝑝1(·)as the initial values for𝑓1(·). Additionally, it is imperative to conduct initial training for𝑓0(·)using control data, ensuring that the training epochs align with those of the stage 1. According to Equation 6, the loss function is presented as follows: ˆ𝑓1(·)=argmin 𝑓1( 1 𝑚𝑡𝑚𝑡∑︁ 𝑖=1{˜𝑦𝑖𝑜𝑡−𝑓1(x𝑜𝑡 𝑖)}2+1 𝑛𝑡𝑛𝑡∑︁ 𝑖=1(𝑦𝑟𝑡 𝑖−𝑓1(x𝑟𝑡 𝑖))2) , (9) ˆ𝑓0(·)=argmin 𝑓0( 1 𝑚𝑐𝑚𝑐∑︁ 𝑖=1[𝑦𝑜𝑐 𝑖−𝑓0(x𝑜𝑐 𝑖)]2+1 𝑛𝑐𝑛𝑐∑︁ 𝑖=1[𝑦𝑟𝑐 𝑖−𝑓0(x𝑟𝑐 𝑖)]2) , (10) where ˜𝑦𝑜𝑡 𝑖=𝑦𝑜𝑡 𝑖−ˆ𝜏𝑐(x𝑜𝑡 𝑖). Finally, we obtain HTE estimator ˆ𝜏(xi)= ˆ𝑓1(xi)−ˆ𝑓0(xi)for each unit. 3.4 Effectiveness in Theoretical In this paper, following the approach of intergrative 𝑅-learner Yang [35], we introduce the confounding function 𝑐to describe the con- founding bias in Observational Studies (OS), as shown by the Equa- tion 5𝑌=𝑇(1−𝑆)𝑐(𝑋)+𝑇𝜏(𝑋)+𝜇0(𝑋)+𝜖. When treating 𝑇(1−𝑆) as a dummy treatment variable, we are able to estimate 𝑐(𝑋), which constitutes the first stage in our paper. Assuming 𝑆⊥𝑌|𝑋(meaning S and Y are independent given X), and combining the ignorabil- ityandoverlap assumptions, we deduce that 𝑇(1−𝑆)⊥𝑌|𝑋and <𝑃(𝑇(1−𝑆))<1. This implies that, under the Potential Outcome Framework (POF), 𝑐is identifiable in theoretical. Since the 𝑐is iden- tified under POF, we denote the identified 𝑐as𝜏𝑐. After acquiring the confounding bias function 𝜏𝑐, we calibrate the outcome of OS using it to reduce the confounding bias. Finally, in the second stage, HTE estimation is generally based on the POF framework and is also identifiable. In summary, we decompose the process of HTE estimation combining OS and RCT data into two stages under POF, which are guaranteed by the effectiveness of POF. 4 Experiment In this section, we perform experiments on a synthetical dataset and two real-world datasets to demonstrate the performance of CIO for HTE estimation. The outcome of these data are all simulated by a certain strategy. We present the results of various experiments designed to address the subsequent three research questions: •RQ1: Wether our proposed approach CIO is effective to com- bine observational data and randomized data for HTE estimation under the two situation: observational data is complete or incom- plete?•RQ2: Should the strength of confounding bias presented in OS data intensify, would the proposed approach maintain its superior performance relative to current data fusion techniques? •RQ3: The impact of inverted treatment assignment on HTE esti- mation. The following text will include three subsections: Experimental Setup, Datasets, and Experimental Analysis. Within the Experimen- tal Analysis subsection, we conduct a mass of experiments and the corresponding analysis to answer the above research questions. 4.1 Experimental Setup Baselines and architectures. To assess the performance of CIO, we choose RHC [ 19], integrative 𝑅-learner [ 35] and CorNet [ 15] as baselines for comparison. Following RHC [ 19], we select Ridge and RF as our base model for comparison. In addition, given that methods based on representation learning have demonstrated no- table efficacy in estimating heterogeneous treatment effects (HTEs), we follow the approach of CFR [ 32] by integrating the Treatment- Agnostic Representation Network (TARNet) into our suite of base- line models for experimental purposes. For the sake of brevity in descriptions, we refer to the integrative 𝑅-learner as the IntR. RHC, IntR, CorNet and CIO all follow a dual-phase training approach that leverages OS and RCT data for estimating HTE and are im- plemented using ridge regression (Ridge), random forest (RF) and TARNet as underlying models. Since CorNet is a method imple- mented with DNN network including representation layer, we only compare with it when we implement other baselines and CIO with TARNet. Furthermore, to determine the benefits of combining OS and RCT data, we train baseline estimators solely on each data type, referred to as SF 𝑂𝑆for OS data and SF 𝑅𝐶𝑇 for RCT data. We also conduct an experiment where OS and RCT data are simply fused for training to serve as a reference for the effectiveness of modeling confounding bias, indicated as SI. Incomplete OS data. Most importantly, the original purpose of CIO is to combine the OS and RCT data for HTE estimation in the situation where the OS data is not fully available. Therefore, in order to create the incomplete condition, we deliberately omit either the untreated or treated group from the OS datasets of the Simulation and STAR experiments during the training phase. This modified approach is referenced as CIO 𝐼𝑂. Metrics. When developing models to predict the Individual Treat- ment Effect (ITE), the main objective is to minimize the Precision in the Estimation of Heterogeneous Effect (PEHE) as outlined in reference [ 16,31]. In a binary treatment scenario, PEHE quantifies the accuracy with which a model can predict the differential impact of two treatments, 𝑡0and𝑡1, for a given set of samples 𝑋. To calcu- late PEHE, we determine the mean squared error across 𝑁samples by comparing the actual difference in outcomes, 𝑦1(𝑛)−𝑦0(𝑛), which are obtained from the simulation strategy, with the predicted difference, ˆ𝑦1(𝑛)−ˆ𝑦0(𝑛), where𝑛denotes the sample index: 𝜖PEHE =1 𝑁𝑁∑︁ 𝑛=0([𝑦1(𝑛)−𝑦0(𝑛)]−[ˆ𝑦1(𝑛)−ˆ𝑦0(𝑛)])2(11) Page 6: CIKM ’24, October 21–25, 2024, Boise, ID, USA Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li Table 1: Comparison of methods for combining OS and RCT data on Simulation, STAR and NSW data. We report the mean value±the standard deviation of√𝜖𝑃𝐸𝐻𝐸 on test data over 10 repeated runs for three data with proportion 𝑝𝑟=0.2of RCT data, respectively. Besides, we remove the control group of OS in Simulation and STAR datasets and present their results inCIO𝐼𝑂. We run trials for 10 times and the best performance is marked in bold. Architecture Method𝑝𝑟=0.2 Simulation STAR NSW RidgeSF𝑂𝑆 21.97±1.06 25.05±2.92 - SF𝑅𝐶𝑇 9.74±2.39 4.06±0.52 2.49±1.06 SI 21.43±1.06 12.82±3.48 15.85±0.32 RHC 14.62±6.18 9.7±2.68 - IntR 7.68±0.34 8.52±0.71 - CIO(Ours) 5.75±0.34 2.36±0.48 - CIO𝐼𝑂(Ours) 8.96±0.63 2.14±0.65 1.48±0.38 RFSF𝑂𝑆 18.35±0.67 48.17±0.47 - SF𝑅𝐶𝑇 10.41±2.87 6.96±1.30 2.30±0.70 SI 17.83±0.73 7.83±0.87 3.84±1.08 RHC 11.76±1.28 18.59±0.50 - IntR 8.84±1.93 9.51±0.22 - CIO(Ours) 6.65±0.26 5.61±0.93 - CIO𝐼𝑂(Ours) 10.18±2.94 5.35±0.96 2.29±0.70 TARNetSF𝑂𝑆 23.64±1.30 30.40±11.62 - SF𝑅𝐶𝑇 10.84±6.41 5.15±3.38 5.14±0.46 SI 22.83±1.08 23.81±7.96 22.47±0.60 RHC 7.89±3.40 7.06±2.65 - CorNet 12.24±2.25 7.00±1.01 - IntR 6.97±0.72 4.93±1.81 - CIO(Ours) 6.61±0.35 3.54±1.38 - CIO𝐼𝑂(Ours) 6.92±0.41 4.17±1.39 5.11±0.39 4.2 Datasets In line with prior approaches [ 13,15,35,37], we create a simulated dataset and choose two real-world datasets, STAR and NSW, for our experimental evaluations. Due to the absence of ground truth for the effects in STAR and NSW, we simulate the outcomes for these datasets rather than relying on their actual outcomes. In this subsection, we will introduce more details of three experimental datasets about their data construction and outcome simulation. 4.2.1 Simulation Dataset. A synthetic dataset, of which the covari- ates and outcomes are all simulated: •Data construction and outcome simulation. We generate in- dependent covariates of dimension 𝑝, denoted as 𝑋𝑖, from a stan- dard normal distribution 𝑋𝑖∼N( 0,1)for𝑖=1,2,...,𝑝 . In this experiment, we set 𝑝=5. Following this procedure, we produce 200 samples for RCT data, 3000 for OS data, and 1000 for test data. The potential outcomes for each sample are then simulated using the equation 𝑌(𝑡)=𝑡𝜏(X)+1+2Í𝑝 𝑖=1𝑋3 𝑖+Í𝑝 𝑖=1𝑋𝑖+5(1−𝑠)𝑈+𝜖(𝑡) , where𝜏(X)=1+Í𝑝 𝑖=1𝑋𝑖+Í𝑝 𝑖=1𝑋2 𝑖,𝑡∈{0,1}indicates the treatment status, 𝑠equalling 0 refers to OS data and 1 to RCT data, and𝜖(𝑡)is drawn fromN(0,1). Treatment assignment for RCT and OS data is governed by 𝑇|(X,𝑆=1)∼𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(0.5)and𝑇|(X,𝑆=0)∼𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(1/(1+exp(−Í𝑝 𝑖=1𝑋𝑖))), respec- tively. Echoing the simulation approach in [ 37], the unobserved variable𝑈is sampled fromN(X𝑇v𝛽(2𝑇−1),1), with vbeing a unit vector(1,..., 1)𝑇and𝛽being a coefficient that modulates the magnitude of confounding bias in OS data. 4.2.2 STAR. A semi-synthetic dataset. Beyond synthetic dataset experiments, we also evaluate the performance of CIO using a real-world dataset in this subsection: •Tennessee Student/Teacher Achievement Ratio (STAR). The STAR Experiment [ 20] was a randomized controlled trial con- ducted in the late 1980s. Its objective is to measure the impact of class size on students’ academic performance. We follow RHC [19] and FAST [ 13] to split STAR for getting observational and randomized data. Our attention is centered on two experimental classroom size conditions: small classes consisting of 13-17 stu- dents and regular classes with 22-25 students. Considering that a significant number of students commenced the study beginning in the first grade, we designate the type of class they were placed in at that time as their initial treatment. For each student, we consider a set of variables: gender, race, birth month, birthday, birth year, free lunch given or not, teacher id. We exclude any students who have missing data for any of these specified covari- ates. Furthermore, we also eliminate students whose combined scores for standardized tests in listening, reading, and math are missing. In total, we recorded 4139 students: 1774 assigned to treatment (small class, T = 1), and 2365 to control (regular size class, T = 0). •Data construction. Through a specific simulation strategy, we get the outcome for each student. Then, we follow the settings of RHC [ 19] and FAST [ 13] to construct OS data, RCT data and test data: To introduce a confounding bias, we divide the study population based on a variable: students living in rural or inner- city areas (denoted by U = 1, totaling 2811 students) versus those in urban or suburban areas (denoted by U = 0, totaling 1407 students). We then create the RCT data by randomly selecting a proportion, which equals 0.5, of the students with U = 1. The OS data is compiled in the following manner: For students with U = 1, we include those who are not part of the trial data and have a treatment status of control (D = 0), along with the treated students (D = 1) whose simulated outcomes were in the bottom 50% among their peers with D = 1 and U = 1. For students with U = 0, we incorporate all of the control students (D = 0) and the treated students (D = 1) who also fell into the bottom 50% of simulated outcomes for the group with D = 1 and U = 0. Finally, the test data is composed of a reserved subset of the entire sample, excluding those individuals that are included in the RCT dataset. •Outcome simulation. Following RHC [ 19] and FAST [ 13], the actual covariates X=(𝑋1,𝑋2,···,𝑋𝑝)𝑇, where𝑝=7. While the STAR dataset contains actual outcome data, it lacks a ground truth for the treatment effect. Consequently, we simulate both the outcome and the treatment effect function specifically for the STAR dataset. Concretely, we set 𝑌(𝑡)=𝑡𝜏(X)+2Í𝑝 𝑖=1𝑋𝑖+X𝑇X+ 𝜖(𝑡)as potential outcomes, where 𝜏(X)=Í𝑝 𝑖=1𝑋𝑖+√︃ |Í𝑝 𝑖=1𝑋𝑖| and𝜖(𝑡)∼N( 0,1). Page 7: Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects CIKM ’24, October 21–25, 2024, Boise, ID, USA 4.2.3 NSW. A semi-synthetic dataset with incomplete OS dataset. The datasets illustrated in the previous subsections include com- plete OS data. While we have shown in Table 1 that CIO maintains its robustness when faced with incomplete OS data, achieved by omitting the control group of OSs, this does not fully convey the method’s practical utility in an intuitive manner. In light of this, we evaluate on inherently incomplete dataset — NSW, where the treated data of OS data is absent: •National Supported Work (NSW) Demonstration. The Na- tional Supported Work (NSW) Demonstration [ 23] was a ran- domized experiment investigating the effect of job training on income and employment status. Following [ 1], we combine ran- domized samples (297 treated, 425 control) with the 2490 PSID observational controls in our experiments. •Data construction. The incomplete nature of the NSW’s OS dataset makes it an apt choice for testing the resilience of the CIO method. Thus, we randomly draw 100 samples, encompassing both the treated group and the control group, from the pool of 722 randomized samples to serve as RCT data. To inject addi- tional bias into the OS data, we incorporate those samples with simulated outcomes ranking in the upper 50% from the 2490 PSID observational controls. The rest of the randomized samples and observational controls are allocated for the testing phase. •Outcome simulation. We use the actual covariates X=(𝑋1,𝑋2,··· ,𝑋𝑝)𝑇of NSW: age, level of education, ethnicity (split into two covariates), marital status, and educational degree, where 𝑝=6. We generate potential outcomes for each person by 𝑌(𝑡)= 𝑡𝜏(X)+2Í𝑝 𝑖=1exp(𝑋𝑖)+𝜖(𝑡), where𝜏(X)=X𝑇X,𝑡={0,1} and𝜖(𝑡)∼U(− 1,1). It should be noted that since the treat data of OS data is inexistent, we set the treatment value of RCT treated data to 0, the control data of OS and RCT to 1. 4.3 Experiment Analysis Preliminary Trials. Initially, we execute CIO alongside other benchmark methods using Ridge, RF, and TARNet to calculate HTEs, an essential functionality of these methodologies. Our investigation aims to determine if integrating even a tiny quantity of RCT data with OS data yields any advantages for HTE estimation. To this end, we randomly select a subset of the RCT data, maintaining a proportion𝑝𝑟=0.2, and combine it with the entirety of the OS data for the training process. We promise that the selected RCT data includes treated and control instances. This strategy is designed to reflect the common scenario encountered in the real world, where RCT data is often considerably less abundant than OS data. The outcomes of this analysis are compiled and presented in Table 1. It can be observed that relying solely on OS data during the training phase introduces significant bias, resulting in poor performance outcomes. Likewise, even though SI integrates OS data with RCTs, it only shows a marginal improvement over SF 𝑂𝑆. This is because SI merges OS data and RCT data directly without implementing any procedures to mitigate bias. In pursuit of this objective, RHC and IntR are designed to correct for biases in OS data by leveraging RCT data in the estimation of HTEs. The results presented in Table 1 indicate that CIO significantly outperforms IntR when complete 0.2 0.4 0.6 0.8 1.0 pr5101520PEHE Ridge 0.2 0.4 0.6 0.8 1.0 pr5101520PEHE RFSI RHC IntR CIOCIOIO(a) Simulation dataset 0.2 0.4 0.6 0.8 1.0 pr05101520PEHE Ridge 0.2 0.4 0.6 0.8 1.0 pr5101520PEHE RFSI RHC IntR CIOCIOIO (b) STAR dataset 0.2 0.4 0.6 0.8 1.0 pr51015PEHE Ridge 0.2 0.4 0.6 0.8 1.0 pr23456PEHE RFSICIOIO (c) NSW dataset Figure 2: Comparison among data-fusion baselines under Ridge and RF with an increasing ratio of RCT data for train- ing. We plot the results upon Simulation dataset, STAR dataset and NSW dataset on Figure 1(a), 1(b) and 1(c) respec- tively. OS data is used, highlighting CIO’s effectiveness in mitigating con- founding bias from OS data. For examples, we have conducted a significance test between IntR and CIO On Simulation dataset, the p-value is 4.27e-10 for Ridge, 3.43e-3 for RF; On STAR dataset, the p- value is 1.79e-6 for Ridge, 7.83e-3 for RF. Another important aspect of our analysis involves the removal of control group data from the OS set and merging the remaining OS data with the full RCT dataset to assess CIO’s resilience. Even when faced with incomplete OS data, CIO is capable of leveraging the available OS data for training purposes. Despite a noticeable decline in the performance of CIO 𝐼𝑂 as compared to the complete CIO, it still surpasses SF 𝑅𝐶𝑇 in terms of effectiveness. This underscores CIO’s ability to preserve a robust performance in HTE estimation by making use of the available data. Moreover, as previously demonstrated, both RHC and IntR are incapable of integrating RCT with OS data in the presence of partial missingness in the OS dataset. As a result, we are limited to only assessing the outcomes of CIO 𝐼𝑂, SI, and SF 𝑅𝐶𝑇 solely on the NSW dataset. Employing Ridge as the underlying model, our proposed approach attains superior performance. The significance test yields Page 8: CIKM ’24, October 21–25, 2024, Boise, ID, USA Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li a p-value of 0.01, falling below the threshold of 0.05. When RF is employed as the underlying architecture, the performance of CIO 𝐼𝑂 is on par with that of SF 𝑅𝐶𝑇. Sensitivity of RCT data volume. In the previously mentioned experiments, CIO demonstrated its efficacy and resilience when pro- vided with a small amount of RCT data for training. Yet, it remains unclear if it can sustain advanced performance when supplied with varying volume of RCT data for training. To investigate this, we modify the proportion of RCT data used in training— 𝑝𝑟, ranging from 0.1 to 1.0. The results of all methods implemented with Ridge regression, RF are shown in Figure 2. We summarize three aspects from the figure: •Intuitively, the√𝜖𝑃𝐸𝐻𝐸 results of all methods will decrease with the size of RCT data swells, given that RCT data are devoid of unobserved confounders. Our approach consistently surpasses alternative data-fusion techniques when applied with Ridge and RF models, thereby affirming the efficacy of CIO in estimating HTEs. •More importantly, to assess the robustness of CIO in scenar- ios where the OS data are incomplete, we conduct experiments without the controls from OS data. Results indicate that CIO 𝐼𝑂 maintains superior performance, even in the face of partial ab- sence of OS data. •Under the Ridge model, RHC and IntR demonstrate enhanced performance compared with SI, whereas this performance edge is not observed with RF. In contrast, CIO consistently outperforms SI regardless of whether Ridge or RF is employed, highlighting CIO’s stable debiasing capability.An observation of the figure reveals that CIO consistently secures the top performance tier, regardless of the amount of RCT data utilized. Furthermore, it is notable that the standard deviation values in Figure 1 significantly vary across different data ratios within the same dataset. At lower ratios, only a minimal amount of RCT data is provided for training, which can result in substantial variance in the data samples used for training in different experimental runs, consequently leading to higher variance in performance. As the volume of RCT data incorporated into training increases, the fluc- tuation in the training data set decreases, leading to performance stability throughout multiple experimental trials. Impact of the strength of confounding bias. Our approach involves identifying the confounding bias present in OS data and using this as a residual adjustment for the observed outcomes. Recognizing the confounding bias in OS data is essential when integrating OS and RCT data. To this effect, we manipulate the intensity of the confounding bias in OS data by varying the 𝛽value. The results of this manipulation are illustrated in Figure 3, where the OS data will suffer from enhancing confounding bias with the value of𝛽rising. It is evident that the efficacy of SI declines sharply as𝛽increases, whereas the performance of other data-fusion meth- ods deteriorates at a more gradual rate, which demonstrates the effectiveness of identifying the confounding bias. Particularly, CIO can maintain a low√𝜖𝑃𝐸𝐻𝐸 despite high values of 𝛽, regardless of whether the OS data is complete. It should be highlighted that as 𝛽 reaches a specific threshold, the performance of CIO 𝐼𝑂surpasses that of CIO. This phenomenon may be attributed to the trade-off 2 4 6 8 10 050100150200PEHE Ridge 2 4 6 8 10 050100150200PEHE RFSI RHCIntR CIOCIOIOFigure 3: For all data-fusion techniques using Ridge and RF, we observe√𝜖𝑃𝐸𝐻𝐸 across a range of 𝛽values that modulate the intensity of the confounding bias in the training OS data. between the quantity of OS control data and the strength of associ- ated confounding bias. We can exploit the characteristic when OS data lacks control or treatment data and suffers from substantial confounding bias. Impact of the size of OS control data. In scenarios where the OS control group data is nonexistent, CIO remains unique ability of integrating the available OS data with RCT data for estimating HTE. Consequently, it becomes intriguing to examine the fluctuat- ing performance of both baseline methods and CIO when trained with varying volumes of OS control data. We randomly select RCT data with𝑝𝑟=0.05(ensuring the inclusion of treatment and control samples) for training. The results are displayed in Figure 4(a). CIO consistently surpasses alternative methods in performance across all sizes of OS control data, demonstrating the applicability of CIO for handling data with varying and intricate compositions. Follow- ing the trials conducted for the Simulation dataset, we change the number of OS control data from a same range. To generate a distinct composition of RCT data for the training set, we set 𝑝𝑟=0.2, as depicted in 4(b). According to the line chart, CIO maintains a steady enhancement in performance over other baseline methods, irrespec- tive of the OS control data size. This illustrates CIO’s exceptional robustness. Page 9: Combining Incomplete Observational and Randomized Data for Heterogeneous Treatment Effects CIKM ’24, October 21–25, 2024, Boise, ID, USA 0 100 200 300 400 500 OS controls' number10152025PEHE pr=0.05SI RHC IntR CIO (a) Simulation dataset 0 100 200 300 400 500 OS controls' number2.55.07.510.012.515.0PEHE pr=0.2SI RHC IntR CIO (b) STAR dataset Figure 4: We change the quantity of control data from the OS used in the training stage, under which we evaluate the efficacy of various data-fusion techniques implemented with Ridge regression. The OS controls’ number varies from a range of {1, 4, 16, 64, 256, 512}. Results pertaining to the Simulation dataset are illustrated in Figure 3(a) and for the STAR dataset in Figure 3(b). Inverse of Treatment Assignment. In Section 3, for the con- venience of describing CIO, we assume that the control group of OSs is unavailable. Nevertheless, in practical situations, it’s often the treatment group’s data that might be missing. As previously detailed, in the absence of treatment data, we can reverse the origi- nal treatment assignments to adapt our methodology. This reversal process enables the flexible application of CIO for merging both OS and RCT data to estimate HTEs. While this approach provides flexibility for CIO, its efficacy post-inversion remains to be seen. To address this, we perform a series of experiments to assess CIO’s ro- bustness. Initially, we eliminate the treatment group of OSs from the Simulation and STAR datasets for our analysis. Subsequently, we extend our experiments by removing and inverting the treatment assignments within the OS data and RCT data of these datasets to further examine the performance of the proposed method. We report the experimental results in Table 2, where the performance of CIO is almost invariable between ’original’ and ’inverse’. Such consistency highlights that CIO retains its effectiveness even in cases where the treatment group data from OSs is absent.Table 2: For both the Simulation and STAR datasets, we ex- clude the control group from the OS data for experimenting. Conversely, the treatment group of OS data is eliminated from the Simulation dataset and from STAR, with their treat- ment assignments being inverted. The ’original’ means the original treatment assignment, while the ’inverse’ represents the inverted treatment assignment. Base ModelSimulation STAR original inverse original inverse Ridge 8.96±0.63 8.03±0.51 2.14±0.65 2.65±0.99 RF 10.18±2.94 9.79±2.52 5.35±0.96 5.24±0.61 TARNet 6.92±0.41 7.12±0.29 4.17±1.39 4.24±2.19 5 Conclusion This paper posits that existing data-fusion techniques are deficient in robustness, rendering them incapable of merging OS data with RCT data for HTE estimation in instances where the OS training data is incomplete. In response to this issue, we present CIO, a re- silient method designed to harness the advantages of both OS and RCT data. CIO circumvents the limitations of current data fusion methods by effectively estimating HTEs without requiring fully pop- ulated OS datasets. To achieve this, we form pseudo-experimental and pseudo-control groups from another perspective to train an estimator for effect measurement which is intended to serve as a confounding bias function, an innovative tactic in assessing con- founding bias. To confirm the robustness and effectiveness of our method, we perform numerous tests that explore various aspects: we examine the influence of RCT data volume, analyze the effect of confounding bias intensity, and investigate how the quantity of OS control data affects outcomes. These trials are conducted on a synthetic dataset and two semi-synthetic datasets which use real- world covariates and outcomes generated via specific strategies. Across all experiments, CIO’s performance consistently surpasses that of the baseline methods it is compared with, irrespective of the dataset and architecture used. The consistent outperformance of CIO when benchmarked against other data-fusion methods affirms the effectiveness and robustness of our confounding bias estima- tor. This tool, which calibrates the observed outcomes of OS data, proves to be powerful in merging RCT and OS data for estimating HTEs. A Proofs Proof1. E(𝜖𝑟|X,𝑇=0,𝑆=1)=E(𝑌|X,𝑇=0,𝑆=1)− E(𝑌|X,𝑇=0,𝑆=1)=0. Proof2. E(𝜖𝑟|X,𝑇=1,𝑆=1)=E(𝑌|X,𝑇=1,𝑆=1)−E(𝑌| X,𝑇=0,𝑆=1)−E(𝑌(1)−𝑌(0)|X,𝑆=1)=E(𝑌(1)|X,𝑇=1,𝑆= 1)−E(𝑌(0)|X,𝑇=0,𝑆=1)−E(𝑌(1)−𝑌(0)|X,𝑆=1)=E(𝑌(1)| X,𝑆=1)E(𝑌(0)|X,𝑆=1)−E(𝑌(1)−𝑌(0)|X,𝑆=1)=0. Proof3. E(𝜖𝑜|X,𝑇=0,𝑆=0)=E(𝑌|X,𝑇=0,𝑆=0)− E(𝑌|X,𝑇=0,𝑆=0)=0. Proof4. E(𝜖𝑜|X,𝑇=1,𝑆=0)=E(𝑌|X,𝑇=1,𝑆=0)−E(𝑌| X,𝑇=0,𝑆=0)−[E(𝑌|X,𝑇=1,𝑆=0)−E(𝑌|X,𝑇=0,𝑆= 0)]=0. Page 10: CIKM ’24, October 21–25, 2024, Boise, ID, USA Dong Yao, Caizhi Tang, Qing Cui, and Longfei Li References [1]Jeffrey A. Smith and Petra E. Todd. 2005. Does Matching Overcome LaLonde’s Critique of Nonexperimental Estimators? Journal of Econometrics 125, 1 (March 2005), 305–353. https://doi.org/10.1016/j.jeconom.2004.04.011 [2]Ahmed Alaa and Mihaela Schaar. 2018. Limits of Estimating Heterogeneous Treatment Effects: Guidelines for Practical Algorithm Design. In Proceedings of the 35th International Conference on Machine Learning . PMLR, 129–138. [3]Ahmed M. Alaa and Mihaela van der Schaar. 2017. Bayesian Inference of In- dividualized Treatment Effects Using Multi-task Gaussian Processes. https: //doi.org/10.48550/arXiv.1704.02801 arXiv:1704.02801 [cs] [4]Susan Athey. 2017. Beyond Prediction: Using Big Data for Policy Problems. Science 355, 6324 (Feb. 2017), 483–485. https://doi.org/10.1126/science.aal4321 [5]Susan Athey, Raj Chetty, and Guido Imbens. 2020. Combining Experimental and Observational Data to Estimate Treatment Effects on Long Term Outcomes. https://doi.org/10.48550/arXiv.2006.09676 arXiv:2006.09676 [econ, stat] [6]Susan Athey, Julie Tibshirani, and Stefan Wager. 2018. Generalized Random Forests. https://doi.org/10.48550/arXiv.1610.01271 arXiv:1610.01271 [econ, stat] [7]Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L. Scott. 2015. Inferring Causal Impact Using Bayesian Structural Time-Series Models. The Annals of Applied Statistics 9, 1 (March 2015). https://doi.org/10. 1214/14-AOAS788 [8]David Cheng and Tianxi Cai. 2021. Adaptive Combination of Random- ized and Observational Data. https://doi.org/10.48550/arXiv.2111.15012 arXiv:2111.15012 [stat] [9]Bénédicte Colnet, Imke Mayer, Guanhua Chen, Awa Dieng, Ruohong Li, Gaël Varoquaux, Jean-Philippe Vert, Julie Josse, and Shu Yang. 2023. Causal Inference Methods for Combining Randomized Trials and Observational Studies: A Review. https://doi.org/10.48550/arXiv.2011.08047 arXiv:2011.08047 [stat] [10] Irina Degtiar and Sherri Rose. 2023. A Review of Generalizability and Transporta- bility. Annual Review of Statistics and Its Application 10, 1 (March 2023), 501–524. https://doi.org/10.1146/annurev-statistics-042522-103837 arXiv:2102.11904 [stat] [11] AmirEmad Ghassami, Alan Yang, David Richardson, Ilya Shpitser, and Eric Tch- etgen Tchetgen. 2022. Combining Experimental and Observational Data for Identification and Estimation of Long-Term Causal Effects. https://doi.org/10. 48550/arXiv.2201.10743 arXiv:2201.10743 [econ, math, stat] [12] Thomas A. Glass, Steven N. Goodman, Miguel A. Hernán, and Jonathan M. Samet. 2013. Causal Inference in Public Health. Annual Review of Public Health 34, 1 (2013), 61–75. https://doi.org/10.1146/annurev-publhealth-031811-124606 [13] Jia Gu, Caizhi Tang, Han Yan, Qing Cui, Longfei Li, and Jun Zhou. 2023. FAST: A Fused and Accurate Shrinkage Tree for Heterogeneous Treatment Effects Estimation. Thirty-seventh Conference on Neural Information Processing Systems (2023). [14] Margaret A. Hamburg and Francis S. Collins. 2010. The Path to Personalized Medicine. The New England Journal of Medicine 363, 4 (July 2010), 301–304. https://doi.org/10.1056/NEJMp1006304 [15] Tobias Hatt, Jeroen Berrevoets, Alicia Curth, Stefan Feuerriegel, and Mihaela van der Schaar. 2022. Combining Observational and Randomized Data for Estimating Heterogeneous Treatment Effects. arXiv:2202.12891 [cs, stat] [16] Jennifer L. Hill. 2011. Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and Graphical Statistics 20, 1 (Jan. 2011), 217–240. https: //doi.org/10.1198/jcgs.2010.08162 [17] Fredrik D. Johansson, Nathan Kallus, Uri Shalit, and David Sontag. 2018. Learning Weighted Representations for Generalization Across Designs. https://doi.org/10. 48550/arXiv.1802.08598 arXiv:1802.08598 [stat] [18] Fredrik D Johansson, Uri Shalit, and David Sontag. [n. d.]. Learning Representa- tions for Counterfactual Inference. ([n. d.]). [19] Nathan Kallus, Aahlad Manas Puli, and Uri Shalit. 2018. Removing Hidden Confounding by Experimental Grounding. https://doi.org/10.48550/arXiv.1810. 11646 arXiv:1810.11646 [cs, stat] [20] Alan B. Krueger. 1999. Experimental Estimates of Education Production Functions. The Quarterly Journal of Economics 114, 2 (1999), 497–532. jstor:2587015 [21] Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. 2019. Meta-Learners for Estimating Heterogeneous Treatment Effects Using Machine Learning. Pro- ceedings of the National Academy of Sciences 116, 10 (March 2019), 4156–4165. https://doi.org/10.1073/pnas.1804597116 arXiv:1706.03461 [math, stat] [22] Milan Kuzmanovic, Tobias Hatt, and Stefan Feuerriegel. 2021. Deconfounding Temporal Autoencoder: Estimating Treatment Effects over Time Using Noisy Proxies. https://doi.org/10.48550/arXiv.2112.03013 arXiv:2112.03013 [cs, stat] [23] Robert J. LaLonde. 1986. Evaluating the Econometric Evaluations of Training Programs with Experimental Data. The American Economic Review 76, 4 (1986), 604–620. jstor:1806062 [24] Christos Louizos, Uri Shalit, Joris Mooij, David Sontag, Richard Zemel, and Max Welling. 2017. Causal Effect Inference with Deep Latent-Variable Models. https://doi.org/10.48550/arXiv.1705.08821 arXiv:1705.08821 [cs, stat] [25] Xinkun Nie and Stefan Wager. 2020. Quasi-Oracle Estimation of Het- erogeneous Treatment Effects. https://doi.org/10.48550/arXiv.1712.04912 arXiv:1712.04912 [econ, math, stat][26] Scott Powers, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam H. Shah, Trevor Hastie, and Robert Tibshirani. 2018. Some Methods for Heterogeneous Treatment Effect Estimation in High Dimensions. Statistics in Medicine 37, 11 (May 2018), 1767–1787. https://doi.org/10.1002/sim.7623 [27] James M. Robins, Miguel Ángel Hernán, and Babette Brumback. 2000. Marginal Structural Models and Causal Inference in Epidemiology:. Epidemiology 11, 5 (Sept. 2000), 550–560. https://doi.org/10.1097/00001648-200009000-00011 [28] Paul R. Rosenbaum and Donald B. Rubin. 1983. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika 70, 1 (1983), 41–55. https://doi.org/10.2307/2335942 jstor:2335942 [29] Evan Rosenman, Guillaume Basse, Art Owen, and Michael Baiocchi. 2020. Com- bining Observational and Experimental Datasets Using Shrinkage Estimators. https://doi.org/10.48550/arXiv.2002.06708 arXiv:2002.06708 [math, stat] [30] Donald B. Rubin. 1974. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies. Journal of Educational Psychology 66, 5 (Oct. 1974), 688–701. https://doi.org/10.1037/h0037350 [31] Patrick Schwab, Lorenz Linhardt, and Walter Karlen. 2019. Perfect Match: A Simple Method for Learning Representations For Counterfactual Inference With Neural Networks. https://doi.org/10.48550/arXiv.1810.00656 arXiv:1810.00656 [cs, stat] [32] Uri Shalit, Fredrik D. Johansson, and David Sontag. 2017. Estimating Individual Treatment Effect: Generalization Bounds and Algorithms. In Proceedings of the 34th International Conference on Machine Learning . PMLR, 3076–3085. [33] Caizhi Tang, Huiyuan Wang, Xinyu Li, Qing Cui, Ya-Lin Zhang, Feng Zhu, Longfei Li, Jun Zhou, and Linbo Jiang. 2022. Debiased Causal Tree: Heterogeneous Treatment Effects Estimation with Unmeasured Confounding. Advances in Neural Information Processing Systems 35 (2022), 5628–5640. [34] Stefan Wager and Susan Athey. 2017. Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests. https://doi.org/10.48550/arXiv.1510. 04342 arXiv:1510.04342 [math, stat] [35] Shu Yang. 2022. Integrative $R$-Learner of Heterogeneous Treatment Effects Combining Experimental and Observational Studies. In Proceedings of the First Conference on Causal Learning and Reasoning . PMLR, 904–926. [36] Shu Yang and Peng Ding. 2021. Combining Multiple Observational Data Sources to Estimate Causal Effects. arXiv:1801.00802 [stat] [37] Shu Yang, Donglin Zeng, and Xiaofei Wang. 2022. Improved Inference for Het- erogeneous Treatment Effects Using Real-World Data Subject to Hidden Con- founding. https://doi.org/10.48550/arXiv.2007.12922 arXiv:2007.12922 [stat] [38] Liuyi Yao, Sheng Li, Yaliang Li, Mengdi Huai, Jing Gao, and Aidong Zhang. 2018. Representation Learning for Treatment Effect Estimation from Observational Data. In Advances in Neural Information Processing Systems , Vol. 31. Curran Associates, Inc. [39] Jinsung Yoon and James Jordon. 2018. GANITE: ESTIMATION OF INDIVIDUAL- IZED TREAT- MENT EFFECTS USING GENERATIVE ADVERSARIAL. (2018). [40] Yao Zhang, Alexis Bellot, and Mihaela van der Schaar. 2020. Learning Overlapping Representations for the Estimation of Individualized Treatment Effects. https: //doi.org/10.48550/arXiv.2001.04754 arXiv:2001.04754 [cs, stat]

---