Thursday, Aug 7: 10:30 AM - 12:20 PM
0663
Topic-Contributed Paper Session
Music City Center
Room: CC-102B
Applied
Yes
Main Sponsor
Biometrics Section
Co Sponsors
ENAR
Section on Statistics in Epidemiology
Presentations
Randomized controlled trials (RCTs) are widely regarded as the gold standard for causal inference in biomedical research. For instance, when estimating the average treatment effect on the treated (ATT), a doubly robust estimation procedure can be applied, requiring either the propensity score model or the control outcome model to be correctly specified. In this paper, we address scenarios where external control data, often with a much larger sample size, are available. Such data are typically easier to obtain from historical records
or third-party sources. However, we find that incorporating external controls into the standard doubly robust estimator for ATT may paradoxically result in reduced efficiency compared to using the estimator without external controls. This counterintuitive outcome suggests that the naive incorporation of external controls could be detrimental to estimation efficiency. To resolve this issue, we propose a novel doubly robust estimator that guarantees higher efficiency than the standard approach without external controls, even under model misspecification. When all models are correctly specified, this estimator aligns with the standard doubly robust estimator that incorporates external controls and achieves semiparametric efficiency. The asymptotic theory developed in this work applies to high-dimensional confounder settings, which are increasingly common with the growing prevalence of electronic health record data. We demonstrate the effectiveness of our methodology through extensive simulation studies and a real-world data application.
Semiparametric functional response models for between-subject, or pairwise, at- tributes are attracting attention owing to their robustness in inference, feasibility in computation, and versatility in interpreting rank-based and high-dimensional variables. This paper develops the semiparametric efficiency bound for this emerging class of regressions whose modeling units are pairwise observations violating independence. To handle their correlations in making inferences, the U-statistics-based generalized estimating equations (UGEE) have been previously established to pro- vide consistent and asymptotically normal estimators. They demonstrated promising performances in simulations and various applications, including microbiome and neuroimaging studies. However, a vital gap is their understudied asymptotic efficiency, which is the key to ensuring optimality and sensitivity in signal detection. Albeit the thoroughly studied semiparametric efficiency in the traditional setting for i.i.d. observations, the efficiency bound for pairwise attributes remains open. By enriching the theory built upon Hilbert spaces, we showed that UGEE estimators are asymptotically efficient. Essentially, our developed theory will not only fill the critical gap in efficiency for this emerging model class but propel applications availing this optimal inference technique. More importantly, this work will serve as a building block for future efficiency development in more complex settings such as missing data in longitudinal studies.
Complex survey data are essential in modern experimental design and healthcare research, enabling cost-effective sampling while mitigating selection bias and improving estimators' statistical efficiency. However, current statistical methodologies offer few nonlinear approaches specifically tailored to complex survey designs—particularly those that model the conditional distribution of multivariate continuous outcomes and possibly their functionals. To bridge this gap, we introduce a novel distributional random forest regression algorithm equipped with strong theoretical guarantees. We illustrate the practical utility of this new algorithm with various biomedical examples from the American NHANES survey cohort.
Speaker
Yating Zou, University of North Carolina at Chapel Hill
Regression models for continuous outcomes frequently require a transformation, which is often specified a priori or estimated from a parametric family. Cumulative probability models (CPMs) nonparametrically estimate the transformation by treating the continuous outcome as if it is ordinal. These models are especially useful for analyzing skewed outcomes and mixed outcomes (e.g., data with detection limits). The models give fitted distributions, upon which the fitted mean, median and quantiles can be computed. We now extend the CPMs by making the regressor side of the model nonparametric. Specifically, we have implemented random forest CPMs, in which each tree model has the regressor space partitioned into terminal nodes and then a CPM is fit on the partition. The fitted cumulative distribution functions are averaged across trees to obtain the final fitted distribution. We will evaluate the performance of this approach and demonstrate it in data applications.
An increased number of outliers presents a major challenge in data analysis, yielding uninterpretable and often biased results when using mean-based statistical models. In causal inference, the classic average causal effect is defined as the population mean difference between counterfactual outcomes under treatment and control. Thus, causal inference for Mann-Whitney-Wilcoxon rank sum test (MWWRST) has been proposed to address these outlier-related challenges and provide a more robust inference. However, these methods assume logistic regression for outcome modeling, which impose restrictions when the number of covariates increases or when complex interactions are present. In this talk, I will introduce a novel approach to estimating the causal effects for MWWRST using random forests. By extending the causal forest framework of Wager and Athey (2018), we demonstrate that our estimator is pointwise consistent and asymptotically normal, providing a flexible and robust alternative for causal inference in nonparametric settings.
Keywords
Mann-Whitney-Wilcoxon rank sum test
U-statistics
Functional response models
Speaker
Tuo Lin, University of Florida