Novel Statistical Methods for the Design and Analysis of Two-Phase Studies

Ran Tao Chair
Vanderbilt University Medical Center
 
Ran Tao Organizer
Vanderbilt University Medical Center
 
Sunday, Aug 3: 4:00 PM - 5:50 PM
0672 
Topic-Contributed Paper Session 
Music City Center 
Room: CC-207C 

Applied

Yes

Main Sponsor

Biometrics Section

Co Sponsors

ENAR
Section on Statistics in Epidemiology

Presentations

A Maximin Optimal Approach for Sampling Designs in Two-phase Studies

Data collection costs can vary widely across variables in data science tasks. Two-phase designs are often employed to save data collection costs. In two-phase studies, inexpensive variables are collected for all subjects in the first phase, and expensive variables are measured for a subset of subjects in the second phase based on a predetermined sampling rule. The estimation efficiency under two-phase designs relies heavily on the sampling rule. Existing literature primarily focuses on designing sampling rules for estimating a scalar parameter in some parametric models or specific estimating problems. However, real-world scenarios are usually model-unknown and involve two-phase designs for model-free estimation of a scalar or multi-dimensional parameter. In this talk, we present a maximin criterion to design an optimal sampling rule based on semiparametric efficiency bounds. The proposed method is model-free and applicable to general estimating problems. The resulting sampling rule can minimize the semiparametric efficiency bound when the parameter is scalar and improve the bound for every component when the parameter is multi-dimensional. Simulation studies demonstrate that the proposed designs reduce the variance of the resulting estimator in various settings. The implementation of the proposed design is illustrated in a real data analysis. 

Keywords

Cost-effective sampling

Efficient influence function

Incomplete data

Semiparametric efficiency

Subsample 

Speaker

Ruoyu Wang, Harvard University

Efficient Estimation of the Cox Model with Time-Varying Effects Under Two-Phase Designs

Two-phase designs are often used in large epidemiological or clinical studies with potentially censored time-to-event outcomes when certain covariates are too expensive to be collected on all participants. Important examples include the case-cohort design, which selects all cases and a random subcohort for the measurement of the expensive covariate, and the nested case-control design, which selects a small number of controls at each observed event time. Existing research on two-phase studies with time-to-event outcomes largely focuses on estimating time-fixed covariate effects. In this talk, we propose a semiparametric approach to estimate time-varying expensive covariate effects under two-phase sampling using B-splines. We devise a computationally efficient and numerically stable EM-algorithm to maximize the semiparametric likelihood. In addition, we establish the consistency, asymptotic normality, and asymptotic efficiency of the estimators. Furthermore, we demonstrate the superiority of the proposed methods over existing ones through extensive simulation studies. Finally, we demonstrate our method on data from a large cohort study, looking at the association between oxidative stress and colorectal cancer incidence. 

Keywords

Missing data

Biased sampling design

Case-cohort design

Nested case-control design

EM algorithm

Semiparametric efficiency 

Speaker

Ran Tao, Vanderbilt University Medical Center

Improving Estimation Efficiency for Case-cohort Studies with a Cure Fraction

In the studies of time-to-event outcomes, it often happens that a fraction of subjects will never experience the event of interest, and these subjects are said to be cured. The studies with a cure fraction often yield a low event rate. To reduce cost and enhance study power, two-phase sampling designs are often adopted, especially when the covariates of interest are expensive to measure or obtain. In this paper, we consider the generalized case-cohort design for studies with a cure fraction. Under this design, the expensive covariates are measured for a subset of the study cohort, called subcohort, and for all or a subset of the remaining subjects outside the subcohort who have experienced the event during the study, called cases. We propose a two-step estimation procedure under a class of semiparametric transformation mixture cure models. We first develop a sieve maximum weighted likelihood method based only on the complete data and also devise an EM algorithm for implementation. We then update the resulting estimator via a working model between the outcome and cheap covariates or auxiliary variables using the full data. We show that the proposed update estimator is consistent and asymptotically at least as efficient as the complete-data estimator, regardless of whether the working model is correctly specified or not. We also propose a weighted bootstrap procedure for variance estimation. Extensive simulation studies demonstrate the superior performance of the proposed method in finite-sample. An application to the National Wilms' Tumor Study is provided for illustration. 

Keywords

Auxiliary variable

Missing data

Mixture cure model

Robust estimation

Semiparametric inference

Survival analysis 

Co-Author

Xu Cao, University of California at Riverside

Speaker

Qingning Zhou

Joint Semiparametric Regression Models for Secondary Responses in Case-Cohort Studies

Case-cohort studies are widely used as a cost-effective sampling strategy. It is often of interest to analyze the association between the secondary responses and the main exposures in a case-cohort study. The analysis of the secondary responses using the case-cohort data is not well studied. We propose a joint model of the time-to-event survival outcome, the continuous secondary responses, and the multivariate mix-type expensive exposures. Specifically, a Cox proportional hazards model, a multivariate linear regression model, and a semiparametric density ratio model are assumed for the failure time, the secondary responses, and the expensive exposures, respectively. The density ratio model is flexible in modeling multivariate mixed-type data without specifying the baseline distribution function. We develop nonparametric maximum likelihood-based estimation and inference procedures. The resulting nonparametric maximum likelihood estimators are shown to be consistent, asymptotically normal, and asymptotically efficient. Extensive simulation studies demonstrate that the asymptotic approximations are accurate under practical settings. The proposed methods are also shown to be reasonably robust to some model misspecifications. We apply the proposed methods to the National Wilms Tumor Study data.  

Co-Author

Weibin Zhong, Regeneron

Speaker

Guoqing Diao, George Washington University

Valid and Efficient Inference for Nonparametric Variable Importance in Two-Phase Studies

We consider a common nonparametric regression setting where the data consist of a response variable Y, some easily obtainable covariates X, and a set of costly covariates Z. Prior to large-scale data collection for developing a model to predict Y with (X, Z), we wish to conduct preliminary investigations to infer the importance of Z for predicting Y given X. To achieve this goal, we propose a nonparametric variable importance measure for Z, defined as a population parameter that quantifies the contribution through general loss functions. Considering two-phase data that consist of a large number of observations for (Y, X) with Z being measured only in a relatively small subsample, we propose a novel semi-parametric method for estimating the proposed importance measure. Our method accommodates the missing Z for each individual in the two-phase data by imputing their contribution to the loss function. Our imputation method, inspired by similarities with semi-supervised learning methods, involves challenging two-stage nonparametric estimation. We establish theoretical results and demonstrate the performance of our method via extensive numerical results. 

Speaker

Jinbo Chen, University of Pennsylvania