Sunday, Aug 3: 4:00 PM - 5:50 PM
0672
Topic-Contributed Paper Session
Music City Center
Room: CC-207C
Applied
Yes
Main Sponsor
Biometrics Section
Co Sponsors
ENAR
Section on Statistics in Epidemiology
Presentations
Data collection costs can vary widely across variables in data science tasks. Two-phase designs are often employed to save data collection costs. In two-phase studies, inexpensive variables are collected for all subjects in the first phase, and expensive variables are measured for a subset of subjects in the second phase based on a predetermined sampling rule. The estimation efficiency under two-phase designs relies heavily on the sampling rule. Existing literature primarily focuses on designing sampling rules for estimating a scalar parameter in some parametric models or specific estimating problems. However, real-world scenarios are usually model-unknown and involve two-phase designs for model-free estimation of a scalar or multi-dimensional parameter. In this talk, we present a maximin criterion to design an optimal sampling rule based on semiparametric efficiency bounds. The proposed method is model-free and applicable to general estimating problems. The resulting sampling rule can minimize the semiparametric efficiency bound when the parameter is scalar and improve the bound for every component when the parameter is multi-dimensional. Simulation studies demonstrate that the proposed designs reduce the variance of the resulting estimator in various settings. The implementation of the proposed design is illustrated in a real data analysis.
Keywords
Cost-effective sampling
Efficient influence function
Incomplete data
Semiparametric efficiency
Subsample
Two-phase designs are often used in large epidemiological or clinical studies with potentially censored time-to-event outcomes when certain covariates are too expensive to be collected on all participants. Important examples include the case-cohort design, which selects all cases and a random subcohort for the measurement of the expensive covariate, and the nested case-control design, which selects a small number of controls at each observed event time. Existing research on two-phase studies with time-to-event outcomes largely focuses on estimating time-fixed covariate effects. In this talk, we propose a semiparametric approach to estimate time-varying expensive covariate effects under two-phase sampling using B-splines. We devise a computationally efficient and numerically stable EM-algorithm to maximize the semiparametric likelihood. In addition, we establish the consistency, asymptotic normality, and asymptotic efficiency of the estimators. Furthermore, we demonstrate the superiority of the proposed methods over existing ones through extensive simulation studies. Finally, we demonstrate our method on data from a large cohort study, looking at the association between oxidative stress and colorectal cancer incidence.
Keywords
Missing data
Biased sampling design
Case-cohort design
Nested case-control design
EM algorithm
Semiparametric efficiency
Speaker
Ran Tao, Vanderbilt University Medical Center
In the studies of time-to-event outcomes, it often happens that a fraction of subjects will never experience the event of interest, and these subjects are said to be cured. The studies with a cure fraction often yield a low event rate. To reduce cost and enhance study power, two-phase sampling designs are often adopted, especially when the covariates of interest are expensive to measure or obtain. In this paper, we consider the generalized case-cohort design for studies with a cure fraction. Under this design, the expensive covariates are measured for a subset of the study cohort, called subcohort, and for all or a subset of the remaining subjects outside the subcohort who have experienced the event during the study, called cases. We propose a two-step estimation procedure under a class of semiparametric transformation mixture cure models. We first develop a sieve maximum weighted likelihood method based only on the complete data and also devise an EM algorithm for implementation. We then update the resulting estimator via a working model between the outcome and cheap covariates or auxiliary variables using the full data. We show that the proposed update estimator is consistent and asymptotically at least as efficient as the complete-data estimator, regardless of whether the working model is correctly specified or not. We also propose a weighted bootstrap procedure for variance estimation. Extensive simulation studies demonstrate the superior performance of the proposed method in finite-sample. An application to the National Wilms' Tumor Study is provided for illustration.
Keywords
Auxiliary variable
Missing data
Mixture cure model
Robust estimation
Semiparametric inference
Survival analysis
Case-cohort studies are widely used as a cost-effective sampling strategy. It is often of interest to analyze the association between the secondary responses and the main exposures in a case-cohort study. The analysis of the secondary responses using the case-cohort data is not well studied. We propose a joint model of the time-to-event survival outcome, the continuous secondary responses, and the multivariate mix-type expensive exposures. Specifically, a Cox proportional hazards model, a multivariate linear regression model, and a semiparametric density ratio model are assumed for the failure time, the secondary responses, and the expensive exposures, respectively. The density ratio model is flexible in modeling multivariate mixed-type data without specifying the baseline distribution function. We develop nonparametric maximum likelihood-based estimation and inference procedures. The resulting nonparametric maximum likelihood estimators are shown to be consistent, asymptotically normal, and asymptotically efficient. Extensive simulation studies demonstrate that the asymptotic approximations are accurate under practical settings. The proposed methods are also shown to be reasonably robust to some model misspecifications. We apply the proposed methods to the National Wilms Tumor Study data.
We consider a common nonparametric regression setting where the data consist of a response variable Y, some easily obtainable covariates X, and a set of costly covariates Z. Prior to large-scale data collection for developing a model to predict Y with (X, Z), we wish to conduct preliminary investigations to infer the importance of Z for predicting Y given X. To achieve this goal, we propose a nonparametric variable importance measure for Z, defined as a population parameter that quantifies the contribution through general loss functions. Considering two-phase data that consist of a large number of observations for (Y, X) with Z being measured only in a relatively small subsample, we propose a novel semi-parametric method for estimating the proposed importance measure. Our method accommodates the missing Z for each individual in the two-phase data by imputing their contribution to the loss function. Our imputation method, inspired by similarities with semi-supervised learning methods, involves challenging two-stage nonparametric estimation. We establish theoretical results and demonstrate the performance of our method via extensive numerical results.