Measurement error and missing data problems

Molin Wang Chair
Harvard T.H. Chan School of Public Health
 
Tuesday, Aug 5: 10:30 AM - 12:20 PM
4110 
Contributed Papers 
Music City Center 
Room: CC-Davidson Ballroom A2 

Main Sponsor

Section on Statistics in Epidemiology

Presentations

Assessment of Methods for Handling Missing Data Using NIDA Clinical Trials Studies on Substance Use

Missing data are inevitable in clinical trials and may bias analyses. Here we describe an analysis of missing data in 8 trials of substance use disorder (SUD) from the National Institute on Drug Abuse (NIDA). Rates and patterns of missingness in longitudinal urine drug screen (UDS) were compared and predictors assessed. Replicate datasets were synthesized using classification and regression trees and analyzed with maximum likelihood (ML) or first processed with multiple imputation (MI). Missingness in UDS was 33% overall (15-52% per study), with 28% of participants having no missingness but some having up to 90%. Most (83%) participants had only intermittent missingness (p<0.001), but dropouts occurred in all studies. Missingness was more common in females (p=0.042) and younger participants (p<0.001). Based on synthetic data, MI and ML had similar results, although ML had fewer assumptions and was more efficient overall. We show that although missing outcome data occurs through random and non-random mechanisms, consistent predictors of missingness exist and ML is an efficient approach for handling missing values. 

Keywords

missing data

maximum likelihood

clinical trials

substance use disorder

imputation 

Co-Author(s)

Amy Hahn, The Emmes Company
Ashley Vena, The Emmes Company
Abigail Matthews, The Emmes Company
Kathryn Hefner, The Emmes Company

First Author

Michael Otterstatter, The Emmes Company

Presenting Author

Michael Otterstatter, The Emmes Company

Functional linear regression model with an error-prone zero inflated functional predictor

Because measures of physical activity derived from accelerometers are used to monitor physical activity behavior, the data may contain measurement error. Due to sedentary behavior, non-wear, or device malfunctions, the data may also contain excess zeroes. Limited options exist for analyzing zero-inflated functional data measured with error. Prior estimation methods were based on the assumption that the zero-inflated data were observed without errors, and assumed marginal distributions, such as a mixture of a degenerate distribution with a Gaussian distribution. However, these methods are not applicable for bias reduction of error-prone zero-inflated functional data. We propose semi-parametric Bayesian approaches that incorporate more flexible marginal distributions and priors while accounting for measurement error biases. We conduct simulations and sensitivity analyses to assess the performance of our proposed methods and compare them to current approaches. Our proposed method reduces biases due to measurement errors under the different simulation settings. We apply our methods to investigate the relationship between school-based physical activity and body mass index. 

Keywords

measurement error

zero-inflation

functional data

physical activity

wearable device

accelerometer 

Co-Author(s)

Roger Zoh, Indiana University
Ufuk Beyaztas, Marmara University
Lan Xue, Oregon State University
Mark Benden, Texas A&M University
Carmen Tekwe, Indiana University

First Author

Heyang Ji, Indiana University

Presenting Author

Heyang Ji, Indiana University

Measurement errors in Gaussian mixture models using high-dimensional air pollution constituents data

Assessing the impact of air pollution components on health outcomes is a crucial aspect of air pollution epidemiology.
The statistical analysis of air pollution data encounters bias due to measurement errors of exposures to air pollution.
It is well established that exposure to fine particles (PM2.5) can cause a variety of adverse health outcomes. However, the specific impacts of PM2.5 constituents are less studied.
We proposed a novel approach to correct the bias resulting from measurement errors associated with multiple correlated air pollutants and a complex correlated error structure.
Through the framework of main and external validation designs,
our proposed method provides a general structure of joint model for data generation. This model integrates a Gaussian mixture model, a measurement error model and a outcome model.
We employ the Expectation-maximization algorithm to simultaneously estimate all the parameters in the joint model.
The application of this method is illustrated in a study investigating the association between air pollution exposures and cognitive function within the Nurses' Health Study (NHS). 

Keywords

measurement errors

Gaussian mixture model

air pollution

Bayesian joint modeling

main and validation studies 

First Author

Zhibing He

Presenting Author

Zhibing He

Methods for Handling Outcome Misclassification in Cancer Survival Analysis

In epidemiological studies of cancer risk factors, the cancer subtype (i.e., fatal or nonfatal) is often determined by death at the end of the study, which is not always observed due to censoring before the end of the follow-up. There are five possible scenarios regarding the status of the cancer outcome: 1. censored before cancer diagnosis; 2. observed fatal cancer; 3. unobserved fatal cancer; 4. observed nonfatal cancer; 5. unobserved nonfatal cancer. In existing studies, for both scenario 3 and 5, the cancer status is considered as nonfatal cancer, leading to possible misclassification in the outcome status. In order to address the issue of outcome subtype misclassification due to censoring in the post-cancer-diagnosis survival data, we propose a weighted partial likelihood method for estimating the parameters in the cause-specific Cox proportional hazard models. In a simulation study, we compare the relative bias and efficiency of our proposed method and the existing method that ignores the potential misclassification issue. We illustrate the statistical method using a fatal cancer example in an ongoing cohort study. 

Keywords

fatal cancer

survival analysis

censoring

outcome misclassification 

Co-Author(s)

Lantian Xu, Harvard T.H Chan School of Public Health
David Zucker, Hebrew University
Molin Wang, Harvard T.H. Chan School of Public Health

First Author

Zhuoran Wei, Harvard T.H. Chan School of Public Health

Presenting Author

Zhuoran Wei, Harvard T.H. Chan School of Public Health

Pooled Testing for Imperfect Multiplex Assays: The Optimal Pool Size for Estimating Coinfections

Pooled testing consists of combining biomaterial from multiple individuals into 'pools' which are tested for evidence of infection (e.g. pathogens, antibodies, etc.). Pooled testing is widely used to estimate the prevalence of infectious diseases and can offer substantial cost savings over individual testing. The optimal pool size has been well-studied for the singleplex case and depends on the test accuracy and disease prevalence. We develop a method to determine the optimal pool size for multiplex pool testing data with imperfect assays. We use an expectation-maximization algorithm to estimate the prevalences of infection with all subsets of pathogens under consideration. We use Louis's method to obtain the asymptotic covariance matrix of these estimators. Our approach reliably estimates the prevalence of co-infections and can determine if infections with different pathogens are independent. We also present an approach for determining the optimal pool size for estimating co-infection prevalence. We apply our method to pooled testing data from a multiplex assay for four pathogens conducted on lone star ticks (Amblyomma americanum) collected in South Carolina to investigate co-infections between Rickettsia spp. and Ehrlichia spp. pathogens. 

Keywords

pooled testing

group testing

imperfect multiplex assays

optimal pool size 

Co-Author(s)

Melissa Nolan, University of South Carolina
Kayla Bramlett, University of South Carolina
Kia Zellars, University of South Carolina
Brian Herrin, Kansas State University
Christopher McMahan

First Author

Stella Self, University of South Carolina

Presenting Author

Stella Self, University of South Carolina

Selection Bias in Hazard Ratios: Evidence from a Large Precision Oncology Trial

Hazard ratios (HRs) are key effect measures in observational studies and randomized-controlled trials (RCTs). While randomization balances treatment groups at baseline, HRs lack causal interpretation due to selection bias from conditioning on survivors. A recent study of 27 large RCTs found selection bias in HRs to have little practical impact. We examined selection bias in HRs in a precision oncology RCT of olaparib (treatment) vs. control for metastatic castration-resistant prostate cancer (n=387). Homologous recombination repair (HRR) gene status strongly predicted treatment effect, with HRs of 0.36 (95% CI 0.27-0.5) in HRR-positive, and 0.88 (95%CI 0.58-1.34) in HRR-negative. In the overall trial, we observed preferential selection of HRR-positive patients in the treatment arm, which increased to 79% in the treatment arm and 68% in the control arm (both 63% at randomization) over six months. This selection led to minor differences between unweighted and inverse-probability weighted HRs. We started seeing differences between weighted and unweighted HRs by four months, when the risk sets in two treatment arms became more imbalanced. Even with strong effect modification, marginal HRs can appear stable unless selection severely distorts the risk sets at event times. 

Keywords

Hazard ratios

Selection bias

Randomized-controlled trial 

Co-Author

Konrad H. Stopsack, Department of Epidemiologic Methods and Etiologic Research,Leibniz Institute for Prevention Researc

First Author

Yuliya Leontyeva, Harvard T.H. Chan School of Public Health

Presenting Author

Yuliya Leontyeva, Harvard T.H. Chan School of Public Health

Surrogate-powered Regularized Estimation: Semi-Supervised Modeling with Multi-Wave Sampling

Surrogate-powered modeling is an emerging approach in semi-supervised learning that improves statistical efficiency by integrating large-scale unlabeled data with a small labeled dataset using multiple surrogate outcomes. This framework is particularly useful in risk modeling with electronic health records (EHR), where gold-standard outcomes are limited due to costly chart reviews, while algorithm-generated surrogates are widely available. Key challenges include effectively combining labeled and unlabeled data with multiple surrogates and designing efficient sampling rules for chart reviews. To address these, we propose a multi-wave sampling strategy to adaptively approximate the optimal sampling rule and introduce a novel semi-supervised estimator with first-order bias correction and sparse regularization to reduce estimation errors. The estimator is asymptotically normal, unbiased, and improves statistical efficiency. Extensive numerical studies demonstrate its effectiveness in reducing mean-squared error. 

Keywords

EHR data

semi-supervised learning

surrogate regression

bias-reduction 

Co-Author(s)

Huiyuan Wang, University of Pennsylvania
Thomas Lumley, University of Auckland
Yong Chen, University of Pennsylvania, Perelman School of Medicine

First Author

Jianmin Chen, University of Pennsylvania, Perelman School of Medicine

Presenting Author

Jianmin Chen, University of Pennsylvania, Perelman School of Medicine