Tuesday, Aug 5: 10:30 AM - 12:20 PM
4110
Contributed Papers
Music City Center
Room: CC-Davidson Ballroom A2
Main Sponsor
Section on Statistics in Epidemiology
Presentations
Missing data are inevitable in clinical trials and may bias analyses. Here we describe an analysis of missing data in 8 trials of substance use disorder (SUD) from the National Institute on Drug Abuse (NIDA). Rates and patterns of missingness in longitudinal urine drug screen (UDS) were compared and predictors assessed. Replicate datasets were synthesized using classification and regression trees and analyzed with maximum likelihood (ML) or first processed with multiple imputation (MI). Missingness in UDS was 33% overall (15-52% per study), with 28% of participants having no missingness but some having up to 90%. Most (83%) participants had only intermittent missingness (p<0.001), but dropouts occurred in all studies. Missingness was more common in females (p=0.042) and younger participants (p<0.001). Based on synthetic data, MI and ML had similar results, although ML had fewer assumptions and was more efficient overall. We show that although missing outcome data occurs through random and non-random mechanisms, consistent predictors of missingness exist and ML is an efficient approach for handling missing values.
Keywords
missing data
maximum likelihood
clinical trials
substance use disorder
imputation
Because measures of physical activity derived from accelerometers are used to monitor physical activity behavior, the data may contain measurement error. Due to sedentary behavior, non-wear, or device malfunctions, the data may also contain excess zeroes. Limited options exist for analyzing zero-inflated functional data measured with error. Prior estimation methods were based on the assumption that the zero-inflated data were observed without errors, and assumed marginal distributions, such as a mixture of a degenerate distribution with a Gaussian distribution. However, these methods are not applicable for bias reduction of error-prone zero-inflated functional data. We propose semi-parametric Bayesian approaches that incorporate more flexible marginal distributions and priors while accounting for measurement error biases. We conduct simulations and sensitivity analyses to assess the performance of our proposed methods and compare them to current approaches. Our proposed method reduces biases due to measurement errors under the different simulation settings. We apply our methods to investigate the relationship between school-based physical activity and body mass index.
Keywords
measurement error
zero-inflation
functional data
physical activity
wearable device
accelerometer
Assessing the impact of air pollution components on health outcomes is a crucial aspect of air pollution epidemiology.
The statistical analysis of air pollution data encounters bias due to measurement errors of exposures to air pollution.
It is well established that exposure to fine particles (PM2.5) can cause a variety of adverse health outcomes. However, the specific impacts of PM2.5 constituents are less studied.
We proposed a novel approach to correct the bias resulting from measurement errors associated with multiple correlated air pollutants and a complex correlated error structure.
Through the framework of main and external validation designs,
our proposed method provides a general structure of joint model for data generation. This model integrates a Gaussian mixture model, a measurement error model and a outcome model.
We employ the Expectation-maximization algorithm to simultaneously estimate all the parameters in the joint model.
The application of this method is illustrated in a study investigating the association between air pollution exposures and cognitive function within the Nurses' Health Study (NHS).
Keywords
measurement errors
Gaussian mixture model
air pollution
Bayesian joint modeling
main and validation studies
In epidemiological studies of cancer risk factors, the cancer subtype (i.e., fatal or nonfatal) is often determined by death at the end of the study, which is not always observed due to censoring before the end of the follow-up. There are five possible scenarios regarding the status of the cancer outcome: 1. censored before cancer diagnosis; 2. observed fatal cancer; 3. unobserved fatal cancer; 4. observed nonfatal cancer; 5. unobserved nonfatal cancer. In existing studies, for both scenario 3 and 5, the cancer status is considered as nonfatal cancer, leading to possible misclassification in the outcome status. In order to address the issue of outcome subtype misclassification due to censoring in the post-cancer-diagnosis survival data, we propose a weighted partial likelihood method for estimating the parameters in the cause-specific Cox proportional hazard models. In a simulation study, we compare the relative bias and efficiency of our proposed method and the existing method that ignores the potential misclassification issue. We illustrate the statistical method using a fatal cancer example in an ongoing cohort study.
Keywords
fatal cancer
survival analysis
censoring
outcome misclassification
Co-Author(s)
Lantian Xu, Harvard T.H Chan School of Public Health
David Zucker, Hebrew University
Molin Wang, Harvard T.H. Chan School of Public Health
First Author
Zhuoran Wei, Harvard T.H. Chan School of Public Health
Presenting Author
Zhuoran Wei, Harvard T.H. Chan School of Public Health
Pooled testing consists of combining biomaterial from multiple individuals into 'pools' which are tested for evidence of infection (e.g. pathogens, antibodies, etc.). Pooled testing is widely used to estimate the prevalence of infectious diseases and can offer substantial cost savings over individual testing. The optimal pool size has been well-studied for the singleplex case and depends on the test accuracy and disease prevalence. We develop a method to determine the optimal pool size for multiplex pool testing data with imperfect assays. We use an expectation-maximization algorithm to estimate the prevalences of infection with all subsets of pathogens under consideration. We use Louis's method to obtain the asymptotic covariance matrix of these estimators. Our approach reliably estimates the prevalence of co-infections and can determine if infections with different pathogens are independent. We also present an approach for determining the optimal pool size for estimating co-infection prevalence. We apply our method to pooled testing data from a multiplex assay for four pathogens conducted on lone star ticks (Amblyomma americanum) collected in South Carolina to investigate co-infections between Rickettsia spp. and Ehrlichia spp. pathogens.
Keywords
pooled testing
group testing
imperfect multiplex assays
optimal pool size
Hazard ratios (HRs) are key effect measures in observational studies and randomized-controlled trials (RCTs). While randomization balances treatment groups at baseline, HRs lack causal interpretation due to selection bias from conditioning on survivors. A recent study of 27 large RCTs found selection bias in HRs to have little practical impact. We examined selection bias in HRs in a precision oncology RCT of olaparib (treatment) vs. control for metastatic castration-resistant prostate cancer (n=387). Homologous recombination repair (HRR) gene status strongly predicted treatment effect, with HRs of 0.36 (95% CI 0.27-0.5) in HRR-positive, and 0.88 (95%CI 0.58-1.34) in HRR-negative. In the overall trial, we observed preferential selection of HRR-positive patients in the treatment arm, which increased to 79% in the treatment arm and 68% in the control arm (both 63% at randomization) over six months. This selection led to minor differences between unweighted and inverse-probability weighted HRs. We started seeing differences between weighted and unweighted HRs by four months, when the risk sets in two treatment arms became more imbalanced. Even with strong effect modification, marginal HRs can appear stable unless selection severely distorts the risk sets at event times.
Keywords
Hazard ratios
Selection bias
Randomized-controlled trial
Co-Author
Konrad H. Stopsack, Department of Epidemiologic Methods and Etiologic Research,Leibniz Institute for Prevention Researc
First Author
Yuliya Leontyeva, Harvard T.H. Chan School of Public Health
Presenting Author
Yuliya Leontyeva, Harvard T.H. Chan School of Public Health
Surrogate-powered modeling is an emerging approach in semi-supervised learning that improves statistical efficiency by integrating large-scale unlabeled data with a small labeled dataset using multiple surrogate outcomes. This framework is particularly useful in risk modeling with electronic health records (EHR), where gold-standard outcomes are limited due to costly chart reviews, while algorithm-generated surrogates are widely available. Key challenges include effectively combining labeled and unlabeled data with multiple surrogates and designing efficient sampling rules for chart reviews. To address these, we propose a multi-wave sampling strategy to adaptively approximate the optimal sampling rule and introduce a novel semi-supervised estimator with first-order bias correction and sparse regularization to reduce estimation errors. The estimator is asymptotically normal, unbiased, and improves statistical efficiency. Extensive numerical studies demonstrate its effectiveness in reducing mean-squared error.
Keywords
EHR data
semi-supervised learning
surrogate regression
bias-reduction
Co-Author(s)
Huiyuan Wang, University of Pennsylvania
Thomas Lumley, University of Auckland
Yong Chen, University of Pennsylvania, Perelman School of Medicine
First Author
Jianmin Chen, University of Pennsylvania, Perelman School of Medicine
Presenting Author
Jianmin Chen, University of Pennsylvania, Perelman School of Medicine