Monday, Aug 3: 2:00 PM - 3:50 PM
6414
Contributed Papers
Thomas M. Menino Convention & Exhibition Center
Room: CC-208
Main Sponsor
Section on Statistics in Epidemiology
Presentations
In environmental epidemiology, individual-level exposure measurements are rarely observed and are commonly assigned from spatiotemporal prediction models. This paradigm introduces measurement error due to spatial misalignment between exposure predictions and health outcome locations, potentially leading to biased inference and undercoverage. Existing correction methods typically rely on access to monitoring data and have limited capacity to accommodate complex survey designs or settings where only gridded exposure predictions are available. We propose a flexible bootstrap-based measurement error correction framework for two-stage environmental health analyses that operates entirely on grid-based exposure predictions and does not require direct access to monitoring measurements. The approach embeds a scalable spatial model within a resampling scheme that reconstructs exposure uncertainty through pseudo-monitor sampling, exposure regeneration, and reassignment to individual locations.
Keywords
Spatial statistics
Environmental Epidemiology
Measurement Error
Count time series arising in public health, epidemiology, and finance often exhibit serial dependence, excess zeros, and time-varying dispersion features that are inadequately captured by standard Poisson models. We propose a time-varying zero-inflated generalized Poisson INGARCH (TV-ZIGP-INGARCH) framework that simultaneously accommodates serial dependence, dynamic zero inflation, and evolving dispersion. In the proposed model, the conditional mean (intensity) follows an INGARCH-type evolution, while both the zero-inflation probability and the dispersion parameter are allowed to vary over time as functions of past observations and exogenous covariates. This structure enables the model to capture evolving structural zeros, periods of under- and over-dispersion, and nonstationary behavior within a unified count time-series framework. Parameter estimation is conducted using both maximum likelihood and expectation maximization approaches. Applications to real-world count time series show substantial gains in goodness-of-fit and predictive accuracy, particularly during periods characterized by changing zero-inflation intensity and dispersion.
Keywords
Count time series
Zero-inflated models
Generalized Poisson distribution
INGARCH models
Time-varying parameters.
CDC's Active Bacterial Core surveillance (ABCs) monitors invasive bacterial diseases among about 45.9 million people across 10 U.S. sites. A challenge in ABCs is missing sociodemographic (e.g., race) and laboratory characterization (e.g., bacterial subtypes) data. Non-random missing data can bias stratified disease estimates. Therefore, we developed a multi-step multiple imputation approach utilizing random forest models to capture complex predictor interactions, mitigate multicollinearity, and account for hierarchical structure of laboratory characteristics. The approach leverages sociodemographic and clinical characteristics to enhance imputation under non-random missingness. We further proposed a decision tree-based framework to characterize the complex missing data mechanisms inherent in ABCs, conducted simulations, and assessed multiple metrics of imputation accuracy. Results demonstrate improved precision and validity of imputed demographic data, leading to more reliable estimates; the race misclassification rate was only 2.4%, despite approximately 20% missingness. This framework can be broadly applicable to public health surveillance systems with non-random missing data.
Keywords
Multiple imputation
Missing not at random (MNAR)
Hierarchical data structure
Infectious disease surveillance
Randomized trials provide high-quality causal evidence but often enroll selective populations. By contrast, electronic health records capture broader clinical populations but frequently lack the outcomes needed for reliable treatment-effect estimation. This limits the ability to extend high-quality evidence to routine-care settings when the target outcome is unavailable or incompletely measured. Here we propose a debiased digital twin framework for estimating treatment effects in target populations with covariate-only data. The framework learns outcome models in a source dataset, applies them to the target population to generate paired digital twins, and then uses bias reference outcomes (BROs), whose population-level treatment effects are expected to be null, to detect and calibrate residual bias arising from limited generalizability. The calibration step is a wrapper and can be readily adapted across data settings and digital twin models. In a real-world application transporting brain imaging-based evidence from SPRINT-MIND to a Penn Medicine EHR cohort, evaluation through leave-one-out BRO falsification analyses showed that naive digital twin estimates were systematically biased and severely undercovered, whereas BRO-calibrated estimates were well centered and achieved near-nominal coverage, improving from 2.2% to 95.2%. Across 238 white matter lesion outcomes, the debiased estimates preserved the overall protective pattern associated with intensive blood pressure treatment while enabling inference when the primary outcomes were not routinely observed in the target population. These results support BRO-calibrated digital twins as a practical approach for extending treatment-effect evidence from selective studies to broader clinical populations.
Keywords
Causal Inference
Negative control outcome calibration/ Bias Reference Outcomes
Observational studies
Electronic health records
Integrating external studies can improve precision and statistical efficiency beyond using an internal study alone, yet heterogeneity across external studies can distort internal estimates. Existing methods may require individual-level external data or strong distributional assumptions and may yield invalid inference under between-study heterogeneity. We propose mGMM, a semiparametric framework that combines generalized method of moments with a random-effects model to borrow external summary information while accounting for heterogeneity. Unlike methods that require external studies to estimate the same target parameter, mGMM uses auxiliary external information to improve inference for the internal-study target. Simulations show consistent estimation, valid inference, and efficiency gains over internal-only analyses under heterogeneity. Applied to colorectal cancer tumor sequencing data while integrating external smoking summaries for established molecular subtypes, mGMM identifies smoking associations that vary by mutation-defined subtype and were not possible using the internal data alone, providing etiologic insight beyond internal-only analyses.
Keywords
Data integration
External summary information
Generalized method of moments
Random-effects model
Study heterogeneity
Statistical efficiency
Speaker
Yifei Wang, Fred Hutchinson Cancer Research Center
Co-Author(s)
Jiayin Zheng, University of Pennsylvania
Li Hsu, Fred Hutchinson Cancer Center
Estimating treatment effects with EHR data is complicated by the lack of validated outcome measures collected under standardized protocols. Typically outcomes are approximated using putative rules or phenotyping models that are trained on small sub-samples. Errors in the inferred outcomes can introduce biases in downstream treatment effect estimates. We develop a semi-supervised method to calibrate inferred outcomes for estimation of average treatment effects when a sub-sample is labeled at random with validated outcomes. The calibration ensures that the subsequent treatment effect estimator remains consistent despite errors in the inferred outcomes. This problem is analogous to that of estimating mean outcomes in a longitudinal study subject to monotone missingness, and we demonstrate connections with existing augmented inverse probability weighting estimators. The proposed estimator is multiply robust and locally semiparametric efficient. It can achieve efficiency gains in finite samples due to an effective normalization of implicit augmentation terms. The performance is evaluated in simulations and illustrated in an analysis of anti-TNF therapies in rheumatoid arthritis.
Keywords
Comparative effectiveness
Electronic health records
Semiparametric efficiency
Multiple robustness
Semi-supervised learning
Weighting methods are widely used to estimate causal effects in observational studies by balancing pre-treatment covariates across treatment groups. Traditional approaches, such as inverse propensity score weighting or moment matching, achieve balance only indirectly and do not ensure alignment of full joint covariate distributions. Recently proposed distributional balancing methods offer flexible, nonparametric alternatives that directly target entire covariate distributions, but they lack a unified framework, theoretical guarantees, and valid inference procedures. We introduce a unified framework for nonparametric distributional balancing based on characteristic function distance (CFD), showing that popular discrepancies such as maximum mean discrepancy and energy distance arise as special cases. We establish conditions under which the resulting CFD-based weighting estimator is √n-consistent. Because the standard bootstrap may fail, we propose subsampling for valid inference. We further extend the framework to instrumental variable settings to address unmeasured confounding. Simulation and real-data analysis demonstrate strong empirical performance consistent with our theory.
Keywords
Energy distance
Local average treatment effect
Maximum mean discrepancy
Reproducing kernel Hilbert space
Quadratic programming
Subsampling