Denoising and Enhancing Data Analysis: Capturing True Signals and Addressing Errors with Advanced Statistical Approaches

Maria Kamenetsky Chair
 
Sunday, Aug 3: 4:00 PM - 5:50 PM
4018 
Contributed Papers 
Music City Center 
Room: CC-201B 

Main Sponsor

ENAR

Presentations

An Algorithm for Randomizing Matched Sets Within and Between Batches

Variation in laboratory assays can contribute to measurement error. Careful planning can minimize differential errors in effect measures. Randomization can help to ensure that sequencing of samples within and across batches is independent of sample characteristics. Batches may be comprised of multiple plates. We developed an algorithm to assign samples to batches that: 1) allows for variation in plate sizes within batches; 2) treats samples for matched study subjects such as cases and controls or exposed and unexposed individuals as a set; 3) randomizes sets to assigned batches; and, 4) orders samples randomly with sets. To evaluate variation within and between batches, quality control samples are: 1) assigned both within and to other batches; and, 2) quality control replicate samples in the same batch are required to be placed a certain distance apart. An option in the tool allows for minimal rearrangement of samples if a sequence of assays requiring different batch sizes are being conducted. A validation step verifies that algorithm arguments are being met. A R-Package with this tool is being developed, including a vignette and a test dataset. 

Keywords

randomization

case-control

laboratory assays

bias

Disclaimer:
The authors have no conflicts of interest to disclose. The views, information or content, and conclusions presented do not necessarily represent the

official endorsement be inferred on the part of, the Uniformed Services University, the Department of Defense, the U.S. Government or The Henry M. Jackson Foundation. 

Co-Author

Thaddeus Haight, Military Traumatic Brain Injury Initiative

First Author

Michelle Mellers, Center for Health Services Research, USU, HJF

Presenting Author

Michelle Mellers, Center for Health Services Research, USU, HJF

Average treatment effect in cluster-randomized trials with outcome misclassification and non-random validation subset

In ASPIRE, a cluster-randomized trial, pediatric primary care clinics receive either facilitation or no facilitation for delivering a secure firearm program. Under this program, clinicians provide both counseling and free gun locks to parents. Randomization should enable non-parametric estimation of the ATE, but clinicians document their own delivery of the program, which may not reflect true delivery. In a follow-up study to address this classification error, parents are asked to validate clinicians' documentation, but only a fraction volunteer. In this setting where a non-random internal validation set is available, we demonstrate that it is possible to use the relationship between gold-standard (parent) and silver-standard (clinician) measures to target the ATE without bias. Moreover, we show that our method is valid even when selection into the validation sample depends on the true outcome. Simulation studies demonstrate acceptable finite sample performance of our estimators with cluster-robust variance expressions in the presence of misclassification and selection bias in the validation set. We apply our methods to ASPIRE to assess the impact of facilitation on program delivery. 

Keywords

cluster-randomized trial

measurement error

selection bias

causal inference 

Co-Author(s)

Kristin Linn, University of Pennsylvania
Nandita Mitra, University of Pennsylvania

First Author

Dane Isenberg, University of Pennsylvania

Presenting Author

Dane Isenberg, University of Pennsylvania

Enhancing Change Point Detection with Skew-t Distributions: Applications in Finance, Healthcare & AI

Change point detection (CPD) is essential in identifying structural shifts in time-series data, with applications spanning finance, healthcare, and environmental monitoring. Traditional CPD methods often assume normality, which fails to capture real-world data that exhibit skewness and heavy tails. This talk explores using skew-t distributions in CPD, providing a more robust framework for detecting distributional shifts.

We introduce parametric and non-parametric CPD approaches, emphasizing a Bayesian Information Criterion (BIC)-based method tailored for skewed data. Applications include changes in financial market regimes, environmental monitoring of heavy metal contamination, and healthcare analytics such as glaucoma progression modeling. Additionally, we highlight the integration of CPD in machine learning and AI, including concept drift detection, anomaly detection, and reinforcement learning.

By leveraging skew-t distributions, we enhance the accuracy of CPD models in capturing asymmetric and long-tailed data, offering more reliable insights across disciplines. 

Keywords

Change Point Detection, Skew-T Distribution, Bayesian Information Criterion, Machine Learning, AI Model Adaptation, Concept Drift, and Anomaly Detection. 

First Author

Abeer Hasan

Presenting Author

Abeer Hasan

Forecasting and Downscaling Solar Irradiance Using Transformers

As solar energy continues to grow as a key component of the global energy mix, accurate forecasting of solar irradiance becomes more crucial for ensuring reliable electricity supply. However, existing forecasting methods often fail to capture the fine temporal variations in solar irradiance, particularly in regions where local weather conditions play a significant role. This research addresses the growing need for accurate solar irradiance forecasting to optimize the integration of solar energy into the grid. By using raw data, we aim to preserve important short-term fluctuations that are crucial for precise forecasts. The focus was on downscaling global solar irradiance data from a 15-min. resolution to a higher, 5-min. local resolution for Brookings, South Dakota. A transformer-based model was applied to forecast solar power output, utilizing different approaches to assess the effectiveness of various downscaling methods. The model was trained on historical data and used to generate short-term forecast for 24 hours, with performance evaluated based on standard error metrics. The findings highlight the potential of transformer models for improving solar irradiance forecast. 

Keywords

Solar Irradiance Forecast

Downscaling

Transformers

Time Series 

Co-Author(s)

Hossein Moradi Rekabdarkolaee, South Dakota State University
Abhilasha Suvedi, South Dakota State University
Timothy M. Hansen, South Dakota State University

First Author

Jesto Peter, South Dakota State University

Presenting Author

Jesto Peter, South Dakota State University

Heterogeneity-Adaptive Meta-Analysis

Meta-analytic methods tend to take all-or-nothing approaches to study-level heterogeneity, either limiting the influence of studies that are suspected to diverge from a shared model or assuming all studies are homogeneous. In this paper, we develop a heterogeneity-adaptive meta-analysis in linear models that adapts to the amount of information shared between datasets. The primary mechanism for the information-sharing is a shrinkage of dataset-specific distributions towards a new "centroid" distribution through a Kullback-Leibler divergence penalty. The Kullback-Leibler divergence is uniquely geometrically suited for measuring relative information between datasets. We establish our estimator's desirable inferential properties without assuming homogeneity between dataset parameters. Among other things, we show that our estimator has a provably smaller mean squared error than the dataset-specific maximum likelihood estimators, and establish asymptotically valid inference procedures. A comprehensive set of simulations illustrates our estimator's versatility, and an analysis of data from the eICU Collaborative Research Database illustrates its performance in a real-world setting. 

Keywords

Data integration

Penalized regression

Information geometry

Stein shrinkage

Data privacy 

Co-Author

Emily Hector, North Carolina State University

First Author

Elizabeth Davis

Presenting Author

Elizabeth Davis

Robusify p-value Calibration in Observational Studies with Partially Valid Negative Control Outcomes

In observational studies, empirical calibration of p-values using negative control outcomes (NCOs) has emerged as a powerful tool for detecting and adjusting for systematic bias in treatment effect estimation. However, existing methods assume that all NCOs are valid-i.e., they have a true null effect-an assumption often violated in real-world settings. This study introduces a mixture model-based approach to account for the presence of invalid NCOs. Our method estimates the null distribution of effect estimates while accommodating heterogeneous NCO validity, enhancing robustness against bias. Through simulation studies, we demonstrate that our approach improves bias correction and controls false discoveries. We apply this methodology to real-world healthcare datasets, showcasing its practical benefits in ensuring reliable causal inference. Our findings underscore the importance of flexible p-value calibration strategies in observational research, particularly when some NCOs may deviate from the true null hypothesis. By tolerating partial misclassification of NCOs, our approach advances empirical calibration toward greater robustness and generalizability. 

Keywords

Hypothesis Testing

Mixture Models

Negative Control Outcomes

Observational Studies

p-value Calibration 

Co-Author(s)

Dazheng Zhang
Huiyuan Wang, University of Pennsylvania
Wenjie Hu, University of Pennsylvania\ School of Medicine - Philadelphia, PA
Qiong Wu, University of Pittsburgh
Howard Chan, University of Pennsylvania
Lu Li
Patrick Ryan, Johnson & Johnson
Marc Suchard, University of California-Los Angeles
Martijn Schuemie, Observational Health Data Science and Informatics
George Hripcsak, University of Columbia
Yong Chen, University of Pennsylvania, Perelman School of Medicine

First Author

Bingyu Zhang

Presenting Author

Bingyu Zhang

Title: Functional Data Analysis in Hearing Research: A Clinical Research Perspective

In a hearing clinical trial comparing Tinnitus patients to a control group, noise exposure was recorded every 3.75 minutes over 7 days. Tinnitus is the perception of sound in the ears or head without an external source. We present an application-driven approach for time series denoising and group comparisons in analyzing sound exposure patterns between the two groups. Instead of traditional two-sample comparison methods, functional data analysis (FDA) was employed. Noise exposure sequences were decomposed into group-specific mean and residual functions, preserving both group-level trends and individual variations. This FDA-based denoising procedure reduced random fluctuations, enhancing the detection of systematic group differences. For statistical inference, a basis function-based simultaneous confidence band was constructed using the denoised sequences. Simultaneous confidence band results closely aligned with the pointwise Wilcoxon test adjusted by the B-H procedure, revealing the most pronounced differences in different times of the day. This approach demonstrates the effectiveness of functional data analysis in time series denoising and structured group comparisons. 

Keywords

Tinnitus

Functional Data Analysis

Clinical Research

Time Series

Intensive Longitudinal Data 

Co-Author(s)

Kun Chen, University of Connecticut
Erika Skoe, University of Connecticut
Ofer Harel, University of Connecticut

First Author

Yifan Zhang, University of Connecticut

Presenting Author

Yifan Zhang, University of Connecticut