CS3b: Panel: Methods with Meaning: Celebrating The "Why" Behind Statistical Innovation

Conference: Women in Statistics and Data Science 2025
11/13/2025: 11:45 AM - 1:15 PM EST
Panel 

Description

Statistical innovation is often judged by novelty -- but what if we instead celebrated our methods' meaning? This session highlights the real-world motivations, challenges, and impact that drive methodological development. Rather than diving deep into equations, in this session four statisticians share the stories behind their work: why the methods were needed; what gap they fill; and how they contribute to science, policy, or public health.

This session is for anyone who believes that statistical methods are not just intellectual exercises, but tools to understand and improve the world. Talks will explore bias in risk prediction, the responsible use of predictions in research, scalable validation of electronic health records, and optimizing cancer screening programs. Each project is rooted in a real-world setting-like healthcare, clinical research, or data science practice-where thoughtful methods can lead to meaningful improvements.

Sponsored by the Caucus for Women in Statistics and Data Science, this session brings together researchers in our field to emphasize the "why" behind their statistical methods development in a celebration of our broad applications.

Keywords

Statistical Applications

Electronic Health Records

Risk Prediction

Measurement Error

Cancer Screening

Machine Learning 

Organizer

Lucy D'Agostino McGowan, Wake Forest University

Target Audience

Mid-Level

Tracks

Knowledge
Women in Statistics and Data Science 2025

Presentations

ML-powered scientific research: Possibilities and pitfalls

From applications in structural biology to the analysis of electronic health record data, predictions from machine learning models increasingly complement costly gold-standard data in scientific inquiry. While "using predictions as data" enables scientific studies to scale in an unprecedented manner, appropriately accounting for inaccuracies in the predictions is critical to achieving trustworthy conclusions from downstream statistical inference.

In this talk, I will explore the methodological and practical impacts of using predictions as data across various applications. I will introduce our recently proposed method for bias correction and draw connections with modern methods and classical statistical approaches dating back to the 1960s. I will also discuss ethical challenges of using predictions as data, underscoring the need for careful and thoughtful adoption of this practice in scientific research.
 

Speaker

Jesse Gronsbell

Why and how we validate EHR Data: Making them aud-it they can be

Data from Electronic health records (EHR) present a huge opportunity to operationalize a standardized whole-person health score in the learning health system and identify at-risk patients on a large scale, except they are prone to missingness and errors. Ignoring these data quality issues could lead to biased statistical results and incorrect clinical decisions. Validation of EHR data (e.g., through chart reviews) can provide better-quality data, but realistically, only a subset of patients' data can be validated and most protocols do not recover missing data. Using a representative sample of 1000 patients from the EHR at an extensive learning health system (100 of whom could be validated), we propose methods to design, conduct, and analyze statistically efficient and robust studies of the ALI and healthcare utilization. Targeted validation with an enriched protocol allowed us to ensure the quality and promote the completeness of the EHR. Findings from our validation study were incorporated into statistical models, which indicated that worse whole-person health was associated with higher odds of engaging in the healthcare system, adjusting for age.
 

Speaker

Sarah Lotspeich, Wake Forest University

Toward Well-Calibrated Risk Estimation with Biased Training Data

The added value of candidate predictors for risk modelling is routinely evaluated by comparing the performance of models with or without including candidate predictors. Such comparison is most meaningful when the estimated risk is unbiased in the target population. Oftentimes, data for standard predictors in the base model is richly available from the target population, but data for candidate predictors are available only from nonrepresentative convenience samples. While the base model can be naively updated using the study data without recognizing the discrepancy between the underlying distribution of the study data and that in the target population, the resultant risk estimates and the evaluation of the candidate predictors are biased. We proposed a semiparametric method for model fitting that enables unbiased assessment of model improvement without requiring a representative sample from the target population, thereby overcoming a major bottleneck in practice. I will discuss how a data analysis project inspired this methodological effort, leading to a novel approach tailored to practical needs. I will describe how this method underpinned a recently well-scored scientific grant proposal, demonstrating how a novel statistical methodology can drive and enable innovative scientific endeavors.
 

Speaker

Jinbo Chen, University of Pennsylvania